DEV Community

Cover image for The Dao of Gradient Descent
Matthew Perez
Matthew Perez

Posted on • Updated on

The Dao of Gradient Descent

Gradient Descent

It's a technique used to find the optimum setting of something. The perfect temperature for the perfect amount of time to bake the perfect cake. The process is similar to baking thousands of cakes, and each time, slightly turning the dial in a way that would bring you closer to the right outcome.

People commonly use the analogy of standing somewhere on a mountain range, blindfolded, trying to make your way to the lowest point (global minima).

Techniques have been developed to reduce the time it takes to get to this optimum point, one of which is called momentum. It's very similar to rolling a ball down the mountain; if you're in a steep area, it will roll downward even faster. Another technique called annealing is where you take large steps at first then smaller ones.

Probably the most popular version is called Stochastic Gradient Descent (SGD), which has the effect of teleporting you around the mountain range but would keep you from getting stuck in a rut somewhere.

There are quite a few varieties of gradient descent, all with the same purpose— Find the minima. Find the right minima, find the optimum solution.

Optimizing Life

When asked if hungry, instead of a yes or no, I start thinking about how eating now would benefit me vs 30 minutes from now. I do this a lot, with everything. Making plans with friends, which road to take, chocolate or vanilla, etc.(This will annoy your partner and I do not recommend it.)

You can see how someone with my personality was attracted to the concept of optimization. When I decided to study data science, I spent a month finding all the best resources and building a curriculum. In the end, I felt I had all my bases covered. Coding from Corey Schafer, linear algebra from Strang, probability from Blitzstein, machine learning from Hastie/Tibshirani/Witten, neural networks from Karpathy, etc.

For six months I grinded through everything. Before that, I had never used a greek symbol for math nor never coded more than a for-loop. Grinding really was the appropriate word for it. Sometimes it felt like I was stuck in quicksand and fighting to move forward even a few inches.

Optimum ran my brain's CPU higher than Google Chrome on Windows. I'm surprised I lasted as long as I did.
Looking back on it, I wasted a lot of time just on letting my brain cool down from doing something I was no longer enjoying.

In the end, I'm glad I did it, but I don't know if I'd take the same approach again.

What does this have to do with Gradient Descent?

My approach was probably most similar to Batch Gradient Descent. For those that don't know, it is not the most efficient method, especially when applied to something with a large set of features (subjects in my case). It results in the straightest line to the goal but is the slowest and most painful to watch of the variations (you can see it on any video that compares the different techniques).

After taking a break and looking back on the experience, something about it all reminded me of stoicism. Going back to the analogy of standing blindfolded in the mountain range, the algorithm doesn't know where optimum is. It only knows two things: the slope beneath its feet and how big a step it will take next.

It was somewhat comforting to think that even these algorithms specifically designed to find the perfect solution, at any given point, are just as lost as I am.

Here are some other conclusions this thought experiment led me to. They might not be new, but connecting them to something rooted in mathematics made me appreciate it a bit more.

  • No matter what method you choose, you'll never reach your goal in a single step.
  • The fastest path often isn't a straight line.
  • Some time spent randomly exploring can help you put your previous assumptions into perspective, preventing you from stopping at local minima.
  • Ride the momentum when you feel the pull. You'll go farther with less effort.
  • Every map is different. Areas where one experiences momentum will differ from person to person.
  • Use methods that will fit within your resource limits; mainly time and energy.
  • One problem's local minima is another one's global, and vise versa.
  • Even with the best techniques, convergence can be tricky. Often 'good enough' and moving on to the next parameter is better than perfection in one aspect of a model.

If you made it this far, you might have noticed this is not the best written work in the last century. It’s more of a rambling. But this rambling is more real than any coherent thought or perfect thought that’s never left my mind.

I read a joke somewhere that says it best. “If a jupyter notebook found the cure for cancer, would anyone hear it?” or a fortune cookie I used to carry with me that said “thoughts without actions will never be bigger than the neuron that created it”.

This brings us to our last bullet point.

  • It all starts with a single, imperfect step.

Note- I haven't started looking at how superconvergence might apply, but it's on my to-do list.

Top comments (0)