Gradient-free Optimization

Optimizers other than gradient descent

View project on GitHub

Mathematical optimization is discussed in Part 3 of my book in two chapters. One of them explores various widely used techniques of optimization, including hyperparameter tuning of machine learning models. The other one is about optimization using gradient descent and gradient-free optimization. Which method of optimization is considered for a problem entirely depends on the objective function we want to optimize.

Gradient Descent

Let’s recapitulate gradient descent.

It is an optimization algorithm utilized in regression models, neural networks to minimize a function called the loss function, also called cost function. It works by iteratively adjusting the model’s parameters in the direction of the steepest descent of the function, aiming to find the minimum value point.

gd

Gradient & Hessian

Let us also recapitulate gradient and Hessian.

Gradient is the vector of first order derivatives of a scalar field, which guides you with the direction in a landscape.

11

Hessian is a matrix of second order partial derivatives of a scaler field. A Hessian not only guides, it tells if you’re climbing, falling, or stuck in the landscape. Those points are inflexion points, extrema, and saddle points of the function.

22

And Jacobian is a matrix of gradients of a vector field.

33

Optimization problems

An objective function defines the goal of the optimization problem, specifying what needs to be maximized or minimized. It’s the function that we try to optimize (either make as large or small as possible) based on the given constraints of the problem.

Based on the types of objective function and constraint variables, optimization problems can have 3 broad categories.

optP

Optimization uses a rigorous mathematical model to determine the most efficient solution to a described problem. One must first identify an objective function (loss or cost), a quantitative measure examples of which are profit, cost, energy. The objective is subject to constraints such as resources, time.

▶️ Linear Programming (LP)

Example:

Objective function: 
   Goal is to maximize total profit, products A and B are sold at 25$ and 20$ respectively

Constraints:
   1. Product A requires 20 resource units, product B 12 units
   2. Only 1800 resource units are available per day
   3. Both products require a production time of 1/15 hour
   4. A working day has a total of 8 hours

Mathematical Formulation:

   Objective function is maximizing total sales (x1 = #items of product A, x2 = #items of product B). Resource and time constraints are defined accordingly.    

Solution: Optimal x1 and x2 are 45 and 75 respectively.

cf

of

▶️ Quadratic Programming (QP)

Quadratic programs are the first-step beyond linear programming in convex optimization. In chapter 4 of my book, I briefly touched upon convex functions and their optimization.

cnc

When we talk about convex and concave functions, we think of the Hessian matrix.

is

The Hessian matrix has two important utilities - to know whether a function is concave or convex, to determine whether a critical point is a local minimum or maximum, or a saddle point [If the gradient of a function is zero at some point x, then the function f(x) has a critical point at x].

001 002

The LASSO regularization technique used in regression problems can be formulated as a QP problem - least squares optimization with linear inequality constraints.

▶️ Non-Linear Programming

A classic example of a problem solved by nonlinear programming is portfolio optimization (finance). There are multiple python libraries that can be used to run the algorithm and solve the problem.

Optimization without gradient

The categorization/classification of optimization problems is strictly based on whether or not one can calculate/compute the gradient of objective function. The differentiability of the function makes us decide/choose the algorithm to be used for solving the problem. This has been discussed in chapter 10 of my book.

opt

Optimization methods using gradient descent algorithm can be called parametric as they are based on assumptions, while the ones not using gradientdescent are non-parametric.

The space for searching global or local minimum solution wrt cost/objective function can be either continuous or discrete.

obj arg

An example of optimization problem with discrete variables in the objective function is the travelling salesman problem (TSP). It asks, “what’s the shortest possible route visiting every city once and returning to the start?” As the number of cities grows, brute-force solutions become computationally infeasible.

A Self-organizing Map (SOM) can be implemented to find sub-optimal solutions of TSP.

Gradient-free optimizers

Well-known optimization algorithms for continuous variables in the objective function are simulated annealing, particle swarm method, genetic algorithm. These are the ones that do not use gradient to optimize the model.

➡️ Simulated Annealing

It is a metaheuristic probabilistic technique to approximate optimization in a local search space of a physical process wherein the system energy is minimized. A Hill climbing algorithm is very basic optimization that explores a local search space. It starts at an initial point, which is often chosen randomly and continues to move to positions within its neighbourhood with a better solution. To execute it, we need to define the search space, step size of the algo, number of iterations, and an objective function.

11

Simulated annealing chooses its next possible position similar to Hill climbing, but it accepts worse results with a probability that decreases with time. It simulates a temperature that decreases with each iteration, similar to a material cooling down. One of the algo parameters is annealing rate which is the rate at which the algorithmic temperature value decreases.

An annealing rate above 1 increases the temperature over time.

➡️ Genetic Algorithm

An evolutionary algorithm like GA is a stochastic, metaheuristic approach to solving problems that would take too long to exhaustively process using deterministic methods.

GA is inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms. Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems via biologically inspired operators such as selection, crossover, and mutation.

33

➡️ Particle Swarm Optimization

PSO is a metaheuristic as it makes few or no assumptions and can search very large spaces of candidate solutions to the problem. It was first used to simulate social behavior.

PSO does not use the gradient, which means it does not require that the objective of the optimization problem is a differentiable function as required by classical optimization methods such as gradient descent and quasi-newton methods. The caveat is metaheuristic algorithms such as PSO do not guarantee that an optimal solution will ever be found.

Summary

Types of metaheuristic algorithms ⤵️

mh

Local search -> The algorithm works by choosing new positions within a neighbourhood of the previous positions. It is recommended for convex optimization problems.

Global search -> The algorithm works by choosing new positions independently of the previous positions. It is recommended for non-convex optimization problems.