loading...

The Perfect Activation

polmonroig profile image Pol Monroig Company ・4 min read

It might be too bold to call an activation function perfect, given that the No Free Lunch Theorem of machine learning states that there is no universally perfect machine learning algorithm. Nevertheless, as misleading as the title can be, I will try to summarize the most widely used activation functions and describe their main differences.

Linear (identity)

The linear activation function is essentially no activation at all.
Overhead: fastest, no computation at all
Performance: bad, since it does not enable a non linear transformation
Advantages:

  • Differentiable at all points
  • Fast execution

Common issues:

  • Does not provide any non-linear output.

Sigmoid

The Sigmoid activation function is one of the oldest ones. Initially made to mimic the activations in the brain it has been shown to have poor performance on artificial neural networks, nevertheless it is commonly used and a classifier output to transform outputs into class probabilities.

Uses: it is commonly used in the output layer of binary classification where we need a probability value between 0 and 1.
Overhead: very expensive because of the exponential term.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

  • Outputs are between 0 and 1, that means that values won't explode.
  • It is differentiable at every point.

Common issues:

  • Outputs are between 0 and 1, that means outputs might saturate.
  • Vanishing gradients are possible.
  • Outputs are always positive ( zero centered functions help in a faster convergence).

Code:

# Pytorch 
torch.nn.Sigmoid() 
# Tensorflow 
tf.keras.activations.sigmoid()

Alt Text

Softmax

Generalization of the Sigmoid function to more than one class, it enables to transform the outputs into multiple probabilities. Used in multiclass classification.
Uses: used in the output layer of a multiclass neural network.
Overhead: similar to Sigmoid, but more overhead caused by more inputs.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

  • Unlike Sigmoid, it ensures that outputs are normalized between 0 and 1

Common issues:

  • Same as Sigmoid.

Code:

# Pytorch 
torch.nn.Softmax(dim=...) 
# Tensorflow 
tf.keras.activations.softmax()

Hyperbolic Tangent

Tanh function has the same shape as Sigmoid, in fact is the same but it is mathematically shifted and it works better in most cases.
Uses: generally used in hidden layers as it outputs between -1 and 1, thus creating normalized outputs, making learning faster.
Overhead: very expensive, since it uses an exponential term.
Performance: similar to Sigmoid but with some added benefits
Advantages:

  • Outputs are between -1 and 1, that means that values won't explode.
  • It is differentiable at every point.
  • It is zero-centered, unlike Sigmoid.

Common issues:

  • Vanishing gradients.
  • Gradients saturation.

Code:

# Pytorch 
torch.nn.Tanh() 
# Tensorflow 
tf.keras.activations.tanh()

Alt Text

ReLU

ReLU, also called rectified linear unit is one of the most commonly used activations, both for its computational efficiency and its great performance. Multiple variations have been created to improve its flaws.
Uses: must be used in hidden layers as it provides better performance than tanh and Sigmoid, and is more efficient since it is computationally faster.
Overhead: Almost none, extremely fast.
Performance: great performance, recommended for most cases.
Advantages:

  • Adds non-linearity to the network.
  • Does not suffer from vanishing gradient.
  • Does not saturate.

Common issues:

  • It suffers from dying ReLU
  • Not differentiable at x = 0

Code:

# Pytorch 
torch.nn.ReLU() 
# Tensorflow 
tf.keras.activations.relu()

Alt Text

Leaky Relu

Given that ReLU suffers from the dying relu problem where negative values are rounded to 0. Leaky ReLU tries to diminish the problem by changing the 0 output by a very small value.
Uses: used in hidden layers.
Overhead: same as ReLU
Performance: great performance if the hyperparameter is chosen correctly
Advantages:

  • Similar to ReLU and fixes dying ReLU.

Common issues:

  • New hyperparameter to tune.

Code:

# Pytorch 
torch.nn.LeakyReLU(negative_slope=...) 
# Tensorflow 
tf.keras.layers.LeakyReLU(alpha=...)

Alt Text

Parametric ReLU

Takes the same idea as leaky ReLU but instead of predifining the leaky hyperparemeter, it is added as a parameter that must be learned.
Uses: used in hidden layers.
Overhead: a new parameter must be learned for each PreLU in the network.
Performance: bad on hidden layers, mostly used on output layers
Advantages:

  • Fixes the need of tuning an hyperparameter

Common issues:

  • The parameter learned is not guaranteed to be the optimum, and it increases the overhead, so you might as well try some yourself with leaky.

Code:

# Pytorch 
torch.nn.PReLU(x) 
# Tensorflow 
tf.keras.layers.PReLU(x)

Alt Text

ELU

The ELU was introduced as another alternative to fix the issues that you can encounter with ReLU.
Uses: used in hidden layers
Overhead: computational expensive, it uses an exponential term
Performance: bad on hidden layers, mostly used on output layers
Advantages:

  • Similar to reLU.
  • Produces negative outputs.
  • Bends smoothly unlike leakyReLU.
  • Differentiable at x = 0

Common issues:

  • Additional hyperparameter

Code:

# Pytorch 
torch.nn.ELU() 
# Tensorflow 
tf.keras.activations.elu()

Alt Text

Other alternatives

There are a lot of activations functions to cover them all in a single post. Here are some:

  • SeLU
  • GeLU
  • CeLU
  • Swish
  • Mish
  • Softplus

Note: if it ends with LU it usually comes from ReLU.

Summary

So... having so many choices, which activation should we use? As a rule of thumb you should always try using ReLU in the hidden layers, as it has a great performance with minimal computational overhead. After that (if you have enough computing power) you might want to try with some complex variations of ReLU or similar alternatives. I would never recommend using Sigmoid, Tanh or Sotfmax for any hidden layer. Sigmoid and Softmax should be used whenever we want probabilities outputs for a classification task. Finally, with the current progress and research in deep learning and AI surely new and better functions will appear, so keep an eye out.

Remember to try and experiment always, you never know which function will work better for a specific task.

Discussion

pic
Editor guide