*Memos:
- My post explains GELU() and Mish().
- My post explains SiLU() and Softplus().
- My post explains Step function, Identity and ReLU.
- My post explains Leaky ReLU, PReLU and FReLU.
- My post explains ELU, SELU and CELU.
- My post explains Tanh, Softsign, Sigmoid and Softmax.
- My post explains Vanishing Gradient Problem, Exploding Gradient Problem and Dying ReLU Problem.
- My post explains layers in PyTorch.
- My post explains loss functions in PyTorch.
- My post explains optimizers in PyTorch.
(1) GELU(Gaussian Error Linear Unit):
- can convert an input value(
x
) to an output value by the input value's probability under a Gaussian distribution with optional Tanh. *0 is exclusive except whenx = 0
. - 's formula is. *Both of them get the almost same results: Or:
- is GELU() in PyTorch.
- is used in:
- Transformer. *Transformer() in PyTorch.
- NLP(Natural Language Processing) based on Transformer such as ChatGPT, BERT(Bidirectional Encoder Representations from Transformers), etc. *Strictly speaking, ChatGPT and BERT are based on Large Language Model(LLM) which is based on Transformer.
- 's pros:
- It mitigates Vanishing Gradient Problem.
- It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
- 's cons:
- It's computationally expensive because of complex operation including Erf(Error function) or Tanh.
- 's graph in Desmos:
(2) Mish:
- can convert an input value(
x
) to an output value byx * Tanh(Softplus(x))
. *0 is exclusive except whenx = 0
. - 's formula is:
- is Mish() in PyTorch.
- 's pros:
- It mitigates Vanishing Gradient Problem.
- It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
- 's cons:
- It's computationally expensive because of Tanh and Softplus operation.
- 's graph in Desmos:
(3) SiLU(Sigmoid-Weighted Linear Units):
- can convert an input value(
x
) to an output value byx * Sigmoid(x)
. *0 is exclusive except whenx = 0
. - 's formula is y =
x
/ (1 + e^{-x}). - is also called Swish.
- is SiLU() in PyTorch.
- 's pros:
- It mitigates Vanishing Gradient Problem.
- It mitigates Dying ReLU Problem. *0 is still produced for the input value 0 so Dying ReLU Problem is not completely avoided.
- 's cons:
- It's computationally expensive because of Sigmoid.
- 's graph in Desmos:
(4) Softplus:
- can convert an input value(
x
) to the output value between 0 and ∞. *0 is exclusive. - 's formula is y = log(1+e^{x}).
- is Softplus() in PyTorch.
- 's pros:
- It normalizes input values.
- The convergence is stable.
- It mitigates Vanishing Gradient Problem.
- It mitigates Exploding Gradient Problem.
- It avoids Dying ReLU Problem.
- 's cons:
- It's computationally expensive because of log and exponential operation.
- 's graph in Desmos:
Top comments (0)