DEV Community

Super Kai (Kazuya Ito)
Super Kai (Kazuya Ito)

Posted on • Edited on

Layers in PyTorch (2)

Buy Me a Coffee

*Memos:

  • My post explains Input, Hidden and Output Layer and Fully-connected Layer, Convolutional Layer, Transposed Convolutional Layer, Pooling Layer, Batch Normalization layer, Layer Normalization, Dropout Layer and Embedding Layer.
  • My post explains activation functions in PyTorch.
  • My post explains loss functions in PyTorch.
  • My post explains optimizers in PyTorch.

(1) Recurrent Layer(1986):

  • can only remember(capture) short-term(range) dependencies.
  • uses recurrence but not parallelization.
  • does simpler computation than LSTM, GRU and Transformer.
  • is faster than LSTM and GRU because of simpler computation but slower than Transformer because of using recurrence not using parallelization.
  • is used for NLP(Natural Language Processing). *NLP is the technology which enables computers to understand and communicate with human language.
  • is also called Recurrent Neural Network(RNN).
  • is RNN() in PyTorch. *My post explains RNN().

(2) LSTM(Long Short-Term Memory)(1997):

  • can remember(capture) longer-term(range) dependencies than Recurrent Layer.
  • uses recurrence but not parallelization.
  • is the improved version of Recurrent Layer.
  • has the 3 gates, forget gate, input gate and output gate: *Memos:
    • Forget gate can decide which data to forget from cell state(long-term memory) considering the value between (0, 1). *For example, 0 means forget everything, 1 means keep everything and 0.1 means forget almost everything but I don't know 0.1 means forget 90% or not.
    • Input gate can decide which data to store to cell state.
    • Output gate can decide the value of hidden state(short-term memory).
    • In LSTM, cell state is the memory called long-term memory and hidden state is the memory called short-term memory.
  • does more complex computation than Recurrent Layer and GRU but does less complex computation than Transformer.
  • is slower than Recurrent Layer and GRU because of more complex computation than them and slower than Transformer because of using recurrence not using parallelization.
  • can mitigate Vanishing Gradient Problem
  • is used for NLP.
  • is LSTM() in PyTorch. *My post explains LSTM().
  • 's source.

(3) GRU(Gated Recurrent Unit)(2014):

  • can remember(capture) longer-term(range) dependencies than Recurrent Layer but less-long-term(range) dependencies than LSTM and Transformer.
  • uses recurrence but not parallelization.
  • is the simplified LSTM.
  • has the 2 gates, reset gate and update gate: *Memos:
    • Reset gate can decide which data to forget from hidden state.
    • Update gate can decide which data to store to hidden state.
    • In GRU, hidden state is memory but it is not called long-term memory or short-term memory.
  • does more complex computation than Recurrent Layer but simpler computation than LSTM and Transformer.
  • is slower than Recurrent Layer and Transformer because of more complex computation than it and slower than Transformer because of using recurrence but not parallelization but faster than LSTM because of less complex computation than it.
  • can mitigate Vanishing Gradient Problem.
  • is used for NLP.
  • is GRU() in PyTorch. *My post explains GRU().
  • 's source.

(4) Transformer(2017):

  • can remember(capture) longer-term(range) dependencies than LSTM.
  • uses parallelization but not recurrence.
  • does more complex computation than Recurrent Layer, LSTM and GRU but is faster than them because of using parallelization but not recurrence.
  • can mitigate Vanishing Gradient Problem.
  • is used for NLP.
  • is based on Multi-Head Attention which is MultiheadAttention() in PyTorch. *LLM(Large Language Model) is based on Transformer.
  • is Transformer() in PyTorch. *My post explains Transformer().

Top comments (0)