RNNs | Notion

1.Sequence Problems

one to many
- text or music generation
many to one
- Time series forcasting
- Sentiment classifiction
many to many
- Translation
- Speech recongnition and generation
- Image captioning
- Question Answering

Q) Why not use feedforward networks?

→ padding을 모든 시퀀스에 추가 (max length와 같도록)

→ 그러려면 메모리가 필요함

→ linearly in number of timesteps

→ flatten 시퀀스와 output을 매핑하는 행렬 ~ overkill

→ ignores the nature of the problem

Stateful Computation

→ statefulness

→ 한 batch의 samples를 계산한 states가 다음 batch의 samples을 위한 초기 상태로 재사용
input마다 step function을 사용 ~ different output

→ many to many problem일 때 GOOD!
may to one problem 일때는?
- encoder - decoder architecture
- 마지막 timestep의 output 벡터인 $y_t$로 전체 시퀀스 계산
one to many 일때는?
- encoder - decoder architecture
  
  → seperate networks로 생각 (실제로 분리된 것은 아님! 이해하기 편하라고)
- 이미지 → 텍스트 출력하기
- last couple of fully connected layers에서 벡터 한개를 가져와 계산
  
  → hidden state of model
many to many 일때는?
- encoder - decoder architecture
  
  → 둘 다 RNN
- 가장 흔한 케이스 - ex)기계 번역
- 전 step의 output을 다음 step의 input으로 사용
- encoder로 구한 output을 decoder의 inital input으로 사용
  
  → 반복
  
  → encoder output ~ hidden state 벡터

RNN Desiderata
Vanilla RNNs
Vanishing gradients
- sigmoid, tanh는 input이 커지면 derivative << 1(0에 가까워짐)
- 각 단계 기울기의 규모는 점점 작아짐 → 기울기 소실
- ReLU → 기울기 폭주(exploding gradients)