How does ChatGPT work


Created: 14 Mar 2023, 11:09 AM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, KnowledgeSharing


  • Adding one word at a time (based on prompt, see highest ranked predicted next word, choose the next word randomly from the list of probs of next word)

    • Temperature - randomness to select the next word
    • How to get probs?
      • N-grams
        • Difficult, not great output. Uses some large corpus of text, then based on N previous letters / words - predict the next word
        • Large num possibilities
      • LLM
  • One hot encoding / vector rep

    • Not taking the context into consideration
    • Use embedding instead
  • Embeddings

    • Similar words are clustered together in embedding space
  • With the new embedding get a probability for the next word

    • How to select the next word? Is it based on the similarity / distance search to the computed embedding?
    • Once you get the new embedding - how to get probs of next word?
    • Probability comes from the softmax output for all possible (50k) words that is in the dictionary, where probability - prob of that word being the next word in sentence.

History:

Transformer encoder decoder for translation task

BERT

  • Stack of encoder layers
  • Get embeddings for sentences
  • Wordpiece tokenisation
    • Surfboard “surf”, “board”
    • Swimming “swim”, “ing”
    • Will learn the context of the word pieces
  • Trained on 2 tasks
    • Masked language model: Masking of words in senteces, predict masked word
    • Next senctence prediction
  • Finetuning
    • Actually also need a huge dataset isntead of just what ppl think finetuning need smol dataset only

GPT

  • Decoder only (what does this mean? Start from the embedding?)
  • preTrained on only next word prediction
  • No arch change involved, is more about the finetuning based on the pretrained model
  • Do finetuning by adding a linear layer

GPT-2/3

  • No finetuning, doing zero-shot / few-shot prompt
  • Zero shot:
    • “give me result of 1+2”
  • One shot:
    • “1+2=3”, “give me the result of 3+4”

Combination of encoder and decoder?

  • They have done it - elektra?

Generative capability

  • Generative type - decoder only / GPT better than BERT

Contextual

  • Similarity check betw sentences - BERT better
  • Since have context of the wor

Encoder /decoder

  • Need context and also generate later
  • e.g. summarising an article?

Closest to chatgpt - instructgpt (that was published)

  • How to use gpt to overcome misalignment? Using RLHF (reinforcement learning w human feedbacl_

What is KL Loss?

  • To make sure that 2 outputs are similar, no irrelevant outputs
  • Is a divergence / similarity metric?