How does ChatGPT work
Created: 14 Mar 2023, 11:09 AM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge, KnowledgeSharing
-
Adding one word at a time (based on prompt, see highest ranked predicted next word, choose the next word randomly from the list of probs of next word)
- Temperature - randomness to select the next word
- How to get probs?
- N-grams
- Difficult, not great output. Uses some large corpus of text, then based on N previous letters / words - predict the next word
- Large num possibilities
- LLM
- N-grams
-
One hot encoding / vector rep
- Not taking the context into consideration
- Use embedding instead
-
Embeddings
- Similar words are clustered together in embedding space
-
With the new embedding get a probability for the next word
- How to select the next word? Is it based on the similarity / distance search to the computed embedding?
- Once you get the new embedding - how to get probs of next word?
- Probability comes from the softmax output for all possible (50k) words that is in the dictionary, where probability - prob of that word being the next word in sentence.
History:
Transformer → encoder decoder for translation task
BERT
- Stack of encoder layers
- Get embeddings for sentences
- Wordpiece tokenisation
- Surfboard ⇒ “surf”, “board”
- Swimming ⇒ “swim”, “ing”
- Will learn the context of the word pieces
- Trained on 2 tasks
- Masked language model: Masking of words in senteces, predict masked word
- Next senctence prediction
- Finetuning
- Actually also need a huge dataset isntead of just what ppl think finetuning need smol dataset only
GPT
- Decoder only (what does this mean? Start from the embedding?)
- preTrained on only next word prediction
- No arch change involved, is more about the finetuning based on the pretrained model
- Do finetuning by adding a linear layer
GPT-2/3
- No finetuning, doing zero-shot / few-shot prompt
- Zero shot:
- “give me result of 1+2”
- One shot:
- “1+2=3”, “give me the result of 3+4”
Combination of encoder and decoder?
- They have done it - elektra?
Generative capability
- Generative type - decoder only / GPT better than BERT
Contextual
- Similarity check betw sentences - BERT better
- Since have context of the wor
Encoder /decoder
- Need context and also generate later
- e.g. summarising an article?
Closest to chatgpt - instructgpt (that was published)
- How to use gpt to overcome misalignment? Using RLHF (reinforcement learning w human feedbacl_
What is KL Loss?
- To make sure that 2 outputs are similar, no irrelevant outputs
- Is a divergence / similarity metric?