Domain specificity in language models


Created: 28 Mar 2023, 11:12 AM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, KnowledgeSharing


NLP - essentially classification for the next word based on the context of the previous words, from the tokens

Long tail distribution, freq of words lesser

  • Every token same level of imptce in masking (BERT), regardless of the freq of words. Rare words are also masked as much as freq words
  • It doesn’t matter - since the dataset is huge, or
  • Transformers - if more attention to the context of words being used, will be considered more freq - even if it occurs less freq
  • Top p sampling
    • Take varying number of tokens based on the probability
    • Instead of top k, which is a hard fixed number that you take from the tokens

Perplexity 1 means less confused, larger (closer to size of num tokens), more confused

Computer vision having a latent space? But NLP don’t have?

  • What is latent space? Latent space vs embedding space?