Vision Language Models Explained

omnivore VLM transformers

Read on Omnivore | Read Original

Highlights

Some of these models have a feature called “grounding” which reduces model hallucinations. ⤴️

(MMMU) is the most comprehensive benchmark to evaluate vision language models. ⤴️

main trick is to unify the image and text representation and feed it to a text decoder for generation. ⤴️

most common and prominent models often consist of an image encoder, an embedding projector to align image and text representations (often a dense neural network) and a text decoder stacked in this order. ⤴️

LLaVA consists of a CLIP image encoder, a multimodal projector and a Vicuna text decoder. ⤴️

frozen the image encoder and text decoder and have only trained the multimodal projector to align the image and text features by feeding the model images and generated questions and comparing the model output to the ground truth captions. After the projector pretraining, they keep the image encoder frozen, unfreeze the text decoder, and train the projector with the decoder. ⤴️

KOSMOS-2, where the authors chose to fully train the model end-to-end, which is computationally expensive compared to LLaVA-like pre-training. ⤴️

doesn’t even have an image encoder. Instead, image patches are directly fed to a projection layer and then the sequence goes through an auto-regressive decoder. ⤴️