Testing CLIP directly with BDD images


Created: 07 Feb 2023, 04:21 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge,


When just using euclidean distance to compare the embedding space for image-text pairs,

  • Not very accurate, when test with standard imagenet
  • Used cosine similarity Distance Metrics instead

Updated phrase templates to allow for possibly better descriptions:

I.e. prompt engineering https://medium.com/mlearning-ai/having-fun-with-clip-features-part-i-29dff92bbbcd

for standard imagenet

For winter scene

does have some cases where it stated snow / cold weather

but still not great

For night scene:

doesn’t seem to work very well for night scenes.

Using ClipCap

https://colab.research.google.com/drive/1NXRL7Sj3anwvNOywbU5VZDSdLZ8nBnd1

https://arxiv.org/abs/2111.09734

Conclusions

  • Using CLIP alone and taking distance/similarity metrics between the current text and image embeddings might not be enough

Follow on questions

  • What if use different tokeniser, pretrained LLM than BERT?
    • This would improve the text embeddings, possibly get something closer to the image embeddings
  • What if use finetuned CLIP?
  • What if use captioning model?

TODO:

  • Shift these onenote documentation to Confluence for others to read

Share on Thurs:

  • I tested increasing phrase templates to include “conditions” or “weather” or “lighting” at the end, see if it makes sense
  • Improves a little bit, but not much

Can try to do finetuning of our own using CLIP + manually annotated images from BDD

Or use bdd labels with weather, scene, timeofday as per https://doc.bdd100k.com/download.html

Can try either or, whichever seems easier can try alr

Similarity search

https://github.com/arampacha/CLIP-rsicd

https://github.com/openai/CLIP

https://github.com/Zasder3/train-CLIP

https://github.com/rmokady/CLIP_prefix_caption