Testing CLIP directly with BDD images
Created: 07 Feb 2023, 04:21 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge,
When just using euclidean distance to compare the embedding space for image-text pairs,
- Not very accurate, when test with standard imagenet
- Used cosine similarity Distance Metrics instead
Updated phrase templates to allow for possibly better descriptions:

I.e. prompt engineering https://medium.com/mlearning-ai/having-fun-with-clip-features-part-i-29dff92bbbcd
for standard imagenet


For winter scene


⇒ does have some cases where it stated snow / cold weather
⇒ but still not great
For night scene:


⇒ doesn’t seem to work very well for night scenes.
Using ClipCap
https://colab.research.google.com/drive/1NXRL7Sj3anwvNOywbU5VZDSdLZ8nBnd1
https://arxiv.org/abs/2111.09734


Conclusions
- Using CLIP alone and taking distance/similarity metrics between the current text and image embeddings might not be enough
Follow on questions
- What if use different tokeniser, pretrained LLM than BERT?
- This would improve the text embeddings, possibly get something closer to the image embeddings
- What if use finetuned CLIP?
- What if use captioning model?
TODO:
- Shift these onenote documentation to Confluence for others to read
Share on Thurs:
- I tested increasing phrase templates to include “conditions” or “weather” or “lighting” at the end, see if it makes sense
- Improves a little bit, but not much
Can try to do finetuning of our own using CLIP + manually annotated images from BDD
Or use bdd labels with weather, scene, timeofday as per https://doc.bdd100k.com/download.html
Can try either or, whichever seems easier can try alr
Similarity search
- https://github.com/nmslib/nmslib
- https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6
https://github.com/arampacha/CLIP-rsicd
https://github.com/openai/CLIP