Testing CLIP directly with BDD images

Created: 07 Feb 2023, 04:21 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge,

When just using euclidean distance to compare the embedding space for image-text pairs,

Updated phrase templates to allow for possibly better descriptions:

for standard imagenet

⇒ does have some cases where it stated snow / cold weather

⇒ but still not great

⇒ doesn’t seem to work very well for night scenes.

Conclusions

Using CLIP alone and taking distance/similarity metrics between the current text and image embeddings might not be enough

Follow on questions

What if use different tokeniser, pretrained LLM than BERT?
- This would improve the text embeddings, possibly get something closer to the image embeddings
What if use finetuned CLIP?
What if use captioning model?

TODO:

Share on Thurs:

I tested increasing phrase templates to include “conditions” or “weather” or “lighting” at the end, see if it makes sense
Improves a little bit, but not much

Can try to do finetuning of our own using CLIP + manually annotated images from BDD

Or use bdd labels with weather, scene, timeofday as per https://doc.bdd100k.com/download.html

Can try either or, whichever seems easier can try alr

Similarity search