Bridging Images and Text - a Survey of VLMs
Nanonets
SEPTEMBER 17, 2024
The only place where text and vision embeddings come together are during loss computation and this loss is typically contrastive loss. Flamingo - The vision tokens are computed with a modified version of Resnet and from from a specialized layer called the Perceiver Resampler that is similar to DETR.
Let's personalize your content