Vision Language Models (VLMs)

4 min readSep 29, 2024

Vision Language Models:

Vision language models (VLMs) combine machine vision and semantic processing techniques to make sense of the relationship within and between objects in images.

In practice, this means combining various visual machine learning (ML) algorithms with transformer-based large language models (LLMs).

Common Visual ML Algorithms:

1. Convolutional Neural Networks (CNNs)

Use Case: Image classification, object detection, facial recognition.
Description: CNNs are designed specifically for processing grid-like data, such as images. They use convolutional layers to automatically learn spatial hierarchies in images, making them extremely powerful for visual tasks.

2. ResNet (Residual Networks)

Use Case: Image recognition, object detection, segmentation.
Description: ResNet, with its skip connections, helps mitigate the vanishing gradient problem, enabling the training of very deep networks. The introduction of residual blocks allows ResNet to achieve great performance on large datasets like ImageNet.

3. YOLO (You Only Look Once)

Use Case: Real-time object detection.
Description: YOLO is known for its speed and efficiency in detecting multiple objects in an image in real-time. It divides the image into grids and predicts bounding boxes and class probabilities directly.

4. Mask R-CNN (Region-based Convolutional Neural Networks)

Use Case: Image segmentation.
Description: Extending Faster R-CNN, Mask R-CNN adds a branch for predicting segmentation masks. It’s commonly used for instance segmentation, which involves detecting objects and distinguishing pixels that belong to each object.

5. U-Net

Use Case: Medical image segmentation.
Description: U-Net is a type of CNN designed for biomedical image segmentation. Its unique “U” shaped architecture captures both high-level and low-level information, making it effective for segmenting complex structures.

6. Vision Transformers (ViT)

Use Case: Image classification, object detection.
Description: Vision Transformers apply Transformer models, originally designed for NLP, to visual tasks. They break images into patches and model long-range dependencies using attention mechanisms, achieving state-of-the-art performance on classification tasks.

7. Generative Adversarial Networks (GANs)

Use Case: Image generation, super-resolution, style transfer.
Description: GANs consist of two neural networks, a generator and a discriminator, that compete in a game-like framework. GANs are widely used to generate realistic images, super-resolve low-resolution images, and perform style transfer.

8. Autoencoders

Use Case: Image denoising, compression, feature learning.
Description: Autoencoders are unsupervised neural networks that learn efficient representations of data (encoding) and can reconstruct input data (decoding). They are useful for tasks like image compression, denoising, and anomaly detection.

9. Faster R-CNN

Use Case: Object detection.
Description: Faster R-CNN improves upon traditional R-CNN by introducing a Region Proposal Network (RPN) to efficiently predict object bounding boxes, significantly speeding up object detection while maintaining accuracy.

10. K-Means Clustering (for Image Segmentation)

Use Case: Image segmentation.
Description: K-Means is an unsupervised learning algorithm that can cluster image pixels based on their similarity. It’s often used for simple segmentation tasks like background removal or grouping similar colors in an image.

Examples of VLMs:

Open AI GTP-4, Google Gemini, Qwen, Llama, CLIP (contrastive language images pre-training)

Algorithms used for training the VLMs:

Contrastive learning. Models like CLIP learn to discern similarities and differences between pairs of images like dogs and cats and then apply text labels to similar images fed into an LLM. The open-source LLaVA uses CLIP as part of a pretraining step, which is then connected to a version of the Llama LLM.
PrefixLM. Models like SimVLM and VirTex directly train a transformer across sections of an image and a sentence stub (the prefix) that is good at predicting the next set of words in an appropriate caption.
Multi-modal fusing with cross attention. Models like VisualGPT, VC-GPT and Flamingo fuse visual elements to the attention mechanism in an existing LLM.
Masked-Language Modeling and Image-Text Matching. Models like BridgeTower, FLAVA, LXMERT and VisualBERT combine algorithms that predict masked words with other algorithms that associate images and captions.
Knowledge distillation. Models like ViLD distill a larger teacher model with high accuracy into a more compact student model with fewer parameters that runs faster and cheaper but retains similar performance.

Evaluation metrics for VLMs:

Bilingual Evaluation Understudy. BLEU compares the number of words in a machine translation to a human-curated reference translation.
Consensus-based Image Description Evaluation. CIDEr compares machine descriptions to a reference description curated by a consensus of human evaluators.
Metric for Evaluation of Translation with Explicit Ordering. METEOR measures the precision, order, recall, and human quality assessments of descriptions or translations.
Recall-Oriented Understudy for Gisting Evaluation. ROUGE compares the quality of a VLM-generated description to human summaries.