Skip to content
/ VLM Public

Generate natural language captions for images using the BLIP vision-language model by Salesforce. Easily run it in Google Colab with GPU support, using the Flickr8k-2k image dataset from Kaggle.

Notifications You must be signed in to change notification settings

FaNa-AI/VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ–ΌοΈ Image Captioning with BLIP (Vision-Language Model)

This project demonstrates how to generate captions for images using the BLIP (Bootstrapped Language-Image Pretraining) model by Salesforce, powered by the πŸ€— Hugging Face Transformers library.

It is designed to run in Google Colab and uses a dataset of images (such as a subset of Flickr8k) to generate natural language captions.


πŸ“Œ Features

  • βœ… Uses Salesforce/blip-image-captioning-base for image captioning
  • βœ… Automatically loads and processes images from a ZIP file
  • βœ… GPU-accelerated via Google Colab
  • βœ… Shows sample outputs using matplotlib
  • βœ… Clean and modular Python code

πŸ“ Dataset

The dataset used is a 2,000-image subset of the Flickr8k dataset.

πŸ“₯ Download here:
https://www.kaggle.com/datasets/sanjeetbeniwal/flicker8k-2k

Expected structure inside the ZIP file:


Flickr8k\_2k.zip
└── Flicker8k\_2kDataset/
β”œβ”€β”€ image1.jpg
β”œβ”€β”€ image2.jpg
└── ...

Upload this ZIP file to your Colab environment before running the notebook.


πŸ› οΈ Dependencies

The following Python packages are required:

pip install torch torchvision torchaudio
pip install transformers
pip install matplotlib

All dependencies are automatically installed in the Colab notebook.


πŸš€ How It Works

  1. Setup: Install required libraries and enable GPU runtime.
  2. Dataset Unzipping: Upload and extract the dataset in Colab.
  3. Model Loading: Load BLIP processor and model to GPU.
  4. Captioning: Select and caption random images.
  5. Visualization: Display images with generated captions using matplotlib.

πŸ“Έ Sample Output

Below is an example of the model generating a caption for an image from the dataset:

Image: screenshot_20250721_235853.jpg Generated Caption: `a child sitting in a play area'

Generated caption sample


πŸ’‘ Model Info


▢️ Usage Instructions

  1. Open the notebook in Google Colab.

  2. Upload your dataset ZIP file to Colab (Flickr8k_2k.zip).

  3. Set runtime to GPU:

    • Runtime β†’ Change runtime type β†’ GPU
  4. Run all cells sequentially.

  5. View the images and their generated captions.


πŸ“„ License

This project is for educational and research purposes. It uses publicly available pretrained models under their respective licenses.

About

Generate natural language captions for images using the BLIP vision-language model by Salesforce. Easily run it in Google Colab with GPU support, using the Flickr8k-2k image dataset from Kaggle.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published