LMM-ConceptExplainer-VQA

This project demonstrates a concept-based explainability framework for Large Multimodal Models (LMMs) using the VQAv2-small dataset. It adapts techniques inspired by the paper "A Concept-Based Explainability Framework for Large Multimodal Models" to provide interpretable visual question answering (VQA) outputs by analyzing how specific words in answers are represented across text and vision modalities.

📌 Objective

To identify and explain key entities (concepts) from VQA answers using both textual and visual embeddings extracted via a Large Multimodal Model (LMM), and to assess their stability and representational utility.

📂 Dataset

Name: VQAv2-small
Accessed via HuggingFace's datasets library.
A lightweight version of the VQAv2 dataset used for development and experimentation.

🧪 Methodology

Extract answers from a subset of the VQAv2 dataset.
Identify and select 5 frequently occurring target entities (e.g., “dog”, “clock”, “woman”).
Use a pretrained LMM (e.g., CLIP or BLIP) to extract:
- Textual embeddings of the selected words.
- Visual embeddings from images whose answers contain those words.
Visualize embeddings and compare alignment between text and image representations.
Evaluate each entity’s:
- Utility: How meaningfully the entity is represented.
- Stability: How consistently the entity is represented across samples.

🧠 Concept-Based Explainability

This project implements a concept-based explanation approach, inspired by research that seeks to understand how multimodal models internally represent human-interpretable concepts.

🧩 Key Concepts

1. Entity Selection

Identify 5 target words/entities from the answers dataset.
These words become the concepts whose representations are analyzed.

2. Representation Extraction

For each entity:
- Textual Representation: From the model's text encoder (e.g., the word "dog").
- Visual Representation: From visual encoder using images where the answer includes the word.

3. Concept Verification

Use cosine similarity and dimensionality reduction to verify that:
- Images associated with the same word produce similar embeddings.
- These are close to the word’s text embedding.

4. Utility Evaluation

Check if representations are:
- Distinct: Do different concepts have clearly separated embeddings?
- Cohesive: Are instances of the same concept tightly clustered?

5. Stability Check

Analyze the variance in representations of the same entity across multiple samples.
Determine how sensitive concept embeddings are to input variations.

📈 Results

The notebook demonstrates:
- Visual and textual embedding extraction.
- Similarity analysis between representations.
- Dimensionality reduction plots.
- Entity-wise clustering consistency and variance metrics.

📦 Requirements

Install the dependencies using:

pip install -r requirements.txt

Required packages include:

transformers
datasets
torch
matplotlib
sklearn
PIL
numpy

📄 License

MIT License

🙋‍♀️ Acknowledgments

Based on methods adapted from the paper: "A Concept-Based Explainability Framework for Large Multimodal Models"
Uses the VQAv2 dataset from HuggingFace Datasets.