LMM-ConceptExplainer-VQA
LMM-ConceptExplainer-VQA
This project demonstrates a concept-based explainability framework for Large Multimodal Models (LMMs) using the VQAv2-small dataset. It adapts techniques inspired by the paper "A Concept-Based Explainability Framework for Large Multimodal Models" to provide interpretable visual question answering (VQA) outputs by analyzing how specific words in answers are represented across text and vision modalities.
📌 Objective
To identify and explain key entities (concepts) from VQA answers using both textual and visual embeddings extracted via a Large Multimodal Model (LMM), and to assess their stability and representational utility.
📂 Dataset
- Name: VQAv2-small
- Accessed via HuggingFace's
datasets
library. - A lightweight version of the VQAv2 dataset used for development and experimentation.
🧪 Methodology
- Extract answers from a subset of the VQAv2 dataset.
- Identify and select 5 frequently occurring target entities (e.g., “dog”, “clock”, “woman”).
- Use a pretrained LMM (e.g., CLIP or BLIP) to extract:
- Textual embeddings of the selected words.
- Visual embeddings from images whose answers contain those words.
- Visualize embeddings and compare alignment between text and image representations.
- Evaluate each entity’s:
- Utility: How meaningfully the entity is represented.
- Stability: How consistently the entity is represented across samples.
🧠 Concept-Based Explainability
This project implements a concept-based explanation approach, inspired by research that seeks to understand how multimodal models internally represent human-interpretable concepts.
🧩 Key Concepts
1. Entity Selection
- Identify 5 target words/entities from the answers dataset.
- These words become the concepts whose representations are analyzed.
2. Representation Extraction
- For each entity:
- Textual Representation: From the model's text encoder (e.g., the word "dog").
- Visual Representation: From visual encoder using images where the answer includes the word.
3. Concept Verification
- Use cosine similarity and dimensionality reduction to verify that:
- Images associated with the same word produce similar embeddings.
- These are close to the word’s text embedding.
4. Utility Evaluation
- Check if representations are:
- Distinct: Do different concepts have clearly separated embeddings?
- Cohesive: Are instances of the same concept tightly clustered?
5. Stability Check
- Analyze the variance in representations of the same entity across multiple samples.
- Determine how sensitive concept embeddings are to input variations.
📈 Results
- The notebook demonstrates:
- Visual and textual embedding extraction.
- Similarity analysis between representations.
- Dimensionality reduction plots.
- Entity-wise clustering consistency and variance metrics.
📦 Requirements
Install the dependencies using:
1pip install -r requirements.txt
Required packages include:
-
transformers
-
datasets
-
torch
-
matplotlib
-
sklearn
-
PIL
-
numpy
📄 License
MIT License
🙋♀️ Acknowledgments
-
Based on methods adapted from the paper: "A Concept-Based Explainability Framework for Large Multimodal Models"
-
Uses the VQAv2 dataset from HuggingFace Datasets.