textify-VQA
🧠 textify-VQA
textify-VQA is a Vision-Question-Answering (VQA) project focused on answering product-related questions using both visual and textual information from product images. The project leverages an Amazon product dataset and introduces a custom model that significantly improves performance on structured data queries like nutrition facts.
📌 Project Objective
The goal of textify-VQA
is to accurately answer questions about product attributes—especially those embedded in tables—by integrating:
- Visual content of product images
- OCR-based textual features
- Pretrained and fine-tuned VQA models
- A custom hybrid architecture for deeper visual-textual understanding
🧰 Key Features
- ✅ Pretrained BLIP model for answering general visual questions
- ✅ Fine-tuned BLIP model adapted to Amazon product QA data
- ✅ Custom VQA architecture combining:
- OCR text features (text, bounding boxes, orientation)
- Visual embeddings
- Question embeddings
- ✅ Strong performance on table-related questions (e.g., nutritional facts, calorie content)
📦 Dataset
The dataset is an Amazon product VQA dataset, consisting of:
- Product images
- User-generated questions and answers
- A mix of visual and text-based queries, especially from packaging and tables
🔍 Methodology
1. Raw BLIP Evaluation
Run zero-shot inference using the base BLIP model to answer general visual queries (e.g., product color, shape).
2. Fine-Tuned BLIP
Adapt BLIP to domain-specific questions by fine-tuning it on the Amazon dataset to improve accuracy.
3. Custom OCR-Based Model
Introduce a novel architecture that integrates:
- OCR features (extracted text, position, angle)
- BLIP visual features
- Question embeddings from a language model
- Multimodal fusion for enhanced understanding of structured data in images
📊 Results Summary
| Model | Visual Q Accuracy | Table Q Accuracy | |------------------|-------------------|--------------------| | Raw BLIP | ✅ Good | ❌ Poor | | Fine-tuned BLIP | ✅ Better | ⚠️ Moderate | | Custom Model | ✅ Good | ✅ Excellent |
🚀 Getting Started
Install dependencies
1pip install -r requirements.txt
Prepare the dataset
1# Download images and convert metadata
2Run prepare_dataset.ipynb
Run raw BLIP inference
1Run test_raw_blip.ipynb
Fine-tune BLIP
1Run finetune_blip.ipynb
Train and evaluate custom model
1Run custom_model.ipynb
📌 Future Work
-
Improve OCR accuracy with larger models (e.g., Donut, LayoutLMv3)
-
Unified transformer architecture for joint learning
-
Deployment as an API or web app for live inference
-
Multilingual support
🤝 Acknowledgements
-
BLIP (Salesforce)
-
HuggingFace Transformers
-
Amazon VQA Dataset
📃 License
MIT License