Vision-language models have improved dramatically, but their capabilities and limitations remain poorly understood. This page documents a series of experiments probing what these models can and cannot do.
Experiment 1: Spatial Reasoning
Question: Can VLMs accurately describe spatial relationships between objects?
Method: Generated 100 images with known object positions. Asked models to describe relative positions (left/right, above/below, in front/behind).
Results:
| Model | Accuracy |
|---|---|
| GPT-4V | 78% |
| LLaVA 1.5 | 62% |
| Claude 3 | 81% |
Finding: Models struggle with depth (in front/behind) more than planar relationships.
Experiment 2: Counting
Question: How accurately can VLMs count objects in images?
Method: Images with 1-20 instances of various objects. Asked for exact counts.
Results: Accuracy drops sharply above 7 items. All models showed systematic under-counting for large quantities.
Experiment 3: Text in Images
Question: Can VLMs reliably extract and reason about text within images?
Method: Screenshots of documents, signs, and interfaces. Asked both extraction and comprehension questions.
Results: Extraction is reliable for clear text. Comprehension depends heavily on context provided in the prompt.
Experiment 4: Temporal Reasoning from Stills
Question: Can VLMs infer what happened before/after a single image?
Method: Images of scenes with clear temporal implications (broken glass, melting ice, etc.). Asked about preceding and subsequent events.
Results: Models showed strong priors but sometimes confabulated details not inferrable from the image.
Key Takeaways
- Spatial reasoning is weaker than expected, especially for 3D relationships
- Counting has hard limits around 7 items, consistent with human cognitive literature
- Text extraction works; text reasoning requires careful prompting
- Temporal inference is possible but unreliable without explicit grounding
Ongoing
Currently exploring:
- Chain-of-thought prompting for spatial tasks
- Ensemble methods for counting
- Multimodal RAG architectures
All code and datasets available upon request.