Vision-language models have improved dramatically, but their capabilities and limitations remain poorly understood. This page documents a series of experiments probing what these models can and cannot do.

Experiment 1: Spatial Reasoning

Question: Can VLMs accurately describe spatial relationships between objects?

Method: Generated 100 images with known object positions. Asked models to describe relative positions (left/right, above/below, in front/behind).

Results:

ModelAccuracy
GPT-4V78%
LLaVA 1.562%
Claude 381%

Finding: Models struggle with depth (in front/behind) more than planar relationships.

Experiment 2: Counting

Question: How accurately can VLMs count objects in images?

Method: Images with 1-20 instances of various objects. Asked for exact counts.

Results: Accuracy drops sharply above 7 items. All models showed systematic under-counting for large quantities.

Experiment 3: Text in Images

Question: Can VLMs reliably extract and reason about text within images?

Method: Screenshots of documents, signs, and interfaces. Asked both extraction and comprehension questions.

Results: Extraction is reliable for clear text. Comprehension depends heavily on context provided in the prompt.

Experiment 4: Temporal Reasoning from Stills

Question: Can VLMs infer what happened before/after a single image?

Method: Images of scenes with clear temporal implications (broken glass, melting ice, etc.). Asked about preceding and subsequent events.

Results: Models showed strong priors but sometimes confabulated details not inferrable from the image.

Key Takeaways

  1. Spatial reasoning is weaker than expected, especially for 3D relationships
  2. Counting has hard limits around 7 items, consistent with human cognitive literature
  3. Text extraction works; text reasoning requires careful prompting
  4. Temporal inference is possible but unreliable without explicit grounding

Ongoing

Currently exploring:

  • Chain-of-thought prompting for spatial tasks
  • Ensemble methods for counting
  • Multimodal RAG architectures

All code and datasets available upon request.