Francesco Mattia

Posted on May 22, 2024 • Edited on May 25, 2024

Unlocking Vision: Evaluating LLMs for Home Security

#genai #computervision #homeautomation #languagemodels

Introduction

I am diving into the vision capabilities of large language models (LLMs) to see if they can accurately classify images, specifically focusing on spotting door handle positions to tell if they’re locked or unlocked. This experiment includes basic tests to evaluate accuracy, speed, and token usage, offering an initial comparison across models.

Code on GitHub

Scenario

Imagine using a webcam to monitor door security, providing images of door handles in different lighting conditions (day and night). The system’s goal is to classify the handle’s position—vertical (locked) or horizontal (unlocked)—and report the status in a parseable format like JSON. This could be a valuable feature in home automation systems. While traditional machine learning models, which require specific training, might achieve better performance, this experiment explores the potential of large language models (LLMs) in this task.

Approach

First, I took some pictures and fed them to the best LLMs, no code, through their websites (Claude 3 Opus, OpenAI GPT-4) to see if they could accurately classify door handle positions. Was this method viable or would it end up being a waste of time?

The initial results were encouraging, but I needed to verify if the models could consistently perform well. With a binary classifier, there’s a 50% chance of guessing correctly, so I wanted to ensure the accuracy was truly meaningful.

To ensure deterministic outputs, I used a prompt with a temperature setting of 0.0. To save on tokens and improve processing speed, I resized the images using the following command:

convert original_image.jpg -resize 200x200 resized.jpg

Next, I wrote a script to access Anthropic models, comparing the classification results to the actual positions indicated by the image filenames (v for vertical, h for horizontal).

./locks_classifier.js -m Haiku -v
🤖 Haiku
images/test01_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 794 ms
images/test02_v.jpg ✅
📊 In: 202 tkn Out: 11 Time: 1073 ms
images/test03_h.jpg ❌
📊 In: 202 tkn Out: 11 Time: 604 ms

Correct Responses: (12 / 20) 60%
Total In Tokens: 3976
Total Out Tokens: 220
Avg Time: 598 ms

The results for Haiku were somewhat underwhelming, while Sonnet performed even worse, albeit with similar speed.

I experimented with few-shot examples embedded in the prompt, but this did not improve the results.

Out of curiosity, I also tested OpenAI models, adapting my scripts to accommodate their slightly different APIs (it’s frustrating that there isn’t a standard yet, right?).

The results with OpenAI models were significantly better. Although slightly slower, they were much more accurate in comparison.

GPT-4-Turbo:

./locks_classifier.js -m GPT4 -v
Responses: (16 / 20) 80% 
In Tokens: 6360 Out Tokens: 240
Avg Time: 2246 ms

The just released GPT-4o:

./locks_classifier.js -m GPT4o -v
Responses: (20 / 20) 100% 
In Tokens: 6340 Out Tokens: 232
Avg Time: 1751 ms

What I learnt

1) LLM Performance: I was curious to see how the models would perform, and I am quite impressed by GPT-4o. It delivered high accuracy and reasonable speed. On the other hand, Haiku’s performance was somewhat disappointing, although its lower cost and faster response time make it appealing for many applications. There’s definitely potential to explore Haiku further.

2) Temperature 0.0: I was surprised by the varying responses even with the temperature set to 0.0, which should theoretically produce consistent results. This variability was unexpected and suggests that other factors may be influencing the outputs. Any ideas on why this might be happening?

🤖 Haiku *Run #1*
Responses: (5 / 11) 45%
In Tokens: 2222 Out Tokens: 121
Avg Time: 585 ms

🤖 Haiku *Run #2*
Correct Responses: (7 / 11) 64% 
In Tokens: 2222 Out Tokens: 121 
Avg Time: 585 ms

🤖 Haiku *Run #3*
Correct Responses: (4 / 11) 36% 
In Tokens: 2222 Out Tokens: 121
Avg Time: 583 ms

3) Variability in Tokenization: There is significant variability in the number of tokens generated by different models for the same input. This variability impacts cost estimates and efficiency, as token usage directly influences the expense of using these models.

Model	In Tks	Out Tks	$/M In Tks	$/M Out Tks	Images per $1
Haiku	202	11	$0.25	$1.25	15,563
Sonnet	156	11	$3.00	$15.00	1,579
GPT-4	318	12	$5.00	$15.00	565
GPT-4o	317	12	$10.00	$30.00	283

4) Variability in Response Time: I did not expect the same model, given the same input size, to have such a wide range of response times. This variability suggests that there are underlying factors affecting the inference speed.

Model	Avg Res Time (ms)	Min Res Time (ms)	Max Res Time (ms)
Haiku	598	351	1073
Sonnet	605	468	1011
GPT-4	2246	1716	6037
GPT-4o	1751	1172	4559

Overall, while the accuracy and results are interesting, they can vary significantly depending on the images used. For instance, would larger images improve the performance of models like Haiku and Sonnet?

Next steps

Here are a few ideas to dive deeper into:

1. Explore Different Challenges: Consider swapping the current challenge with a different task to further test the capabilities of LLMs in various scenarios.

2. Test Local Vision-Enabled Models: Evaluate models like Llava 1.5 7B running locally on platforms such as LM Studio or Ollama. Would a local LLM provide a viable option?

3.Compare with Traditional ML Models: Conduct tests against more traditional machine learning models to see how many sample images are needed to achieve similar or better accuracy.

Let me know if you have any comments or questions. I’d love to hear your suggestions on where to go next and what tests you’d like to see conducted!

Top comments (1)

Francesco Mattia • May 4 '25

Nearly a year on, the pricing landscape has changed dramatically. GPT‑4o’s rate has fallen from $10/M input tokens and $30/M output tokens to just $2.50 and $10 respectively—a 75–90 % cut. New arrivals such as GPT‑4.1 and its “mini” variant have proved reliable, with GPT‑4.1‑mini standing out as both the fastest and the cheapest option.

Keep in mind that your real cost per image is price × tokens consumed. Because this task is simple classification, the outputs are tiny, so input tokens dominate the bill. Models like GPT‑4.1‑mini and the Sonnet line keep input usage low, whereas GPT‑4.1 and GPT‑4o need roughly 2‑3 × more tokens per image. The table below shows how those two factors—token price and token usage—combine.

Model	Success %	In Tks (avg)	Out Tks (avg)	$/M In	$/M Out	Images per $1	Avg Time (ms)
Haiku 3 (Bedrock)	70 %	122.8	8.0	—	—	—	834
GPT‑4o	100 %	317.0	11.8	—	—	—	1 556
Sonnet 3.5	95 %	122.8	9.6	$3	$15	1 952	1 237
Sonnet 3.7	90 %	122.8	19.2	$3	$15	1 522	1 458
GPT‑4.1	100 %	317.0	9.8	$2	$8	1 403	1 630
GPT‑4.1‑mini	90 %	138.4	9.9	$0.40	$1.60	14 029	701
GPT‑4.1‑nano	50 %	177.2	6.0	$0.10	$0.40	—	685