Join an event that enterprise leaders have been trusted for nearly 20 years. VB Transform brings together people who build real enterprise AI strategies. learn more
Computer vision projects rarely went as planned. This was no exception. The idea was simple. We built a model that allows us to view photos of our laptops to identify physical damage. This can be a cracked screen, missing keys, broken hinges, and more. It seemed like a simple use case for image models and large-scale language models (LLMS), but quickly became more complicated.
Along the way, I ran into problems with hallucinations, unreliable outputs, and non-laptop images. To solve this, we have decided to apply the agent framework in an atypical way. Improved model performance rather than automating tasks.
In this post, we’ll explain how a combination of what we tried, what we didn’t work, and how the approach can help us build something we can ultimately trust.
Where we started: Monolithic Prompt
Our first approach was fairly standard for multimodal models. Using a single large prompt, I handed the image to an image-enabled LLM and asked to identify visible damage. This monolithic prompt strategy is easy to implement and works neatly for clean, well defined tasks. However, the actual data is rarely played.
I ran into three major issues early on.
- Hallucinations: Models can invent damage that did not exist, or mislabel what you see.
- Junk image detection: There was no reliable way to flag images that were not even laptops, like desks, walls, or people occasionally slipped down and received meaningless damage reports.
- Inconsistent accuracy: A combination of these issues rendered the model unreliable for operational use.
This was the point that made it clear that there was a need to be repeated.
First fix: Mixing image resolution
One thing we noticed is how much image quality affects the output of the model. Users uploaded all sorts of images, from sharp, high resolution to blurry. This has made us refer the study It highlights how image resolution affects deep learning models.
The models were trained and tested using a combination of high-resolution and low-resolution images. The idea was to make the model more resilient with the wide range of image quality that we actually encounter. This helped to improve consistency, but the central issues of hallucination and junk image processing persisted.
Multimodal detour: Text-only LLM is multimodal
Recent experiments in combining image captions with text-only LLM – as in the techniques covered in batchI decided to give it a try if the captions are generated from images and interpreted by the linguistic model.
How does this work:
- LLM starts by generating multiple possible captions for the image.
- Another model, called a multimodal embedded model, checks how well each caption fits the image. In this case, we used Siglip to obtain similarity between images and text.
- The system keeps the top few captions based on these scores.
- LLM uses these top captions to create new captions and try to get closer to what the image actually shows.
- Repeat this process until the caption stops improving or reaches the set limit.
While in theory it is smart, this approach has brought new problems to our use cases.
- Permanent hallucinations: The caption itself may contain imaginary damage that LLM confidently reported.
- Incomplete coverage: Even with multiple captions, some issues were completely missed.
- There are few benefits and increased complexity: The added steps make the system more complicated, without reliably surpassing the previous setup.
It was an interesting experiment, but in the end it wasn’t a solution.
Creative Use of Agent Framework
This was the turning point. Although the agent framework is usually used to coordinate task flows (think of an agent adjusting calendar invitations or customer service actions), I wondered whether it would help to break down image interpretation tasks into smaller, specialized agents.
We have built an agent framework configured like this.
- Orchestrator Agent: Check the image to determine which laptop components are displayed (screen, keyboard, chassis, port).
- Component Agent: The dedicated agent inspected each component for a specific damage type. For example, one for a cracked screen and one for missing keys.
- Junk detection agent: Another agent first flagged whether the image was even on a laptop.
This modular, task-driven approach produced much more accurate and explainable results. The hallucinations were dramatically dropped, reliably flagged the junk images, and each agent’s task was simple and focused enough to successfully control the quality.
Blind Spot: Agent Approach Tradeoffs
This worked, but it wasn’t perfect. Two main restrictions were shown.
- Increased latency: Runs multiple sequential agents added to the total inference time.
- Coverage gap: The agent was able to detect only explicitly programmed problems. If the image shows something unexpected, if the agent is not responsible for identifying it, it will not be noticed.
We needed a way to balance accuracy and coverage.
Hybrid Solution: Combining Agent and Monolithic Approach
To fill the gap, we created a hybrid system.
- Agent Framework We first performed and processed accurate detection of known damage types and junk images. The number of agents was limited to the most important agents to improve latency.
- Next, Monolithic Image LLM Prompt I scanned the images for other things my agent might have missed.
- Finally, we I’ve fine-tuned the model Use image sets curated for high-priority use cases, such as frequently reported damage scenarios, to further improve accuracy and reliability.
This combination has increased the accuracy and explanationability of the agent setup, wide coverage of monolithic prompts, and confidence in targeted fine-tuning.
What we learned
By the time we concluded this project, several things became clear.
- Agent frameworks are more versatile than they get credit: They are usually associated with workflow management, but we found that when applied in a structured, modular way, the performance of the model can be significantly improved.
- Blend beets of different approaches rely on only one: In addition to the wide coverage of LLMS, the agent-based accuracy detection and a slightly tweaked combination of where it matters most, resulted in much more reliable results than its own single method.
- Visual models are prone to hallucinations: Even more advanced setups can jump to conclusions and see what’s not there. To curb these mistakes, a thoughtful system design is required.
- Diversity in image quality makes a difference: Training and testing on both clear, high-resolution images and daily low-quality images helped the model remain resilient when faced with unpredictable real-world photos.
- I need a way to catch junk images: Dedicated checking of junk or unrelated photos was one of the simplest changes we made and had a major impact on the reliability of the entire system.
Final Thoughts
What began as a simple idea of detecting physical damage in laptop images using LLM prompts quickly transformed into a much deeper experiment in combining different AI technologies to tackle unpredictable real-world problems. Along the way, I realized that some of the most useful tools were not originally undesigned for this type of work.
Agent frameworks, often considered workflow utilities, have proven surprisingly effective when reused for tasks such as structured damage detection and image filtering. A little creativity helped us to build a system that is less accurate and actually easier to understand and manage.
Shruti Tiwari is AI Product Manager at Dell Technologies.
Vadiraj Kulkarni is a data scientist at Dell Technologies.