DEV Community

Cover image for Comparing Open-Source Vision Models for Photo Description Tasks Using .NET Aspire
sy
sy

Posted on

Comparing Open-Source Vision Models for Photo Description Tasks Using .NET Aspire

In our ongoing series about building a local image summarisation system, we have explored how to combine various open-source technologies to generate meaningful descriptions of photos. Today, we'll tackle a crucial question: How do we choose the best vision model for our needs?

This article we focus on a simple approach: using OpenAI's GPT-4o as an automated judge to evaluate the quality of summaries generated by different open-source models. In the next sections, we will explore the following:

  • Setting up an evaluation pipeline with .NET Aspire
  • Using GPT-4o to score model outputs
  • Visualising and analysing the results using Jupyter notebooks

Our evaluation covers six prominent open-source vision models:

  1. llama3.2-vision: Latest iteration of Meta's multimodal model
  2. llava-llama3: Vision-language model built on LLaMA architecture
  3. llava:7b: Compact vision-language model suitable for local deployment
  4. llava:13b: Larger variant offering enhanced capabilities
  5. Florence-2-large-ft: Microsoft's vision model known for detailed scene understanding
  6. llava-phi3: Recent addition combining efficiency with strong performance

These models run locally through our Aspire based infrastructure, which handles:

  • Model inference and serving
  • Reverse geocoding for location context
  • Experiment tracking and result storage

Now that we can generate summaries, which model should we use for summarising our photo library? This post will cover a simple approach how that can be achieved using help of a commercial model.

Evaluation Process

As a weekend project this post covers the following idea: "How about using a commercial model to judge the output generated by Open Source models?"

Why GPT-4o?

GPT-4o (released May 2024) offers several advantages as our evaluation model:

  • Multimodal capabilities for analysing both images and text.
  • Consistent scoring methodology.
  • Cost-effective solution.

Besides being a good choice for the task, pricing was also an advantage for a fun project where there is no budget at all. For instance, 300 evaluation requests (50 images x 6 open source models) cost around $0.8. Open AI API Pricing is available here

Approach

Our approach can be described simply as below:

  1. Input parameters:

    • Original photo (scaled to 256px width).
    • Model-generated summary.
    • Model used.
    • Categorisation predictions.
    • Top 10 detected objects.
  2. Scoring Criteria:

    • Quality and accuracy of the summary (0-100).
    • Accuracy of category predictions.
    • Precision of object detection.
    • Consistency with image content.
  3. Result Collection:

    • Structured score and justification storage.
    • Integration with existing MongoDB database.

Setting up

For this task, we utilise OpenAIClient from Aspire.OpenAI as seen in code sample below.

Key Implementation Decisions:

  1. Temperature Setting (0.1f):

    • Chosen for consistent, deterministic evaluations.
    • Reduces random variation in scoring.
  2. JSON Schema Format:

    • Ensures structured, parseable responses.
    • Simplifies result processing and storage.
  3. Image Preprocessing:

    • 256px width limitation balances detail and API costs.
    • Consistent sizing ensures fair comparisons.

public class OpenAiPhotoSummaryEvaluationClient([FromKeyedServices("openaiConnection")] OpenAIClient client)
    : IPhotoSummaryEvaluator
{
    private const string SystemPrompt =
        "You are a highly accurate and fair image summarisation evaluation model. "
        + " Your job is to evaluate the quality of summaries generated from images by different computer vision models. \n\n"
        + " When evaluating a summary of the provided image:\n\n"
        + " - Provide a single score ranging between 0 and 100 combining the following properties: \n\n"
        + "    - Quality and accuracyof the summary.\n\n"
        + "    - Quality and accuracy of the categories predicted for the image.\n\n"
        + "    - Quality and accuracy of the objects predicted to be in the image.\n\n"
        + "  - Be fair and consistent when evaluating. \n\n";
    private const string PromptSummary =
        "Please score the provided image summary based on the quality and accuracy of the summary, categories, and objects predicted in the image.";

    public async Task<PhotoSummaryScore> EvaluatePhotoSummary(string base64Image, ImageSummaryEvaluationRequest summary)
    {
        .. omitted: resize the image to me max 256 px wide
        var img = ChatMessageContentPart.CreateImagePart(new BinaryData(memStream.ToArray()), "image/jpeg",
            ChatImageDetailLevel.Auto);
        List<ChatMessage> messages =
        [
            new UserChatMessage(PromptSummary, JsonSerializer.Serialize(summary), img),
            new SystemChatMessage(SystemPrompt)
        ];

        var options = new ChatCompletionOptions()
        {
            Temperature = 0.1f,
            ResponseFormat = ChatResponseFormat.CreateJsonSchemaFormat(jsonSchemaFormatName: "image_summary_result",
                jsonSchema: BinaryData.FromString("""
                {
                    "type": "object",
                    "properties": {
                        "Score": { "type": "number"},
                        "Justification": { "type": "string"}
                    },
                    "required": ["Score", "Justification"],
                    "additionalProperties": false
                }
                """),
            jsonSchemaIsStrict: true)
        };
        var completion = await client.GetChatClient("gpt-4o").CompleteChatAsync(messages, options);
        using var structuredJson = JsonDocument.Parse(completion.Value.Content[0].Text);
        var score = structuredJson.RootElement.GetProperty("Score").GetDouble();
        var justification = structuredJson.RootElement.GetProperty("Justification").GetString();
        return new PhotoSummaryScore(score, justification!, "OpenAI");
    }
}
Enter fullscreen mode Exit fullscreen mode

The results from this process are then stored in the database with the summaries against the original image. OpenAI API has certain rate limits and therefore it is important to manage how often these are being Called.

Analysis and visualisation

Our analysis notebook provides:

  1. Data Collection:

    • MongoDB query and result aggregation
  2. Visualisation Components:

    • Model comparison table.
    • Example evaluation cases
  3. Validating the outcome:

    • Filter results by best model.
    • Visualise evaluation justifications.
  4. Use Aspire Command to download and upload the notebook between development machine and Jupyter Server on Docker host.

Below is an example of the evaluation process where GPT-4o correctly identifies the inaccuracies in the generated summaries. The results look fair and accurate making it easier to introduce more open source models and then use the notebook to evaluate the performances.
This also allows to tweak the prompts to get better results from the models. For example wrong location information is likely from including the address resolved from the GPS tag of the photo which leads some models to be more creative with their description.

Evaluation result

Results and Remarks

Following the process outlined earlier, llava:13b is on top with an average of 85.6 score with Florence-2-large-ft being second as below:

Model Rankings

Observations

  • Providing too much address detail can lead models to make up location information.
  • Larger models provide more detailed summaries.
  • OpenAIClient from Aspire.OpenAI works well with Ollama Server as well.
  • The Aspire Command for Jupyter notebook made it was for me to pull and push the notebook from my machine to wherever Aspire is running containers on.
    • As a next step, it makes sense to consider periodic downloading of the notebook.
    • Jupyter Notebook Command in Aspire Dashboard

Conclusion and what's next

It is easy enough to utilise APIs that allow inference on image inputs. However, making a decision on what model to use is not so straightforward given the need for a large number of test images against each model. This is what makes the evaluation process crucial to make the most out of such technology.

In this post, we have looked into using OpenAI GPT-4o model to evaluate open source model performance to assess the quality of the image summaries generated by open source models.

Our evaluation framework using GPT-4o provides a systematic approach to comparing vision model performance. Key takeaways include:

  1. Automated Evaluation Benefits:

    • Consistent scoring methodology.
    • Scalable to large image sets.
    • Cost-effective solution.
  2. Implementation Insights:

    • Aspire.OpenAI simplifies integration.
    • Jupyter notebooks enable flexible analysis.
    • .Net Aspire makes local development orchestration a breeze.

Next Steps

  1. Model Expansion:

    • Integration of newer vision models
    • Prompt engineering optimisation
    • Performance benchmarking
  2. Feature Development:

    • Natural language image search implementation
    • Enhanced evaluation metrics
    • Automated testing pipeline

The notebook can be accessed in GitHub with rest of the code.

Links

Top comments (0)