DEV Community

Cover image for Image data parsing: From Image to data (Using Vision API)
hil for SerpApi

Posted on • Originally published at serpapi.com

Image data parsing: From Image to data (Using Vision API)

The AI becomes scarier better every day. OpenAI now offers the vision API, which allows you to extract information from an image.

We'll learn how to use Vision API by OpenAI in a simple image and extract data from complex images.

OpenAI vision API to scrape data from an image

_We experimented with parsing HTML raw data with AI before, feel free to read the blog post: Web scraping experiment with AI (Parsing HTML with GPT-4)

Vision API tutorial step-by-step

Let's start with setting up a project to test the Vision API. I'll be using Javascript (Nodejs) in this sample, but feel free to use any language you're comfortable with.

Preparation

Create a new directory and initialize NPM

mkdir openai-vision-api && cd openai-vision-api 
npm init -y // NPM init
npm install openai dotenv --save  // Install openai and dotenv package
Enter fullscreen mode Exit fullscreen mode

Add API Key

Get your API Key from openAI dashboard, and put it in the .env file. Feel free to create a new .env file.

OPENAI_API_KEY=YOUR_API_KEY
Enter fullscreen mode Exit fullscreen mode

Basic code setup

Create a new index.js file and import related packages and create a new openai instance

require("dotenv").config();
const OpenAI = require('openai');

const { OPENAI_API_KEY } = process.env;

const openai = new OpenAI({
  apiKey: OPENAI_API_KEY,
});
Enter fullscreen mode Exit fullscreen mode

Add vision API method

Here is how to call a vision API in your code

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4-vision-preview",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            },
          },
        ],
      },
    ]
  });
  console.log(response.choices[0].message.content);
}
main();
Enter fullscreen mode Exit fullscreen mode

Now run the program with

node index.js 
Enter fullscreen mode Exit fullscreen mode

Here is the result:

Simple example of Vision API

Parsing data from complex image with Vision API

We saw it worked with a simple image. Now, let's try for a complex one. I'm going to take a screenshot from Google Shopping results.

I'll upload this image, to use the public URL on our Vision API

Google shopping results screenshot for coffee

I need to update two things: first, the token parameter since the response should be longer. Second is the prompt, to tell exactly what I want from the AI.

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4-vision-preview",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "Please share the detail information of each item on this product on a nice structure JSON" },
          {
            type: "image_url",
            image_url: {
              "url": "https://i.ibb.co/F8nGWk5/Clean-Shot-2024-01-17-at-13-46-43.png",
            },
          },
        ],
      },
    ],
    max_tokens: 1000 // Add more token
  });
  console.log(response.choices[0].message.content);
}
Enter fullscreen mode Exit fullscreen mode

Here is the result

Vision API result for a complex image

The result is very good! but here is the catch:

- The response is not always consistent (structure wise). I believe we can solve this by adjusting our prompt

- The time taken for this particular image is range between 10+ to 20+ seconds. (It's just the parsing time, not the scraping time).

Can we use this as a web scraping solution?

As you might know, parsing data is just a part of web scraping. There are other things involved like proxy rotation, solving captcha, and so on. So we can't say that vision API is a web scraping solution.

Here is the idea though, of how to use this as part of our web scraping solution:

- Create a scraping solution, for example using Puppeteer in Javascript to take a screenshot .

- Upload the image to a public URL or get the base64 code

- Pass this image to the vision API method parameter like the one we provided above.

- Return the results in a nice structured way.

- (Bonus) If you want to have a consistent data structure, you might want to learn about function calling by OpenAI.

Summary

It's very fun to experiment with OpenAI features like vision API and see the possibility to help us with web scraping and parsing.

In above example, where we try to parse the Google Shopping results page data, it's still far from ready for production, compare to the Google Shopping API, which only take 1-3s to scrape and return the Google Shopping page in a consistent structured format.

FAQ

How much does vision API cost?

Model gpt-4-1106-vision-preview costs $0.01 / 1K tokens for the input and $0.03/1K tokens for the output.

Does it support function calling?

Not right now, the gpt-4-1106-vision-preview haven't support function calling yet (Per 17th January 2024).

Reference: OpenAI Vision API

Top comments (0)