hil for SerpApi

Posted on Jan 18 • Originally published at serpapi.com

Image data parsing: From Image to data (Using Vision API)

#webscraping #openai #ai

The AI becomes ~~scarier~~ better every day. OpenAI now offers the vision API, which allows you to extract information from an image.

We'll learn how to use Vision API by OpenAI in a simple image and extract data from complex images.

_We experimented with parsing HTML raw data with AI before, feel free to read the blog post: Web scraping experiment with AI (Parsing HTML with GPT-4)

Vision API tutorial step-by-step

Let's start with setting up a project to test the Vision API. I'll be using Javascript (Nodejs) in this sample, but feel free to use any language you're comfortable with.

Preparation

Create a new directory and initialize NPM

mkdir openai-vision-api && cd openai-vision-api 
npm init -y // NPM init
npm install openai dotenv --save  // Install openai and dotenv package

Add API Key

Get your API Key from openAI dashboard, and put it in the .env file. Feel free to create a new .env file.

OPENAI_API_KEY=YOUR_API_KEY

Basic code setup

Create a new index.js file and import related packages and create a new openai instance

require("dotenv").config();
const OpenAI = require('openai');

const { OPENAI_API_KEY } = process.env;

const openai = new OpenAI({
  apiKey: OPENAI_API_KEY,
});

Add vision API method

Here is how to call a vision API in your code

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4-vision-preview",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            },
          },
        ],
      },
    ]
  });
  console.log(response.choices[0].message.content);
}
main();

Now run the program with

node index.js

Here is the result:

Parsing data from complex image with Vision API

We saw it worked with a simple image. Now, let's try for a complex one. I'm going to take a screenshot from Google Shopping results.

I'll upload this image, to use the public URL on our Vision API

I need to update two things: first, the token parameter since the response should be longer. Second is the prompt, to tell exactly what I want from the AI.

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4-vision-preview",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "Please share the detail information of each item on this product on a nice structure JSON" },
          {
            type: "image_url",
            image_url: {
              "url": "https://i.ibb.co/F8nGWk5/Clean-Shot-2024-01-17-at-13-46-43.png",
            },
          },
        ],
      },
    ],
    max_tokens: 1000 // Add more token
  });
  console.log(response.choices[0].message.content);
}

Here is the result

The result is very good! but here is the catch:

- The response is not always consistent (structure wise). I believe we can solve this by adjusting our prompt

- The time taken for this particular image is range between 10+ to 20+ seconds. (It's just the parsing time, not the scraping time).

Can we use this as a web scraping solution?

As you might know, parsing data is just a part of web scraping. There are other things involved like proxy rotation, solving captcha, and so on. So we can't say that vision API is a web scraping solution.

Here is the idea though, of how to use this as part of our web scraping solution:

- Create a scraping solution, for example using Puppeteer in Javascript to take a screenshot .

- Upload the image to a public URL or get the base64 code

- Pass this image to the vision API method parameter like the one we provided above.

- Return the results in a nice structured way.

- (Bonus) If you want to have a consistent data structure, you might want to learn about function calling by OpenAI.

Summary

It's very fun to experiment with OpenAI features like vision API and see the possibility to help us with web scraping and parsing.

In above example, where we try to parse the Google Shopping results page data, it's still far from ready for production, compare to the Google Shopping API, which only take 1-3s to scrape and return the Google Shopping page in a consistent structured format.

FAQ

How much does vision API cost?

Model gpt-4-1106-vision-preview costs $0.01 / 1K tokens for the input and $0.03/1K tokens for the output.

Does it support function calling?

Not right now, the gpt-4-1106-vision-preview haven't support function calling yet (Per 17th January 2024).

Reference: OpenAI Vision API

DEV Community

Image data parsing: From Image to data (Using Vision API)

Vision API tutorial step-by-step

Parsing data from complex image with Vision API

Can we use this as a web scraping solution?

Summary

FAQ

Top comments (0)

Read next

Genmo Mochi 1 — SOTA Video Generation Model — Full Tutorial With SwarmUI — Locally Generate Amazing AI Videos for Free

From Static to Dynamic: How Agentic RAG Redefines AI

Swarm-Tuning AI Experts: Collaborative Fine-Tuning of Large Language Models

Data Interpreter: LLM Agent Assisting Data Scientists in Workflow and Insight Generation