The AI becomes scarier better every day. OpenAI now offers the vision API, which allows you to extract information from an image.
We'll learn how to use Vision API by OpenAI in a simple image and extract data from complex images.
_We experimented with parsing HTML raw data with AI before, feel free to read the blog post: Web scraping experiment with AI (Parsing HTML with GPT-4)
Vision API tutorial step-by-step
Let's start with setting up a project to test the Vision API. I'll be using Javascript (Nodejs) in this sample, but feel free to use any language you're comfortable with.
Preparation
Create a new directory and initialize NPM
mkdir openai-vision-api && cd openai-vision-api
npm init -y // NPM init
npm install openai dotenv --save // Install openai and dotenv package
Add API Key
Get your API Key from openAI dashboard, and put it in the .env
file. Feel free to create a new .env
file.
OPENAI_API_KEY=YOUR_API_KEY
Basic code setup
Create a new index.js
file and import related packages and create a new openai instance
require("dotenv").config();
const OpenAI = require('openai');
const { OPENAI_API_KEY } = process.env;
const openai = new OpenAI({
apiKey: OPENAI_API_KEY,
});
Add vision API method
Here is how to call a vision API in your code
async function main() {
const response = await openai.chat.completions.create({
model: "gpt-4-vision-preview",
messages: [
{
role: "user",
content: [
{ type: "text", text: "What’s in this image?" },
{
type: "image_url",
image_url: {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
},
},
],
},
]
});
console.log(response.choices[0].message.content);
}
main();
Now run the program with
node index.js
Here is the result:
Parsing data from complex image with Vision API
We saw it worked with a simple image. Now, let's try for a complex one. I'm going to take a screenshot from Google Shopping results.
I'll upload this image, to use the public URL on our Vision API
I need to update two things: first, the token parameter since the response should be longer. Second is the prompt, to tell exactly what I want from the AI.
async function main() {
const response = await openai.chat.completions.create({
model: "gpt-4-vision-preview",
messages: [
{
role: "user",
content: [
{ type: "text", text: "Please share the detail information of each item on this product on a nice structure JSON" },
{
type: "image_url",
image_url: {
"url": "https://i.ibb.co/F8nGWk5/Clean-Shot-2024-01-17-at-13-46-43.png",
},
},
],
},
],
max_tokens: 1000 // Add more token
});
console.log(response.choices[0].message.content);
}
Here is the result
The result is very good! but here is the catch:
- The response is not always consistent (structure wise). I believe we can solve this by adjusting our prompt
- The time taken for this particular image is range between 10+ to 20+ seconds. (It's just the parsing time, not the scraping time).
Can we use this as a web scraping solution?
As you might know, parsing data is just a part of web scraping. There are other things involved like proxy rotation, solving captcha, and so on. So we can't say that vision API is a web scraping solution.
Here is the idea though, of how to use this as part of our web scraping solution:
- Create a scraping solution, for example using Puppeteer in Javascript to take a screenshot .
- Upload the image to a public URL or get the base64 code
- Pass this image to the vision API method parameter like the one we provided above.
- Return the results in a nice structured way.
- (Bonus) If you want to have a consistent data structure, you might want to learn about function calling by OpenAI.
Summary
It's very fun to experiment with OpenAI features like vision API and see the possibility to help us with web scraping and parsing.
In above example, where we try to parse the Google Shopping results page data, it's still far from ready for production, compare to the Google Shopping API, which only take 1-3s to scrape and return the Google Shopping page in a consistent structured format.
FAQ
How much does vision API cost?
Model gpt-4-1106-vision-preview
costs $0.01 / 1K tokens
for the input and $0.03/1K tokens
for the output.
Does it support function calling?
Not right now, the gpt-4-1106-vision-preview
haven't support function calling yet (Per 17th January 2024).
Reference: OpenAI Vision API
Top comments (0)