DEV Community

Cover image for Image based Function Calling with gemini-1.0-pro-vision

Posted on

Image based Function Calling with gemini-1.0-pro-vision

I am excited to introduce a groundbreaking development in artificial intelligence: Trigger actions based directly on images! Yes, you read that right – with the power of Java, we can now integrate function calling with image inputs.

Imagine a system so advanced that it can:

🚑 Call an ambulance immediately after detecting an image of a car accident.
🍳 Suggest recipes the moment it sees images of vegetables.
👮 Alert the police when it captures an image of a traffic signal violation.
🚒 Contacts the fire department immediately if it "sees" fire.

All of these are 𝐧𝐨𝐭 𝐣𝐮𝐬𝐭 𝐜𝐨𝐧𝐜𝐞𝐩𝐭𝐬; they are fully functional and implemented in 100% Java.

👀 Why this matters:

✅ Enhances emergency response times dramatically.
✅Introduces a new level of interaction between AI and daily life.
✅Opens up limitless possibilities for AI applications in various industries.

Stay ahead of the curve in tech innovations. Dive into my latest article to see how image-driven function calling can set new standards in the tech world!


Historically, function calling and tools integration in AI systems were largely contingent on text input, which could limit the immediacy and context of responses, particularly in dynamic or visually-driven scenarios. Tools4AI has introduced a revolutionary feature that extends the functionality of AI beyond text-based interactions to include image-based action triggers.

All the images in this example have been generated by AI and are available here for testing

Innovative Image Recognition Integration:
Tools4AI uses Gemini (gemini-1.0-pro-vision) to enhance AI capabilities by enabling the system to analyze images and automatically execute relevant actions based on the visual data it processes. This development is particularly crucial in emergency management, where speed and accuracy of response can save lives and property. Here's how Tools4AI can change the landscape:
This is the only code you will need to process the image and take action , its available here

package org.example.image;

import com.t4a.processor.AIProcessingException;
import com.t4a.processor.GeminiImageActionProcessor;
import com.t4a.processor.GeminiV2ActionProcessor;

public class ImageActionExample {
    public static void main(String[] args) throws AIProcessingException {
        GeminiImageActionProcessor processor = new GeminiImageActionProcessor();
        String imageDisription = processor.imageToText(args[0]);
        GeminiV2ActionProcessor actionProcessor = new GeminiV2ActionProcessor();
        Object obj = actionProcessor.processSingleAction(imageDisription);
        String str  = actionProcessor.summarize(imageDisription+obj.toString());
Enter fullscreen mode Exit fullscreen mode

Image description

If you execute the ImageActionExample with above image as source it correctly identifies that we need to call Ambulance

The image depicts a car accident involving a blue car and a red car on a city street. The blue car has front-end damage while the red car has rear-end damage. Debris from the accident is scattered on the street and a police officer is present at the scene. An ambulance has been called and is seen in the background.

Direct Action from Visual Cues: Whether it's a surveillance image of a car accident or a live feed of a residential fire, Tools4AI can immediately recognize critical situations and initiate appropriate emergency protocols without human input.
A sample action is written and the code is available here

@Predict(actionName = "callEmergencyServices", description = "This action will be called in case of emergency", groupName = "emergency")
public class EmergencyAction implements JavaMethodAction {
    public String callEmergencyServices(@Prompt(describe = "Ambulance, Fire or Police") String typeOfEmergency) {
        return typeOfEmergency+" has been called";
Enter fullscreen mode Exit fullscreen mode

Image description

Enhanced Practical Application: The integration of image recognition allows Tools4AI to directly interact with other digital systems and services. For instance, detecting a flat tire from traffic camera footage can trigger a roadside assistance call, while identifying a fire through a security camera can alert fire services instantly. Code for this action is here

package org.example.image.action;

import com.t4a.annotations.Predict;
import com.t4a.annotations.Prompt;
import com.t4a.api.JavaMethodAction;
@Predict(actionName = "carRepairService", description = "This action will be called in case of car servicing", groupName = "car services")
public class CarServiceAction implements JavaMethodAction {
    public String carRepairService(String typeOfProblem) {
        return typeOfProblem+" has been found and will be fixed";
Enter fullscreen mode Exit fullscreen mode

Tools4AI correctly identifies the image and calls the car repair action.
Documented Effectiveness and Use Cases: Tools4AI's image-based action capability is not theoretical—it's a fully functioning feature with practical implementations. :

Reduced Dependency on Textual Reports: By reducing the reliance on text-based alerts, which require human interpretation and subsequent action, Tools4AI allows for a more agile response strategy, directly linking what the camera "sees" to the necessary emergency service.
Scalable and Versatile Applications: The technology is scalable across multiple environments, enhancing security and response mechanisms in both public and private sectors.

Image description
Tools4AI correctly identifies and calls the emergency services with fire truck
*Future Directions and Potential: *
As Tools4AI continues to evolve, the potential for broader applications is vast. Future developments could include more nuanced understanding and response to a wider range of visual stimuli, further enhancing the system's utility in complex environments. The integration of image recognition into AI not only marks a significant technological leap but also sets a new standard for responsive AI systems across industries.

The potential applications of image recognition combined with function calling are vast and varied. For instance, this technology could be highly effective in identifying traffic violations, such as speeding or running a red light, by automatically processing images from traffic cameras and triggering alerts or fines. Additionally, in the culinary world, it could suggest cooking recipes based on photos of ingredients that users have on hand, simplifying meal preparation and enhancing the cooking experience. Other examples could include

Healthcare: For medical diagnostics, where it could analyze x-rays or MRI scans to automatically identify abnormalities and alert medical professionals for further review.

Retail: In retail environments, image recognition can enhance customer experiences by enabling visual search capabilities—users could snap a photo of an item and instantly find out where it can be purchased or see similar products.
Security: Security systems could use image recognition to detect unauthorized access or identify suspicious activities, automatically notifying security personnel or law enforcement.

Environmental Monitoring: This technology can be applied to monitor changes in landscapes, track wildlife, or detect signs of environmental degradation, such as illegal logging or pollution.

Smart Homes and IoT: In smart home settings, image recognition could identify residents and adjust settings according to individual preferences, or monitor the home for safety hazards, like fires or flooding.

**Agriculture: **For agricultural applications, such technology could assess crop health from images, predict yields, and detect pest infestations, automating responses such as the application of pesticides or irrigation.

Full code for this article is available here

Top comments (0)