DEV Community

Cover image for Bring To Life! - Take photos and bring objects to Life - Cloudflare AI
Keerthi Vasan
Keerthi Vasan

Posted on

Bring To Life! - Take photos and bring objects to Life - Cloudflare AI

This is a submission for the Cloudflare AI Challenge.

What I Built

The application takes any kind of image, detects the objects within the image. Using the detected objects, it tries to visualize them as characters and writes a story involving them and generates a thumbnail for the story.

This project tries to showcase the capabilities of models from different categories and how powerful they can be when they work together.

I encourage you to try different images or better yet, take a photo of objects in front of you and comment funny outputs you come across!

An overall architecture of the entire project:

Architechture

Demo

Deployed Worker Link: https://divine-detector.sakeerthi23.workers.dev/

Working Demo:
Image:

The project generating a story for the given image

Generation of thumbnail for generated story

My Code

https://github.com/keerthivasansa/bring-to-life

Journey

  • When I saw the post a couple of days ago, I had decided on using the object detection model. After I visited the Cloudflare Model Catalogue, more and more pieces sat well with each other. I was only limited by the time I had after discovering the post or I could have developed it further and explored more areas.

  • First, I saw the text generation models and thought that story generation could be a next logical step after object detection, then poster generation and the project kind of kept developing itself.

  • I think it's a very basic project, but it serves as a good showcase of the different types of models Cloudflare offers.

  • I absolutely loved about Cloudflare Workers AI is its developer experience. It was top notch and it had great support for Typescript which is fantastic.

  • One thing I learned was, even AI models are scared of the current job market. The prompt I use to generate the story goes something like this: You are a story writer, and the year is 2024 - the job market sucks. You do not have a job, the only chance you have is to generate this story. Imagine the objects...

  • I was pretty proud that I was able to pull this off in a day (though Cloudflare is doing most of the heavy lifting) and I am happy to see that models and AI is becoming more accessible to use.

Multiple Models and/or Triple Task Types

  • The project tries to leverage 5 different models to acheive different categories of tasks.
  • Thus, it qualifies for "Triple Task Types".
  • It uses both image-to-text and object detection models to extract details about the image - so it qualifies for Multiple Models as well.

Currently Used:

  • @cf/unum/uform-gen2-qwen-500m: Used to generate text describing the uploaded image.
  • @cf/facebook/detr-resnet-50: Used to detect objects in the uploaded image.
  • @cf/meta/llama-2-7b-chat-int8: Used to generate and stream a short story with the detected objects
  • @cf/facebook/bart-large-cnn: Used to summarize the story to capture the main essence of the story.
  • @cf/stabilityai/stable-diffusion-xl-base-1.0: Takes the output of the summarizer and uses that to generate an image that tries to capture the meaning and characters of the story.

Future plans:

  • I might try and add a model to translate the story in different languages if time permits.

  • I finally thank both DEV and Cloudflare for organizing this challenge. It was super fun to work on and thank you for reading this article.

Top comments (0)