Hey friends! Welcome to the Season Finale of The Adventures of Blink! Season 2 ends with today's post... if you're just now discovering me, you probably want to go back and start with S2e1 and work your way forward so this will make more sense!
A brief aside before we get to work
Thank you, fellow adventurers!
By the time this episode releases, this blog will pass TEN THOUSAND followers. Now I know for a fact that doesn't mean there are 10k of you actively adventuring with me (view counts are much, much lower than that- metrics are weird! 🤪) but I am grateful to have that level of exposure within the community! If you're a regular reader and you've enjoyed the series, I would love to hear from you - whether that's a comment or a ❤️ on the blog, or a comment or 👍🏻 on the youtube channel... I started The Adventures of Blink hoping to have some two-way conversation. Come interact and let's be friends!
Youtube
The Finale: AI Integration
I mean, it's 2024. If you haven't done AI integration, have you even created a product? 🤣 I chose to do this because it's a fun, practical use for LLM: Hangman is a game where you have to guess the contents of a phrase. But if you just have a database of phrases... your game has finite replayability. At some point you'll get a phrase from the database that you've played before, and the game loses some of its fun value.
Adding AI here increases that replayability - because you can create a non-deterministic expansion of your data. We don't know what the LLM is going to say in response to our query each time - so we can continually add on to our game board collection.
Architecture of our AI add-on
We're going to revisit a tool that we explored back in season 1: Ollama. As it turns out, ollama has been published to DockerHub as a container you can download! This will fit perfectly into our Hangman architecture - we just spin up a separate container to hold our LLM and then write some code to call it when we need it to give us a new game board.
Putting a Llama in a Container
We can't just put the ollama container into our code, though... because ollama by itself doesn't do much. You have to load a model into it, remember?
In the non-Docker version, we run
# at the time of this writing, 3.2 is the
# latest version of the llama model
ollama pull llama3.2
in order to add our model so that Ollama can use it... so we're going to need to tell our container to do some work on startup.
Here's our Dockerfile for Ollama:
# Start with the official Ollama base image
FROM ollama/ollama:latest
# We needed curl available in the container
# for our next step...
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
# Copy a startup script into the container
COPY start_model.sh /usr/local/bin/start_model.sh
RUN chmod +x /usr/local/bin/start_model.sh
# Expose the necessary port for Ollama
EXPOSE 8080
# Run the startup script when the container starts
ENTRYPOINT ["/usr/local/bin/start_model.sh"]
Notice that we're starting our container with a script... start_model.sh
. Here's what that entails:
#!/bin/bash
echo "Starting Ollama serve in the background..."
ollama serve &
serve_pid=$!
if [ $? -ne 0 ]; then
echo "Error starting Ollama server"
exit 1
fi
echo "Waiting for Ollama server to be ready..."
wait_time=5
max_retries=10
while true; do
if curl -s http://localhost:11434/ > /dev/null; then
break
fi
echo "Ollama server not ready, retrying in $wait_time seconds..."
sleep $wait_time
wait_time=$((wait_time * 2))
((max_retries--))
if [ $max_retries -eq 0 ]; then
echo "Ollama server failed to start"
exit 1
fi
done
echo "Ollama server is ready."
if ! ollama list | grep -q "llama3.2"; then
echo "Model llama3.2 not found. Downloading..."
ollama pull llama3.2
if [ $? -ne 0 ]; then
echo "Error downloading llama3.2 model"
exit 1
fi
echo "Model llama3.2 download complete."
else
echo "Model llama3.2 already downloaded."
fi
trap 'kill -SIGTERM $serve_pid' SIGINT SIGTERM
echo "Ollama server is now running in the background."
while true; do
sleep 60
done
When your container starts, you check to see if the llama3.2 model is loaded. If it isn't, you download it. Then you have an infinite loop to keep the container active until it's manually stopped.
This creates a container that we can add to our docker-compose.yml:
llm:
build:
context: ./llm
dockerfile: Dockerfile
container_name: ollama-container
restart: unless-stopped
env_file:
- .env
ports:
- "11434:11434"
volumes:
- ./llm:/app
And now you can reach your ollama on its standard port of 11434 to interact with the model: you can request embeddings of input text, or you can simply interact by sending it a prompt and getting the response.
APIs, APIs everywhere
Architecturally, we should treat the LLM like we treated the database... it's a backend component that feeds information to the frontend app. Thus we should build an API for it, just as we did with our MongoDB.
Since we used Flask for our other API, I copy-pasted the setup from that one and changed a few things:
from flask import Flask, jsonify
import requests
from prometheus_client import generate_latest, Counter, Histogram
import os, json
app = Flask(__name__)
# Ollama connection configuration
llm_uri = f"{os.getenv('LLM_URI')}/api/generate"
model_id = os.getenv("MODEL_ID")
REQUEST_COUNT = Counter('llm_requests_total', 'Total number of requests to the llm')
REQUEST_LATENCY = Histogram('llm_request_latency_seconds', 'Latency of requests to the llm')
@app.route('/metrics')
def metrics():
return generate_latest()
# /getpuzzle is our LLM endpoint - we pass in the prompt
# (which in hangman is a constant value for now) and then
# it responds with a puzzle and its associated hint.
@app.route('/getpuzzle', methods=['GET'])
def get_puzzle():
REQUEST_COUNT.inc()
with REQUEST_LATENCY.time():
try:
prompt = "Suggest a Hangman puzzle I can use to defeat my friend. You may choose from these categories: 'thing', 'place', 'phrase', or 'food and drink'. Your puzzle must be more than three words long and less than ten words long. You are required to respond with only the completed puzzle solution. What puzzle string should I use?"
payload = {
"model": model_id,
"prompt": prompt
}
response = requests.post(llm_uri, json=payload)
# Accumulate chunks for the puzzle solution
full_response = ""
for line in response.iter_lines():
if line:
chunk = json.loads(line.decode('utf-8'))
full_response += chunk.get("response", "")
if chunk.get("done", False):
break
# Parse the accumulated response as the puzzle solution
result = full_response.strip().replace("\"", "").replace("'","")
# Prepare a second request for the category hint
payload_2 = {
"model": model_id,
"prompt": f"Given the possible categories of 'thing', 'place', 'phrase', or 'food and drink', What would be the most relevant category for the following hangman puzzle: << {result} >> ? You are required to answer with only the category."
}
response_2 = requests.post(llm_uri, json=payload_2)
full_response_2 = ""
for line in response_2.iter_lines():
if line:
chunk = json.loads(line.decode('utf-8'))
full_response_2 += chunk.get("response", "")
if chunk.get("done", False):
break
# Parse the accumulated response as the category hint
result_2 = full_response_2.strip().replace("\"", "").replace("'","")
# Create the final JSON response
final_response = {
"hint": result_2,
"phrase": result
}
return jsonify(final_response), 200
except json.JSONDecodeError as e:
print("Error decoding JSON:", e)
return jsonify({"error": "JSON decoding error"}), 500
except Exception as e:
print('Unable to connect to ollama port:', e)
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5002)
You'll notice that we actually make 2 ollama calls in this endpoint. Why? In my original testing, I found that the model struggled with combining the activities of (a) generating the data and (b) formatting it exactly how I wanted it. So I simplified it - I asked the model for the two bits of data I wanted (a puzzle and a hint) and then I assembled them manually into the proper response format. This way I can ensure that the data I'm using is well-formatted and able to be used by the game.
Also of note: I built this API to run in its own Docker container, just like the database API. I'm not providing the code here because it's almost exactly the same as the database API's container - I changed the port number and what folder it builds from. It's all in the git repository if you'd like to see!
Some Glue
Just like with our database API, we want to have an interface within our code to centralize the use of the API so we loosely couple our app to the LLM. Here's what that looks like:
import requests
class LLM_Integration:
def __init__(self):
ROOT_URL = 'http://localhost:5002'
self.puzzle_url = f"{ROOT_URL}/getpuzzle"
def getpuzzle(self):
response = requests.get(self.puzzle_url)
if response.status_code == 200:
return response.json()
else:
return response
We can create an instance of this class and use it in any place that needs it, but only have to update our constant values in one place.
Now we're ready to add AI to our game!
With all of this in place, we're ready to actually implement the AI feature into our game experience!
I elected to add it as a new button on the phrase editor screen. Here is the code we're adding:
# Add this button to the __init__ method of the
# WordEditorApp class, where the other buttons are defined.
self.ai_button = tk.Button(self, text="AI Generate Word", command=self.ai_word_popup)
self.ai_button.grid(row=2, column=0)
# Then add this method to the class to handle the button click
def ai_word_popup(self):
"""Popup for AI Generation of a new word"""
response = self.llm.getpuzzle()
hint = response['hint']
phrase = response['phrase']
self.edit_popup("Add Phrase", word=phrase, hint=hint, save_callback=self.add_word)
The structure of our popups (basically always using a single edit_popup) makes this very easy to plug in - all we do is ensure that we run the AI call, get the response out of it, and pass it to edit_popup.
An Aside: Major Changes happened along the way
You might notice that the WordEditorApp class got a bit of an overhaul in between weeks here. When I tried to start up the app, I realized it wasn't working right and had to revisit how windows are defined in Tkinter. All of these changes were pushed up together in the S2E10 branch. Suffice to say there's always space for refactoring in your projects!
Wrapping up Season 2
Friends, this season has been an absolute delight for me as a programmer to work through things and show you how I did it. I hope you've had a fun time, but more importantly, I hope you started to get a feel for how some of these DevOps principles and tools work together to produce software. I'm grateful for each of you that's read or watched along, and I hope you have a holiday season full of warmth and happiness.
As for me, I'll be taking a couple of months off to relax, as well as plan out Season 3. Look for an announcement somewhere around the New Year. Take care, and I can't wait to see what you build!
Top comments (0)