Jamesb

Posted on Dec 29, 2023

Avoiding Cascading Failure in LLM Prompt Chains

#ai #machinelearning #opensource #typescript

A common problem faced when building LLM applications composed of chains of prompts is that failures and inaccuracies early on in the chain get compounded into system-wide failures. It's like the cascading failure problem where a failure in a small subset of nodes propagates outwards bringing down the entire network.

I noticed this a lot while working on Open Recommender, an open source YouTube video recommender system which takes users' Twitter feeds as input, infers the kind of topics they are interested in and searches YouTube to find relevant YouTube videos and clips to recommend them.

The Pipeline

The beginning of the data processing pipeline looks like this. The main part I'll discuss in this article is the createQueries step.

const pipeline = new Pipeline(...)
  .addStage(validateArgs)
  .addStage(getTweets)
  // generates YouTube queries based on tweets
  .addStage(createQueries)
  .addStage(searchForVideos)
  // ... more stages
const results = await pipeline.execute();

When I run pipeline.execute, each stage gets executed sequentially and the output of the previous stage is passed as input to the next. The getTweets stage outputs a list of a user's last 30 tweets. These get passed to the createQueries stage which constructs a list of YouTube search queries. Those search queries are then passed to the searchForVideos stage which searches YouTube and returns a list of search results for each query.

Chinese Whispers

The problem is that LLM prompt chains are like a game of Chinese whispers - without building error recovery mechanisms into your program errors compound into stranger and stranger outputs.

I kept running into issues where the createQueries function was simultaneously a strong determinant of the quality of the final recommendations as well as very difficult to get working reliably.

Constructing really effective search queries is inherently a difficult problem because it requires a great deal of knowledge about the user - it's not enough to know that the user has tweeted about a particular topic, you also need to infer the user's expertise level and whether it's a passing interest or something they really care about.

Initially my approach in the createQueries stage was to run the user's tweets through a prompt called inferInterests. The idea was to extract an array of topics (concepts, people, events and problems) the user was interested in and use those to construct search queries. But this felt like quite a one dimensional compression of the users interests and erased a lot of nuance in terms of what the user was expressing about the topic.

This meant that the quality of the createQueries output could range between great and very poor and as many as half of the recommended videos presented to the user at the end of the pipeline felt irrelevant.

It was difficult to build in error recovery mechanisms too, because if I added a step to compare the video search results against the queries, they would look reasonable, but comparing them against the tweets made it clear that a lot of results were missing the mark in terms of relevancy.

The Solution

My first realisation was that compressing a user's tweets into a list of topics, people, events and problems was an extremely lossy compression of a user's interests. And strong lossy compression does not allow stages later in the pipeline to effectively recover from errors.

For that reason I removed the intermediate inferInterests step and instead generate queries directly from the user's tweets:

CreateYouTubeSearchQueries.execute(args: {
    user: string;
    tweets: Tweet[];
}): Promise<{
    queries: {
        query: string;
        tweetIDs: number[];
    }[];
}>

Note that in the return type I ask GPT to include the IDs of the tweets that it used to generate each search query. Later in the pipeline I use these tweets to double check that the outputs of subsequent stage are still relevant. So for example, in the signature of the filterSearchResults prompt, you can see that it takes arrays tweets and search results as input and returns an array of search results with relevancy scores:

FilterSearchResults.execute(args: {
    user: string;
    tweets: Tweet[];
    results: SearchResult[];
}): Promise<{
    result: SearchResult;
    relevance: number;
}[]>

In this was I'm controlling the error compounding by comparing against the "ground truth" of users' tweets.

Additionally by adding a simple relevancy score in the prompt output schema, I can filter out bad query recommendations by setting a search result relevancy cutoff value.

Finally at the end of the pipeline, I added a more expensive filtering and ranking step inspired by RankGPT to do a final ordering over the remaining video clips, picking only the top 10-15 to recommend to the user.

Core Takeaways

It's best to carry the "ground truth" data for your prompt through the pipeline rather than relying on a lossy compressed summary of it.
Employ other mechanisms like filtering and re-ranking to minimise the effect of errors built up earlier in the pipeline.

Next Steps

DM me on Twitter (@experilearning) if you want to try the current version of Open Recommender! I'll run it on your Twitter data and send you the results. Check out the beta roadmap to see what will be available over the next month or so.

DEV Community