Unlocking Advanced RAG: Citations and Attributions

#opensource #ai #machinelearning #openai

Often we want LLMs to cite exact quotes from source material we provide in our prompts. This is useful in academic contexts to cite snippets from papers, for law firms who need to cite sections of the legal code for a case and in business applications where knowing the exact source of a quote can save hours of scrolling through financial documents. But it's not as simple as asking the LLM to return exact quotes in the prompt. We can't trust that the LLM won't just hallucinate a quote or citation that doesn't really exist. So how can we get LLMs to provide exact quotes or citations and verify that they are correct?

I ran into this problem while working on the RemNote Flashcard Copilot. My goal was to allow new users to generate flashcards by highlighting a paragraph from their notes. I wanted to add citation pins linking the AI generated flashcards to each sentence in their notes so that new users could understand how the AI had used their notes to generate the flashcards.

Naive substring checks don't work because LLMs will make small changes in punctuation and wording, correct spelling and grammar mistakes.

Initially I attempted to split by sentence and generate flashcards per sentence. But this is not multilingual - different languages use different sentence delimiters. End of sentence delimiters can have different meanings in different contexts. Of course I could ask an LLM to chunk a paragraph into sentences, but this would add a bunch of additional waiting time for the user.

Then I tried using using an LLM to verify that citations are correct. The issue with this is that this is still error-prone - you still run into situations where the checker LLM says that the cited sentence is correct and present but when you go to search for it, you can't find it.

Solution

What we need is a way to verify with a high probability that the sentence or paragraph cited by the LLM is a genuine citation. And beyond that, it would be ideal if we could find the original citation within the source text ourselves rather than trusting the LLM citation. This is useful when we want to add UI elements to text we are rendering in our application.

The solution I came up with can be summarised as follows:

LLM cites a sentence
I fuzzy search to find the best match to that sentence in the original text
If the match ratio is close (>90%) it's considered valid

Fuzz Partial Ratio is useful when you want to find the similarity between two strings, focusing only on the best matching substring. So searching for 'pie' inside 'apple pie' yields a score of 100 because the shorter string 'pie' is found within the longer string 'apple pie'.

I modified the original partial_ratio function to return the best scoring match and its start index, because this weirdly wasn't included in the partial_ratio function from fuzzball.

import { ratio, } from 'fuzzball';
import { SequenceMatcher } from 'difflib';

// modified from: https://github.com/nol13/fuzzball.js/blob/773b82991f2bcacc950b413615802aa953193423/fuzzball.js#L942
function partial_ratio(str1: string, str2: string) {
  if (str1.length <= str2.length) {
      var shorter = str1
      var longer = str2
  }
  else {
      var shorter = str2
      var longer = str1
  }
  var m = new SequenceMatcher(null, shorter, longer);
  var blocks = m.getMatchingBlocks();
  let bestScore: number = 0;
  let bestMatch: string | null = null
  let bestStartIdx: number = -1
  for (var b = 0; b < blocks.length; b++) {
      var long_start = (blocks[b][1] - blocks[b][0]) > 0 ? (blocks[b][1] - blocks[b][0]) : 0;
      var long_end = long_start + shorter.length;
      var long_substr = longer.substring(long_start,long_end);
      var r = ratio(shorter,long_substr);
      if (r > bestScore) {
          bestScore = r;
          bestMatch = long_substr;
          bestStartIdx = long_start;
      }
      if (r > 99.5) {
        break;
      }
  }
  return {
    bestMatch,
    bestScore,
    bestStartIdx,
  }
}

I hope this can be useful to someone!

DEV Community

Unlocking Advanced RAG: Citations and Attributions

Solution

Top comments (0)

Read next

Integrating Google Maps in Spring Boot Backend: A Comprehensive Guide

First Impressions of SafeLine: The Most Starred Open-Source WAF on GitHub

Why conference ticket prices are higher than they seem: a look at open-source solutions

Feature Engineering 101: The Art of Enhancing Machine Models