DEV Community

Cover image for Keyword Extraction with Google Gemini
Ranjan Dailata
Ranjan Dailata

Posted on

Keyword Extraction with Google Gemini


In this blog post, you will be guided with the process on how to effectively extract the keywords from a given context. Before we deep dive into the keyword extraction, let's try to understand what exactly the topics are?

Keyword Extraction

Here's the brief content about the keyword extraction, generated with the help of ChatGPT.

Keyword extraction is a natural language processing (NLP) technique that involves identifying and extracting the most relevant words or phrases from a given text. The goal is to capture the essential topics or themes within the content, allowing for a concise representation of the document's key information. This process is valuable in various applications such as information retrieval, document summarization, and content categorization.

Keyword extraction plays a crucial role in improving search engine results, summarizing large documents, organizing content, and facilitating information retrieval in a more efficient and meaningful manner.


  1. Please head over to the Google Colab
  2. Make sure to login to the Google Cloud and get the Project Id and Location Info.
  3. Use the below code for Vertex AI initialization purposes.
import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth


PROJECT_ID = "<<project_id>>"  # @param {type:"string"}
LOCATION = "<<location>>"  # @param {type:"string"}

if "google.colab" in sys.modules:
    # Define project information

    # Initialize Vertex AI
    import vertexai
    vertexai.init(project=PROJECT_ID, location=LOCATION)
Enter fullscreen mode Exit fullscreen mode

The basic requirement for accomplishing the topic extraction is done via the careful consideration of the topic extraction prompt. Here's the code snippet for the same.

def get_keyword_extraction_prompt(content):
    prompt = f"""Extract key keywords or phrases from the following text: {content}"""
    prompt = prompt + """1. Identify and list the most important keywords or key phrases in the text. These keywords should capture the main topics, concepts, or subjects discussed in the text.
      2. If there are subtopics or secondary themes mentioned in the text, list them as well. Ensure that the extracted keywords accurately represent the content's context.
      3. Include the exact text span or sentence where each keyword or phrase is found in the original text.
      4. If there are any ambiguous keywords or phrases, indicate the uncertainty and provide possible interpretations or context that might clarify the intended meaning.
      5. Consider the context, relevance, and frequency of the keywords when determining their significance.
      6. If the text suggests any actions, decisions, or recommendations related to the extracted keywords, provide a brief summary of these insights.

      Ensure that your keyword extraction results are relevant, concise, and capture the essential topics within the text.

      Here's the output schema:


          "KeywordExtraction": [
                  "Keyword": "",
                  "Context": "",
                  "TextSpan": ""


      Do not respond with your own suggestions or recommendations or feedback.
    return prompt
Enter fullscreen mode Exit fullscreen mode

Now let's see a generic code for executing the above topic extraction prompt using the Google Gemini Pro model. Here's the code snippet for the same.

import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

def execute_prompt(prompt, max_output_tokens=8192):
  model = GenerativeModel("gemini-pro")
  responses = model.generate_content(
        "max_output_tokens": max_output_tokens,
        "temperature": 0,
        "top_p": 1

  final_response = []

  for response in responses:

  return ".".join(final_response)
Enter fullscreen mode Exit fullscreen mode

Now is the time to perform the prompt execution and do some JSON transformation for the extraction of topics. Here's the code snippet for the same.

Code block for extracting the JSON from the LLM response. Please note, at this time, Google Gemini Pro being released to the public and has some known issues in building the formatted structured JSON response. Hence, the need to tweak a bit.

import re
import json

def extract_json(input_string):
    # Extract JSON within ```

    matches = re.findall(r'


', input_string, re.DOTALL)

    if matches:
        # Join the matches into a single string
        json_content = ''.join(matches)

        # Remove periods
        json_content = re.sub(r'\.', '', json_content)

        return json_content

 ``` block found.")
        return None
Enter fullscreen mode Exit fullscreen mode
keywords = []
prompt = get_keyword_extraction_prompt(summary)
response = execute_prompt(prompt)
extracted_json = extract_json(response)
if extracted_json != None:
Enter fullscreen mode Exit fullscreen mode


Top comments (0)