DEV Community

Billy
Billy

Posted on

AI Research Assistant with Semantic Document Search System

This is a submission for the Open Source AI Challenge with pgai and Ollama

What I Built

This is an AI based research assistant with a semantic document search system for smart document storage and retrieval using natural language queries. Ollama is integrated into the assistant to summarise, and generate sentiment analysis, key points, related topics for provided content. Streamlit is used to provide a minimalistic user interface.

You can use natural language to search data stored in the PostgreSQL database. Uses pgvector for vector similarity search, pgai through TimescaleDB for search AI features. It is very helpful in cases where you have to manage and search through large collections of documents based on meaning rather than just keywords.

Key Features:

  • Summarize docs, and generate sentiment analysis, key points, and related topics
  • Semantic search & insights using document embeddings
  • Batch document processing (directly upload CSV files)
  • User-friendly interface
  • Rich metadata and insights for categorization
  • Scalable vector search using both IVFFlat and pgvectorscale

Although initially the idea was to develop a semantic document search tool, later on I decided to extend this to an AI research assistant featuring the same document search system along with Ollama integration.

Demo

Because of problems with hosting Ollama along with the assistant app, only the semantic document search tool demo is hosted. 😅

assistant

document search tool

GitHub logo tomlin7 / AI-research-assistant

Semantic document search system with pgvector and PGAI

AI Research Assistant with Semantic Document Search System

This is a submission for the Open Source AI Challenge with pgai and Ollama

What I Built

This is an AI based research assistant with a semantic document search system for smart document storage and retrieval using natural language queries. Ollama is integrated into the assistant to summarise, and generate sentiment analysis, key points, related topics for provided content. Streamlit is used to provide a minimalistic user interface.

You can use natural language to search data stored in the PostgreSQL database. Uses pgvector for vector similarity search, pgai through TimescaleDB for search AI features. It is very helpful in cases where you have to manage and search through large collections of documents based on meaning rather than just keywords.

Key Features:

  • Uses Ollama to summarise docs, and generate sentiment analysis, key points, and related topics
  • Semantic search capability using document embeddings, powered…

Tools Used

Ollama + pgvector + pgai + Streamlit

  • Ollama is integrated into the assistant to summarise, and generate sentiment analysis, key points, related topics for provided content.
  • TimescaleDB (PostgreSQL) for primary database (can be configured for self hosted psql as well)
  • pgvector for efficient vector similarity search
  • pgai through TimescaleDB for AI
  • Streamlit for the web interface

Key Technologies

  1. Database Layer

    • pgvector extension for vector operations
    • pgai extension for AI features
    • IVFFlat indexing for efficient similarity search
    • JSONB data type for flexible metadata storage
  2. Machine Learning

  3. Backend

    • Python 3.12+
    • psycopg2 for PostgreSQL interaction
  4. Frontend

    • Streamlit for the web interface
    • Pandas for data display

Final Thoughts

This project is about integrating AI vector search features with traditional databases (which are hard to get used to). The same tool is used to create an AI research assistant with Ollama integration. This is a very helpful tool for content management systems where you need to manage and search through large collections of documents. Integration of pgvector and pgai provides a strong solution.

TODO

  • ⏺️ Better visualization of results using charts and stuff
  • ✅ Batch document processing (import CSV)
  • ⏺️ Delete, update documents functionality
  • ⏺️ Filtering based on metadata as well
  • ⏺️ More use cases of pgai

Top comments (0)