DEV Community

Siva Teja
Siva Teja

Posted on

Code Archeologist: AI-Powered Git Repository Analysis with PostgreSQL

This is a submission for the Open Source AI Challenge with pgai and Ollama


What I Built

Code Archeologist is an AI-powered application that analyzes Git repository histories to identify patterns in code evolution. It creates a "genetic tree" of your code's ancestry, provides refactoring suggestions, generates a codebase heatmap, commit activity timeline, contributor statistics, dependency graph, file change frequency, and integrates with issue tracking. Leveraging PostgreSQL extensions pgvector and pgvectorscale, along with Ollama for text embeddings, the application performs high-performance similarity searches and advanced AI-driven insights without the need for external vector databases.


Demo

GitHub logo MS-Teja / code-archeologist

Code Archeologist analyzes your Git repository history to identify patterns in code evolution.

Code Archeologist

Code Archeologist analyzes your Git repository history to identify patterns in code evolution. It creates a "genetic tree" of your code's ancestry, provides refactoring suggestions, generates a codebase heatmap, commit activity timeline, contributor statistics, dependency graph, file change frequency, and integrates with issue tracking.

Features

  • Genetic Tree: Visualize the ancestry of your codebase.
  • Refactoring Suggestions: Receive actionable recommendations to improve your code.
  • Codebase Heatmap: Identify hotspots and areas with high activity.
  • Commit Activity Timeline: Track commit patterns over time.
  • Contributor Statistics: Analyze contributions from different team members.
  • Dependency Graph: Visualize project dependencies.
  • File Change Frequency: Monitor how often files are modified.
  • Issue Integration: Link code changes with issue tracking systems.
  • Semantic Search in Commits: Find similar commits based on semantic meaning using vector embeddings.
  • Question Answering: Ask questions about your codebase and receive AI-generated answers.
  • Summarization: Get concise summaries of commit messages.

Features with AI Integration

  • Vector Embeddings:

Home
Analyzing
Code Evolution Graph
Similar Commits
Commit Questions
Summary
File Change Frequency
Commit Activity Timeline
Contributors
Dependency Graph
Issues


How the Project Works

Code Archeologist utilizes a combination of PostgreSQL extensions and open-source AI models to deliver a seamless analysis experience:

  • Data Ingestion

    • The application connects to a GitHub repository and fetches commit history, contributors, file changes, issues, and dependencies using the GitHub API.
  • Embedding Generation

    • Commit messages and relevant text data are processed using Ollama to generate 768-dimensional vector embeddings.
    • These embeddings are stored in PostgreSQL using the pgvector extension, which allows efficient storage and retrieval of high-dimensional vectors.
  • Indexing and Similarity Search

    • The pgvectorscale extension is employed to create a diskann index on the embedding column, enabling fast approximate nearest neighbor searches.
    • This setup allows the application to perform rapid similarity searches, facilitating features like semantic commit search and question answering.
  • AI-Driven Features

    • Refactoring Suggestions: Analyzes code evolution to recommend potential improvements.
    • Summarization: Generates concise summaries of commit messages using AI models.
    • Question Answering: Allows users to ask natural language questions about the codebase and receive AI-generated answers based on the commit history.
  • Visualization

The frontend built with Vue.js displays various visualizations such as genetic trees, heatmaps, timelines, and dependency graphs, providing users with insightful views of their codebase.

  • Performance and Scalability

By leveraging PostgreSQL with pgvector and pgvectorscale, the application ensures efficient storage, rapid querying, and scalability to handle large datasets without relying on external vector databases.


How the Backend Works

The backend of Code Archeologist is built using Node.js and Express.js, interfacing with a PostgreSQL database enhanced with AI-specific extensions. Here’s an overview of its functionality:

  • Session Management

    • Utilizes express-session to handle user sessions, ensuring secure and persistent interactions.
  • Database Initialization

    • Connects to PostgreSQL database.
    • Initializes the database schema, creating tables like code_analysis and commit_embeddings.
    • Ensures necessary extensions (pgvector, pgvectorscale) are installed for vector operations and AI functionalities.
  • Data Processing

    • Fetching Data: Retrieves commits, contributors, issues, and dependencies from GitHub repositories.
    • Embedding Generation: Uses Ollama to generate vector embeddings for commit messages, storing them with pgvector for efficient similarity searches.
    • Indexing: Implements pgvectorscale with diskann indexing to optimize search performance.
  • AI Integration

    • OpenAI & Ollama: Integrates OpenAI for generating completions and Ollama for creating text embeddings, facilitating features like refactoring suggestions and question answering.
  • Error Handling

    • Implements robust error handling across all endpoints, ensuring meaningful responses and logging errors for debugging.

Technologies and Tools Used

  • Frontend:

    • Vue.js: JavaScript framework for building user interfaces.
    • Axios: Promise-based HTTP client for making API requests.
    • Cytoscape.js: Library for graph theory (network) data visualization.
    • Chart.js: Simple yet flexible JavaScript charting library.
    • Highlight.js: Syntax highlighting for code snippets.
    • QTip2: Advanced tooltips for enhanced user interactions.
    • D3.js: Data-driven documents for creating dynamic visualizations.
    • DOMPurify: Sanitizes HTML to prevent XSS attacks.
  • Backend:

    • Node.js: Server-side JavaScript runtime.
    • Express.js: Web framework for building API endpoints.
    • PostgreSQL: Relational database system.
    • vector: PostgreSQL extension for storing vector embeddings.
    • vectorscale: Extension for optimized vector similarity searches.
    • Octokit: GitHub API client for fetching repository data.
    • dotenv: Loads environment variables from a .env file.
    • express-session: Manages user sessions.
    • Winston: Logging library for capturing application logs.
    • Cors: Enables Cross-Origin Resource Sharing.
  • AI & Machine Learning:

    • Ollama: Generates text embeddings using open-source models.
    • OpenAI SDK: Facilitates AI-driven features like completions and question answering.
  • APIs

    • GitHub API: Accesses repository information.
    • OpenAI API: For generating summaries and answering questions.

Final Thoughts

Building Code Archeologist was a great experience that showed me just how powerful combining PostgreSQL extensions with AI tools can be. Using pgvector, pgvectorscale, and Ollama together made it possible to create a strong, scalable app that can handle complex searches and give useful insights. This project really boosted my appreciation for using open-source tools to build smart AI applications.


Thanks for considering my submission!

Top comments (0)