DEV Community

Cover image for 13 open-source tools that will make you 99% more likely to land any AI job 🪄✨
Sunil Kumar Dash for Composio

Posted on

13 open-source tools that will make you 99% more likely to land any AI job 🪄✨

I’ve been in the AI space for quite some time, back when the top language models were BERT and T5. During this period, the progress has been insane.

We now have better models, tools, frameworks, and machines.

If you are contemplating entering AI, this is the best time. And the ideal approach is to master tools that will put you ahead of the competition.

So, I have compiled a coveted list of open-source software that covers various aspects of AI development, from AI model training and monitoring to building AI agents.

This is the beginning Gif

Comment if anything else needs to be mentioned here. Also, do Star and contribute meaningfully to the repositories. This can be the best strategy to improve your CV's credibility


1. Composio👑: Automate workflows by Integrating popular apps with AI

The age of AI agents is upon us, and many Fortune 500 companies have started including agentic workflows. However, automating complex workflows is anything but easy.

To connect AI models with external applications, you would need specialized toolsets. For instance, to automate aspects of software development, the AI model must have access to GitHub, Jira, Code interpreters, code indexers, the Internet, etc.

This is where Composio comes into the picture.

It lets you integrate over 100 production-ready toolsets such as Gmail, Google Sheets, Jira, Notion and many more to automate complex real-world workflows.

So, here’s how you can get started with it.

Python

pip install composio-core
Enter fullscreen mode Exit fullscreen mode

Add a GitHub integration.

composio add github
Enter fullscreen mode Exit fullscreen mode

Composio handles user authentication and authorization on your behalf.

Here is how you can use the GitHub integration to Star a repository.

from openai import OpenAI
from composio_openai import ComposioToolSet, App

openai_client = OpenAI(api_key="******OPENAIKEY******")

# Initialise the Composio Tool Set 
composio_toolset = ComposioToolSet(api_key="**\\*\\***COMPOSIO_API_KEY**\\*\\***")

## Step 4
# Get GitHub tools that are pre-configured
actions = composio_toolset.get_actions(actions=[Action.GITHUB_ACTIVITY_STAR_REPO_FOR_AUTHENTICATED_USER])

## Step 5
my_task = "Star a repo ComposioHQ/composio on GitHub"

# Create a chat completion request to decide on the action
response = openai_client.chat.completions.create(
model="gpt-4-turbo",
tools=actions, # Passing actions we fetched earlier.
messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": my_task}
  ]
)
Enter fullscreen mode Exit fullscreen mode

Run this Python script to execute the given instruction using the agent.

Javascript

You can Install it using npmyarn, or pnpm.

npm install composio-core
Enter fullscreen mode Exit fullscreen mode

Define a method to let the user connect their GitHub account.

import { OpenAI } from "openai";
import { OpenAIToolSet } from "composio-core";

const toolset = new OpenAIToolSet({
  apiKey: process.env.COMPOSIO_API_KEY,
});

async function setupUserConnectionIfNotExists(entityId) {
  const entity = await toolset.client.getEntity(entityId);
  const connection = await entity.getConnection('github');

  if (!connection) {
      // If this entity/user hasn't already connected, the account
      const connection = await entity.initiateConnection(appName);
      console.log("Log in via: ", connection.redirectUrl);
      return connection.waitUntilActive(60);
  }

  return connection;
}
Enter fullscreen mode Exit fullscreen mode

Add the required tools to the OpenAI SDK and pass the entity name on to the executeAgent function.

async function executeAgent(entityName) {
  const entity = await toolset.client.getEntity(entityName)
  await setupUserConnectionIfNotExists(entity.id);

  const tools = await toolset.get_actions({ actions: ["github_activity_star_repo_for_authenticated_user"] }, entity.id);
  const instruction = "Star a repo ComposioHQ/composio on GitHub"

  const client = new OpenAI({ apiKey: process.env.OPEN_AI_API_KEY })
  const response = await client.chat.completions.create({
      model: "gpt-4-turbo",
      messages: [{
          role: "user",
          content: instruction,
      }],
      tools: tools,
      tool_choice: "auto",
  })

  console.log(response.choices[0].message.tool_calls);
  await toolset.handle_tool_call(response, entity.id);
}

executeGithubAgent("joey")
Enter fullscreen mode Exit fullscreen mode

Execute the code and let the agent do the work for you.

Composio works with famous frameworks like LangChain, LlamaIndex, CrewAi, etc.

For more information, visit the official docs, and for even more complex examples, see the repository's example sections.

Composio Gif

Star the Composio repository ⭐


2. TRL by HuggingFace: Train transformer language models with reinforcement learning

You often need LLMs and diffusion models to behave in specific ways, like adding guardrails or ensuring they follow human instructions. This is where you need TRL.

TRL, or Transformer Reinforcement Learning backed by HuggingFace, is a widely used open-source library to fine-tune and align language models easily.

It supports multiple methods for aligning models, such as reinforcement learning using PPO (Proximal Policy Optimization), Supervised fine-tuning, and DPO (Direct Preference Optimization).

It’s easy, and the Pythonic interface makes it easier for beginners to get started quickly.

Install trl using pip.

pip install trl
Enter fullscreen mode Exit fullscreen mode

Let’s quickly go through the SFTTrainer class for a supervised fine-tuning of an LLM.

# imports
from datasets import load_dataset
from trl import SFTTrainer

# get dataset
dataset = load_dataset("imdb", split="train")

# get trainer
trainer = SFTTrainer(
    "facebook/opt-350m",
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)

# train
trainer.train()
Enter fullscreen mode Exit fullscreen mode

The code block creates an SFTTrainer instance with facebook/opt-350m. The train() method will start training the model over IMDB data.

Check out the example section for more.

Trl Gif

Star the trl repository ⭐


Pytorch-Lightning: Build, Train, and finetune models at Scale

AI development cannot be thought of without Pytorch, and Pytorch-listening takes it further.

It is a general-purpose framework that helps structure and scale PyTorch-based deep learning projects, providing training, experimentation, and deployment tools across various domains.

Several benefits of lightning over Pytorch.

  • It makes the Pytorch code more readable, structured, and user-friendly.
  • Reduces repetitive code with predefined training loops and utilities.
  • Simplifies training, experimentation, and deployment with less boilerplate code.

Get started with Lightning using pip

pip install lightning
Enter fullscreen mode Exit fullscreen mode

Define an auto-encoder using Lightning module.

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L 

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, _ = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)
Enter fullscreen mode Exit fullscreen mode

Load MNIST data.

# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)
Enter fullscreen mode Exit fullscreen mode

The Lightning Trainer “mixes” any LightningModule with any dataset and abstracts away all the engineering complexity needed for scale.

# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
Enter fullscreen mode Exit fullscreen mode

For more on Lightning, check out the official documentation.

LightningAI Gif

Star the lightning AI repository ⭐


4. Weight and Biases: Monitor all the pieces of your ML pipeline

Suppose you want to fine-tune or train a model. In that case, you must keep track of multiple components, such as model hyperparameters, training and validation metrics, data preprocessing steps, model architecture versions, and experiment configurations.

Knowing if the model you are training is on the right course is essential.

Wandb is one of the best open-source solutions out there. It allows you to track metrics and collaborate with your team members.

Get started with W&B in four steps:

  1. First, sign up for a W&B account.
  2. Second, install the W&B SDK with pip. Navigate to your terminal and type the following command:
pip install wandb
Enter fullscreen mode Exit fullscreen mode
  1. Third, log into W&B:
wandb.login()
Enter fullscreen mode Exit fullscreen mode
  1. Use the example code snippet below as a template to integrate W&B into your Pytorch Lightning script:
# This script needs these libraries to be installed:
#   torch, torchvision, pytorch_lightning

import wandb

import os
from torch import optim, nn, utils
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger

class LitAutoEncoder(pl.LightningModule):
    def __init__(self, lr=1e-3, inp_size=28, optimizer="Adam"):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(inp_size * inp_size, 64), nn.ReLU(), nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, inp_size * inp_size)
        )
        self.lr = lr

        # save hyperparameters to self.hparamsm auto-logged by wandb
        self.save_hyperparameters()

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)

        # log metrics to wandb
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.lr)
        return optimizer

# init the autoencoder
autoencoder = LitAutoEncoder(lr=1e-3, inp_size=28)

# setup data
batch_size = 32
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset, shuffle=True)

# initialise the wandb logger and name your wandb project
wandb_logger = WandbLogger(project="my-awesome-project")

# add your batch size to the wandb config
wandb_logger.experiment.config["batch_size"] = batch_size

# pass wandb_logger to the Trainer
trainer = pl.Trainer(limit_train_batches=750, max_epochs=5, logger=wandb_logger)

# train the model
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

# [optional] Finish the wandb run, which is necessary for the notebook
wandb.finish()
Enter fullscreen mode Exit fullscreen mode

You can observe the metrics on your Wandb dashboard in real time.

For more information, refer to the developer guide.

Wandb Gif

Star the Wandb repository ⭐


5. MlFlow: A Machine Learning Lifecycle Platform

Mlflow is a comprehensive Mlops framework used across industries.

It lets you track the entire lifecycle of an AI model, from training and fine-tuning to deployment. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc.), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud)

Not only AI models, but it also lets you track and monitor AI agents built with LangChain, OpenAI SDK, etc.

It is an essential tool to build a complete end-to-end Ml/AI pipeline.

Mlflow Gif

Star the MlFlow repository ⭐


6. Pgvector: Open-source vector similarity search for Postgres

RAG applications are complete with vector databases. Vector databases manage unstructured data as high-dimensional vectors or embeddings.

Many organizations already use the Postgres database to store structured data, making Pgvector the best option for a vector database for all these companies.

Out of many available options, Pgvector will make the most sense in the long term.

Install PGVector in Linux and Mac.

Compile and install the extension (supports Postgres 12+)

cd /tmp
git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo
See the installation notes if you run into issues 
Enter fullscreen mode Exit fullscreen mode

You can install it with DockerHomebrewPGXNAPTYumpkg, or conda-forge. It comes preinstalled with the Postgres app and many hosted providers. There are also instructions for GitHub Actions.

Enable the extension (do this once in each database where you want to use it)

CREATE EXTENSION vector;
Enter fullscreen mode Exit fullscreen mode

Create a vector column with 3 dimensions

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Enter fullscreen mode Exit fullscreen mode

Insert vectors

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Enter fullscreen mode Exit fullscreen mode

Get the nearest neighbours by L2 distance

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Also supports inner product (<#>), cosine distance (<=>), and L1 distance (<+>, added in 0.7.0)

Note: <#> returns the harmful inner product since Postgres only supports ASC order index scans on operators

Storing

Create a new table with a vector column

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
Enter fullscreen mode Exit fullscreen mode

Or add a vector column to an existing table

ALTER TABLE items ADD COLUMN embedding vector(3);
Enter fullscreen mode Exit fullscreen mode

Insert vectors

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
Enter fullscreen mode Exit fullscreen mode

Or load vectors in bulk using COPY (example)

COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);
Enter fullscreen mode Exit fullscreen mode

Upsert vectors

INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
    ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;
Enter fullscreen mode Exit fullscreen mode

Update vectors

UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;
Enter fullscreen mode Exit fullscreen mode

Delete vectors

DELETE FROM items WHERE id = 1;
Enter fullscreen mode Exit fullscreen mode

Querying

Get the nearest neighbours to a vector

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

For more on PGVector, refer to their repository.

Pgvector Gif

Star the PgVector repository ⭐


7. Llama Cpp: LLM inference in C/C++

Many organizations want to host or open-source LLMs themselves. This will require a highly optimized and efficient inference engine.

Llama Cpp makes the most sense here. Developed by Georgi Gerganov, it is one of the best open-source solutions for serving LLMs.

As the name suggests, it is built with C++, making it fast. It also supports almost all the open-access models, such as Llama 3, Mistral, Gemma, Nous Hermes, etc.

Check out this guide for instructions on how to build llama cpp by yourself.

Llama cpp

Star the Llama Cpp repository ⭐


8. LangGraph: Build resilient language agents as graphs

LangGraph is easily one of the most capable frameworks for building efficient and reliable AI agents. As the name suggests, it follows a cyclic graphical architecture, such as Nodes and Edges, to build AI agents.

It is an extension of LangChain, so it has a massive community of AI developers building on it.

Get started with it using pip.

pip install -U langgraph
Enter fullscreen mode Exit fullscreen mode

If you want to build agents/bots with LangGraph, check out our detailed blog on building a Gmail and Calendar assistant.

For more on LangGraph, visit the documentation.

For more on LangGraph, visit the documentation.

LangGraph Gif

Star the LangGraph repository ⭐


9. Pydantic: Data validation using Python type hints

It is easily one of the best things that has happened to the Python ecosystem for a while.

The core value proposition of Pydantic is data validation.

From Building resilient APIs to getting structured outputs from LLMs, Pydantic has seen a massive rise in popularity. Many companies use Pydantic, and even OpenAI announced that it uses Pydantic to get structured output from LLMs.

Install Pydantic using pip .

pip install pydantic
Enter fullscreen mode Exit fullscreen mode

A small example.

from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str = 'John Doe'
    signup_ts: Optional[datetime] = None
    friends: List[int] = []

external_data = {'id': '123', 'signup_ts': '2017-06-01 12:22', 'friends': [1, '2', b'3']}
user = User(**external_data)
print(user)
#> User id=123 name='John Doe' signup_ts=datetime.datetime(2017, 6, 1, 12, 22) friends=[1, 2, 3]
print(user.id)
#> 123
Enter fullscreen mode Exit fullscreen mode

Check out the documentation for more.

Pydantic Gif

Star the Pydantic repository ⭐


10. FastAPI: Fast, Simple, and Easy Python Framework

FastAPI has also received a lot of praise for its performant yet simplistic nature and easy-to-learn. Nature.

Many AI companies use FastAPI predominantly to build APIs using FasAPI to either expose an endpoint to infer from models or create web apps.

Mastering FastAPI will put you in a good position to handle both AI and API development.

It’s built on Starllete, making it the fastest Python framework.

Get started with FastAPI using pip.

pip install "fastapi[standard]"
Enter fullscreen mode Exit fullscreen mode

Build a simple API.

from typing import Union

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"Hello": "World"}

@app.get("/items/{item_id}")
def read_item(item_id: int, q: Union[str, None] = None):
    return {"item_id": item_id, "q": q}
Enter fullscreen mode Exit fullscreen mode

Run the server using.

fastapi dev main.py
Enter fullscreen mode Exit fullscreen mode

For more information on FastAPi, visit documentation.

FastAPI Gif

Star the FastAPI repository ⭐


11. Neo4j: Graphs for Everyone

Neo4j has a separate place in building knowledge bases for AI apps. It is the only OSS tool that provides a graph database with vector search.

Neo4j is pioneering GraphRAG, an effective RAG method that extracts relevant information using a hybrid retrieval approach from Knowledge graphs and vector databases.

This has been proven more effective than traditional RAG, which only uses vector retrieval.

One of the common patterns for using GraphRAG is as follows:

  1. Do a vector or keyword search to find an initial set of nodes.
  2. Traverse the graph to bring back information about related nodes.
  3. Optionally, re-rank documents using a graph-based ranking algorithm such as PageRank.

For more information, refer to this article on GraphRAG.

Neo4j Gif

Star the Neo4j repository ⭐


12. AirByte: Reliable and Extensible data pipeline

Data is crucial for building AI applications, especially in production environments where managing large volumes of data from diverse sources is critical. Airbyte is particularly effective at handling this.

With a vast catalogue of over 300 connectors, Airbyte supports integration with various APIs, databases, data warehouses, and data lakes.

Airbyte also includes a Python extension called PyAirByte. This extension is compatible with popular frameworks like LangChain and LlamaIndex, making transferring data from multiple sources to your GenAI applications easier.

Check out this notebook for a detailed example of using PyAirByte with LangChain.

For additional information, please refer to the documentation.

AirByte Gif

Star the AirByte repository ⭐


13. DsPy: Programming LLMs

DsPy is another highly underrated framework that will be very big in the future.

They are solving for what nobody is doing right now.

The stochastic nature of LLMs makes it challenging to integrate them into traditional software systems, which are typically deterministic.

This often leads to the need for extensive prompt engineering and fine-tuning. DsPy bridges this gap by offering a more systematic way of working with LLMs.

DSPy from Stanford simplifies this by doing two key things:

  1. Separating Program Flow from Parameters: This feature keeps your program's flow (the steps you take) separate from the details of how each step is done (the LM prompts and weights). This makes it easier to manage and update your system.
  2. Introducing New Optimizers: DSPy uses advanced algorithms that automatically fine-tune the LM prompts and weights based on your goals, such as improving accuracy or reducing errors.

Check out this Getting Started Notebook for more on how to work with DsPy.

DsPy Gif

Star the DsPy repository ⭐


Thanks for reading! Feel free to share any other essential open-source tools for AI in the comments. ✨

Top comments (9)

Collapse
 
vortico profile image
Vortico

Hey, great post! We really enjoyed it. You might be interested in knowing how to productionalise ML models with a simple line of code. If so, please have a look at flama for Python. We introduced some time ago an introductory post here Introducing Flama for Robust ML APIs. If you have any doubts, or you'd like to learn more about it and how it works in more detail, don't hesitate to give us a shout. And if you like it, please gift us a star ⭐ here.

Collapse
 
atsag profile image
Andreas

Thank you!

Collapse
 
nevodavid profile image
Nevo David

Great list!

Collapse
 
mathew00112 profile image
Mathew

Amazing I will try these tools

Collapse
 
sunilkumrdash profile image
Sunil Kumar Dash

Thanks, Mathew.

Collapse
 
daniel1230 profile image
Daniel

Thanks!

Collapse
 
migduroli profile image
migduroli

I would add flama which is specifically thought for the productionalisation of ML models via ML APIs. To have a look at an actual example of an entire ML pipeline run with flama, you can check this post, which I think contains all the relevant information.

Collapse
 
sachajw profile image
Sacha Wharton

What a delicious list!

Collapse
 
capgo profile image
Capgo

Very useful, thank you!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.