Sunil Kumar Dash

for Composio

Posted on Aug 29, 2024

13 open-source tools that will make you 99% more likely to land any AI job 🪄✨

#javascript #python #ai #opensource

I’ve been in the AI space for quite some time, back when the top language models were BERT and T5. During this period, the progress has been insane.

We now have better models, tools, frameworks, and machines.

If you are contemplating entering AI, this is the best time. And the ideal approach is to master tools that will put you ahead of the competition.

So, I have compiled a coveted list of open-source software that covers various aspects of AI development, from AI model training and monitoring to building AI agents.

Comment if anything else needs to be mentioned here. Also, do Star and contribute meaningfully to the repositories. This can be the best strategy to improve your CV's credibility

1. Composio👑: Automate workflows by Integrating popular apps with AI

The age of AI agents is upon us, and many Fortune 500 companies have started including agentic workflows. However, automating complex workflows is anything but easy.

To connect AI models with external applications, you would need specialized toolsets. For instance, to automate aspects of software development, the AI model must have access to GitHub, Jira, Code interpreters, code indexers, the Internet, etc.

This is where Composio comes into the picture.

It lets you integrate over 100 production-ready toolsets such as Gmail, Google Sheets, Jira, Notion and many more to automate complex real-world workflows.

So, here’s how you can get started with it.

Python



pip install composio-core

Add a GitHub integration.



composio add github

Composio handles user authentication and authorization on your behalf.

Here is how you can use the GitHub integration to Star a repository.



from openai import OpenAI
from composio_openai import ComposioToolSet, App

openai_client = OpenAI(api_key="******OPENAIKEY******")

# Initialise the Composio Tool Set 
composio_toolset = ComposioToolSet(api_key="**\\*\\***COMPOSIO_API_KEY**\\*\\***")

## Step 4
# Get GitHub tools that are pre-configured
actions = composio_toolset.get_actions(actions=[Action.GITHUB_ACTIVITY_STAR_REPO_FOR_AUTHENTICATED_USER])

## Step 5
my_task = "Star a repo ComposioHQ/composio on GitHub"

# Create a chat completion request to decide on the action
response = openai_client.chat.completions.create(
model="gpt-4-turbo",
tools=actions, # Passing actions we fetched earlier.
messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": my_task}
  ]
)

Run this Python script to execute the given instruction using the agent.

Javascript

You can Install it using npm, yarn, or pnpm.



npm install composio-core

Define a method to let the user connect their GitHub account.



import { OpenAI } from "openai";
import { OpenAIToolSet } from "composio-core";

const toolset = new OpenAIToolSet({
  apiKey: process.env.COMPOSIO_API_KEY,
});

async function setupUserConnectionIfNotExists(entityId) {
  const entity = await toolset.client.getEntity(entityId);
  const connection = await entity.getConnection('github');

  if (!connection) {
      // If this entity/user hasn't already connected, the account
      const connection = await entity.initiateConnection(appName);
      console.log("Log in via: ", connection.redirectUrl);
      return connection.waitUntilActive(60);
  }

  return connection;
}

Add the required tools to the OpenAI SDK and pass the entity name on to the executeAgent function.



async function executeAgent(entityName) {
  const entity = await toolset.client.getEntity(entityName)
  await setupUserConnectionIfNotExists(entity.id);

  const tools = await toolset.get_actions({ actions: ["github_activity_star_repo_for_authenticated_user"] }, entity.id);
  const instruction = "Star a repo ComposioHQ/composio on GitHub"

  const client = new OpenAI({ apiKey: process.env.OPEN_AI_API_KEY })
  const response = await client.chat.completions.create({
      model: "gpt-4-turbo",
      messages: [{
          role: "user",
          content: instruction,
      }],
      tools: tools,
      tool_choice: "auto",
  })

  console.log(response.choices[0].message.tool_calls);
  await toolset.handle_tool_call(response, entity.id);
}

executeGithubAgent("joey")

Execute the code and let the agent do the work for you.

Composio works with famous frameworks like LangChain, LlamaIndex, CrewAi, etc.

For more information, visit the official docs, and for even more complex examples, see the repository's example sections.

Star the Composio repository ⭐

2. TRL by HuggingFace: Train transformer language models with reinforcement learning

You often need LLMs and diffusion models to behave in specific ways, like adding guardrails or ensuring they follow human instructions. This is where you need TRL.

TRL, or Transformer Reinforcement Learning backed by HuggingFace, is a widely used open-source library to fine-tune and align language models easily.

It supports multiple methods for aligning models, such as reinforcement learning using PPO (Proximal Policy Optimization), Supervised fine-tuning, and DPO (Direct Preference Optimization).

It’s easy, and the Pythonic interface makes it easier for beginners to get started quickly.

Install trl using pip.



pip install trl

Let’s quickly go through the SFTTrainer class for a supervised fine-tuning of an LLM.



# imports
from datasets import load_dataset
from trl import SFTTrainer

# get dataset
dataset = load_dataset("imdb", split="train")

# get trainer
trainer = SFTTrainer(
    "facebook/opt-350m",
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)

# train
trainer.train()

The code block creates an SFTTrainer instance with facebook/opt-350m. The train() method will start training the model over IMDB data.

Check out the example section for more.

Star the trl repository ⭐

Pytorch-Lightning: Build, Train, and finetune models at Scale

AI development cannot be thought of without Pytorch, and Pytorch-listening takes it further.

It is a general-purpose framework that helps structure and scale PyTorch-based deep learning projects, providing training, experimentation, and deployment tools across various domains.

Several benefits of lightning over Pytorch.

It makes the Pytorch code more readable, structured, and user-friendly.
Reduces repetitive code with predefined training loops and utilities.
Simplifies training, experimentation, and deployment with less boilerplate code.

Get started with Lightning using pip



pip install lightning

Define an auto-encoder using Lightning module.



import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L 

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, _ = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

Load MNIST data.



# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

The Lightning Trainer “mixes” any LightningModule with any dataset and abstracts away all the engineering complexity needed for scale.



# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

For more on Lightning, check out the official documentation.

Star the lightning AI repository ⭐

4. Weight and Biases: Monitor all the pieces of your ML pipeline

Suppose you want to fine-tune or train a model. In that case, you must keep track of multiple components, such as model hyperparameters, training and validation metrics, data preprocessing steps, model architecture versions, and experiment configurations.

Knowing if the model you are training is on the right course is essential.

Wandb is one of the best open-source solutions out there. It allows you to track metrics and collaborate with your team members.

Get started with W&B in four steps:

First, sign up for a W&B account.
Second, install the W&B SDK with pip. Navigate to your terminal and type the following command:



pip install wandb

Third, log into W&B:



wandb.login()

Use the example code snippet below as a template to integrate W&B into your Pytorch Lightning script:



# This script needs these libraries to be installed:
#   torch, torchvision, pytorch_lightning

import wandb

import os
from torch import optim, nn, utils
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger

class LitAutoEncoder(pl.LightningModule):
    def __init__(self, lr=1e-3, inp_size=28, optimizer="Adam"):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(inp_size * inp_size, 64), nn.ReLU(), nn.Linear(64, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, inp_size * inp_size)
        )
        self.lr = lr

        # save hyperparameters to self.hparamsm auto-logged by wandb
        self.save_hyperparameters()

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)

        # log metrics to wandb
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=self.lr)
        return optimizer

# init the autoencoder
autoencoder = LitAutoEncoder(lr=1e-3, inp_size=28)

# setup data
batch_size = 32
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset, shuffle=True)

# initialise the wandb logger and name your wandb project
wandb_logger = WandbLogger(project="my-awesome-project")

# add your batch size to the wandb config
wandb_logger.experiment.config["batch_size"] = batch_size

# pass wandb_logger to the Trainer
trainer = pl.Trainer(limit_train_batches=750, max_epochs=5, logger=wandb_logger)

# train the model
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

# [optional] Finish the wandb run, which is necessary for the notebook
wandb.finish()

You can observe the metrics on your Wandb dashboard in real time.

For more information, refer to the developer guide.

Star the Wandb repository ⭐

5. MlFlow: A Machine Learning Lifecycle Platform

Mlflow is a comprehensive Mlops framework used across industries.

It lets you track the entire lifecycle of an AI model, from training and fine-tuning to deployment. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc.), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud)

Not only AI models, but it also lets you track and monitor AI agents built with LangChain, OpenAI SDK, etc.

It is an essential tool to build a complete end-to-end Ml/AI pipeline.

Star the MlFlow repository ⭐

6. Pgvector: Open-source vector similarity search for Postgres

RAG applications are complete with vector databases. Vector databases manage unstructured data as high-dimensional vectors or embeddings.

Many organizations already use the Postgres database to store structured data, making Pgvector the best option for a vector database for all these companies.

Out of many available options, Pgvector will make the most sense in the long term.

Install PGVector in Linux and Mac.



Compile and install the extension (supports Postgres 12+)

cd /tmp
git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo
See the installation notes if you run into issues

You can install it with Docker, Homebrew, PGXN, APT, Yum, pkg, or conda-forge. It comes preinstalled with the Postgres app and many hosted providers. There are also instructions for GitHub Actions.

Enable the extension (do this once in each database where you want to use it)



CREATE EXTENSION vector;

Create a vector column with 3 dimensions



CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Insert vectors



INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Get the nearest neighbours by L2 distance



SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Also supports inner product (<#>), cosine distance (<=>), and L1 distance (<+>, added in 0.7.0)

Note: <#> returns the harmful inner product since Postgres only supports ASC order index scans on operators

Storing

Create a new table with a vector column



CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Or add a vector column to an existing table



ALTER TABLE items ADD COLUMN embedding vector(3);

Insert vectors



INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Or load vectors in bulk using COPY (example)



COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);

Upsert vectors



INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
    ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;

Update vectors



UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;

Delete vectors



DELETE FROM items WHERE id = 1;

Querying

Get the nearest neighbours to a vector



SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

For more on PGVector, refer to their repository.

Star the PgVector repository ⭐

7. Llama Cpp: LLM inference in C/C++

Many organizations want to host or open-source LLMs themselves. This will require a highly optimized and efficient inference engine.

Llama Cpp makes the most sense here. Developed by Georgi Gerganov, it is one of the best open-source solutions for serving LLMs.

As the name suggests, it is built with C++, making it fast. It also supports almost all the open-access models, such as Llama 3, Mistral, Gemma, Nous Hermes, etc.

Check out this guide for instructions on how to build llama cpp by yourself.

Star the Llama Cpp repository ⭐

8. LangGraph: Build resilient language agents as graphs

LangGraph is easily one of the most capable frameworks for building efficient and reliable AI agents. As the name suggests, it follows a cyclic graphical architecture, such as Nodes and Edges, to build AI agents.

It is an extension of LangChain, so it has a massive community of AI developers building on it.

Get started with it using pip.



pip install -U langgraph

If you want to build agents/bots with LangGraph, check out our detailed blog on building a Gmail and Calendar assistant.

For more on LangGraph, visit the documentation.

Star the LangGraph repository ⭐

9. Pydantic: Data validation using Python type hints

It is easily one of the best things that has happened to the Python ecosystem for a while.

The core value proposition of Pydantic is data validation.

From Building resilient APIs to getting structured outputs from LLMs, Pydantic has seen a massive rise in popularity. Many companies use Pydantic, and even OpenAI announced that it uses Pydantic to get structured output from LLMs.

Install Pydantic using pip .



pip install pydantic

A small example.



from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str = 'John Doe'
    signup_ts: Optional[datetime] = None
    friends: List[int] = []

external_data = {'id': '123', 'signup_ts': '2017-06-01 12:22', 'friends': [1, '2', b'3']}
user = User(**external_data)
print(user)
#> User id=123 name='John Doe' signup_ts=datetime.datetime(2017, 6, 1, 12, 22) friends=[1, 2, 3]
print(user.id)
#> 123

Check out the documentation for more.

Star the Pydantic repository ⭐

10. FastAPI: Fast, Simple, and Easy Python Framework

FastAPI has also received a lot of praise for its performant yet simplistic nature and easy-to-learn. Nature.

Many AI companies use FastAPI predominantly to build APIs using FasAPI to either expose an endpoint to infer from models or create web apps.

Mastering FastAPI will put you in a good position to handle both AI and API development.

It’s built on Starllete, making it the fastest Python framework.

Get started with FastAPI using pip.



pip install "fastapi[standard]"

Build a simple API.



from typing import Union

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"Hello": "World"}

@app.get("/items/{item_id}")
def read_item(item_id: int, q: Union[str, None] = None):
    return {"item_id": item_id, "q": q}

Run the server using.



fastapi dev main.py

For more information on FastAPi, visit documentation.

Star the FastAPI repository ⭐

11. Neo4j: Graphs for Everyone

Neo4j has a separate place in building knowledge bases for AI apps. It is the only OSS tool that provides a graph database with vector search.

Neo4j is pioneering GraphRAG, an effective RAG method that extracts relevant information using a hybrid retrieval approach from Knowledge graphs and vector databases.

This has been proven more effective than traditional RAG, which only uses vector retrieval.

One of the common patterns for using GraphRAG is as follows:

Do a vector or keyword search to find an initial set of nodes.
Traverse the graph to bring back information about related nodes.
Optionally, re-rank documents using a graph-based ranking algorithm such as PageRank.

For more information, refer to this article on GraphRAG.

Star the Neo4j repository ⭐

12. AirByte: Reliable and Extensible data pipeline

Data is crucial for building AI applications, especially in production environments where managing large volumes of data from diverse sources is critical. Airbyte is particularly effective at handling this.

With a vast catalogue of over 300 connectors, Airbyte supports integration with various APIs, databases, data warehouses, and data lakes.

Airbyte also includes a Python extension called PyAirByte. This extension is compatible with popular frameworks like LangChain and LlamaIndex, making transferring data from multiple sources to your GenAI applications easier.

Check out this notebook for a detailed example of using PyAirByte with LangChain.

For additional information, please refer to the documentation.

Star the AirByte repository ⭐

13. DsPy: Programming LLMs

DsPy is another highly underrated framework that will be very big in the future.

They are solving for what nobody is doing right now.

The stochastic nature of LLMs makes it challenging to integrate them into traditional software systems, which are typically deterministic.

This often leads to the need for extensive prompt engineering and fine-tuning. DsPy bridges this gap by offering a more systematic way of working with LLMs.

DSPy from Stanford simplifies this by doing two key things:

Separating Program Flow from Parameters: This feature keeps your program's flow (the steps you take) separate from the details of how each step is done (the LM prompts and weights). This makes it easier to manage and update your system.
Introducing New Optimizers: DSPy uses advanced algorithms that automatically fine-tune the LM prompts and weights based on your goals, such as improving accuracy or reducing errors.

Check out this Getting Started Notebook for more on how to work with DsPy.

Star the DsPy repository ⭐

Thanks for reading! Feel free to share any other essential open-source tools for AI in the comments. ✨

Top comments (10)

migduroli • Sep 4 '24

I would add flama which is specifically thought for the productionalisation of ML models via ML APIs. To have a look at an actual example of an entire ML pipeline run with flama, you can check this post, which I think contains all the relevant information.

Vortico • Sep 4 '24

Hey, great post! We really enjoyed it. You might be interested in knowing how to productionalise ML models with a simple line of code. If so, please have a look at flama for Python. We introduced some time ago an introductory post here Introducing Flama for Robust ML APIs. If you have any doubts, or you'd like to learn more about it and how it works in more detail, don't hesitate to give us a shout. And if you like it, please gift us a star ⭐ here.

Andreas • Sep 4 '24

Thank you!

Nevo David • Aug 30 '24

Great list!

Mathew • Aug 29 '24

Amazing I will try these tools

Sunil Kumar Dash • Aug 29 '24

Thanks, Mathew.

Daniel • Aug 29 '24

Thanks!

Senthilnathan Subramanian • Nov 20 '24

Wow. dev is my go to for anything new. Not only it serves the curious IT folks, these are phenomenally valuable tools for continous learning. I cannot thank enough dev promoters and all folks involved in keep this platform an exciting place. I love the quality of content the contributors share with poor ones like me. I am a modern day Rip-van-winkle out of tech for several years at a stretch. dev is an inspiring place for all techies. I frequently share the contents on linkedin. Thank You Sunil Kumar Dash, for sharing quite a lot of valuable stuff in just one article. Thank You. Cheers.