DEV Community

Cover image for 5 open source tools for effective ML Testing
Bap
Bap

Posted on • Edited on

5 open source tools for effective ML Testing

Hi there 👋

In this week's article, we will look into various Machine Learning testing tools.

The below tools are made to improve your model's accuracy, reliability and overall effectiveness.

Whether you are beginning your journey into ML or you are a seasoned expert, I hope you will find good use in the below repos.

Ready to check them out?

Image description


Shameless promotion: If you enjoy building ML and GenAI projects, how would you feel about getting rewarded for it? 🙃

To participate, sign up to Quira and check out Quests. The current prize pool is $2048; click on the image to learn more about it. 👇

Image description

Now that you have checked our Quests let's see how you can leverage the repos below to build an excellent ML/GenAI project. 🚀


Giskard-AI/giskard 🐢

A testing framework going from tabular to LLMs

Image description

Why should I consider this repo? Giskard can be seen as a health checkup for your models. It can scan many models, from simple tabular ones to advanced language models. Its flexibility to work in any environment and with any model makes it great for ensuring your models are fair, accurate, and ready for real-world applications. On a side note, I actually met up with their founder at a conference last year, and I was really impressed with their project!

Installation: pip install giskard -U

Getting started:


import giskard

# Replace this with your own data & model creation.
df = giskard.demo.titanic_df()
demo_data_processing_function, demo_sklearn_model = giskard.demo.titanic_pipeline()

# Wrap your Pandas DataFrame with Giskard.Dataset (test set, a golden dataset, etc.).
giskard_dataset = giskard.Dataset(
    df=df,  # A pandas.DataFrame that contains the raw data (before all the pre-processing steps) and the actual ground truth variable (target).
    target="Survived",  # Ground truth variable
    name="Titanic dataset", # Optional
    cat_columns=['Pclass', 'Sex', "SibSp", "Parch", "Embarked"]  # List of categorical columns. Optional, but is a MUST if available. Inferred automatically if not.
)

# Wrap your model with Giskard.Model. Check the dedicated doc page: https://docs.giskard.ai/en/latest/guides/wrap_model/index.html
# you can use any tabular, text or LLM models (PyTorch, HuggingFace, LangChain, etc.),
# for classification, regression & text generation.
def prediction_function(df):
    # The pre-processor can be a pipeline of one-hot encoding, imputer, scaler, etc.
    preprocessed_df = demo_data_processing_function(df)
    return demo_sklearn_model.predict_proba(preprocessed_df)

giskard_model = giskard.Model(
    model=prediction_function,  # A prediction function that encapsulates all the data pre-processing steps and that could be executed with the dataset used by the scan.
    model_type="classification",  # Either regression, classification or text_generation.
    name="Titanic model",  # Optional
    classification_labels=demo_sklearn_model.classes_,  # Their order MUST be identical to the prediction_function's output order
    feature_names=['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'],  # Default: all columns of your dataset
)

scan_results = giskard.scan(giskard_model, giskard_dataset)

display(scan_results)


Enter fullscreen mode Exit fullscreen mode

Here's an example of an output you could receive after testing your model. 👇

Image description

https://github.com/Giskard-AI/giskard


confident-ai/deepeval ✍️

The Evaluation Framework for LLMs

Image description

Why should I consider this repo? DeepEval is a tool for those working with LLMs. It offers a specialised, easy-to-use testing framework akin to Pytest but focused on LLMs. With its capability to run a variety of crucial metrics, it ensures thorough quality checks of LLM applications.

Installation: pip install -U deepeval

Getting started:


from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

input = "What if these shoes don't fit?"
context = ["All customers are eligible for a 30 day full refund at no extra costs."]
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra costs."

hallucination_metric = HallucinationMetric(threshold=0.7)
test_case = LLMTestCase(
    input=input,
    actual_output=actual_output,
    context=context
)
evaluate([test_case], [hallucination_metric])


Enter fullscreen mode Exit fullscreen mode

Here's an example of an output you could receive after testing your model. 👇

Image description

https://github.com/confident-ai/deepeval


promptfoo/promptfoo 🔎

Measure LLM quality and catch regressions

Image description

Why should I consider this repo? This tool should be used when you want to evaluate the output of different prompts with various LLMs. It's built to streamline evaluations with side-by-side comparisons, caching, and concurrency. You can leverage promptfoo for various known LLM providers (OpenAI, Gemini) and custom APIs.

Installation: npx promptfoo@latest init

Getting started:


import promptfoo from 'promptfoo';

const results = await promptfoo.evaluate({
  prompts: ['Rephrase this in French: {{body}}', 'Rephrase this like a pirate: {{body}}'],
  providers: ['openai:gpt-3.5-turbo'],
  tests: [
    {
      vars: {
        body: 'Hello world',
      },
    },
    {
      vars: {
        body: "I'm hungry",
      },
    },
  ],
});

Enter fullscreen mode Exit fullscreen mode

Here's an example of an output you could receive after testing your model. 👇

Image description

https://github.com/promptfoo/promptfoo


deepchecks/deepchecks ✅

ML testing for sklearn and pandas

Image description

Why should I consider this repo? Deepchecks is a tool you can use at every stage, from research to production. It focuses on providing continuous validation, runs checks on different data types (Tabular, NLP, Vision), and integrates into your CI/CD workflows.

Installation: pip install deepchecks -U --user

Getting started:


from deepchecks.tabular.suites import model_evaluation

suite = model_evaluation()

suite_result = suite.run(train_dataset=train_dataset, test_dataset=test_dataset, model=model)

suite_result.save_as_html() # replace this with suite_result.show() or suite_result.show_in_window() to see results inline or in window

# or suite_result.results[0].value with the relevant check index to process the check result's values in python


Enter fullscreen mode Exit fullscreen mode

Here's an example of an output you could receive after testing your model. 👇

Image description

https://github.com/deepchecks/deepchecks


great-expectations/great_expectations 📊

Always know what to expect from your data

Image description

Why should I consider this repo? This tool is like a quality control expert for your data. Great Expectations is not a pipeline execution framework. Instead, it integrates seamlessly with DAG execution tools like Spark, Airflow, dbt , prefect, dagster , Kedro, Flyte, etc. It tests your data quality pipeline while these tools execute the pipelines.

Installation: pip install great_expectations

Getting started:


# It is recommended to deploy within a virtual environment

import great_expectations as gx

context = gx.get_context()

Enter fullscreen mode Exit fullscreen mode

Depending on the tools and systems that you're already using, the setting up phase will be quite different.

This is why it is best to read up on their documentation (which I find really well-made)! 🤓

https://github.com/great-expectations/great_expectations


That's it for this one guys. 🌟

I hope these discoveries are valuable and will help you with your ML models! ⚒️

If you want to leverage these tools to build cool ML/GenAI projects and earn rewards, log into Quira and discover Quests! 💰

Image description

As usual, please consider supporting these projects by starring them. ⭐️

We are not affiliated with them.

We just think that great projects deserve great recognition.

See you next week,

Your Dev.to buddy 💚

Bap


If you want to join the self-proclaimed "coolest" server in open source 😝, you should join our discord server. We are here to help you on your journey in open source. 🫶

Top comments (15)

Collapse
 
mmabrouk profile image
Mahmoud Mabrouk

I would add to these github.com/agenta-ai/agenta
It provides a UI to compare and evaluate prompts and LLM apps. Compared to promptfoo it provides in addition an interface for human evaluation/annotation, and take care of versioning and deploying the applications.

Collapse
 
fernandezbaptiste profile image
Bap

Thanks for sharing this Mahmoud! Agenta's interface does look very slick 💅

Collapse
 
g5f profile image
g5f

Really interesting repos, thanks @fernandezbaptiste !

Collapse
 
fernandezbaptiste profile image
Bap

Thanks a lot! Appreciate your comment :)

Collapse
 
uliyahoo profile image
uliyahoo

Cool! Haven't heard of some of these, will definitely check them out.

Collapse
 
fernandezbaptiste profile image
Bap

🙃🌟

Collapse
 
matijasos profile image
Matija Sosic

Good stuff, thank you! Promptfoo looks especially cool.

Collapse
 
fernandezbaptiste profile image
Bap

🚀

Collapse
 
jacksbridger profile image
Jack Bridger

awesome!

Collapse
 
fernandezbaptiste profile image
Bap

Thanks a lot for the kind comment Jack!

Collapse
 
mfts profile image
Marc Seitz

perfect timing . thanks for sharing :)

Collapse
 
fernandezbaptiste profile image
Bap

woop woop 🙌

Collapse
 
sunglassesdev profile image
Sunglasses Dev

Awesome list ;)

Collapse
 
fernandezbaptiste profile image
Bap

Appreciate it 🙏

Collapse
 
fernandezbaptiste profile image
Bap

Please share the ML testing tools you use in the comment below! 👇