a subjective evaluation of a few open LLMs

#slm #llm #opensource #ai

Hey there !

Amazing what you can do with small language models like llama3! I've been using local models a lot, the best consistent devX I've personnally had so far was with llama3 or with phi3.

Both never cease to excel at following precise instructions, formatting data, labeling sentiment, etc. All simple tasks that can be easily implemented in any app.

The best part is that these models are free and can be totally private (can run on-premise); so if you need to transition to more efficient solutions than throwing money at OpenAI's, let me submit to you this little test I did lately =>

import os
import pytest

from llm_lib import classify_sentiment

@pytest.mark.parametrize("headline_input,expected", [
    (
            {
                'headline_text': ('Asure Partners with Key Benefit Administrators',
                                  'to Offer Proactive Health Management Plan (PHMP) to Clients')},
            {'possible_sentiments': ['bullish', 'neutral', 'slightly bullish', 'very bullish']}
    ),
    (
            {
                'headline_text': ('Everbridge Cancels Fourth Quarter',
                                  'and Full Year 2023 Financial Results Conference Call')},
            {'possible_sentiments': ['bearish', 'neutral', 'slightly bearish', 'uncertain', 'very bearish']}
    ),
    (
            {
                'headline_text': ("This Analyst With 87% Accuracy Rate Sees Around 12% Upside In Masco -",
                                  "Here Are 5 Stock Picks For Last Week From Wall Street's Most Accurate Analysts "
                                  "- Masco (NYSE:MAS)")},
            {'possible_sentiments': ['bullish', 'slightly bullish', 'very bullish']}
    ),
    (
            {'headline_text': 'Tesla leads 11% annual drop in EV prices as demand slowdown continues'},
            {'possible_sentiments': ['bearish', 'slightly bearish', 'very bearish']}
    ),
    (
            {'headline_text': "Elon Musk Dispatches Tesla's 'Fireman' to China Amid Slowing Sales"},
            {'possible_sentiments': ['bearish', 'slightly bearish']}
    ),
    (
            {'headline_text': "OpenAI co-founder Ilya Sutskever says he will leave the startup"},
            {'possible_sentiments': ['bearish', 'neutral', 'slightly bearish', 'uncertain']}
    ),
    (
            {'headline_text': "Hedge funds cut stakes in Magnificent Seven to invest in broader AI boom"},
            {'possible_sentiments': ['bearish', 'bullish', 'neutral', 'slightly bearish', 'slightly bullish']} # the "broader AI boom" part can be seen as bullish
    )
])
def test_classify_sentiment(headline_input, expected):
    assert classify_sentiment(**headline_input) in expected['possible_sentiments']

... the goal here is to get one of the expected sentiment with headlines about the stocks market. TL;DR: open LLMs nailed it!

I've became so used to their performance that I can even put these kinds of tests in a CI without having to worry too much that they would randomly fail 😏

Here is the prompt and its wrapping function (I'm using LangChain for this) =>

def classify_sentiment(headline_text: str, model_name: str = current_model):
    template = """You are a stocks market professional. Your job is to label a headline with a sentiment IN ENGLISH.

Headlines that mention upside should be considered bullish. 
Any headline that mentions a sales decline, a drop in stock prices, a factory glut, an economic slowdown, increased selling pressure, or other negative economic indicators should be considered bearish instead of neutral. 
Only label a headline as neutral if it does not have any clear positive or negative sentiment or business implication.

You'll prefix a bullish or bearish sentiment with "very" if the headline is particularly positive or negative in its implications.
On the other hand, you'll prefix a slightly bullish or slightly bearish sentiment with "slightly" if the headline is only slightly positive or negative in its implications.

Here is the headline text you need to label, delimited by dashes:

--------------------------------------------------
{headline_text}
--------------------------------------------------

Here is the list of the possible sentiments, delimited by commas:

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
very bullish
bullish
slightly bullish
neutral
slightly bearish
bearish
very bearish
uncertain
volatile
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

You are to output ONLY ONE SENTIMENT WITH THE EXACT WORDING, from the provided list of sentiments.
DO NOT add additional content, punctuation, explanation, characters, or any formatting in your output."""
    sentiment_prompt = PromptTemplate.from_template(template)
    chain = sentiment_prompt | get_model(model_name)
    output = chain.invoke({"headline_text": headline_text})
    return output.content.strip().lower()

I've spent a little time downloading a handful of models of various sizes from ollama, with a focus on the smallest ones since I want to perform recurrent scraping and can't afford to wait too long before inference is done.

Here are the results on running the tests on a Intel® Xeon® Gold 5412U server with 256 GB DDR5 ECC and no GPU.

| Model              | Status | Time (s) |
|--------------------|--------|----------|
| llama3             | OK     | 17.68    |
| phi3               | OK     | 17.84    |
| aya                | OK     | 21.68    |
| mistral            | OK     | 21.76    |
| mistral-openorca   | OK     | 22.20    |
| gemma2             | OK     | 23.14    |
| phi3:medium-128k   | OK     | 45.87    |
| phi3:14b           | OK     | 47.36    |
| aya:35b            | OK     | 77.99    |
| llama3:70b         | OK     | 144.62   |
| qwen2:72b          | OK     | 148.25   |
| command-r-plus     | OK     | 239.20   |
| qwen2              | OKKO   | 16.11    |

I've set qwen2 to OKKO as it systemtically considers that Hedge funds cut stakes in Magnificent Seven to invest in broader AI boom is a very bullish, I didn't discard the model entirely since this is open to interpretation...

However and without suprise, Llama3 leads the pack, followed closely by the small phi3. I've also found out, while building up this test, that Cohere's aya is a really nice pick for data extraction!

Lastly, I've tried a few larger models, I intend to use those for my workloads when more intelligence is required. The future is bright for us developers to develop all kinds of agentic workflows with such brilliant models available to us. It's a great time to live in ✨ so keep building! ✨

EDIT: I like testing prompts, so let's start a package => https://pypi.org/project/yuseful-prompts/

Feel free to contribute!

Top comments (1)

Deepak Kumar • Jul 1

Hello everyone,

I hope you're all doing well. I recently launched an open-source project called the Ultimate JavaScript Project, and I'd love your support. Please check it out and give it a star on GitHub: Ultimate JavaScript Project. Your support would mean a lot to me and greatly help in the project's growth.

Thank you!

DEV Community

a subjective evaluation of a few open LLMs

Top comments (1)

Read next

12 Open Source tools that Developers would give up Pizza for👋🍕

4 years of programming ...

Integrating AI with Grace: The Gemini SDK and Flutter - Part 3

I a Avid Vim User, Finally Migrated to Neovim! How does it work, what do I gain from it?