Alfred Nutile

Posted on Aug 26

PHP and LLMs book - Local LLMs: Streamlining Your Development Workflow

#php #laravel #llm

This is a chapter from the PHP and LLMs book you can buy at

https://bit.ly/php_llms

Or get a sample at

https://bit.ly/php_llms_sample

Enjoy!

Big Picture

This chapter covers the basics of working with local Large Language Models (LLMs) in a PHP development environment. We'll go through setting up Ollama on your machine and creating a simple PHP client to interact with its API. The chapter demonstrates how to use both completion and chat modes, and includes a practical example of using LLMs for news summarization. Along the way, we'll cover important development practices like mocking API responses and writing effective tests. By the end of this chapter, you should have a foundational understanding of integrating local LLMs into your PHP applications, setting the stage for more advanced topics in later chapters.

What You Will Learn

How to set up a local LLM environment using Ollama
The advantages and considerations of working with local LLMs
How to create a basic PHP client to communicate with the Ollama API
How to implement and test both completion and chat functionalities in your client
The practical application of LLMs for news summarization and article filtering

Setting up your local Computer

We are going to use different LLMs and models, but to start this chapter I will show how to setup Ollama on your machine.
Keep in mind that an M1 or M2 Mac will not be great speed wise and you can overwhelm these systems with too many requests, but we'll mock a lot of the requests we do so this will not be an issue overall. Later in the book, I will cover how to use Laravel queues to help manage that, and how our Mock Driver can mock results so you can work in the UI without being impacted by this.

Ollama

This is the easy part! They have a nice download page: http://ollama.com/. Once installed, you'll have an icon in your tool bar (or taskbar on Windows) and you can basically "Restart" when needed since they do a ton of updates.

After you've installed Ollama, we need to pull down two models to start this process.

First let's use a small model called phi3 for results so we do not tax your machine too much.

Some great models that are larger are llama3.1 and mistral but we'll stick to the smaller model phi3
since it will take up less memory. Explore https://ollama.com/library for more models.

ollama pull phi3

Here are their docs https://github.com/ollama/ollama/blob/main/docs/README.md for more details.

OK, so now you have a chat model. For simple chat at the commandline to start you can just open a terminal and run:

ollama run phi3:latest

Pretty cool! But let's do more. Because you are running Ollama on your machine it is accessable via a port you can make http requests using the API https://github.com/ollama/ollama/blob/main/docs/api.md

curl http://localhost:11434/api/generate -d '{
  "model": "phi3:latest",
  "prompt": "What is PHP?"
}'

Or without streaming.

curl http://localhost:11434/api/generate -d '{
  "model": "phi3:latest",
  "prompt": "What is PHP?",
  "stream": false //added
}'

In the pursuit of accurate and reliable results, our approach often employs a 'multi-shot' strategy. Think of it as refining your initial request through multiple interactions with the LLM, much like crafting a masterpiece through successive drafts.

I am going to default to non-streaming. I will show how to stream, but will mostly stick to non-streaming. This is because much of what I am building is not for chat, and, when it is for chat, I want the quality of the results to be multi-shot. Here is an explanation from

Google Gemini:

In the pursuit of accurate and reliable results, our approach often employs a "multi-shot" strategy. Think of it as refining your initial request through multiple interactions with the LLM, much like crafting a masterpiece through successive drafts.

Each interaction serves as a "shot," where the LLM processes your prompt and delivers a response. We then evaluate this response and, if necessary, provide additional guidance or context in subsequent prompts, shaping the output toward the desired outcome. This iterative process allows us to fine-tune the results, ensuring they align with your intent and meet the highest standards of accuracy.

While this multi-shot approach may take slightly longer than a single prompt, the emphasis is on delivering the most precise and useful information, even if it means sacrificing a bit of speed.

OK, back to the fun part! Let's make our first "Client" to talk to this API and return results. From there, we will build up to "abstract" it as a driver we can use on any LLM service.

HTTP Client with Tests

NOTE: there are libraries for these but I want you to see how easy it is, using HTTP, to talk to these APIs. This pays off later as we do things like Pool requests, or talk to APIs that have no library yet.

We will start with some simple code on this:

//app/Services/Ollama/Client.php
<?php

namespace App\Services\Ollama;

use Illuminate\Support\Facades\Http;

class Client
{
    public function completion(string $prompt) {
        return Http::post('http://localhost:11434/api/generate', [
            'model' => 'phi3:latest',
            'prompt' => $prompt,
            "options" => [
                "temperature" => 0.1,
            ],
            'stream' => false,
        ]);
    }
}

A note on temperature and the elephant in the room hallucinations. We can prevent hallucinations almost 100%. And I am 100% confident of that from past experience. Preventing hallucinations or what I call drifting depends on a few things like, temperature, context building, prompting skills, etc that I will go over in this book. So let us put that concern aside for a bit and I promise you by the end of this book you will see it more of a problem with the prompt or the context than with the LLM.

So let us put that concern (hallucinations) aside for a bit and I promise you by the end of this book you will see it more of a problem with the prompt or the context than with the LLM.

So here (and by default) I set the temperature to 0.1 or 0 in many cases, lowering the amount of "creativity" the LLM will reply with.

Let's make a test. Below you will see the first run is there for us to get a fixture from the actual repsonse, then we will update the test to then use the fixture with our HTTP mock.

<?php

test('should return response', function () {

    $client = new \App\Services\Ollama\Client;
    $response = $client->completion('What is PHP?');

    put_fixture('simple_ollama_client_results.json', $response->json());

    expect($response)->not->toBeNull();
});

Ok I hate when people show a code example and introduce a custom something that then I am like "why?". But I just use
this all the time and add it to my app/helpers.php:

// app/helpers.php
<?php

use Illuminate\Support\Facades\File;

if (! function_exists('put_fixture')) {
    function put_fixture($file_name, $content = [], $json = true)
    {
        if (! File::exists(base_path('tests/fixtures'))) {
            File::makeDirectory(base_path('tests/fixtures'));
        }

        if ($json) {
            $content = json_encode($content, 128);
        }
        File::put(
            base_path(sprintf('tests/fixtures/%s', $file_name)),
            $content
        );

        return true;
    }
}

And then register it in composer.json here:

    "autoload": {
        "psr-4": {
            "App\\": "app/",
            "Database\\Factories\\": "database/factories/",
            "Database\\Seeders\\": "database/seeders/"
        },
        "files": [
            "app/helpers.php"
        ]
    },

I know there is the VCR library as well but I just wanted something more simple in my day to day workflow.

With our initial run, we got a file put into this folder tests/fixtures/simple_ollama_client_results.json.

And, in that file, we have the payload that it returned.

{
    "model": "phi3:latest",
    "created_at": "2024-08-22T15:19:44.208546Z",
    "response": "PHP, which stands for \"PHP: Hypertext Preprocessor,\" is a widely-used open source server-side scripting language. It was originally designed in 1994 by Rasmus Lerdorf to manage the creation and maintenance of websites...",
    "done": true,
    "done_reason": "stop",
    "context": [
        //lots of numbers
    ],
    "total_duration": 6179024209,
    "load_duration": 7809625,
    "prompt_eval_count": 14,
    "prompt_eval_duration": 760911000,
    "eval_count": 215,
    "eval_duration": 5409058000
}

There are some details in the above response we can talk about later, but for now we have the context, and the done_reason. All these LLMs will have different stop reasons. As we work on abstracting them out we will define a consistent version of and all the results will conform so for example when we get to Tools, there will be a different stop reason for that.

Now, let's update the test, moving the fixture we just generated in place.

<?php

test('should return response', function () {

    $data = get_fixture('simple_ollama_client_results.json'); //added

    \Illuminate\Support\Facades\Http::fake([
        'localhost:11434/*' => $data
    ]); //added

    \Illuminate\Support\Facades\Http::preventStrayRequests();//added

    $client = new \App\Services\Ollama\Client;
    $response = $client->completion('What is PHP?');

    //put_fixture('simple_ollama_client_results.json', $response->json()); removed
    expect($response)->not->toBeNull();
});

The get_fixture method is another helper I use to get the fixture from the tests/fixtures folder:

// app/helpers.php
<?php

// ...

if (! function_exists('get_fixture')) {
    function get_fixture($file_name, $decode = true)
    {
        $results = File::get(base_path(sprintf(
            'tests/fixtures/%s',
            $file_name
        )));

        if (! $decode) {
            return $results;
        }

        return json_decode($results, true);
    }
}

So now using the generated fixture we can fake the interaction. This is a good start as we flesh this out more in the "Agnostic Driver Chapter".

Using the Chat Method

You can read the Ollama docs here about how chat works. Those docs will help us in this section since we want to have a threaded discussion, not just a prompt completion as seen in the previous section.

We have our first PHP class that can ask the LLM a question. Let's make a new method for "chat". This allows us to add context to the question(s). Keep in mind for many one off integrations the completion we did above is all you need. But you will see as we get into tools it becomes more of a thread.

A thread in this case would look like this, simplified:

[
    {
        "role": "user",
        "content": "Find me all the events on this page and create a list of events and save them to the system"
    },
    {
        "role": "assistant",
        "content": "<thinking>I will use the event_tool and pass in the data I found on the page.</thinking>"
    },
    {
        "role": "tool",
        "content": "",
        "tool": {
            "name": "event_create_tool",
            "parameters": //all the events formmated
        }
    },
    {
        "role": "assistant",
        "content": "I created the following events in your system..."
    }
]

The above is a good example of what comes out of a request to the LLM, in this case the LLM was asked a question by you, the user role, then it started to build a chain of responses based on it needing to use tools to answer the question. We will show real results like this in later chapters.

In the next section, we update the client for using chat over completion, then do our first use case.

Update Client for Chat

NOTE: the tests are simple, but you'll see how we build on the code with confidence and the tests over time.

//app/Services/Ollama/Client.php
<?php
public function chat(array $messages) {
    return Http::post('http://localhost:11434/api/chat', [
        'model' => 'phi3:latest',
        'messages' => $messages,
        'stream' => false,
        "options" => [
            "temperature" => 0.1,
        ]
    ]);
}

And the test:

<?php
test('should return chat response', function () {

    $client = new \App\Services\Ollama\Client;
    $response = $client->chat([
        [
            'role' => 'user',
            'content' => 'What is PHP?',
        ]
    ]);

    put_fixture('simple_ollama_client_chat_results.json', $response->json());

    expect($response)->not->toBeNull();
});

Then we take our fixture and use it, here is what it looks like:

{
    "model": "phi3:latest",
    "created_at": "2024-08-22T17:11:40.160747Z",
    "message": {
        "role": "assistant",
        "content": "PHP, which stands for Hypertext Preprocessor, is a widely-used open source server-side scripting language. It was originally created by Rasmus Lerdorf in 1994 and...",
    },
    "done_reason": "stop",
    "done": true,
    "total_duration": 16670826417,
    "load_duration": 5122545583,
    "prompt_eval_count": 14,
    "prompt_eval_duration": 84989000,
    "eval_count": 412,
    "eval_duration": 11462114000
}

Here we use it for the test:

<?php
test('should return chat response', function () {

    $data = get_fixture('simple_ollama_client_chat_results.json'); // added

    \Illuminate\Support\Facades\Http::fake([ // added
            'localhost:11434/*' => $data] // added
    ); // added

    \Illuminate\Support\Facades\Http::preventStrayRequests(); //added

    $client = new \App\Services\Ollama\Client;
    $response = $client->chat([
        [
            'role' => 'user',
            'content' => 'What is PHP?',
        ]
    ]);

    //put_fixture('simple_ollama_client_chat_results.json', $response->json()); removed

    expect($response)->not->toBeNull();
});

So just like before we send in the results, prove it works. Again this is going to change quite a bit in the Abstraction chapter.

Real Use Case- News Summary

There is a feature in LaraLlama.io that allows you to search the web. I use that to search daily for news, and then send me a summary of the news for that day or week.

There are three things I can use the LLM for.

Parse the HTML to get the content about the article
Reject the article if it is not about the subject I want, e.g., a bad search result.
Then summarize all the articles that came in that day.

We will talk about this more later on in the Prompt chapter but this is all about Prompt and Context. We are giving the LLM Context or Content to parse with the prompt.

<role>
You are extracting the content from the provided HTML or text that is related to technology news. 

<task>
First see if the article provided talks about technology news and if not just return bool false. Else pull out the content of the article as a Title, Summary, URL, and Content formatted as below.

<format>
Title:
Url:
Summary:
Content:

Step 1 and 2 Filter non-related data and then extract data

In the Prompting chapter, I'll show you how to test your prompt but for now we will go with the prompt above.

Then we use our client in our new NewsFeedParser class:

//app/Domains/NewsFeedParser.php
<?php

namespace App\Domains;

use Facades\App\Services\Ollama\Client;

class NewsFeedParser
{

    public function handle(string $context) : bool|string
    {
        $prompt = <<<PROMPT
<role>
You are extracting the content from the provided HTML or text that is related to technology news. 

<task>
First see if the article provided talks about technology news and if not just return bool false. Else pull out the content of the article as a Title, Summary, URL, and Content formatted as below.

<format>
Title:
Url:
Summary:
Content:

<content>
$context
PROMPT;

        /**
         * @NOTE
         * This is going to change quite a bit
         * for now keeping it simple
         */
        $results = Client::completion($prompt);

        $results = data_get($results, 'response', false);

        if ($results == 'false') {
            return false;
        }

        return $results;
    }
}

NOTE: I use Laravel's Real Time Facades - watch how this pays off in the next test.

The first test to see it work will be for false responses where the LLM weeds out search results that are not about the technology news we want, so below you'll see put_fixture('news_feed_false.json', $results); where I get the results and save them for a mocked test after this.

it('can return false', function () {

    $context = 'This is about how to make a hamburger';

    $results = (new \App\Domains\NewsFeedParser())->handle($context);

    put_fixture('news_feed_false.json', $results);

    expect($results)->toBeFalse();
});

Then our mocked test to see it work:

it('can return false', function () {

    $context = 'This is about how to make a hamburger';

    $data = get_fixture('news_feed_false.json'); //removed

    Facades\App\Services\Ollama\Client::shouldReceive('completion') //removed
        ->once() //removed
        ->andReturn($data); //removed

    $results = (new \App\Domains\NewsFeedParser())->handle($context);

    //put_fixture('news_feed_false.json', $results); removed

    expect($results)->toBeFalse();
});

NOTE: See how we mock now at the Client level not at Http. I typically test one class at a time, even in my Feature tests and mock the surrounding classes. You can easily remove this though and test them together. By the way, we'll pass Data objects around using the Spatie Data object which will be tested as well.

OK, so we can handle false. Now, let's do a good response example.

First we make the test to generate the mock data. We will create a Laravel news article and save it to our tests/fixtures folder:

<!--tests/fixtures/news_feed_good.html-->
<html lang="en">
<head>
    <title>Laravel News</title>
</head>
<body>
<article>
    <h1>Laravel 11 Released</h1>
    <p>Laravel 11 has been released, and it's packed with new features and improvements. Here's what's new in this version:</p>
    <ul>
        <li>New Blade Directives</li>
        <li>Improved Form Validation</li>
        <li>Improved File Uploads</li>
    </ul>
</article>
</body>
</html>

Then, we create a test to generate the mock data:

<?php
it('can return an article summary', function () {

    $context = get_fixture('news_feed_good.html', false);

    $results = (new \App\Domains\NewsFeedParser())->handle($context);

    expect($results)->not->toBeFalse();
});

Another way we can create a fixture is to temporarily update the tested method to generate a fixture:

// app/Domains/NewsFeedParser.php
<?php

public function handle(string $context) : bool|string {
// ...

$results = Client::completion($prompt);

put_fixture('news_feed_good_response.json', $results); //added

$results = data_get($results, 'response', false);

if ($results == 'false') {
    return false;
}

return $results;

In this case my prompt gave me mixed results. This is a good example of how we could keep testing to get good results from our prompt, but we will save that for a later chapter 🙂

Now that we have the fixture, we move it to our test:

it('can return an article summary', function () {

    $data = get_fixture('news_feed_good_response.json'); //added

    Facades\App\Services\Ollama\Client::shouldReceive('completion') //added
        ->once() //added
        ->andReturn($data); //added

    $context = get_fixture('news_feed_good.html', false);

    $results = (new \App\Domains\NewsFeedParser())->handle($context);

    //put_fixture('news_feed_good_response.json', $results); removed

    expect($results)->not->toBeFalse();
});

That is it for our NewsFeedParser - we now have the class that uses the LLM to verify the content is what we want, and then extract the content we want out of the HTML.

Step 3 Summarize all the news

Now that we have the news we want (in this example we can assume it was saved in a database for later use), let's do two examples of creating a "News Summary" experience, one with completion and one with chat.

We will cover both, but I asked Ollama what is the advantage of one over the other, and I liked some of the information it offered:

When using LLMs like Ollama, the choice between sending input as chat or completion depends on how you want to provide context and what type of response you expect.

Chat:

In chat mode, you're essentially having a conversation with the model. You send a message, and the model responds. Then, you can send another message building upon the previous one, and so on. This is useful when:

You want to provide context incrementally, like in a conversation.
You need the model to maintain state or recall previous messages.
In your example, sending 5 news articles as separate chat messages would allow the model to process each article individually, taking into account any relationships between them. This might be useful if you want the model to:

Identify common themes or trends across multiple articles
Recognize patterns or connections between different articles
Completion:

In completion mode, you're providing a prompt and expecting a response that completes or generates text based on that input. You can send a single prompt with multiple components (like your 5 news articles) as long as they're concatenated into a single string.

When sending multiple inputs as a single completion prompt, the model will process them as a whole, without maintaining state between individual messages. This is useful when:

You want to provide a fixed amount of context upfront.
The relationships between different components are not crucial for the response.
In your example, sending all 5 news articles and the summary prompt in a single completion request would allow the model to process them as a whole, generating a summary that takes into account all the articles.

Context window:

When using chat, you get more control over the context window, as each message builds upon previous ones. The model maintains state and can recall previous messages, allowing for more nuanced responses.

In completion mode, the context window is fixed to the input prompt itself. While the model can still recognize patterns or relationships within that prompt, it won't maintain state across individual components like it would in a chat conversation.

So, if you want to provide more context and control over how the model processes your inputs, use chat. If you're comfortable providing all relevant information upfront and don't need incremental processing, completion might be the better choice.

Completion Example

In this example we get all our articles for this summary, it could be 1, it could be 10. The concern here is that there is too much content and we break the "context window". This limit will increase over time, like with Google Gemini it is a 2 million tokens window!

See the Terminology chapter. But the TLDR is this: tokens represent the space your question takes up, and we need to leave room for the reply of the LLM. Combined, this forms your context window.

In this case, we build up our messages thread from a query to the database to generate "context" for the threaded prompt:

//app/Domains/News/NewsDigestCompletion.php
<?php

namespace App\Domains\News;

use App\Models\News;
use Carbon\Carbon;
use Facades\App\Services\Ollama\Client;

class NewsDigestCompletion
{
    public function handle(Carbon $start, Carbon $end) : string
    {
        $messages = News::all()->map(function ($news) {
            return sprintf('News: Title: %s Content: %s', $news->title, $news->body);
        })->join("\n");

        $prompt = <<<PROMPT
<role>
You are my news digest assistant
<task>
Take the news articles from the <content> section below and create a TLDR followed by a title and summary of each one
If not news is passed in the just say "No News in this

<content>
{ $messages }
PROMPT;

        $results = Client::completion($messages);

        return data_get($results, 'content', 'No Results');
    }
}

So, we get all the news, mash it into a big text blob (the LLM can do some interesting things to understand the text - it does not need to be perfectly formatted), then we add it to our prompt.

From here, we get our results and give them back to the Class making the request.

Chat Example

//app/Domains/News/NewsDigest.php
<?php

namespace App\Domains\News;

use App\Models\News;
use Carbon\Carbon;
use Facades\App\Services\Ollama\Client;

class NewsDigest
{
    public function handle(Carbon $start, Carbon $end) : string
    {
        $messages = News::whereBetween('created_at', [$start, $end])->get()->map(
            function ($news) {
                return [
                    'user' => sprintf('News:  Title: %s Content: %s', $news->title, $news->body),
                    'role' => 'user'
                ];
            }
        );

        $prompt = <<<PROMPT
<role>
You are my news digest assistant
<task>
Take the news articles from this thread and create a TLDR followed by a title and summary of each one
If not news is passed in the just say "No News in this thread"
PROMPT;

        $messages = $messages->push([
            'role' => 'user',
            'content' => $prompt
        ])->toArray();

        $results = Client::chat($messages);

        return data_get($results, 'content', 'No Results');
    }
}

As you can see above we are sending the array of news articles with our prompt as part of the message thread.

NOTE: We are taking this in small steps. Later, we will get into Tools and how to make a prompt like this below just work:

<role>
You are my assistant helping with news and events
<task>
Get all the news articles from this week and make a digest of them an email them to info@dailyai.studio

So you can see three "Tools" in that prompt, news_fetch, news_digest, and send_email, that will "just work" by the time we are at the end of this book. The LLM will look at the prompt, the tools you tell it the system has, and know which ones to use and in what order. As for the news feature, this can all be wrapped up into a scheduler to search the web daily and then send off a digest. And once the Tools are part of the natural language of the prompt, a user can easily create these complex workflows.

What You Learned

The core concepts and benefits of utilizing local LLMs
How to install and configure Ollama on your local machine
The basics of interacting with the Ollama API using cURL
The importance of considering context window limitations
How to build a simple PHP client to communicate with Ollama
How to mock API responses for efficient testing
A practical example of using LLMs for news summarization
The strategic use of both completion and chat modes for different LLM interactions
A preview of the exciting possibilities that await as we delve deeper into LLMs

DEV Community