How we used AI to automate stock sentiment classification

#python #ai #machinelearning #workflow

This article is meant to accompany this video: https://www.youtube.com/watch?v=yeML0vX0yLw

In this article, we would like to provide you with a step-by-step tutorial, in which we build a slack bot that sends us a daily message on the sentiment of the news of our stocks. To do this, we need a tool that can:

automatically fetch data from news sources
retrieve API calls from our ML model to get predictions
send out our enriched data to Slack

We will build the web scraper in Kern AI workflow, labeled our news articles in refinery, and then enrich the data with gates AI. After that, we will use workflow again to send out the predictions and the enriched data via a webhook to Slack. If you'd like to follow along or explore these tools on your own, you can join our waitlist here: https://www.kern.ai/

Let's dive into the project!

Scraping data in workflow

To get started, we first need to get data. Sure, there are some publicly available datasets for stock news available. But we are interested in building a sentiment classifier for only specific companies and, ideally, we want news articles that are not too old and therefore irrelevant.

We start our project in workflow. Here we can add a Python node, with which we can execute custom Python code. In our case, we use it to scrape some news articles.

There are many ways to access news articles. We decided to use the Bing News API because it offers up to 1000 free searches per month and is fairly reliable. But of course, you can do this part however you like!

To do this, we use a Python yield node, which takes in one input (the scraping results) but can return multiple outputs (in this case, one record per found article):

def node(record: dict):
    from bs4 import BeautifulSoup
    import requests
    import time
    from datetime import datetime
    from uuid import uuid4

    search_term = "AAPL" # You can make this a list and iterate over it so search multiple companies! 

    subscription_key = <YOUR_AZURE_COGNITIVE_KEY>
    search_url = "https://api.bing.microsoft.com/v7.0/news/search"

    headers = {"Ocp-Apim-Subscription-Key" : subscription_key}
    params  = {"q": search_term, "textDecorations": True, "textFormat": "HTML", "mkt": "en-US"}

    response = requests.get(search_url, headers=headers, params=params)
    response.raise_for_status()
    search_results = response.json()

    headers = ["name", "description", "provider", "datePublished", "url"]

    record = {}
    for i in headers:
        if i == "provider": 
            providers = [article[i][0] for article in search_results["value"]]
            names = []
            for index in range(len(providers)):
                for key in providers[index]:
                    if key == "name":
                        names.append(providers[index][key])

            record[i] = names
        else: 
            part_of_response = [article[i] for article in search_results["value"]]
            record[i] = part_of_response

    record["topic"] = [search_term] * len(record["name"])

    # Scraper the collected urls
    texts = []
    for url in record["url"]:
        try:
            req = requests.get(url)
            text = req.content 
            soup = BeautifulSoup(text, 'html')
            results = soup.find_all('p')
            scraped_text = [tag.get_text() for tag in results]
            scraped_text_joined = " ".join(scraped_text)
            texts.append(scraped_text_joined)
            time.sleep(0.5)
        except:
            texts.append("Text not available.")
    record["text"] = texts

    for item in range(len(record)):
        yield {
            "id": str(uuid4()),
            "name": record["name"][item],
            "description": record["description"][item],
            "provider": record["provider"][item],
            "datePublished": record["datePublished"][item],
            "url": record["url"][item],
            "topic": record["topic"][item],
            "text": record["text"][item],
        }

After that, we can store our data in a shared store. There are two store nodes, a "Shared Store send" to receive data into a store and a "Shared Store read" node from which you can access stored data and feed it into other nodes.

We can create a Shared Store in the store section of Workflow. In the store section, you'll also find many other cool stores, such as spreadsheets or LLMs from OpenAI!

Simply click on "add store" and give it a fitting name. Afterward, you'll be able to add the created store in a node in workflow.

Now that we've scraped some data, we can move on to label and process it!

Enriching new incoming data with gates

Once we've run our web scraper and collected some data, we can sync a shared store with refinery. This will load all of our scraped data into a refinery project. Once we run the scraper again, new records will be loaded into the refinery project automatically.

Refinery is our data-centric IDE for text data and we can use it to label and process our articles very quickly and easily.

For example, we can create heuristics or something called an active learner to speed up and semi-automate the labeling process. Click here for a quickstart tutorial on how to label and process data with refinery.

Once the results of the project are satisfactory, all heuristics of ml models of a refinery can be accessed via an API through our second new tool, which is called gates.

Before we can access a refinery project, we have to go to gates first, open our project there, and start our model and/ or heuristic in the configuration.

Once we've done so, we will be able to select the model of the running gate in our gates AI node in workflow.

Gates is integrated directly into workflow and we don't need an API token to do this. But of course, the gates API is also usable outside of workflow, for this we would need an API token. But we will cover this in another blog article then.

After we've passed the data through gates, we get a dictionary as a response containing all the made predictions and confidence values for each of our active learners and heuristics. We also get all the input values returned as well, so if you are only interested in the results, we have to do a little bit of filtering. The Python code below takes in the response from gates and only returns the prediction and the topic. You can use a normal Python node for this.

def node(record: dict):
    return {
        "id": record["id"],
        "prediction": record["prediction"]["results"]["sentiment"]["prediction"],
        "stock": record["topic"],
    }

Afterward, we store the filtered and enriched results in a separate store!

Aggregate the sentiments

Now we got all of our news articles enriched with sentiment predictions. The only thing that's left is to aggregate the predictions and send them out. You could do this via E-Mail or you could also send the results to a Google Sheet. In our example, we are going to use a webhook to send out the aggregated results to a dedicated slack channel.

In our example we simply count the number of positive, neutral and negative articles, but you could also send out the confidence of the articles or the text snippets of articles. To do this, we use a Python aggregate node, which takes in multiple records but sends out only one output. Here's the code for this node:

def node(records: list[dict]):
    from datetime import datetime
    from uuid import uuid4

    positive_count = 0
    neutral_count = 0
    negative_count = 0

    for item in records:
        if item["prediction"] == "rather positive":
            positive_count += 1
        elif item["prediction"] == "neutral":
            neutral_count += 1
        elif item["prediction"] == "rather negative":
            negative_count += 1

    return {
        "id": str(datetime.today()),
        "text": f"Beep boop. This is the daily stock sentiment bot. There were {positive_count} positive, {neutral_count} neutral and {negative_count} news about Apple today!"
    }

We then create a webhook store, add the URL of our slack channel and then add the node to our workflow. Afterward, we can run the workflow and it should send out a slack message to us!

This simple use-case only scratches the surface of what you can do with the Kern AI platform and you have a lot of freedom to customize the project and workflow to your need!

If you have any questions or feedback, feel free to leave it in the comment sections down below!