DEV Community

Cover image for Simple Data Ingestion tutorial with Yahoo Finance API and Python
Adi Polak
Adi Polak

Posted on • Updated on

Simple Data Ingestion tutorial with Yahoo Finance API and Python

Clone the code ( and give us a star ) ➜➜ Tutorial Github Repo

Data is everywhere; many companies rely on data to make all kinds of decisions, forecast market, plan for the future, understand customers' needs, retargeting efforts, and more. Many times when we look at various architectures to work with data, there is always data ingestion part.

Let's understand what it is and build one!

One of the options for ingesting data to the system is 'pulling' it. Pulling data is often used when we want to enrich existing data, we do that by pulling data at predetermined and known times.

What does pull data means?

Pull data is taking/requesting data from a resource on a scheduled time or when triggered.
For a time scheduled pull data example, we can decide to query twitter every 10 seconds. For a trigger example, we can think about other processes in our system that calls our pull data process and wakes it up with a request to pull new/updated data. This term can be seeing more philosophical. The main idea is that there is no online-always server that awaits requests.
This service genereates requests and pulls the data it needs - in our case, we call an API to pull financial information.

What will we use?

Yahoo Finance with 'python':

Today YOU learn how to read finance information using yahoo finance (yFinance) API and write it to EventHubs.
These are the reasons I chose yFinance API:

  • It's free
  • It considered a gold standard for stock market APIs
  • Provides access to 5 years of daily history of OHLC(open high-low chart) stock prices data

I use Azure EventHubs to store the data because; It is a highly scalable publish-subscribe service that can ingest millions of events per second and stream them to multiple consumers. It supports both batch and real-time processing.
EventHubs Capture enables us to capture the data in both Azure Data Lake and other storages available on the cloud.


Time to drill into our tutorial:

Prerequisites:

  1. Python installed
  2. Anaconda
  3. Basic knowledge of python

Want to learn Python? Here is a free online course for you!

Optional ( if you wish to work with Azure):

  1. Azure free account
  2. Azure CLI installed
  3. Optional - if you wish to work with Event Hubs, you need to create it first, this is how.
  4. Get EventHubs connection String - this is how
  5. Key Vault - for storing keys, Secret and certificates ( used to store Event Hubs secret connection string )

Want to learn Azure Cloud Fundamentals? Here is a free online course for you!

Set your environment with conda:

conda create -n yahoofinance python=3.6 anaconda

conda activate yahoofinance
Enter fullscreen mode Exit fullscreen mode

Download and install libraries:

# Install azure-eventhub:
pip install azure-eventhub

# Install azure:
pip install azure

# Install yfinance:
pip install yfinance
Enter fullscreen mode Exit fullscreen mode

YES! you are done with setting the environment.
Regarding editor for the code, I like working with VS-Code, but you can work with your favorite tool.


Are you ready to write code?

The code itself is short but holds many concepts that are important to know and understand, follow with attention!

Here is the complete code, I added this in the beginning, so it is easier for you to copy it to your text editor and change it while you follow the tutorial.

import asyncio
from azure.eventhub.aio import EventHubProducerClient
from azure.eventhub import EventData
import yfinance as yf

from azure.keyvault.secrets import SecretClient
from azure.identity import DefaultAzureCredential
from datetime import datetime


async def run(stocksCodesList):
    print('Start Batch for stocks: \n',stocksCodesList)
    # Get connection string from vault

    credential = DefaultAzureCredential()
    keyVaultName = "{your key vault name}"
    keyName = "{your key name}"
    KVUri = "https://" + keyVaultName + ".vault.azure.net"

    # Create a producer client to send messages to the event hub.
    producer = EventHubProducerClient.from_connection_string(conn_str_value)

    async with producer:
        # Create a batch.
        event_data_batch =  await producer.create_batch()
        for stockCode in stocksCodesList:
            #Get stock info
            stockInfo = yf.Ticker(stockCode).info
            # Add events to the batch.
            event_data_batch.add(EventData(stockInfo))

        # Send the batch of events to the event hub.
        await producer.send_batch(event_data_batch)
        printSentMessage()


def printSentMessage():
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    print('Batch sent - Current Time =', current_time)

loop = asyncio.get_event_loop()
loop.run_until_complete(run(["MSFT"]))
loop.close()
Enter fullscreen mode Exit fullscreen mode

Now! let's break it into parts and demystify them :

import asyncio

async def run(stocksCodesList):
    # Get stock information and write them to Event Hubs

loop = asyncio.get_event_loop()
loop.run_until_complete(run(["MSFT"]))
loop.close()

Enter fullscreen mode Exit fullscreen mode

This is an asynchronous python code. For that, we use 'asyncio' library:
loop = asyncio.get_event_loop() - create or take an existing loop.
loop.run_until_complete(run(["MSFT"])) - giving the "event-loop" the run function with list of stocks we would like to query- this will create a Future object and will wait for it until complete.
loop.close() - close the loop.
async def run(stocksCodesList): async and await are two python keywords that are used to define coroutines (more on that soon)

To learn more on on event_loop, read here.

Here is how we call yFinance API to get stock information:

import asyncio
import yfinance as yf

async def run(stocksCodesList):
    for stockCode in stocksCodesList:
            #Get stock info
            stockInfo = yf.Ticker(stockCode).info
            print(stockInfo)

loop = asyncio.get_event_loop()
loop.run_until_complete(run(["MSFT"]))
loop.close()

Enter fullscreen mode Exit fullscreen mode

This code iterate over the list of Stock Codes strings we received and pulled the information with:
yf.Ticker(stockCode).info This command returns much information that you can use to build beautiful UI and market prediction algorithms.

We queried yahoo finance with MSFT- Microsoft stock symbol, and here is the response in JSON format. Slide to the right to see it all; it is LONG!

{'zip': '98052', 'sector': 'Technology', 'fullTimeEmployees': 144000,
 'longBusinessSummary': "Microsoft Corporation develops, licenses, and supports software, services, devices, and solutions worldwide. The company's Productivity and Business Processes segment offers Office, Exchange, SharePoint, Microsoft Teams, Office 365 Security and Compliance, and Skype for Business, as well as related Client Access Licenses (CAL); and Skype, Outlook.com, and OneDrive. It also provides LinkedIn that includes Talent and marketing solutions, and subscriptions; and Dynamics 365, a set of cloud-based and on-premises business solutions for small and medium businesses, large organizations, and divisions of enterprises. The company's Intelligent Cloud segment licenses SQL and Windows Servers, Visual Studio, System Center, and related CALs; GitHub that provides a collaboration platform and code hosting service for developers; and Azure, a cloud platform. It also provides support services and Microsoft consulting services to assist customers in developing, deploying, and managing Microsoft server and desktop solutions; and training and certification to developers and IT professionals on various Microsoft products. The company's More Personal Computing segment offers Windows OEM licensing and other non-volume licensing of the Windows operating system; Windows Commercial comprising volume licensing of the Windows operating system, Windows cloud services, and other Windows commercial offerings; patent licensing; Windows Internet of Things; and MSN advertising. It also provides Microsoft Surface, PC accessories, and other intelligent devices; Gaming, including Xbox hardware, and Xbox software and services; video games and third-party video game royalties; and Search, including Bing and Microsoft advertising. The company sells its products through distributors and resellers; and directly through digital marketplaces, online stores, and retail stores. It has strategic partnerships with Humana Inc. and Nokia. The company was founded in 1975 and is headquartered in Redmond, Washington.", 
'city': 'Redmond', 'phone': '425-882-8080', 'state': 'WA', 'country': 'United States', 'companyOfficers': [], 'website': 
'http://www.microsoft.com', 'maxAge': 1, 'address1': 'One Microsoft Way',
 'fax': '425-706-7329', 'industry': 'Software—Infrastructure', 'previousClose': 149.7, 'regularMarketOpen': 152.44, 
'twoHundredDayAverage': 153.86584, 'trailingAnnualDividendYield': 0.012959252000000001, 'payoutRatio': 0.32930002, 
'volume24Hr': None, 'regularMarketDayHigh': 158.25,
 'navPrice': None, 'averageDailyVolume10Day': 73927350,
 'totalAssets': None, 'regularMarketPreviousClose': 149.7, 'fiftyDayAverage': 162.70589, 'trailingAnnualDividendRate': 1.94,
 'open': 152.44, 'toCurrency': None, 'averageVolume10days': 73927350, 'expireDate': None, 'yield': None, 'algorithm': None, 
'dividendRate': 2.04, 'exDividendDate': 1589932800, 
'beta': 1.091844, 'circulatingSupply': None, 'startDate': None, 'regularMarketDayLow': 150.3, 'priceHint': 2, 'currency': 'USD', 'trailingPE': 27.509142, 'regularMarketVolume': 19627913,
 'lastMarket': None, 'maxSupply': None, 'openInterest': None, 'marketCap': 1201223368704, 'volumeAllCurrencies': None,
 'strikePrice': None, 'averageVolume': 47615782, 'priceToSalesTrailing12Months': 8.947727, 'dayLow': 150.3,
 'ask': 157.71, 'ytdReturn': None, 'askSize': 1800,
 'volume': 19627913, 'fiftyTwoWeekHigh': 190.7, 
'forwardPE': 25.390675, 'fromCurrency': None,
 'fiveYearAvgDividendYield': 2, 'fiftyTwoWeekLow': 118.1, 'bid': 157.69, 'tradeable': True, 'dividendYield': 0.013099999000000001,
 'bidSize': 1100, 'dayHigh': 158.25, 'exchange': 'NMS',
 'shortName': 'Microsoft Corporation', 'longName': 'Microsoft Corporation', 'exchangeTimezoneName': 'America/New_York', 'exchangeTimezoneShortName': 'EDT',
 'isEsgPopulated': False, 'gmtOffSetMilliseconds': '-14400000', 'quoteType': 'EQUITY', 'symbol': 'MSFT',
 'messageBoardId': 'finmb_21835', 'market': 'us_market',
 'annualHoldingsTurnover': None, 'enterpriseToRevenue': 8.131, 'beta3Year': None, 'profitMargins': 0.33016,
 'enterpriseToEbitda': 17.817, '52WeekChange': 0.25777185, 'morningStarRiskRating': None,
 'forwardEps': 6.22, 'revenueQuarterlyGrowth': None,
 'sharesOutstanding': 7606049792, 'fundInceptionDate': None, 'annualReportExpenseRatio': None, 'bookValue': 14.467, 'sharesShort': 55155176, 'sharesPercentSharesOut': 0.0073, 'fundFamily': None, 'lastFiscalYearEnd': 1561852800,
 'heldPercentInstitutions': 0.74407, 'netIncomeToCommon': 44323000320, 'trailingEps': 5.741, 'lastDividendValue': None,
 'SandP52WeekChange': -0.11360252, 'priceToBook': 10.916568, 'heldPercentInsiders': 0.01421, 'nextFiscalYearEnd': 1625011200, 'mostRecentQuarter': 1577750400,
 'shortRatio': 0.91, 'sharesShortPreviousMonthDate': 1581638400, 'floatShares': 7495074784, 'enterpriseValue': 1091541204992,
 'threeYearAverageReturn': None, 'lastSplitDate': 1045526400, 'lastSplitFactor': '2:1', 'legalType': None,
 'morningStarOverallRating': None, 'earningsQuarterlyGrowth': 0.383, 'dateShortInterest': 1584057600, 
'pegRatio': 1.88, 'lastCapGain': None,
 'shortPercentOfFloat': 0.0073, 'sharesShortPriorMonth': 56193866, 'category': None, 'fiveYearAverageReturn': None, 
'regularMarketPrice': 152.44, 'logo_url': 'https://logo.clearbit.com/microsoft.com'}
Enter fullscreen mode Exit fullscreen mode

We got the data. Next, save it in Event Hubs 🤩

import asyncio
from azure.eventhub.aio import EventHubProducerClient
from azure.eventhub import EventData
import yfinance as yf


async def run(stocksCodesList):
    print('Start Batch for stocks: \n',stocksCodesList)
    # Connection string 
    conn_str_value = "EVENT HUBS CONNECTION STRING - bad security practice"

    # Create a producer client to send messages to the event hub.
    producer = EventHubProducerClient.from_connection_string(conn_str_value)

    async with producer:
        # Create a batch -  - notice the Await !
        event_data_batch =  await producer.create_batch()
        for stockCode in stocksCodesList:
            #Get stock info
            stockInfo = yf.Ticker(stockCode).info
            # Add events to the batch.
            event_data_batch.add(EventData(stockInfo))

        # Send the batch of events to the Event hubs. - notice the Await!
        await producer.send_batch(event_data_batch)


loop = asyncio.get_event_loop()
loop.run_until_complete(run(["MSFT"]))
loop.close()

Enter fullscreen mode Exit fullscreen mode

Let's break this down, here we linked to an existing EventHubs component and asked it for a producer.
conn_str_value = "EVENT HUBS CONNECTION STRING" - - bad security practice !!
producer = EventHubProducerClient.from_connection_string(conn_str_value)

NOTICE! Providing private links or any secure information in plaintext directly in code is a BAD security practice. The upcoming section demonstrates how to do it with security in mind.

We then call:
async with producer: this call creates a coroutine, and everything inside this code block happens in a coroutine. A coroutine is a subroutine for non-preemptive multitasking. We leverage it to create asynchronous communication since the EventHubs Producer doesn't return an action. Hence, we can leverage the fact that there is no need for synchronous communication. Synchronous communication tends to slow down the system.

event_data_batch = await producer.create_batch() - the producer creates a batch with the await keyword. It means that the current subroutine waits for the batch to be created and not continue to the next line.
for stockCode in stocksCodesList: in this loop over the list of stock codes, we extract the information and add it to the batch.

await producer.send_batch(event_data_batch) - this call is crucial as it is the one that sends the batch with all the data accumulated to EventHubs.

All right! We are almost done.
The last part is all about doing the work and while being SECURE.

To do that, we leverage the Key Vault.
Using the Key Vault, we store keys values, secret, and certificates in hardware security that is available for us when we work with Azure.

The first thing is to store the string connection in the Key Vault. Here is a quick tutorial with pictures.

After this stage, you have the Key Vault name and the Key name for identifying the key secret- your connection string.

Second, is to create Service Principal
Service Principal -we use this to provide the app with identity( like an identity card ) to identify itself with the Key Vault. Key Vault ( or any other service used) checks the credential this app has, and if it can access the secret, if yes, it provides the app with the secret in a secure way. If not, this whole process will fail.

HOW?
Create a Service Principal(SP) with the command:

az ad sp create-for-rbac --skip-assignment --name { Event Hubs Sender App SP }
Enter fullscreen mode Exit fullscreen mode

Replace { Event Hubs Sender App SP } with an actual name.

You get back something like this, save this information.

{
  "appId": "00000000-0000-0000-0000-000000000000",
  "displayName": "{given name}",
  "name": "http://{given name}",
  "password": "00000000-0000-0000-0000-000000000000",
  "tenant": "00000000-0000-0000-0000-000000000000"
}
Enter fullscreen mode Exit fullscreen mode

In the Key Vault screen, go to Access Control ➡️ Role Assignment ➡️ Add ➡️ Pick Role under Role ➡️ Select your SP. Look at the diagram below:

For everything to work, you need to configure the environment with the SP credentials:

export AZURE_CLIENT_ID="http://{given name}"
export AZURE_CLIENT_SECRET="00000000-0000-0000-0000-000000000000"
export AZURE_TENANT_ID="00000000-0000-0000-0000-000000000000"
Enter fullscreen mode Exit fullscreen mode

Remember n0t to store those credentials in source contr0l!

Let's add the code!

Secure coding part, here we handle the call to the Key Vault and securely retrieves the connection string:

from azure.keyvault.secrets import SecretClient
from azure.identity import DefaultAzureCredential

# Get connection string from vault
credential = DefaultAzureCredential()
keyVaultName = "{your key vault name}"
keyName = "{your key name}"
KVUri = "https://" + keyVaultName + ".vault.azure.net"

client = SecretClient(vault_url=KVUri, credential=credential)
conn_str_value = client.get_secret(keyName).value
Enter fullscreen mode Exit fullscreen mode

DefaultAzureCredential will search for a valid credential on the machine for identifying the app, it will look for the parameters we defined earlier (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET ... ). It will return an object which can provide an access token for the vault.

client = SecretClient(vault_url=KVUri, credential=credential) - create a secret client to request the encrypted key value with.
conn_str_value = client.get_secret(keyName).value - use the client to retrieve the secret, notice that in this point everything is encrypted and even if you will try to print the key, you won't see the actual value.

THE END.

In this tutorial, you learned how to call Yahoo Finance API, how to set a conda environment, create an Event Hubs producer, and secure keys in Key Vault. You are secure and ready to ingest data to publisher/subscriber service!

Learn more 💡

Thank you for reading all the way through! I hope you enjoyed and learned from this tutorial.

Do you have any Thoughts? Questions? Concerns? Ideas? Ping me on Twitter.

Top comments (0)