Reddit has been at the epicenter of one of the biggest movements in the world of finance, and although it seemed like an unlikely source of such a movement — it’s hardly surprising in hindsight.
The trading-focused subreddits of Reddit are the backdrop for a huge amount of discussion about what is happening in the markets — so it is only logical to tap into this huge data source.
When building a data extraction tool like this, one of the first things we need to do is identify what the data we’re extracting is actually about — and for that we will be using named entity recognition (NER).
Once we have our data we need to process it and extract organization names so that any further analysis is automatically classified and results assigned to the correct stocks.
Organizations are mentioned in each subreddit in a variety of formats. Typically we will find two formats:
Organization name, eg Tesla/Tesla Motors
Ticker symbol, eg TSLA, tsla, or $TSLA
We also need to be able to differentiate between tickers and other abbreviations/slang —some of these are unclear like AI (AI can mean both artificial intelligence and refer to the ticker symbol for C3.ai).
So, we need a reasonable competent NER process to accurately classify our data.