Since this was our final year, we had to work on a project and we decided to take up a challenging project involving stock market prediction through the help of purely news articles by using NLP and Machine learning prediction models.
A stock market prediction platform for parsing and predicting stock market index prices based on news articles and machine learning.
Folder Structure Overview
- documentation - Some of our presentation and various notes taken.
- python - All the code.
- website - Contains the code for our website
Requirements.txt will be uploaded soon
Running the project
- You need to download all news articles by manually specifying parameters in python/download/news/threads.py. It does multithreading and downloads parallelly saving time.
- You have to manually uncomment each line and check dates. This will download news to python/data/news/[newspaper]/lists/[various-file]
- Then run merge.py and specify parameters to merge all the news files.
- Run the NLP classifier.
- Run python/nlp/classify.py and give input file as in_csv variable. Specify output file location in output_file in makeKeyWordList function and ensure write_to_file variable is set to 1.
- This outputs multiple files in output location which needs to be merged again with merge.py
Built using python, flask, beautifulsoup4, tensor flow, sklearn, VADER nlp library.
Our first step was to form a dataset of the stock market on which we had to predict the stock market.
- We decided to build it on NIFTY 50 which is the indian stock market index consisting of 50 stocks and was a large enough database.
- We downloaded the stock market data from alphavantage and yahoo finance of the large 8 years to use in our project.
- Then we wrote html parsers to data mine 20 lakh news articles over the course of 8 years from 4 dataset news using beautifulsoup4 and python.
- This was then given as an input to an NLP library called VADER in python for giving us a sentiment score of positivity or negativity of the news. Since there were some false positives we had to include additional keywords and thus we got a score from 1.0 to -1.0 indicating sentiment.
- For each day, we took the news article, extracted any companies if it was present in the article which is part of NIFTY 50 and also took the mean of the news article sentiment of the day which belong to the same sector since stocks in same sector are affected by news as well. We extracted a dataset of 78000 points of data.
Thus just by giving a news headline as an input we can now predict the chance and percentage by which the stock will be affected using our project.
We are still working on the frontend and hope it will be built soon.
It was a very tough project but it taught us a lot from the various ways to download, manage large amount of data and feed it to machine learning models to get very accurate results and helped us learn a whole lot!