In recent years, data science has seen an influx of researchers, aspiring professionals, and enthusiasts. While many of the tools and techniques have been around for years, if not decades, the industry has really taken off in the past 5 years. Good publicity, increased technological capabilities, and high pay prospects have combined to form the perfect recipe for a burgeoning field.
The space is so enticing that even I have decided to test the waters. In the coming months, I will be creating my first data science project using python as part of my involvement in the Chicago Python user group mentorship program.
If you're in my position, you may be wondering how to get started. With the help of individuals much smarter than I am, I've crafted a 5 step sequence to get started. Without further ado:
Its easy to get sucked into the glitz and glamour of the "data scientist" role. The title itself, however, can mean 100 different things to 100 different employers. If you are an aspiring professional data scientist, know that the title is suffering a bit of an identity crisis. Assuming you've done your research, aren't just in it for the alluring pay, and are still interested, then continue to step two.
Data Science is essentially programming and statistics. The two most popular languages at the moment for the practice are Python and R. It would not be a bad idea to take an online data science course and pick up some fundamental statistics before you begin. This will also help you ascertain whether the type of work is even something you would be interested in. I've scanned countless recommendations from online communities over the past year and these courses seem to receive a lot of good feedback.
- Python for Data Science and Machine Learning Bootcamp
- Stanford Statistical Learning
- Machine Learning from Andrew Ng
I have worked through a decent portion of the Jose Portilla Udemy course in preparation for my project, and I find the instruction to be outstanding.
A word of caution: Do not get sucked into the trap of the infinite tutorial loop. Just get the basics down then try to tackle your own project as soon as you can. While the courses listed above are great resources, you will learn a tremendous amount by solving the problems that arise in your own project. You will also have something cool to show for it.
This one is pretty self explanatory. If you're struggling the think of a project idea, just consider your own interests. From apples to zoos, if you can think of a subject, there's a good chance data exists for it. The best project idea is the one you actually stick to, so choosing something that excites you is in your best interest.
Got your topic? Great, now its time to select your dataset(s). Kaggle is an amazing resource, as is Google's dataset search. For me, I like astronomy, so I'm looking at the NASA Exoplanet Archive and beginning to envision the sorts of relationships and models that can potentially be drawn out. Side Note - web developers, please make data scientist's lives easier by allowing Google to find and publicize your data with their search tool. (Unless you are anti Google, which is okay too)
At this point, you've got the tools, techniques, and data to get started. Get stuck? There are plenty of online communities willing to help out. Good luck!
Author's Note | This has been the first in a series of blog posts for the ChiPy mentorship program. My next steps are to extract and clean my data, so keep an eye out for my next post if you are interested in my progress.