Alexey Tukalo

Posted on May 6

9 months of Machine Learning and beyond: before I've started

#machinelearning #datascience #beginners #ai

For this post, I want to look back at my earlier attempts to learn machine learning and data science, and discuss my current learning strategy.

After university, I worked as a graphics programmer, developing tools for data visualization. However, I was also curious about data analysis. I admired people who could tell stories with data and wanted to acquire that skill. I eventually decided to pursue the Data Analysis with R Specialization on Coursera.

I considered many options to learn data analysis. Most courses on the topic focused on people without much computer science background, covering basics of Python and R, data visualization, RDBMS, and SQL databases, with a bit of data analysis using statistics at the end. Given my background, I didn't need to cover these basics, so I looked for a program focused on foundational knowledge, especially statistics.

The Data Analysis Specialization fit my needs perfectly. In hindsight, my only regret is not choosing a course that covered a similar curriculum in Python, which I find to be a more versatile language. Originally, the specialization consisted of five courses but has since been shortened to three.

Data Analysis with R Specialization on Coursera

The program included 4 theory oriented courses. They had theoretical videos, interviews with industry experts, quizzes, and peer-reviewed data analysis tasks. The tasks involved writing reports in R Markdown, similar to Jupyter Notebooks but for R, and submitting them for peer review. Reviewing other students' reports was a crucial part of the learning process, as it offered different perspectives on the same dataset and introduced useful R tricks.

The specialization aslo concluded a practical course with a large data analysis project, similar to the peer-reviewed reports but on a larger scale.

Introduction to Probability and Data with R

The first course goal was understanding the fundamentals of probability and data analysis. This course covered key concepts such as the difference between observational studies and experiments, which is crucial for designing studies and interpreting data accurately. It emphasized the importance of eliminating bias, an essential skill for anyone working with data, as bias can skew results and lead to incorrect conclusions.

The course also provided an introduction to probability theory, helping students understand the likelihood of various events occurring. It explored different types of probability distributions, such as Normal and Binomial, which are fundamental to many statistical analyses and real-world phenomena. The course taught how to quantitatively describe a dataset using measures like mean, median, and standard deviation, essential for summarizing data. Additionally, it introduced the concept of robust statistics, which offer metrics that are more tolerant to outliers in the data, ensuring that analyses are not unduly influenced by anomalies.

Inferential Statistics

The second course taught how to make informed decisions based on data. This course introduced tools for estimating the probability of a given dataset occurring by chance, which is vital for making evidence-based decisions in various fields, from business to scientific research. The course covered hypothesis testing, a key technique for determining whether a result is statistically significant, and introduced the concepts of false positive and false negative errors. Understanding these errors is crucial, as they can affect decision-making. For example, in medical testing, a false negative could miss a diagnosis, while a false positive could lead to unnecessary treatment. This course highlighted the importance of balancing the risk of each error type based on the specific context of a problem.

Linear Regression and Modeling

The third course delved into linear regression, one of the most fundamental and widely used statistical and machine learning models. The main goal of linear regression is to fit a linear function to a set of data points, establishing a relationship between a dependent variable and one or more independent variables. This method is used extensively to predict outcomes based on input variables and has applications across numerous industries, from predicting sales to understanding the relationship between factors. The course provided an in-depth introduction to linear regression, including techniques for managing outliers to improve model accuracy, methods for evaluating model performance, and strategies for selecting optimal parameters in multi-linear regression. This knowledge is invaluable for anyone looking to develop predictive models or understand the relationships between different variables.

Bayesian Statistics

The last theoretical course introduced a different approach to probability. It explained the difference between the Frequentist and Bayesian approaches, with the Bayesian interpretation offering a flexible framework for applying statistical methods in a wide range of applications. The Bayesian approach treats probability as a degree of belief rather than a fixed frequency, allowing for continuous updating of beliefs based on new evidence according to Bayes’ Rule. This flexibility makes Bayesian statistics useful in dynamic environments where conditions change, such as finance or real-time decision-making. The course demonstrated how this approach can lead to more nuanced and adaptable models.

Outcome of the study

While the knowledge wasn't directly applicable to my day-to-day tasks, it enhanced my decision-making, helping me understand research papers and news. However, since it was not immediately useful for me at work I was demotivated from diving deeper into the topic, focusing instead on other skills like databases, cloud computing, algorithms, and web development.

Back to ML

As I mentioned in my previous post, I was impressed by ChatGPT and Midjourney, which reignited my interest in machine learning. I had basic ideas about artificial neural networks, but struggled to understand modern models. Having experience with Udemy and Coursera, I found Coursera too intense for my needs, as I wasn’t planning to switch to a machine learning job but simply wanted to understand the field. I didn’t want to spend time on quizzes and programming tasks typical of Coursera's computer science courses, and considered Coursera a bit pricey.

In contrast, I already purchased several machine learning courses on Udemy. The platform also offers a wide variety of courses from different instructors. With my self-study experience, I've always been concerned with long-term retention of knowledge, particularly when it's not in active use. I find interval-based training helpful for retention.

My Learning Approach

In interval-based training, I buy several courses on the same topic by different instructors, mixing courses on different computer science areas to space out similar topics. This approach has two benefits:

First, the initial exposure to a new topic serves as an introduction, forming a "skeleton" of knowledge. After a break, I revisit the topic through another course. Repetition over time enhances long-term retention, supported by research on memory mechanics.
Secondly, different instructors provide different perspectives often from a different angle, enabling a more comprehensive understanding of a topic.

Wrapping Up

Starting with the next post, I'll review the courses I've taken, beginning with foundational machine learning courses I completed early on.

DEV Community

9 months of Machine Learning and beyond: before I've started

Data Analysis with R Specialization on Coursera

Introduction to Probability and Data with R

Inferential Statistics

Linear Regression and Modeling

Bayesian Statistics

Outcome of the study

Back to ML

My Learning Approach

Wrapping Up

Top comments (0)

Read next

🤯GitHub's Copilot Now Available for Free to Users

Just Launched RobinReach: Multi-Channel Social Media Management 🚀

Oracle and Meta Team Up: What It Means for Developers Working with AI Models

Github's Top 36 items of Dec 19, 2024