DEV Community

Machine Learning Tech Stories
Machine Learning Tech Stories

Posted on • Updated on

What are the practical aspects to Data Science ?

( This is a placeholder for my learnings in the context of 'Practical Data Science Specialization', updated almost daily, until I complete the course )

What is Data Science ?

It is an intersection between various domains/ toolsets, to solve problems dealing with data, namely:

  1. Artificial Intelligence
  2. Machine Learning
  3. Deep Learning
  4. Having domain knowledge of the business
  5. Having knowledge about mathematics behind techniques to deal with data
  6. Statistics
  7. Visualization - to deal with data visualization
  8. Programming - to program with python, numpy etc

image

courtesy; [Andrew Ng]

Doing Machine Learning Projects or Data Science projects in Laptop vs Cloud

When we do our projects in our local laptop, we are actually limited by the resources provided by that laptop. The maximum amount of memory, processing power and whether it is CPU or GPU etc. All this matters in how efficient you can run your machine learning project in all phases of the project, for example:

  1. Data ingestion
  2. Data exploration
  3. Data analysis
  4. Data visualization

All the above steps are key steps which are prior to forming up the machine learning model, which will be used to make some inference.
So, each of the above steps need to be done, in efficient manner and Cloud provides resources at scale and on need basis, you can increase the amount of storage/RAM/processing power. ( which brings me to the question - suppose there is some project which is undergoing some kind of training in the cloud, and I , as a ML engineer/researcher realize that, it is happening very slow ( yes, it depends on the mathematical equation/parameters which is fitting the data ), but, it is slow. How do I , move this running project into another machine with greater capacity, without downtime for my training. For example, already as part of my old machine RAM, some kind of computations would have been and stored. How would these computations be pushed to the new machine ? Basically, how does scale work, while machine learning training happens, or even for that matter, machine learning inference matters - i don't have answers, but, the parallel I am looking at is, from software projects, where we have application servers, which are mapped to kubernetes pods and you can instantaneouly scale the pods, so, effectively, it becomes a distributed machine learning problem. So, the training / inference happens over a distributed system. So, the data required for training / inference gets distributed across systems. So, when a new machine is added which brings in the scalability, effectively , we are giving some portion of existing data to this new machine to process , whether it is training phase / inference phase - I am still curious as to how this happens - I have heard of tools like KubeFlow, need to explore them )

Obviously companies like Google, Microsoft, OpenAI, definitely use distributed machine learning. For example, when they say that, there are 1.75 trillion parameters for a model, which is able to infer something. Ofcourse the architecture would involve either a giant machine with huge processing power or RAM OR scores of distributed systems which participate in the overall processing.

Data slicing or Data transformations in parallel

While doing our machine learning projects, we might come across scenarios, where we need to slice data (slicing & dicing) meaning that, we want to reduce complex set of data, into small meaningful or focused set of columns . For example, in the initial dataset, we might have 100's of columns. But, we might be interested, in only few columns , so we can limit the data. we can also rename columns, to our needs.

General steps in any data science process.
  1. We deal with massive data sets
  2. We need to extract relevant features from this dataset
  3. We then to gain knowledge/insight from these set of relevant features

image
courtesy: aws/deep-learning.ai

Advantages of doing data science projects in the cloud

With our laptop, we are limited by the hardware, sometimes, training the model might consume whole of our RAM and even CPU might get hogged. In this case, if it were cloud, we could have switched from cpu to gpu compute instance and also chosen a sizeable RAM to continue our task.

image
courtesy: aws/deep-learning.ai

What is the machine learning workflow, to be worked on ?

image

courtesy: aws/deep-learning.ai

As we see above, in the first phase - which is Ingest & Analyze,

  1. Initially, we need to ingest the data, this will be done using Amazon S3,
  2. The data exploration will be done using SQL queries, using Amazon Athena
  3. We need to perform statistical bias detection on the input data, which will be done using Amazon Sagemaker Clarify (at this point, I am clueless about what is bias detection and why do we need them ? )
  4. AWS Glue will be used to catalogue the data

In the next phase - Prepare & Transform

  1. We need to extract relevant features from the input data set and this involves feature engineering, which will be done using SageMaker Datawrangler, Processing Jobs
  2. The features which are extracted, needs to be stored, for which we will use SageMaker FeatureStore

In the next phase - Train & Tune

  1. We need to now , go into the phase of model development. Using Autopilot, we will be having set of model candidates which will be trained on the data, which will be then chosen upon for the best candidate, with best score/accuracy.
  2. Trainer & Debugger could be used to detect, for improving the model accuracy.
  3. Hyperparameter Tuning - why this is required?

In the next phase - Deploy & Manage

  1. Here is when, the model has been built, now its ready for use. So, we need to deploy the model .
  2. We will be creating automated pipeline, so that, once the model is built, it is automatically deployed as well.
  3. SageMaker endpoints will serve the model, which can be used for inference.
  4. BatchTransform and Pipelines will also be used in this stage.
Popular Machine learning tasks

image

courtesy: aws/deeplearning.ai

  1. Supervised - In this, the machine learns with examples - Classification could involve categorizing sentiment of text into positive, neutral, negative. Regression could involve predicting a continuous value, given set of parameters. For example, predicting house price value.
  2. Unsupervised - This will involve determining patterns and clustering / grouping the datapoints
  3. Imageprocessing / CV - In this we need to determine whether an image contains a dog/cat. OR, we need , for self driving cars, to differentiate between speed signs and trees.
  4. NLP/ NLU - In this, we need to do sentiment analysis / machine translation / transfer learning / question-answering.
Multi-class classification for sentimental analysis of product reviews.

We have set of product reviews, for example, from amazon.com.
For each product review, we will have to classify the sentiment, whether it is positive/negative and other classes.

image
courtesy: aws/deep-learning.ai

We need to do training, this is a supervised machine learning problem, so we need to provide labels as shown below.

image

courtesy: aws/deep-learning.ai

Learning continues....

Top comments (0)