DEV Community

Vignesh Subramanian
Vignesh Subramanian

Posted on

Data science Project Lifecycle

  1. Understand Business Requirement: Define the problem statement and understand the customer's business use case. The primary motive should be adding value to customer's business by helping them make best decisions.

  2. Data Acquisition: The process of data acquisition is mostly Extract, Transform and Load(ETL). The major sources of data is/are databases, data warehouses, log files, Hadoop/spark systems. Most of the time, SQL is used to load/processing of data is done using SQL.

  3. Data Preparation: This step involves data cleaning and pre-processing.

  4. Exploratory Data Analysis(EDA): After data preparation, in this step we plot the data, visualize the data using various visualization techniques. In this step, we deep dive into each feature/columns present in the dataset to understand the features that contribute more towards achieving our goal.

  5. Modeling, Evaluation and Interpretation: Modeling deals with building various models like regression, classification, clustering, etc that suits the given business problem. Evaluation is defining the KPI, that is the performance metrics for our model(s) are defined in this phase for best results/predictions. Model interpretation at heart, is to find out ways to understand model decision making policies better thus giving more transparency to the users of the application.

  6. Communicate/Document/Publish results: Showcase the results to stakeholders and higher-management to get a go-ahead for model deployment.

  7. Deployment: This step is usually done by Software Development Engineers, Machine Learning Engineers and sometimes by Distributed System Engineers.

  8. Real-World Testing(A/B Testing): The real world testing deals with verifying the results of data analysis, models we have obtained are really meaningful in the production environment. In this phase we measure the true business impact.

  9. Customer/Business: In this step, we showcase the experimental results we have achieved to the customer/business and convince them of the solution we have developed for the business use case and the value the solution adds to their business.

  10. Operations: This phase deals with retraining the models, handling failures in the model, and defining an entire process of how to retrain the model and handle model failures. This is called operationalization of models.

  11. Optimization: This step is a continuous improvement phase. Here we try to improve the model(same model or different model), acquire new/more data, adding more features, improving/optimizing the production code.

P.S: The above steps in the life cycle of a Data Science project may differ/include/exclude one or more steps depending on specific business use cases, timelines, business processes etc.

Top comments (0)