Building my first Data Science project
I discovered soon enough that figuring out that I wanted to work towards a future in data science by taking classes and finishing specialisations was the easy part of my journey.
One of the things you must do if you want to work in data science is to build your portfolio. At the beginning, I struggled for days trying to find a good topic for a project, thinking on how to do it, what I should look into; many times I thought about something and soon dropped it because it was not exactly what I wanted to do. I felt I could not get it started. But it was then when I realised that building a project started much earlier than when you type the first lines of code. In fact, thinking on how to design it is one of the essential parts of any data science project.
Because of my PhD and postdoctoral research work, I knew that liver disease has become one of the most common causes of death around the world. Due to the fact that ending stages can be different from patient to patient, establishing a method to assess the prognosis of a particular patient with liver disease still remains a challenge. So, I decided the purpose of my project to be the analysis of a dataset containing information about liver disease patients and the creation of a model to predict their survival.
I chose a dataset from UCI Machine Learning Repository which CSV file was downloaded from Open ML website. It comprises an observational study where data was collected regarding 19 different features and an extra class (DIE or LIVE) from 155 patients with chronic liver disease.
I decided to use Python 3 in Jupyter Notebook. Python has a set of packages, such as Numpy, Pandas, Matplotlib, Scikit-learn, that are very powerful for data analysis and visualization.
First of all, it is important to use the command %matplotlib notebook in order to interactively plot the figures. We need to load the modules to our python environment using the command import. Because the dataset was downloaded as a CSV file, We will use the Pandas command read_csv that automatically reads the file into a DataFrame. We can check the shape of our DataFrame to match the specifications provided for our dataset: 155 patients(rows), 19 features+1 class (columns).
Exploratory Data Analysis
An important part of doing predictions with Machine Learning techniques is to perform Exploratory Data Analysis (EDA). This is useful for getting to know your data, looking at it from different perspectives, describing and summarizing it without making any assumption in order to detect any potential problems.
First, we can inspect our data to see if we need to clean it. We will start by using the head command that will show us the first 5 rows of our DataFrame. As we can see below, there are missing values identified with the ? symbol. Knowing the data types of the variables included in our dataset is a good piece of information. We can check this by using dtypesfunction.
As we can see above, 19 of our 20 variable appear as object type. Some of these variables are categorical (with ‘no’, ‘yes’ levels) and some of them should be numerical with int or float type.
Because for machine learning algorithms, it is required to have numerical data, we need to convert categorical data that has values ‘no’, ‘yes’ to 0 and 1 respectively. Another important point to consider is to convert the binary survival variable ( Class) encoded now as ‘DIE’, ‘LIVE’ levels to numerical categories (0 and 1, respectively). We will use for this task, the function replace. Lastly, we will convert all of our columns in the dataset to float type.
Machine learning algorithms perform well when the number of observations in each class is similar but when there is a high class imbalance, problems arise leading to misclassification. Class imbalance occurs when the total number of observations in one class is significantly lower than the observations in the other class. Typically, if a data set contains 90% in one class, and 10% in the other class, then it suffers from class imbalance. In order to check this point, we can calculate what percentage of the data belongs to each category.
We can observe above that even though our dataset is not perfectly balanced (79.35% of patients is contained inLIVE class while only 20.65% is in DIE class), it does not suffer from high class imbalance allowing us to continue with our analysis.
The first step in EDA is to generate descriptive statistics summarizing the central tendency, dispersion and shape of our dataset’s distribution. We will do that using the function describe from Pandas. It is important to highlight here that this function excludes the NaN values present in our dataset. Because many of our variables are discrete, it does not make sense to get central tendency parameters for them. So, we will only include the numerical variables in this case. On the other hand, we will use the function apply and value_counts to get the counts in every level (0 or 1 that corresponded to ‘no’ and ‘yes’) for each discrete variable in our dataset.
We can observe in the first table that the patients belong to an age bracket of 7–78 years, with a mean of 41.2 and a median of 39. There are missing values in most of the variables but in particular in PROTIME where we only have 88 observations. If we pay attention to the means of the different variables, it is interesting to note that they display a moderate variance; the range goes from 1.42 (BILIRUBIN) to 105.35 (ALK_PHOSPHATE). Also, the variables SGOT and ALK_PHOSPHATE show a high standard deviation and their distribution could be right skewed due to the fact that the mean is higher than the median. The rest of the variables appear to be normally distributed (mean ~ median). The distribution of our variables is important to considered because they could affect lately our machine learning algorithm due to the fact that many of them make assumptions about the shape of data, particularly about how the residuals are distributed. So we could evaluate to perform a transformation to fix the skewness observed.
In the case of the categorical variables, there is a marked predominance of observations belonging to level 0 in the variable SEX which means that the dataset include more female than male patients. Likewise, there are more observations in the the class 0 than in the class 1 in the variables ANOREXIA, ASCITES and VARICES. This could point out that these features are differentially present in the patients and might be interesting variables influencing their survival.
The next step is to create some visuals in order to understand further our dataset by exploring relationships existent in it. For this task, it is very useful to use the seaborn library which facilitates strong attractive statistical graphics that are easy to code.
We will here take some seconds to evaluate the variables included in our dataset, in particular, some that are interesting regarding liver disease. Elevated levels of Alkaline phosphatase (ALK_PHOSPHATE), Aspartate Aminotransferase (SGOT), bilirubin (BILIRUBIN), albumin (ALK_PHOSPHATE) as well as Prothrombin time (PROTIME) indicate a malfunctioning liver. The presence of anorexia (ANOREXIA) and ascites (ASCITES) appeared later in patients with liver disease and normally, indicates a poor prognosis. Because all of these mentioned variable are indicators of a more or less severe liver damage, we wold evaluate them to see their relationship and explore if they could be important for our predictive model.
As we already observed, our dataset contains a lot of NaN values. How to handle missing values is an extensive topic that we will not address here, but it is important to notice that there are several ways to overcome this issue and the best way to do it has to be evaluated for each situation. In our case, we are going to drop them by using the Pandas function .dropna(). We will create a new data frame by selecting only the interesting value that we mentioned above.
We will continue plotting histograms from our numerical variables to visualize and confirm their distribution. In the seaborn library, the function displot allow as to plot a univariate distribution of observations. It is possible to plot both histograms side by side by using the function subplot of matplotlib.
We can observed in the histograms that in fact several of our variables, including ALK_PHOSPHATE and SGOT that we had detected in the summary statistics, show a degree of skewness. There are several transformation that can be applied in order to fix that. We will use the Pandas function applymap and the Numpy function np.log to log-transform the columns corresponding to those skewed variables in our dataframe.
Then, we can make use of the pairplot function to visualize the relationship between the different numerical variables. One nice feature about seaborn is that we can use the parameter hue to show with different color in the plot, the different levels of a categorical variable. In our case, we are interesting in identifying the patients in Class 0 and_ 1_.
Observing the plots, we can highlight several things:
- From the histograms, we learn that the skewness present in our data was mainly fixed
- We can observed that patients tend to differentiate according to whether they belong to Class 0 or Class 1 in some of our variables; however this distinction is not completely clear.
- It appears that there is not a perfect linear relationship between the variables plotted, though in some of them we can observed a trend to an interaction ( SGOT and ALK_PHOSPHATE, SGOT and BILIRUBIN, PROTIME and ALBUMIN, BILIRUBIN and ALK_PHOSPHATE)
Then, we can analyse the relationship between our categorical variables and our numerical variables. For this part, we will take advantage of PairGrid of seaborn that allows us to plot with a little more freedom to choose the x and y variables. In this case, we will use the graph swarmplot , a particular case of scatterplot which do not overlap the points.
It is possible to observe that there is no difference in the variables plotted regarding the ANOREXIA status. This can be evidenced by the fact that not only patients from both levels of Class are distributed homogeneously but also there is not difference in the expression of the variables analysed regarding the levels of ANOREXIA. On the other hand, we can see a trend that patients with Class 0 tend to have ascites. However, there is no differences in how the variables are expressed regarding ASCITES status.
The last thing that we will deepen our analysis to see if there is any strong correlation between all our parameters. The importance of performing correlation analysis in our dataset lies on the fact that highly correlated variables can hurt some models or in other cases, could provide little extra information and considering them can be computational expensive without any real benefit. Also, knowing if our variables display a linear relationship can help us choose which machine learning algorithm is more suitable for our data.
For this task, we will use the Pearson correlation coefficient because it is a good parameter to evaluate the strength of the linear relationship between two variables. In order to perform the correlation analysis with all our variables, we first need to apply the function factorize to the columns containing categorical variables contained in the dataset in order to obtain a numeric representation of them. We will now make use of the function corr and plot the resulting array using heatmap that will allow us to visualise the correlation coefficient by the intensity of the colours.
We can observe in the heatmap that some of the variables show a coefficient of ~0.6 or -0.4, but most of them display a very low correlation coefficient. So we can conclude that there is no strong linear correlation between our variables.
We have finished the EDA of our dataset. We got to know our data and have now a feeling of it that will become very valuable when choosing the right machine learning algorithm for our case.
If you want to keep going on how to perform prediction for this project, you can read my next post.
References
- Dua, D. and Karra Taniskidou, E. (2017).UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
Top comments (0)