DEV Community

Cover image for Exploratory Data Analysis in JavaScript
sk
sk

Posted on

Exploratory Data Analysis in JavaScript

EDA

looking for patterns that addresses our question(s), this entire article and future ones are heavily informed by Allen downey's thinkstats2, it actually an excellent book both in terms of content and structure, a cherry on top it is free, and written in python.

data and practical methods can answer questions and guide decisions under uncertainty
> Allen Downey

That is exactly what we will attempt to do in this article and the subsequent one's, answer questions given data,

For all the beginners or if this is your first time attempting data analysis, it can be a little hard to wrap your head around it at first, I'll admit it took a long time for me to even have a basic understanding of what I am doing and even today feel like I have no idea what I am doing, and reason really is data science thou being a science involves a lot of different fields: statistics,some form of art, critical thinking, on top of that you need to have some form of domain knowledge for whatever data you are working with and the list goes, but worry not a simple structure or system is usually useful to handle such fields, which is what thinkstats2 attempts to do, successfully at that.

Statistical analysis and EDA are the most fundamental and important parts of actually doing data science, building and training models is actually the last step, if you have been around in the data science world there's a rather famous statement garbage in garbage out, simply meaning feed your model garbage data expect a garbage model/output. Sadly this is where most beginners get stuck because they dive straight into building models, without understanding the most fundamental and crucial step and EDA is the first step towards success meaning you are in the right path.

Question

before we can even do any analysis we need a question to actually inform and guide our analysis, we will be using the automobile dataset.

brand/make - BMW, Audi, Volvo etc

possible questions:

what is the most expensive car?
what is the least expensive car?

what is the most affordable car by make/brand?
what is the least affordable car by make/brand?

which is the most profitable make/brand to invest in?

given a car with certain features can you classify it's brand/make
given a car with certain features including make/brand can you estimate it's price

The first two questions are probably the simplest to answer we don't care about the make of the car, this is basically getting the min and max of the dataset in terms of price

the second two are little involved we want to isolate a group from the rest of the data by affordability, secondly we need to know what affordable means in this context(data), does it mean more people are buying from that brand, or that brand has many models and variation of cars(we will come back to this later), what determines affordability?, after answering that question we need to identify that variable in the data and isolate it to answer the questions.

the third may seem like the second but in some form it's not, and I will admit we cannot "really"answer it with the data we have, because profitability is more than how many people are buying those cars, or how many models does the brand have, it encompasses stocks, how much does it cost to produce the car(the entire production chain) and what not, this is where data collection comes in.

the last two inform models, this is where we look at linear separability of data, correlation, variance etc.

the gist of the matter is question(s) inform approach and methods you will use to negotiate the data, and how you handle it, usually you will have a single question about the data which of course can branch to multiple questions with an aim to answer The question, we will attempt to answer the above questions to some extent because we are learning.

Simple Analysis and Data Cleaning

before we can answer the questions we need to get familiar with the data, clean it and check for errors, usual data comes with a codebook which describes the data, since we do not have it here we will use our intuition(never do this a codebook is very important).

we will be using bokke.js as per the previous article I published here: if you have not read it yet please do or you may not understand what is going on here, I cannot cover bokke here, there are also new features in bokke that are not covered in the previous article but will be covered here.

npm i bokke.js

Enter fullscreen mode Exit fullscreen mode

or if you were following the last article update bokke to


npm i bokke.js@1.0.1

Enter fullscreen mode Exit fullscreen mode

setup:

index.html:


<!DOCTYPE html>

<html lang="en">

<head>

 <meta charset="UTF-8">

 <meta http-equiv="X-UA-Compatible" content="IE=edge">

 <meta name="viewport" content="width=device-width, initial-scale=1.0">

 <title>Document</title>





 <style>

 table {

 border-collapse: collapse;

 margin-top: 10px;

 width: 100%;

 }



 table th {

 text-align: center;

 background-color: #3e6070;

 color: #fff;

 padding: 4px 30px 4px 2px;

 }



 table td {

 border: 1px solid #e3e3e3;

 padding: 4px 8px;

 }



 table tr:nth-child(odd) td {

 background-color: #e7edf0;

 }

 </style>

</head>

<body>

 <input type="file" id="csv">




 <div id="table">




 </div>



 <div id="har">



 </div>



 <script src="main.js"></script>

</body>

</html>

Enter fullscreen mode Exit fullscreen mode

main.js




import dataFrame, {Hist, Plot} from "bokke.js";

let file = document.getElementById("csv");

file.onchange = e => {



 dataFrame(file, (frame)=> {




     document.getElementById("table").innerHTML = frame.table

     window.frame_ = frame // to use in the dev console

     window.Hist = Hist

     window.Plot = Plot




 },)

 }



Enter fullscreen mode Exit fullscreen mode

devConsole:


const df = window.frame_  // dataframe
const tbl = document.getElementById("table") // div to append table
let df2     // to assign manipulated/changed dataframe
function callback(frame){
      tbl.innerHTML = frame.table;
      df2 = frame
}





Enter fullscreen mode Exit fullscreen mode

bokke has a dataframe module which produces an html table, and has many methods for data cleaning and exploration, you can find an overview and tutorial on bokke.js here, prerequisite for this one

drop row 205 and the duplicate index column:

df.drop(true, callback,205, " ")  // true means inplace


Enter fullscreen mode Exit fullscreen mode

because we do not have a codebook we will clean the data in parallel with answering a question, meaning a question will determine which features(columns) we are to clean, to avoid cleaning "unnecessary" features(one of the few reason the codebook is important), in essence we are doing multiple things developing a codebook whilst analyzing data.

what is the most expensive car? && what is the least expensive car?

immediately this tells us to look at the price. Looking at unique values especially in our case(no code book) is super useful, for we expect the price column to be numbers and we need to validate that, if there are no missing values or strings, if there are we handle them.

 df.unique("price") // returns an array of unique values

 //  we have a  '?', not a number


Enter fullscreen mode Exit fullscreen mode

first discovery the automobile dataset uses '?' to denote missing values,

Handling missing values

there are multiple ways to handle missing values ranging from easy to advanced which are beyond the scope of this article, for our case we can take the easy way and use descriptive statistics(mean), I guess it's not that easy, dropping the row is way easier but we won't.

the mean is sometimes criticized as a replacement(imputation) technique, but in our case it is highly unlikely that it will produce noise in the price column, if you think about the mean for each make will be between the min and max of the make, and it is way easier to get, anyway the focus is not the mean but knowing that in your EDA step you need to handle missing values, you can easily google more advanced methods for doing such.

the cars are classified by their makes/brands, let's assume they are representative of the population of cars in each brand(which I doubt is the case) but for our purpose will work, the normal intuition will be to take the entire price column mean and fill the missing value(s) with it, but that is inaccurate, the mean of the entire column represents the distribution of the entire column(all makes(bmw, audi etc) combined), we need to fill the missing value with a mean from it's make

  1. we need to find out which make has the missing data
   // will group all rows that return true
  function predicate(val){
    if(val === "?")
        return true
    else
        return false
}

Enter fullscreen mode Exit fullscreen mode

 df.group_by("price", predicate, callback)  // return all rows with '?' in the price column

Enter fullscreen mode Exit fullscreen mode

which return's three makes


isuzu
audi 
porsche

Enter fullscreen mode Exit fullscreen mode
  1. group by each make and compute basic stats

    but first we need to replace "?" with 0 in the price column, which is safe for we know that no car cost's 0 dollars, we are doing this to avoid getting "NaN" when calculating a mean, this is a minor limitation on bokke's side and will be fixed soon.

    
    df.replace("?","NaN", callback, true) // replace all "?" with NaN
    
    df.map({'NaN': "0"}, "price", callback, true) // replacing all NaN's in the price column   with "0"
    

    we can check if it worked by:

    
       function predicate(val){
        if(val === '0')
            return true
        else
            return false
       }
    
       df.group_by("price", predicate, callback)
    
    

you can check if there are no string's in the price column by computing basic stats

df.basicStat()

// if the last column has no NaN's we are good to go

Enter fullscreen mode Exit fullscreen mode

I know we replaced NaN with a string "0" but this actually is a number, it is coerced during calculation, always use strings with map to avoid problems

2.1 Isuzu

function predicate(val){
    if(val === 'isuzu')
        return true
    else
        return false
}

df.group_by("make", predicate, callback) 

Enter fullscreen mode Exit fullscreen mode

df2 now will have car's who make are Isuzu

df2.basicStat()  

// mean 4458.25
// actual mean 8916.5 (will become clear what this means later ignore for now)

// row 44 and 45
Enter fullscreen mode Exit fullscreen mode

const q = df.query()  // returns a proxy object to emulate indexing q[1] will return row 1 , q[[1, 2, 3]] will return row 1 2 3


let isuzu = q[[44, 1,45]]  // i ran into a problem with q[[44, 45]] so i added row 1 in between, i think query has a problem fetching some rows next to each other(under investigation)

isuzu.splice(1, 1) // remove row 1 not a suzu

isuzu.forEach((row)=> {
    row[row.length -1] = Math.round(4458.25)  // fill all missing values with mean
    df.swap(row[0], row, callback)     // swap the old row with a new row
})


/*

  swap(rowIndex, row(array), callback)

  rowindex - the index in the dataframe 
  row - Array with same columns as df

*/

Enter fullscreen mode Exit fullscreen mode

2.2 audi

function predicate(val){
    if(val === 'audi')
        return true
    else
        return false
}

df.group_by("make", predicate, callback) 


// row 9
Enter fullscreen mode Exit fullscreen mode
df2.basicStat() 

// mean 15307.857142857143

let audi = q[9] // get row 9
audi[audi.length - 1] = Math.round(17859.166666666668) // change price(last col) to mean
df.swap(9, audi, callback)

Enter fullscreen mode Exit fullscreen mode

2.3 porsche

function predicate(val){
    if(val === 'porsche')
        return true
    else
        return false
}

df.group_by("make", predicate, callback) 



// row 129

Enter fullscreen mode Exit fullscreen mode
df2.basicStat()

// mean 25120.4


Enter fullscreen mode Exit fullscreen mode
let porsche = q[129]
porsche[porsche.length - 1] = Math.round(25120.4)
df.swap(129, porsche, callback)
df.group_by("make", predicate, callback)


Enter fullscreen mode Exit fullscreen mode

2.4 analysis



const Hist = window.Hist   // histogram object
let car = new Hist(1, df.getCol("price").data)



/*

note on histograms - they calculate frequency of values in an array given a bin number 

bin - is an interval (bin values that are closer to each other and calculate their frequency)

for example 


[1, 2, 3, 4, 5, 6]

bin of 1, will calculate each value 

bin of 2 will bin values in intervals of two etc

new Hist(bin, array) and returns a hist object with some functionality

hist.smallest(10) will get the lowest 10 values 
hist.largest()  vice versa of smallest



*/
Enter fullscreen mode Exit fullscreen mode

smallest and largest

car.smallest(1)
['4458']

car.largest(1)
['45400']

// getting the least and most expensive car
function predicate(val){
    if(val === 4458 || val === 45400)
        return true
    else
        return false
}

df.group_by("price", predicate, callback)

Enter fullscreen mode Exit fullscreen mode

Oooh shocker I honestly thought porsche was gonna take it, actually did not see that coming(I am exploring this data the first time too, pardon my excitement)., data can shape perceptions :) and prove us wrong, if you allow it to talk, remember we are pretending this is representative of all cars

look at fuel type we at least can tell yet not conclusive, that it is not a good classifier, most and least expensive car both use gas, while the horseporwer though say's something different, you see a simple question is already building up to answer more complex ones and giving us a glimpse into the data(this is impressive considering we do not have a code book)

but wait something is wrong, the mean for isuzu is below the minimum price for it, that is impossible.

what is a mean: sum of all elements/ length

all our means are wrong(we computed even the missing cars as part of the length while they are not contributing with a price they are missing remember) - always question your result's don't try and force what you want them to be, let the data talk

recomputing means

isuzu:




function predicate(val){
    if(val === 'isuzu')
        return true
    else
        return false
}

df.group_by("make", predicate, callback) // getting all suzu's assigning them to df2


function predicate2(val){
    if(val !== 4458)   // removing the previously filled row, 4458 is the wrong isuzu mean
        return true
    else
        return false
}

df2.group_by("price", predicate2, callback) // removing the missing data
df2.basicStat() //  actual correct mean : 8916.5


let isuzu = q[[44, 1,45]]
isuzu.splice(1, 1)

isuzu.forEach((row)=> {
    row[row.length -1] = Math.round(8916.5)
    df.swap(row[0], row, callback)
})



Enter fullscreen mode Exit fullscreen mode

audi


function predicate(val){
    if(val === 'audi')
        return true
    else
        return false
}

df.group_by("make", predicate, callback)

function predicate2(val){
    if(val !== 15308)
        return true
    else
        return false
}

df2.group_by("price", predicate2, callback)

df2.basicStat() // correct mean: 17859.166666666668




let audi = q[9]
audi[audi.length - 1] = Math.round(17859.166666666668)
df.swap(9, audi, callback)
Enter fullscreen mode Exit fullscreen mode

porsche:






function predicate(val){
    if(val === 'porsche')
        return true
    else
        return false
}
df.group_by("make", predicate, callback)

function predicate2(val){
    if(val !== 25120)
        return true
    else
        return false
}

df2.basicStat() // 31400.5



let porsche = q[129]
porsche[porsche.length - 1] = Math.round(31400.5)
df.swap(129, porsche, callback)
Enter fullscreen mode Exit fullscreen mode

correct analysis


const Hist = window.Hist
let car = new Hist(1, df.getCol("price").data)
car.smallest(1) // ['5118']  // least expensive car is a subaru 

car.largest(1) // ['45400']



Enter fullscreen mode Exit fullscreen mode

looking beyond doesn't hurt

car.smallest(3) // ['5118', '5151', '5195']
car.largest(3) // ['40960', '41315', '45400']


function predicate(val){
    if(val === 5118 || val===5151 || val === 5195 || val===40960 || val === 41315 || val === 45400)
        return true
    else
        return false
}

df.group_by("price", predicate, callback)

df2.writeCSV("leastandMostexpensive") // save the answer could becoming handy
Enter fullscreen mode Exit fullscreen mode

findings


most 

mercedes-benz 
bmw 

least

subaru 
chev
mazda



Enter fullscreen mode Exit fullscreen mode

always save progress towards clean data

df.drop(true, callback,"")
df2.writeCSV("clean-automobile")

Enter fullscreen mode Exit fullscreen mode

we can probably use this data to analyze for linear separability

looking at this data we can already classify between a least and most expensive car using horsepower alone(but letting the data speak is probably the best way).

affordability and profitability

probably the most difficult questions to answer given the data we have

for affordability we need to know what that means in the first place, what people regard as affordable, and we do not have such data, secondly automobile dataset does not seem like a frequency of people buying cars, rather comparison of different makes given different models, if we were to calculate the mean of the entire price column we would be calculating the average price of a car, which tells us nothing unless we have data depicting affordability

for profitability too, we need to know more than the price of cars, also the number of people buying those cars, the production cost for each car in each make etc, we can probably try and maybe say given this data the frequency of each make depicts profitability meaning if a make has more cars that means most people tend to buy from that make hence a need for variance in the cars, it could be totally the opposite, let's do a simple histogram and see where it takes us

Hist

let h = new Hist(500, df.getCol('price').data) // bin of 500
Plot.Hist(["blue"], h)

// 6500 - 18500  most frequent makes/cars
// mode between 6500 - 8000
Enter fullscreen mode Exit fullscreen mode

base on our definition(assuming it is correct) the most profitable makes are makes with cars priced between 6500 and 18500, and to specifically zoom in, the mode is between 6500 and 8000, you can already see how unsatisfactory this answer is, what if the high end cars being pricey makes them more profitable even if less people buy them. let's take it further and compare the two groups


function predicate(val){
    if(val >= 20500 && val <= 45000)
        return true
    else
        return false
}

df.group_by("price", predicate, callback)
df2.basicStat()

// mean: 30025.16


function predicate2(val){
    if(val >= 6500 && val <= 18500)
        return true
    else
        return false
}
df.group_by("price", predicate2, callback)
df2.basicStat()


// mean: 11060


Enter fullscreen mode Exit fullscreen mode

although this might look like something it does not tell us anything, the mean 30025 does not tell us how many people actually buy these cars, it might be that the number of people who buys from the other brands cancels the other, because the lower the price more people buy vice versa, the profit margin could be nearly proportional if you think about it,

a simple thing we could have just done is take the cars below and above the mean for the entire price column and compare the two group's mean or calculate the difference, the steps above omitted data, which you should never do, it was a quick example, always use the entire data set if possible, but I don't think it made that much of a difference

so how do we solve this: by data collection and research, we need a definite answer on what makes a car affordable according to buyers given a country for this set and what makes a car profitable according to sellers(this is known as data collection) given a country for this set and secondly we need a codebook, without a codebook we are just guessing.

The last two

classification and regression - these deserve a separate article, where we will look at distributions, Probability mass functions, correlations, variance etc. we answered the first one because it was easy and demonstrated some form of data cleaning and EDA.

the next article will tackle one of the two, where stuff get's interesting.

Summary

EDA and Statistical analysis are fundamental and a prerequisite to building robust models, and answer questions

A codebook is very important.

ask simple questions beside the main question to help guide you in exploring the data and cleaning it in the process(especially if you have no codebook which is bad), exploring the main question having answered few basic one's, is somewhat helpful as we have some clue of what the data is(my opinion)

Conclusion

we did not really do much in this article, upcoming articles will build upon this one.

Top comments (0)