DEV Community

Cover image for LOADING DATA FOR PROCESSING IN PYTHON...
Naftal Rainer
Naftal Rainer

Posted on

LOADING DATA FOR PROCESSING IN PYTHON...

According to Wikipedia, Data mining is an important process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
It is, therefore, necessary in any machine learning convention to get the acquired data into our system. The data might be a log file, dataset file, or database which may be obtained from different multiple sources.
In this article, we’ll look at the different methods of loading different formatted data from a variety of sources, including CSV files and SQL databases.
We’ll majorly focus on python pandas extensive library methods for loading external data as well as loading scikit – learn and seaborn’s inbuilt datasets.

Loading a sample dataset.

This involves loading data from preexisting datasets either in the scikit learn package or the seaborn package. The packages embeds some small toy datasets. The datasets are refered to as “toy” datasets because they are smaller and cleaner than the datasets we come across in the real world.

The sklearn.datasets comes with a few small standard datasets that do not require downloading any file from some external website.

Sample datasets from sklearn include:
load_boston() - Load and return the Boston house-prices dataset. It contains 503 observations on Boston housing prices and a good dataset for exploring regression algorithms.

load_iris() - Load and return the iris dataset.It contains 150 observations on the measurements of Iris flowers. It is a good data‐ set for exploring classification algorithms.

load_diabetes() - Load and return the diabetes dataset (regression).

load_digits([n_class]) - Load and return the digits dataset (classification). load_digits Contains 1,797 observations from images of handwritten digits. It is a good data‐ set for teaching image classification.

load_linnerud() - Load and return the linnerud dataset (multivariate regression).

# Load scikit-learn's datasets 
from sklearn import datasets

# Load digits dataset
digits = datasets.load_digits()
Enter fullscreen mode Exit fullscreen mode

To load Iris dataset use:

from sklearn.datasets import load_iris
data = load_iris()
data.target[[10, 25, 50]]
Enter fullscreen mode Exit fullscreen mode

Seaborn on the other hand comes with a few important datasets in the library. When Seaborn is installed, the datasets download automatically.Some of the datasets have a small amount of preprocessing applied to define a proper ordering for categorical variables.

To see a list of available datasets use get_dataset_names().

# Import the seaborn module
import seaborn as sns
sns.get_dataset_names()
Enter fullscreen mode Exit fullscreen mode

The above line of code will generate the following output −

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 
'diamonds', 'dots', 'exercise', 
'flights', 'fmri', 'gammas', 'geyser', 'iris', 
'mpg', 'penguins', 'planets', 'tips', 'titanic']
Enter fullscreen mode Exit fullscreen mode

Seaborn contains relational (table-format) datasets.
With the help of the load_dataset() function you can load the required dataset.

import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
print df.head()
Enter fullscreen mode Exit fullscreen mode

The above line of code will generate the following output −

   total_bill  tip   sex    smoker day  time   size
0    16.99    1.01   Female  No    Sun  Dinner  2
1    10.34    1.66   Male    No    Sun  Dinner  3
2    21.01    3.50   Male    No    Sun  Dinner  3
3    23.68    3.31   Male    No    Sun  Dinner  2
4    24.59    3.61   Female  No    Sun  Dinner  4
Enter fullscreen mode Exit fullscreen mode

Loading a File Using the Pandas Library.

Pandas is a diverse data analysis library that offers various functions for reading files of different formats into our script for data manipulation.
The various reader functions include:

  • read_csv
  • read_excel
  • read_hdf
  • read_sql
  • read_json
  • read_html
  • read_stata
  • read_clipboard
  • read_pickle
  • read_msgpack
  • read_gbq

CSV and Textual Files
CSV (comma-separated values) is data that is generally reported in tabular form and has values in a row separated by a comma.
Text files also contains tabular data separated by spaces or tabs and stored in a file with a .txt extension.
types (generally with the extension .txt).
For this,pandas provides a set of functions specific for this to load a local or hosted CSV.

  • read_csv
  • read_table

CSV files may use other characters as separators, like tabs. Panda's sep = ' ' parameter allows us to define the delimiter used in the file.

# Load library
import pandas as pd

# Create URL
url = 'https://tinyurl.com/titanic-csv'

# Load dataset
dataframe = pd.read_csv(url)

# View first two rows
dataframe.head(2)
Enter fullscreen mode Exit fullscreen mode

The above line of code will generate the following output −

    PassengerId Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
0   892 3   Kelly, Mr. James    male    34.5    0   0   330911  7.8292  NaN Q
1   893 3   Wilkes, Mrs. James (Ellen Needs)    female  47.0    1   0   363272  7.0000  NaN S
2   894 2   Myles, Mr. Thomas Francis   male    62.0    0   0   240276  9.6875  NaN Q
3   895 3   Wirz, Mr. Albert    male    27.0    0   0   315154  8.6625  NaN S
4   896 3   Hirvonen, Mrs. Alexander (Helga E Lindqvist)    female  22.0    1   1   3101298 12.2875 NaN S
Enter fullscreen mode Exit fullscreen mode

N/B : For a locally stored csv file, the url is the referenced location of the file in your working directory. For example if the csv file (named sample_1) is located in 'c:\users\rainer\desktop' location, then the url to be used will be:

# Create url
url = 'c:\\users\\rainer\\desktop\\sample_1.csv'

# Load dataset
dataframe = pd.read_csv(url)
Enter fullscreen mode Exit fullscreen mode

CSV files are tabulated data in
which the values on the same column are separated by commas. But since CSV files are considered text files,
you can also use the read_table() function while specifying the delimiter.

# Create url
url = 'c:\\users\\rainer\\desktop\\sample_1.csv'

# Load dataset
dataframe = pd.read_table(url, sep = ',')
Enter fullscreen mode Exit fullscreen mode
Example

Create a small csv file in your working directory and save it as sample_1.csv

sample_1.csv

item, white,blue,red,yellow
ruler,1,3,5,2
cup,2,5,9,2
pen,6,1,4,0
book,4,1,2,1

# Load dataset
dataframe = pd.read_csv('sample_1.csv')
print(dataframe)
Enter fullscreen mode Exit fullscreen mode

This will generate:

  item   white  blue  red  yellow
0  ruler       1     3    5       2
1    cup       2     5    9       2
2    pen       6     1    4       0
3   book       4     1    2       1
Enter fullscreen mode Exit fullscreen mode

Loading Data from Microsoft Excel Files
The excel spreadsheet contains data in tabular form. Pandas provides the read_excel() function to load data of this format (.xls and .xlsx).
This can be achieved through pd.read_excel('data.xls') which returns the DataFrame composed of the data tabulated in the spreadsheets. However, if you need to load the data in a spreadsheet, then specify the name of the specific sheet or the number of the sheet (index) just as the second argument as shown
pd.read_excel('data.xls','Sheet2') or pd.read_excel('data.xls',1)
To do it practically, open an Excel file and enter the data shown in the figure below:

Sheet1

Book1.xlsx Sheet1

Sheet2

Book1.xlsx Sheet2

After entering the data into a spreadsheet, save and load onto jupyter notebook or preferred python manipulation tool using:

import pandas as pd

# Load file into Data variable
Data = pd.read_excel("Book1.xlsx","Sheet1")

# Print 
Print(Data.head())
Enter fullscreen mode Exit fullscreen mode

The output generated is

NAME     AGE BEST COLOR  BOOK NO.  PAGE
0     Edwin   5yrs      Green         7  2404
1  Muuo Ian   6yrs     Purple         8  2405
2      Joel   5yrs     Orange         9  2406
3      Mush   5yrs      Red          10  2407
4  Nyandeng  10yrs      Green        11  2408
Enter fullscreen mode Exit fullscreen mode

To view other sheets available, just specify the sheet number.

import pandas as pd

# Load file into Data variable
Data = pd.read_excel("Book1.xlsx","Sheet2")

# Print 
Print(Data.head())
Enter fullscreen mode Exit fullscreen mode

The output generated is

NAME    AGE BEST COLOR BOOK NO.  PAGE
0       Fred  10yrs     Yellow       25  2410
1     Bianca  13yrs       Blue       26  2412
2     Martin   6yrs      Green       27  2413
3  Genevieve   3yrs     Violet       28  2414
4      Karen  12yrs     Yellow       29  2415
Enter fullscreen mode Exit fullscreen mode

CSV and Excel files are the most common data file formats used though there are so many formats available. Stay put for the next article which will cover the loading of data in the JSON format and from a database using Pandas Library.

Untill then, goodbye 👋👋👋. Happy Coding Week Ahead!!!

Top comments (0)