DEV Community

Cover image for How working/install Pig with Notebooks?
Lucas M. Ríos
Lucas M. Ríos

Posted on

How working/install Pig with Notebooks?

🐷📝 Basic commands to work with Pig in Notebooks


🔗Related content

You can find post related in:

📀Google Colab

You can find repo related in:

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn


Resume 🧾

I will install Hadoop with Pig program and will use a library of Python to write a job that answer the question, how many row exists by each rating?

First I install Hadoop using same commands that I have used before but without put a number of step.


Install Hadoop 🐘

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.


Unzip and copy 🔓

I use following command:

!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/


Set up Hadoop's Java ☕

I use following command:

#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh
Enter fullscreen mode Exit fullscreen mode

Set Hadoop home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['HADOOP_HOME']="/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text
Enter fullscreen mode Exit fullscreen mode

1st - Install Pig 🐷

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz

You would can get other version if you need in: https://downloads.apache.org/pig/ and later replace it in the before command.


2nd - Unzip and copy 🔓

I use following command:

!tar -xzvf pig-0.17.0.tar.gz


3rd - Set Pig home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['PIG_HOME']="/content/pig-0.17.0"
os.environ['PIG_CLASSPATH']="/usr/local/hadoop-3.3.1/conf"
os.environ["PATH"] += os.pathsep + "/content/pig-0.17.0/bin"
Enter fullscreen mode Exit fullscreen mode

We can validate installation with command:

!pig -version


4th - Create a folder with HDFS 🌎📂

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///content/data_pig


4.1 - Remove folder with HDFS ♻

Maybe, later you need remove it. To do that you must apply following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r file:///content/data_pig


5th - Getting a dataset to anlyze with Pig 💾

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "/content/ml-100k.zip" -d "file:///content/data_pig"

For list them:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /content/data_pig/ml-100k


6th - Creating process to use Pig with Pig Syntax 🐖

To create job in Pig, you must see structure of dataset to configure jobs.
In this case we print dataset with following command:

!head /content/data_pig/ml-100k/u.data

I can get following information of dataset:

  • First column reference to userID.
  • Second column reference to movieID.
  • Third column reference to rating.
  • Fourth column reference to timestamp.
# Create pig script
%%writefile id.pig
/* id.pig */

student = LOAD 'file:///content/data_pig/ml-100k/u.data' USING PigStorage(' ')
   as (userId:int, movieId:int, rating:int, timestamp:int);

student_order = ORDER student BY rating DESC;

Dump student_order;
Enter fullscreen mode Exit fullscreen mode

7th - Running the process 🙈

Here we run the process specifing some parameters:

  • Pig file program is id.pig
  • Dataset is in file:///content/data_pig/ml-100k/u.data

When run process, maybe take a few minutes...

You can run script with:

!pig -x local id.pig

But we run script and save results in a file .txt:

!pig -x local id.pig > results.txt


8th - Advancing in the logic of the scripts 😎

Now we will advance in logic of the script to get answer to next questions:

  • What are the oldest 5 star movies?
  • What are the worst movies?

8.1 - Find oldest 5 star movies start ⭐

%%writefile fiveStarMovies.pig

ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
    AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;

ratingsByMovie = GROUP ratings BY movieID;

avgRatings = FOREACH ratingsByMovie GENERATE group as movieID, AVG(ratings.rating) as avgRating;

fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;

fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;

oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;

DUMP oldestFiveStarMovies;
Enter fullscreen mode Exit fullscreen mode

Run script and save results in a file .txt:

!pig -x local fiveStarMovies.pig > fiveStarMovies.txt


8.2 - Find most rated bad movies ⭐

%%writefile BadPopularMovies.pig

ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
  AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

nameLookup = FOREACH metadata GENERATE movieID, movieTitle;

groupedRating = GROUP ratings by movieID;

avgRatings = FOREACH groupedRating GENERATE group as movieID, AVG(ratings.rating) as avgRating, COUNT(ratings.rating) AS numRatings;  

badMovies = FILTER avgRatings BY avgRating < 2.0;

namedBadMovies = JOIN badMovies BY movieID, nameLookup BY movieID;

results = FOREACH namedBadMovies GENERATE nameLookup::movieTitle as movieName,
          badMovies::avgRating as avgRating, badMovies::numRatings as numRatings;

finalResults = ORDER results BY numRatings DESC;

DUMP finalResults;
Enter fullscreen mode Exit fullscreen mode

Run script and save results in a file .txt:

!pig -x local BadPopularMovies.pig > BadPopularMovies.txt


9th - Say thanks, give like and share if this has been of help/interest 😁🖖


Top comments (0)