Hadoop and Pig Latin

Starting from:

$30

Hadoop and Pig Latin

Note that you do this assignment 4 in the groups that you have created for your project.
Setup Instructions your Pig Cluster
Follow the instructions provided in the MapReduce Instructions document provided in MyCourses.

Useful Links
Hadoop's Pig Latin Documentation

How to get some of the diagnostic info asked in questions

You are obsessed, you want more of Pig

Information about the data used in this assignment
The data used for this assignment is from the movielens data sets, specifically the ml-latest-small data set. We did minor transformation and cleanup on the data set, so that it fits properly into the schema provided in the assignment. The data set is already loaded for you into the HDFS and consists of three files

/data/movies.csv consists of the following fields.
movieid used to uniquely identify a movie
title (name) of the movie
year in which the movie was released
/data/moviegenres.csv consists of the following fields.
movieid
name of a genre to which movieid belongs. A movie can belong to multiple genres.
/data/ratings.csv consists of the following fields.
userid, used to uniquely identify a user.
movieid
rating, a double value
timestamp, when the rating was done
When the data is loaded, you should define the schema so that the columns can be referred to by name. Also, datatypes are defined here to avoid unnecessarily casting every time the value is used.
movies = LOAD '/data/movies.csv' USING PigStorage(',') AS (movieid:INT, title:CHARARRAY, year:INT);
ratings = LOAD '/data/ratings.csv' USING PigStorage(',') AS (userid:INT, movieid:INT, rating:DOUBLE, TIMESTAMP);
moviegenres = LOAD '/data/moviegenres.csv' USING PigStorage(',') AS (movieid:INT, genre:CHARARRAY);
(TIMESTAMP left unconverted.)

Unless otherwise explicitly stated for a particular question, you will let pig use the default number maps and reduces as it finds fitting for the script and will not override them. Not confirming to this can result in point deduction !!

Your output should be flattened like the format below.
(Drama,2015,71)
(Drama,2016,18)
(Fantasy,2015,13)
(Fantasy,2016,7)
Below is an example of a not flattened output, which is not acceptable

((Drama,2015),71)
((Drama,2016),18)
((Fantasy,2015),13)
((Fantasy,2016),7)
Remember that there are some data values with '(' in the data itself. For example '(no genres listed)' is a data value listed in genre. Do not confuse this for flattening. They can be displayed as such.

TURN IN INSTRUCTIONS: Turn in 11 files
What to turn is marked in red

Question 0: Name and year of the movies released before 1920 ? (0 Points)
The goal of this task is to successfully set up and run the Pig cluster. After going through the setup instructions, you will run the file example.pig ( provided for in the module a4 on mycourses). This script will load the data from HDFS, choose only the movies that were released before 1920, output their names and years, ordering the output by the year of the release. You are going to run the file by pasting it into the interactive grunt shell.
The script will take a few minutes to run.

At the end, you should see totals for how many records were read in and out and some tuples right above the prompt that looks like this:

(Trip to the Moon A (Voyage dans la lune Le),1902)
(Birth of a Nation The,1915)
(20 000 Leagues Under the Sea,1916)
(Intolerance: Loves Struggle Throughout the Ages,1916)
(Immigrant The,1917)
(Dogs Life A,1918)
(Billy Blazes Esq.,1919)

Run the following line to see how the results relation was generated:

illustrate ordertitles;

The illustrate command can be a useful tool for you to debug/analyze/verify some of the problems you might encounter as you develop solutions.

You do not need to submit anything for this question.
Question 1: How many movies were released in each year ? (10 Points)
Make a copy of the example.pig file to answer this question.

First, group the movies by year and assign it to moviesperyear.

Now write a foreach directive such that you project group attribute of moviesperyear as year and use the COUNT function to count the number of movies associated with each year, calling the later nummovies. Assign this to yearcount.

Last, order the tuples in yearcount by year. store the output in your home directory as 'q1'.

Try to run the script using the ILLUSTRATE (see the link on diagnostic info on how to do this). Copy paste the tables output by illustrate into Q1_illustrate.txt

What you need to turn in:
Submit your script as Q1_script.pig and Q1_illustrate.txt .

Question 2: Find the title of all 'Comedy' and 'Sci-Fi' movies from 2015
Start by selecting only those movies that was released in 2015.
Next find movieids from moviegenres of movies that belong to either 'Comedy' or 'Sci-Fi'. (Remember a movie may belong to either or both).

Join these two on movieids and to find the movies of interest to our question.

Project only the names (title) of the movies from this join.

Remove the duplicate titles and sort the output on title.

Dump the results to the screen.

answer the following questions below:

(i)
(a) How many Maps and Reduces are generated in each job?
(b) What does the schema look like just after the join?

(c) How long did the query run?

(ii) Now modify this script to have your join step run with 4 reduce tasks.
(a) How many Maps and Reduces are generated in each job?
(b) How long did the query run?

(c) Is the difference in query execution time what you were expecting ? Describe what you were expecting to see and (if that is not what happened in the end) why you think it did not happen ?

Submit your final script as Q2_script.pig and answers in: pig_answers2.txt.
Question 3: For each genre, how many movies were produced in the years 2015 and 2016 ? (15 Points)
Output genre and the number of movies. Order the output by genre and then by year. You can ignore genres that have no movies. Some movies belong to multiple genres, it is ok to count them once in each genre. Print results to the screen.
Answer the following question:

(a) What does the schema look like immediately after the group by? Is is nested or flat?
(b) How long did your query run?

Submit your script as Q3_script.pig and answers in: pig_answers3.txt.
Question 4: Find years in which the number of movies produced were less than the previous year. (15 Points)
Output should contain year, number of movies produced that year and the number of movies produced the previous year. You need not consider the years in which there are no previous year's data available for it. You will use the PigStorage function to generate a CSV file where each line has the year, number of movies produced that year and the number of movies produced the previous year. The file need not be sorted. Write the query in any way you please. Store the results in HDFS storage under your home directory as 'q4' .
Submit your script as Q4_script.pig.

Question 5: Find the 10 movies with the maximum number of user ratings. (15 Points)
Output to the screen, the movie title and number the of ratings that it has, order them by the number of ratings with movies with maximum number of ratings first.
Submit your script as Q5_script.pig.

Question 6: Find the 5 Sci-Fi movies from 2015 with the maximum number of user ratings. (15 Points)
Output to the screen the movie title and the number of ratings it has, order them by the number of ratings with movies with maximum number of ratings first. Do not output more than 5 movies.
Submit your script as Q6_script.pig.

Question 7: For all the movies released in 2016, output the movieid, title, number of genres to which the movie belongs and the number of user ratings it has received. (15 Points)
You can ignore movies without any genres and ratings explicitly recorded in the corresponding data sets. Store the results in HDFS under your home directory as 'q7', in CSV format, similar to Q4.
Next, generate the explain (you do not have to excecute the script to do this) for the entire Q7 script, save them locally as Q7_explain.txt . How to do this described in the link provided for diagnositc info

Submit your script as Q7_script.pig and Q7_explain.txt .