How to Use Apache Pig Commands in Hadoop

Pig Latin Commands

Commands

Data Types

Load

int

Describe

long

Filter

float

Group

double

Order By

chararray

Distinct

bytearray

Split

boolean

Join

Biginteger

Dump

Bigdecimal

 

$ pig -version
sample.txt
1,Alen,50000,devloper
2,Tom,45000,jr.devloper
3,Harry,20000,sales
4,Vicky,50000,devloper
4,Vicky,50000,devloper
5,Ajay,20000,sales
6,Vishal,25000,sales
LOAD
$ pig -x local
grunt> emp_data = LOAD ‘/home/user/sample.txt’ using PigStorage(‘,’) as (id:int,name:chararray,salary:int,role:chararray);
DESCRIBE
grunt>DESCRIBE emp_data;
emp_data: {id: int,name: chararray,salary: int,role: chararray}
DUMP (Show record)
grunt>DUMP emp_data;
(1,Alen,50000,developer)
(2,Tom,45000,jr.devloper)
(3,Harry,20000,sales)
(4,Vicky,50000,developer)
(4,Vicky,50000,developer)
(5,Ajay,20000,sales)
(6,Vishal,25000,sales)
FILTER
grunt>filter_data= FILTER emp_data BY role == ‘devloper’;
grunt> DUMP filter_data;
(1,Alen,50000,developer)
(4,Vicky,50000,developer)
(4,Vicky,50000,developer)
GROUP BY
grunt>grp_data= GROUP emp_data BY (role);
grunt> DUMP grp_data;
(sales,{(6,Vishal,25000,sales),(5,Ajay,20000,sales),(3,Harry,20000,sales)})
(devloper,{(4,Vicky,50000,devloper),(4,Vicky,50000,devloper),(1,Alen,50000,devloper)})
(jr.devloper,{(2,Tom,45000,jr.devloper)})
ORDERD BY (Sort by salary base)
grunt>sorted_data= ORDER emp_data BY salary ASC;
grunt>DUMP sorted_data;
(5,Ajay,20000,sales)
(3,Harry,20000,sales)
(6,Vishal,25000,sales)
(2,Tom,45000,jr.devloper)
(4,Vicky,50000,devloper)
(4,Vicky,50000,devloper)
(1,Alen,50000,devloper)
DISTINCT (Remove duplicate record)
grunt>distinct_data= DISTINCT emp_data;
grunt>DUMP distinct_data;
(1,Alen,50000,devloper)
(2,Tom,45000,jr.devloper)
(3,Harry,20000,sales)
(4,Vicky,50000,devloper)
(5,Ajay,20000,sales)
(6,Vishal,25000,sales)
SPLIT 
grunt> SPLIT emp_data INTO emp1 IF salary>25000,emp2 IF salary<25000;
grunt>DUMP emp1;
(1,Alen,50000,devloper)
(2,Tom,45000,jr.devloper)
(4,Vicky,50000,devloper)
(4,Vicky,50000,devloper)
grunt>DUMP emp2;
(3,Harry,20000,sales)
(5,Ajay,20000,sales)
JOIN
Create two tables’ customer and orders
customer.txt
1,Ajay,Mumbai
2,Harry,Banglore
3,Aakash,Pune
4,Rahul,Delhi
customer id, customer name, customer city
grunt>customer= LOAD ‘/home/user/customer.txt’ using PigStorage(‘,’) as (cid:int,name:chararray,city:chararray);
grunt>DUMP customer;
(1,Ajay,Mumbai)
(2,Harry,Banglore)
(3,Aakash,Pune)
(4,Rahul,Delhi)
orders.txt
001,1,4000
002,4,5000
003,5,2000
004,2,8000
005,6,9000
order id, customer id, amount
grunt>orders= LOAD ‘/home/user/orders.txt’ using PigStorage(‘,‘) as (oid:int,cid:int,amt:int);
grunt>DUMP orders;
(1,1,4000)
(2,4,5000)
(3,5,2000)
(4,2,8000)
(5,6,9000)
grunt> join_data= JOIN customer BY cid,orders BY cid;
grunt> DUMP join_data;
(1,Ajay,Mumbai,1,1,4000)
(2,Harry,Banglore,4,2,8000)
(4,Rahul,Delhi,2,4,5000)
LEFT OUTER 
grunt> left_outer= JOIN customer BY cid LEFT OUTER,orders BY cid;
grunt>DUMP left_outer;
(1,Ajay,Mumbai,1,1,4000)
(2,Harry,Banglore,4,2,8000)
(3,Aakash,Pune,,,)
(4,Rahul,Delhi,2,4,5000)
Those that match will be displayed on the left; those that do not match will be displayed as null on the right.
RIGHT OUTER 
grunt> right_outer= JOIN customer BY cid RIGHT OUTER,orders BY cid;
grunt>DUMP right_outer;
(1,Ajay,Mumbai,1,1,4000)
(2,Harry,Banglore,4,2,8000)
(4,Rahul,Delhi,2,4,5000)
(,,,3,5,2000)
(,,,5,6,9000)
grunt> QUIT;

Questions
1.What command is used to load a dataset into Pig from a text file?
2.How can you display the structure (schema) of a Pig relation?
3.Which command is used to display all records stored in a relation?
4.What is the purpose of the FILTER command in Pig?
5.How does the GROUP BY command work in Pig?
6.Which command helps in sorting data based on a specific field like salary?
7.What does the DISTINCT command do in Apache Pig?
8.How can you divide a dataset into two or more parts based on a condition?
9.What is the difference between JOIN and LEFT OUTER JOIN in Pig?
10.What happens when you perform a RIGHT OUTER JOIN in Pig?

(MCQ) Questions with Answers
1. Which command is used to view all the records in a Pig relation?
a) DESCRIBE
b) DUMP
c) FILTER
d) LOAD
✅ Answer: b) DUMP
2. Which command defines how the input data is read into Pig?
a) LOAD
b) DESCRIBE
c) GROUP
d) SPLIT
✅ Answer: a) LOAD
3. What does the DESCRIBE command display?
a) Data output
b) Schema of the relation
c) Only column names
d) Record count
✅ Answer: b) Schema of the relation
4. Which command is used to remove duplicate records?
a) GROUP
b) FILTER
c) DISTINCT
d) ORDER
✅ Answer: c) DISTINCT
5. What is the correct Pig Latin command to group data by a field?
a) group_data = GROUP emp_data BY role;
b) group_data = FILTER emp_data BY role;
c) group_data = SORT emp_data BY role;
d) group_data = DISTINCT emp_data BY role;
✅ Answer: a) group_data = GROUP emp_data BY role;
6. Which command splits data into two relations based on a condition?
a) FILTER
b) SPLIT
c) DISTINCT
d) JOIN
✅ Answer: b) SPLIT
7. Which command sorts data in ascending or descending order?
a) SORT
b) ORDER
c) FILTER
d) GROUP
✅ Answer: b) ORDER
8. What happens in a LEFT OUTER JOIN?
a) Only matching records from both datasets are displayed
b) Only unmatched records from both datasets are displayed
c) All records from the left dataset are shown, even if unmatched
d) All records from the right dataset are shown, even if unmatched
✅ Answer: c) All records from the left dataset are shown, even if unmatched
9. In Pig, which of the following joins includes all records from the right table even if unmatched?
a) INNER JOIN
b) LEFT OUTER JOIN
c) RIGHT OUTER JOIN
d) FULL JOIN
✅ Answer: c) RIGHT OUTER JOIN
10. Which of the following data types is supported in Pig?
a) int, chararray, float, double
b) integer, string, real, boolean
c) number, varchar, long, decimal
d) real, string, int, array
✅ Answer: a) int, chararray, float, double

Install Python and Run MapReduce Program on Hadoop