1.Python Install
sudo apt-get update
sudo apt-get install python3 -y
2.Hadoop Streaming Jar
Running Python programs in Hadoop requires the Hadoop Streaming API.
This jar file comes with Hadoop, and the path looks like this:
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
3.Environment Variable
You need to check that Python is set to the correct path.
Add this to .bashrc
nano ~/.bashrc
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export STREAMING_JAR=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
export PYTHON_BIN=/usr/bin/python3
source ~/.bashrc
Example (Word Count by Python in Hadoop)
1. Input Data
echo “hello hadoop hadoop streaming with python” > input.txt
echo “hello world from python mapreduce” >> input.txt
hdfs dfs -mkdir -p /user/hadoop/input
hdfs dfs -put input.txt /user/hadoop/input/
2. Mapper Script (mapper.py)
#!/usr/bin/env python3
import sys
for line in sys.stdin:
for word in line.strip().split():
print(f”{word}\t1″)
3. Reducer Script (reducer.py)
#!/usr/bin/env python3
import sys
current_word = None
count = 0
for line in sys.stdin:
word, value = line.strip().split(“\t”)
value = int(value)
if word == current_word:
count += value
else:
if current_word:
print(f”{current_word}\t{count}”)
current_word = word
count = value
if current_word:
print(f”{current_word}\t{count}”)
chmod +x mapper.py reducer.py
4. Job Run (Hadoop Streaming)
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input /user/hadoop/input \
-output /user/hadoop/output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py
5. Output Check
hdfs dfs -cat /user/hadoop/output/part-00000
Output
from 1
hadoop 2
hello 2
mapreduce 1
python 2
streaming 1
with 1
world 1
Example (Word Count without Hadoop)
1. Input File create
echo “hello hadoop hadoop streaming with python” > input.txt
echo “hello world from python mapreduce” >> input.txt
2. Mapper (mapper.py)
#!/usr/bin/env python3
import sys
for line in sys.stdin:
for word in line.strip().split():
print(f”{word}\t1″)
3. Reducer (reducer.py)
#!/usr/bin/env python3
import sys
current_word = None
count = 0
for line in sys.stdin:
word, value = line.strip().split(“\t”)
value = int(value)
if word == current_word:
count += value
else:
if current_word:
print(f”{current_word}\t{count}”)
current_word = word
count = value
if current_word:
print(f”{current_word}\t{count}”)
4. Local Execution (Hadoop Pipe Use)
cat input.txt | python3 mapper.py | sort | python3 reducer.py
5.Expected Output
from 1
hadoop 2
hello 2
mapreduce 1
python 2
streaming 1
with 1
world 1
How to Install Hadoop on Linux Ubuntu Steb by Setp
