Implementation and Example
Implementing Hadoop Streaming involves setting up a Hadoop cluster and running MapReduce jobs using custom scripts. Here’s a step-by-step example using Python for word count, a classic MapReduce task.
Step 1: Setting Up Hadoop
Ensure Hadoop is installed and configured on your cluster. You can use Hadoop in pseudo-distributed mode for testing or a fully distributed mode for production.
Step 2: Writing the Mapper and Reducer Scripts
Create two Python scripts, one for the mapper and one for the reducer
mapper.py:
#!/usr/bin/env python
import sys
# Input comes from standard input (stdin)
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Split the line into words
words = line.split()
# Increase counters
for word in words:
# Write the results to standard output (stdout)
print(f'{word}\t1')
reducer.py:
#!/usr/bin/env python
import sys
current_word = None
current_count = 0
word = None
# Input comes from standard input
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Parse the input we got from mapper.py
word, count = line.split('\t', 1)
# Convert count (currently a string) to an integer
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print(f'{current_word}\t{current_count}')
current_count = count
current_word = word
if current_word == word:
print(f'{current_word}\t{current_count}')
Step 3: Running the Hadoop Streaming Job
Upload the input data to the Hadoop Distributed File System (HDFS):
hadoop fs -put input.txt /user/hduser
Run the Hadoop Streaming job:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input /user/hadoop/input \
-output /user/hadoop/output \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py
Step 4: Retrieving the Results
After the job completes, retrieve the output from HDFS:
hadoop fs -get /user/hduser/output/part-00000 /home/hduser
This output file contains the word count results generated by the MapReduce job.
What is the Purpose of Hadoop Streaming?
In the world of big data, processing vast amounts of data efficiently is a crucial task. Hadoop, an open-source framework, has been a cornerstone in managing and processing large data sets across distributed computing environments. Among its various components, Hadoop Streaming stands out as a versatile tool, enabling users to process data using non-Java programming languages. This article delves into the purpose of Hadoop Streaming, its usage scenarios, implementation details, and provides a comprehensive understanding of this powerful tool.
Hadoop Streaming is a utility that allows users to create and run MapReduce jobs using any executable or script as the mapper and/or reducer, instead of Java. It enables the use of various programming languages like Python, Ruby, and Perl for processing large datasets. This flexibility makes it easier for non-Java developers to leverage Hadoop’s distributed computing power for tasks such as log analysis, text processing, and data transformation.
Contact Us