Read Big Data on a Shoestring Online
Authors: Nicholas Bessmer
hadoop
dfs -cat /test_dir/myfile
A few words to test
And
Hadoop is running! Remember the Linux tips:
»
cd – means change to a directory
»
Linux user forward rather than backslashes
»
Unless you set your path, you will need to change (cd) to this directory:
/home/ec2-user/hadoop-0.20.2/bin
or wherever you copied HADOOP to. You will need to run the commands as follows:
./hadoop –dfs –ls
Pig is described by Apache foundation as:
Pig is a dataflow programming environment
for processing very large files
.
Pig's language is called Pig Latin
.
A Pig Latin program consists of a directed
acyclic graph where each node represents an operation that transforms data.
Operations are of two flavors: (1) relational-algebra style operations such as
join, filter, project; (2) functional-programming style operators such as map,
reduce.
Pig compiles these dataflow programs into (sequences of) map-reduce jobs and executes them using Hadoop. It is also possible to execute Pig Latin programs in a "local" mode (without Hadoop cluster), in which case all processing takes place in a single local JVM.
This script queries The Excite search engine search log file. Please be aware this will take some time to run!
This is checking for the frequency of search phrases and uses Hadoop.
The Query Phrase Popularity script (script1-local.pig or script1-hadoop.pig) processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.
The output file will report the following and perform basic functions and statistics:
hour
,
ngram
,
score
,
count
,
mean
Run the following command:
cd /home/ec2-user/pig-0.10.1/tutorial/pigtmp
And this command:
pig ../scripts/script1-hadoop.pig
You will see a lot of processing information dumped to the screen like:
2013-02-09 00:07:11,446 [Thread-5] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2013-02-09 00:07:11,454 [Thread-5] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-02-09 00:07:11,454 [Thread-5] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-02-09 00:07:11,456 [Thread-5] INFO
2013-02-09 00:07:16,614 [Thread-14] INFO org.apache.hadoop.mapred.MapTask - kvstart = 0; kvend = 262144; length = 327680
2013-02-09 00:07:20,808 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2013-02-09 00:07:25,772 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
2013-02-09 00:07:30,276 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner -
These informational messages can be filtered out. At the end of the processing you should see something like:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2 0.10.1 ec2-user 2013-02-09 00:07:06 2013-02-09 00:17:09 HASH_JOIN,GROUP_BY,DISTINCT,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_local_0001 1 1 n/a n/a n/a n/a n/a n/a clean1,clean2,houred,ngramed1,raw DISTINCT
job_local_0002 2 1 n/a n/a n/a n/a n/a n/a hour_frequency1,hour_frequency2 GROUP_BY,COMBINER
job_local_0003 2 1 n/a n/a n/a n/a n/a n/a hour00,hour12,hour_frequency3,same,same1 HASH_JOIN file:///home/ec2-user/pig-0.10.1/tutorial/pigtmp/script2-hadoop-results,
Input(s):
Successfully read 0 records from: "file:///home/ec2-user/pig-0.10.1/tutorial/pigtmp/excite.log.bz2"
Output(s):
Successfully stored 0 records in: "file:///home/ec2-user/pig-0.10.1/tutorial/pigtmp/script2-hadoop-results"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local_0001 -> job_local_0002,
job_local_0002 -> job_local_0003,
job_local_0003
Look at
The Results
It is necessary to run some final commands to look at the results of the Pig Latin script above:
»
cd /home/ec2-user/pig-0.10.1/tutorial/pigtmp
»
hadoop fs -ls script1-hadoop-results
»
hadoop fs -cat 'script1-hadoop-results/*' | less
This will show the results in a very simple format – basically data points. Using tools such as Open Source Pentaho Community Edition BI Suite, you can create reports against Big Data including using Pig Lating. Please see:
We reviewed the differences between Hadoop and Cassandra and went through an exercise of setting up and using our very first Hadoop Bg Data Analytics implementations. Now that you are familiarized with the steps (and costs) you can move forward with implementing Big Data within your own business or organization.
One final note: Cassandra and Hadoop can co-exist! This site will tell you more:
http://wiki.apache.org/cassandra/HadoopSupport
See
http://pig.apache.org/docs/r0.7.0/tutorial.html
The Query Phrase Popularity script (script1-local.pig or script1-hadoop.pig) processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.
The script is shown here:
Register the tutorial JAR file so that the included UDFs can be called in the script.
REGISTER ./tutorial.jar;
Use the PigStorage function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields
user
,
time
, and
query
.
raw
= LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
Call the NonURLDetector UDF to remove records if the query field is empty or a URL.
clean1 = FILTER raw BY
org.apache.pig.tutorial.NonURLDetector(query);
Call the ToLower UDF to change the query field to lowercase.
clean2 = FOREACH clean1 GENERATE user, time,
org.apache.pig.tutorial.ToLower(query) as query;
Because the log file only contains queries for a single day, we are only interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the ExtractHour UDF to extract the hour (HH) from the time field.
houred
= FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;
Call the NGramGenerator UDF to compose the n-grams of the query.
ngramed1 = FOREACH
houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
Use the DISTINCT operator to get the unique n-grams for all records.
ngramed2 = DISTINCT ngramed1;
Use the GROUP operator to group records by n-gram and hour.
hour_frequency1 = GROUP ngramed2 BY (
ngram, hour);
Use the COUNTfunction to get the count (occurrences) of each n-gram.
hour_frequency2 = FOREACH hour_frequency1 GENERATE
flatten($0), COUNT($1) as count;
Use the GROUP operator to group records by n-gram only. Each group now corresponds to a distinct n-gram and has the count for each hour.
uniq_frequency1 = GROUP hour_frequency2 BY group::
ngram;
For each group, identify the hour in which this n-gram is used with a particularly high frequency. Call the ScoreGenerator UDF to calculate a "popularity" score for the n-gram.
uniq_frequency2 = FOREACH uniq_frequency1 GENERATE
flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));
Use the FOREACH-GENERATE operator to assign names to the fields.
uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as
ngram, $2 as score, $3 as count, $4 as mean;
Use the FILTER operator to move all records with a score less than or equal to 2.0.
filtered_uniq_frequency
= FILTER uniq_frequency3 BY score > 2.0;
Use the ORDER operator to sort the remaining records by hour and score.
ordered_uniq_frequency
= ORDER filtered_uniq_frequency BY (hour, score);
Use the PigStorage function to store the results. The output file contains a list of n-grams with the following fields:
hour
,
ngram
,
score
,
count
,
mean
.
STORE
ordered_uniq_frequency INTO '/tmp/tutorial-results' USING PigStorage();