Big Data on a Shoestring (3 page)

Read Big Data on a Shoestring Online

Authors: Nicholas Bessmer

BOOK: Big Data on a Shoestring
12.04Mb size Format: txt, pdf, ePub

hadoop
dfs -cat /test_dir/myfile

A few words to test

 

And
Hadoop is running! Remember the Linux tips:

 

»
       
cd – means change to a directory

»
       
Linux user forward rather than backslashes

»
       
Unless you set your path, you will need to change (cd) to this directory:

 

/home/ec2-user/hadoop-0.20.2/bin

or wherever you copied HADOOP to. You will need to run the commands as follows:

 

./hadoop –dfs –ls

 

Let’s Use PIG

 

Pig is described by Apache foundation as:

 

Pig is a dataflow programming environment
for processing very large files
.
Pig's language is called Pig Latin
.
A Pig Latin program consists of a directed
acyclic graph where each node represents an operation that transforms data.
Operations are of two flavors: (1) relational-algebra style operations such as
join, filter, project; (2) functional-programming style operators such as map,
reduce.
Pig compiles these dataflow programs into (sequences of) map-reduce jobs and executes them using Hadoop. It is also possible to execute Pig Latin programs in a "local" mode (without Hadoop cluster), in which case all processing takes place in a single local JVM.

Change to Pig Directory and Run Sample Script From The Tutorial

 

This script queries The Excite search engine search log file. Please be aware this will take some time to run!
This is checking for the frequency of search phrases and uses Hadoop.

 

The Query Phrase Popularity script (script1-local.pig or script1-hadoop.pig) processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.

The output file will report the following and perform basic functions and statistics:

 

hour
,
ngram
,
score
,
count
,
mean

 

Run the following command:

 

cd /home/ec2-user/pig-0.10.1/tutorial/pigtmp

And this command:

 

pig ../scripts/script1-hadoop.pig

 

You will see a lot of processing information dumped to the screen like:

 

2013-02-09 00:07:11,446 [Thread-5] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2013-02-09 00:07:11,454 [Thread-5] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-02-09 00:07:11,454 [Thread-5] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2013-02-09 00:07:11,456 [Thread-5] INFO 
2013-02-09 00:07:16,614 [Thread-14] INFO  org.apache.hadoop.mapred.MapTask - kvstart = 0; kvend = 262144; length = 327680
2013-02-09 00:07:20,808 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner -
2013-02-09 00:07:25,772 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner -
2013-02-09 00:07:30,276 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner -

These informational messages can be filtered out. At the end of the processing you should see something like:

 

HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
0.20.2  0.10.1  ec2-user        2013-02-09 00:07:06     2013-02-09 00:17:09     HASH_JOIN,GROUP_BY,DISTINCT,FILTER
Success!
Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
job_local_0001  1       1       n/a     n/a     n/a     n/a     n/a     n/a     clean1,clean2,houred,ngramed1,raw       DISTINCT
job_local_0002  2       1       n/a     n/a     n/a     n/a     n/a     n/a     hour_frequency1,hour_frequency2 GROUP_BY,COMBINER
job_local_0003  2       1       n/a     n/a     n/a     n/a     n/a     n/a     hour00,hour12,hour_frequency3,same,same1        HASH_JOIN       file:///home/ec2-user/pig-0.10.1/tutorial/pigtmp/script2-hadoop-results,
Input(s):
Successfully read 0 records from: "file:///home/ec2-user/pig-0.10.1/tutorial/pigtmp/excite.log.bz2"
Output(s):
Successfully stored 0 records in: "file:///home/ec2-user/pig-0.10.1/tutorial/pigtmp/script2-hadoop-results"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local_0001  ->      job_local_0002,
job_local_0002  ->      job_local_0003,
job_local_0003

 

Look at
The Results

 

It is necessary to run some final commands to look at the results of the Pig Latin script above:

 

»
       
cd /home/ec2-user/pig-0.10.1/tutorial/pigtmp

»
       
hadoop fs -ls script1-hadoop-results

»
       
hadoop fs -cat 'script1-hadoop-results/*' | less

 

This will show the results in a very simple format – basically data points. Using tools such as Open Source Pentaho Community Edition BI Suite, you can create reports against Big Data including using Pig Lating. Please see:

 

http://community.pentaho.com/

 

Conclusion

 

We reviewed the differences between Hadoop and Cassandra and went through an exercise of setting up and using our very first Hadoop Bg Data Analytics implementations. Now that you are familiarized with the steps (and costs) you can move forward with implementing Big Data within your own business or organization.

 

One final note: Cassandra and Hadoop can co-exist! This site will tell you more:

 

http://wiki.apache.org/cassandra/HadoopSupport

 

Appendix – Pig Script

 

See
http://pig.apache.org/docs/r0.7.0/tutorial.html

The Query Phrase Popularity script (script1-local.pig or script1-hadoop.pig) processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.

The script is shown here:


        
Register the tutorial JAR file so that the included UDFs can be called in the script.

REGISTER ./tutorial.jar;


        
Use the PigStorage function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields
user
,
time
, and
query
.

raw
= LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);


        
Call the NonURLDetector UDF to remove records if the query field is empty or a URL.

clean1 = FILTER raw BY
org.apache.pig.tutorial.NonURLDetector(query);


        
Call the ToLower UDF to change the query field to lowercase.

clean2 = FOREACH clean1 GENERATE user, time,
org.apache.pig.tutorial.ToLower(query) as query;


        
Because the log file only contains queries for a single day, we are only interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the ExtractHour UDF to extract the hour (HH) from the time field.

houred
= FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;


        
Call the NGramGenerator UDF to compose the n-grams of the query.

ngramed1 = FOREACH
houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;


        
Use the DISTINCT operator to get the unique n-grams for all records.

ngramed2 = DISTINCT ngramed1;


        
Use the GROUP operator to group records by n-gram and hour.

hour_frequency1 = GROUP ngramed2 BY (
ngram, hour);


        
Use the COUNTfunction to get the count (occurrences) of each n-gram.

hour_frequency2 = FOREACH hour_frequency1 GENERATE
flatten($0), COUNT($1) as count;


        
Use the GROUP operator to group records by n-gram only. Each group now corresponds to a distinct n-gram and has the count for each hour.

uniq_frequency1 = GROUP hour_frequency2 BY group::
ngram;


        
For each group, identify the hour in which this n-gram is used with a particularly high frequency. Call the ScoreGenerator UDF to calculate a "popularity" score for the n-gram.

uniq_frequency2 = FOREACH uniq_frequency1 GENERATE
flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));


        
Use the FOREACH-GENERATE operator to assign names to the fields.

uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as
ngram, $2 as score, $3 as count, $4 as mean;


        
Use the FILTER operator to move all records with a score less than or equal to 2.0.

filtered_uniq_frequency
= FILTER uniq_frequency3 BY score > 2.0;


        
Use the ORDER operator to sort the remaining records by hour and score.

ordered_uniq_frequency
= ORDER filtered_uniq_frequency BY (hour, score);


        
Use the PigStorage function to store the results. The output file contains a list of n-grams with the following fields:
hour
,
ngram
,
score
,
count
,
mean
.

STORE
ordered_uniq_frequency INTO '/tmp/tutorial-results' USING PigStorage();

 

[1]

Other books

Divided Hearts by Susan R. Hughes
Soul Eater by Lorraine Kennedy
Salvation by Noelle Adams