Read Big Data on a Shoestring Online
Authors: Nicholas Bessmer
© 2013 Nicholas Bessmer
All Rights Reserved
Table of Contents
3 - Our Big Data Analytics Example Using Pig Latin Sample Script
Getting Our Tools Running on Our New Big Data Server
Getting The Linux Environment Set Up – Basic Steps
Editing Our Hadoop Configuration Files
Edit /conf/core-site.xml. I have used localhost in the value of fs.default.name
Format the name node (one per install). $ bin/hadoop namenode –format
Start all Hadoop components $ bin/hadoop-daemon.sh start namenode
Use the hadoop command-line tool to test the file system: $ hadoop dfs -ls /
Change to Pig Directory and Run Sample Script From The Tutorial
The companion volume Big Data for Small Business discusses how businesses can gain a competitive advantage by using Big Data techniques to filter out noise and determine trends in very large, unstructured data sets. Big Data is a toolbox to perform analysis on large (
petabytes
) sets of unstructured (Twitter, chats, web logs) data that change in
near
real-time
.
But … there are two forks in the road in terms of how businesses can use Big Data:
»
To process
Operational Data
that changes in real-time.
»
To analyze trends
in massive volumes of structured and unstructured data that are set aside by using
batch jobs
.
A business may benefit from both flavors of Big Data tools. We want to avoid getting too immersed in buzz words and stay focused on how to realize the greatest Big Data benefits for the least cost.
“In
Greek mythology
,
Cassandra
(
Greek
Κα
σσάνδρα, also Κασάνδρα)
[1]
was the daughter of King
Priam
and Queen
Hecuba
of
Troy
. Her beauty caused
Apollo
to grant her the gift of
prophecy
.”
-Wikipedia
We are
all familiar with ATM machines where each check that is deposied is considered a transaction – a discrete set of steps with a beginning and an end.
Transactional
systems in the database world have ways to make sure changes are saved properly including discarding partial information. Cassandra is a
distributed database
that is not transactional – rather it is much more fluid and suited for operational data.
Imagine an airplane with thousands of measurements occurring in real-time. Everything from speed, height, thrust, navigation to the health of the airplane systems need to be checked almost instantaneously. It is wasted effort to spend a lot of time making sure each data-point is saved somewhere. Rather, the operational data of the plane needs to be fed to the command center (the pilots) in real-time with as little overhead as possible. Cassandra is really good at:
Fault tolerant peer to peer architecture.
Performance that can be easily tuned.
Session Storage (imagine sites like Netflix with millions of people streaming videos)
User Data Storage
Scalable, low-latency storage for mobile apps
Critical data storage
As discussed in the companion volume Big Data for Small Business, Hadoop is really good at the following:
Reporting on large amounts of unstructured data
Ability to sort and perform simple calculations on large amount of unstructured and structured data:
o
Counting words – this is the standard Map Reduce example
o
High-volume analysis – gathering and analyzing large scale ad network data
o
Recommendation engines – analyzing browsing and purchasing patterns to recommend a product
o
Social graphs – Determining relationships between individuals
For the purpose of this guide, we will work through setting up a Hadoop Big Data Analytics example and run a simple Pig Latin example script from the Pig tutorial. This will perform some analysis on the Excite search engine.
For future reference, you can find huge data sets to test Big Data with at the following site:
http://aws.amazon.com/publicdatasets/
Some
examples that can be useful to businesses:
»
US and foreign Census Data.
»
Labor statistics
»
Federal Reserve data
»
Federal contracts
Here are examples that are useful to scientists:
»
Daily global weather measurements
»
Genome databases.
We may want to use census data from our local metropolitan area to identify trends such as disposable income or demographics like where elderly or young people reside. This type of marketing savvy requires not only computer power but also the
framework
that Hadoop provides. Think of Hadoop as a toolbox that allows people to approach managing huge volumes of unstructured and structured data.
In Amazon’s example, these sample big data sets are accessible by signing up for their EC2 service. This is a metered service that allows businesses and institutions to run applications and services in
the cloud.
Amazon is acts as a central utility like the electrical company from which customers rent services – in this case computing power and data storage. EC2 is Amazon’s Elastic Cloud and what follows are the steps to set up an EC2 account through Amazon.
Here is the sign up screen to “rent” Amazon Web Services to run your application and database in the cloud. This is a metered service that fluctuates based on your demand. It will not break the bank.
Better yet, let’s sign up for
the micro version
.
Free Tier*
As part of AWS’s Free Usage Tier, new AWS customers can get started with Amazon EC2 for free. Upon sign-up, new AWS customers receive the following EC2 services each month for one year:
750 hours of EC2 running Linux/Unix Micro instance usage
750 hours of EC2 running Microsoft Windows Server Micro instance usage
750 hours of Elastic Load Balancing plus 15 GB data processing
30 GB of Amazon EBS Standard volume storage plus 2 million IOs and 1 GB snapshot storage
15 GB of bandwidth out aggregated across all AWS services
1 GB of Regional Data Transfer
Not bad to test drive this service. You need to provide you credit number and do a phone validation (to make sure you are a real person). Remember – we want the micro service to start off with. You will receive a confirmation email and you should select MANAGE YOUR ACCOUNT. Sign in with the credentials that you created and: