Home' Campus Technology : October 2012 Contents DATA MANAGEMENT
CAMPUS TECHNOLOGY | October 2012
THE GROWTH IN THE VOLUME of the world's data is
currently outpacing Moore's Law, which posits that the number of
transistors on integrated circuits doubles approximately every two
years. In other words, notes Charles Zedlewski, vice president of the
products group at Cloudera, the pace of microprocessor innovation
is not keeping up with the rate at which data is being created.
"Keep in mind that an ever-higher fraction of that data cannot
be readily organized into the traditional rows and columns of a
database," adds Zedlewski. "These two phenomena are basi-
cally starting to break the traditional architectures and technolo-
gies people have used for the past 20-30 years to manage data."
Enter Apache Hadoop, an open source platform for data-intensive,
distributed computing that has become synonymous with Big
Data. The Hadoop project was originally developed at Yahoo by
Doug Cutting, now an architect at Cloudera. (The project was
named for his daughter's stuffed elephant.)
At its core, Hadoop is a combination of Google's MapReduce
and the Hadoop Distributed File System (HDFS). MapReduce is
a programming model for processing and generating large data
sets. It supports parallel computations on so-called unreliable
computer clusters. HDFS is designed to scale to petabytes of
storage and to run on top of the file systems of the underlying
operating system. Yahoo released to developers the source code
for its internal distribution of Hadoop in 2009.
"It was essentially a storage engine and a data-processing
engine combined," explains Zedlewski. "But Hadoop today is
really a constellation of about 16 to 17 open source projects, all
building on top of that original project, extending its usefulness
in all kinds of different directions."
Cloudera is a provider of Hadoop system-management tools and
support services. Its Hadoop distribution, dubbed the Cloudera
Distribution Including Apache Hadoop (CDH), is a data-manage-
ment platform that combines a number of components, including
support for the Hive and Pig languages; the HBase database for
random, real-time read/write access; the Apache ZooKeeper coor-
dination service; the Flume service for collecting and aggregating
log and event data; Sqoop for relational database integration; the
Mahout library of machine learning algorithms; and the Oozie
server-based workflow engine, among others.
The sheer volume of data is not why most customers turn to
Hadoop. Instead, it's the flexibility the platform provides. "It's the
idea that you can hold on to lots and lots of data without having
to predetermine how you're going to use it, and still make pro-
ductive use of it later," says Zedlewski.
Make no mistake, Hadoop can handle the big stuff. Speaking at the
annual Hadoop Summit in California this summer, Facebook engineer
Andrew Ryan talked about his company's record-setting reliance on
HDFS clusters to store more than 100 petabytes of data.
Hadoop is just one of the technologies emerging to support Big
Data analytics, according to James Kobielus, IBM's Big Data evan-
gelist. NoSQL, which is a class of non-relational database-manage-
ment systems, is often used to characterize key value stores and
other approaches to analytics, much of it focused on unstructured
content. New social graph analysis tools are used on many of the
new event-based sources to analyze relationships and enable cus-
tomer segmentation by degrees of influence. And so-called semantic
web analysis (which leverages the Resource Description Framework
specification) is critical for many text analytics applications.
THE NEW TOOLS OF BIG DATA
slice and dice the data. "Most of the data in higher education
is dramatically more structured," Thornburgh notes, citing
student information systems (SIS) in particular. "This is pre-
cise information about all these students: exactly who they
are, which classes they took, what grades they received in
those classes. Each of those events is not just the click of a
keystroke; it's the manifestation of months of work on both
the faculty member's and the student's behalf."
Unfortunately, the very structure that makes it easy to
analyze aspects of student data stands at odds with
another underlying concept behind Big Data: flexibility.
While relational databases are great at serving up data for
preconfigured purposes, it can be a bear to set them up
to generate different results, even when the amount of
data involved is of moderate size.
"That's one of the key limitations of traditional designs
today," insists Charles Zedlewski, vice president of the prod-
ucts group at Cloudera, one of the commercial supporters
of Hadoop (see "The New Tools of Big Data" at right) and
provider of a range of Big Data solutions and services. "Typ-
ically, once you set up a database, it's difficult and expensive
to change later. In Big Data, the whole point is that you're
acquiring so much data that it's not realistic to assume up
front all the different ways that you're going to use it. You
can't possibly predict that. So how do you make it possible
to experiment and change a lot at very little cost?"4
Links Archive November 2012 September 2012 Navigation Previous Page Next Page