Lesson 1: Building a Big Data Infrastructure Part 1
Unstructured Storage & Hadoop
Unstructured Data
-
Log Files
-
Text
-
Unknown Formats
Hadoop
Open source
HDFS: Distributed file system modeled after GFS
MapReduce: Distributed batch processing modeled after Google's MapReduce
Hadoop's Wider Ecosystem
HBase - A column oriented database modeled after Google's BigTable
ZooKeeper - A service for maintaining configuration and distributed synchronization
Hive - Provides a SQL like interface for querying data in Hadoop
Cascading - A framework for creating data processing workflows in Hadoop
Pig - A high level language for creating MapReduce programs
Flume - Useful for moving log data into Hadoop
Batch Processing
-
Like cron
-
Run once or frequently
-
Ship code to data
What Hadoop & Batch Processing are Good For
Storing copies of all data
Storing and grepping through log files
Joining data from disparate sources
Building Indexes
Building Models
←
→
/
#