Lesson 1: Building a Big Data Infrastructure Part 1

Unstructured Storage & Hadoop

Unstructured Data

  1. Log Files

  2. Text

  3. Unknown Formats

Hadoop

  1. Open source

  2. HDFS: Distributed file system modeled after GFS

  3. MapReduce: Distributed batch processing modeled after Google's MapReduce

Hadoop's Wider Ecosystem

  1. HBase - A column oriented database modeled after Google's BigTable

  2. ZooKeeper - A service for maintaining configuration and distributed synchronization

  3. Hive - Provides a SQL like interface for querying data in Hadoop

  4. Cascading - A framework for creating data processing workflows in Hadoop

  5. Pig - A high level language for creating MapReduce programs

  6. Flume - Useful for moving log data into Hadoop

Batch Processing

  1. Like cron

  2. Run once or frequently

  3. Ship code to data

What Hadoop & Batch Processing are Good For

  1. Storing copies of all data

  2. Storing and grepping through log files

  3. Joining data from disparate sources

  4. Building Indexes

  5. Building Models

/

#