Working with Big Data: Infrastructure, Algorithms, and Visualizations

Slides, code, and supplemental materials for the LiveLessons

View the Project on GitHub pauldix/working-with-big-data

Intro

This site goes with the Addison Wesley LiveLessons: Working with Big Data: Infrastructure, Algorithms, and Visualizations available on Safari Online

Table of Contents

The goal of these live lessons is to touch on the various aspects of big data at a high level. For instance, instead of going into detail on the many nuances of Hadoop, we'll just get it set up and use it in conjunction with other tools like Cassandra. Further, we'll go into how to work with and integrate algorithms and close out with tools for visualization. We'll be going through the full-stack to see how the different pieces of a big data system fit together.

  1. Lesson 1: Unstructured storage and Hadoop
    1. Set up a basic Hadoop installation
    2. Write data into the Hadoop File System
    3. Write a Hadoop streaming job to process text files
  2. Lesson 2: Structured storage and Cassandra
    1. Set up a basic Cassandra installation
    2. Create a Cassandra schema for storing data
    3. Store and retrieve data from Cassandra using Ruby
    4. Write data into Cassandra from a Hadoop streaming job
    5. Use Hadoop to parallelize writes
  3. Lesson 3: Real-time Processing and Messaging
    1. Set up the Kafka messaging system
    2. Publish and consume data from Kafka using Ruby
    3. Aggregate log files into Hadoop with Kafka and a Ruby Consumer
    4. Create horizontally scalable message consumers
    5. Sample messages using Kafka's partitioning
    6. Create redundant message consumers for high availability
  4. Lesson 4: Running machine earning algorithms in a big data architecture
  5. Lesson 5: Experimentation and running algorithms in production
  6. Lesson 6: Basic Visualizations

Authors and Contributors

Paul Dix (@pauldix)

Discussion

Ask questions about the lectures, or about anything related to big data in the Working with Big Data Group.