EXPOSITION ON CYBER INFRASTRUCTURE AND BIG DATA.

Programming for Big Data

Summer 2015:July 11 - August 13

DEPARTMENT OF COMPUTER SCIENCE SUMMER CAMP

Course Overview

This course is designed for students who need to create applications to analyze Big Data stored in Apache Hadoop using Pig and Hive. Topics include: Map-Reduce, HDFS, HBase, Hadoop, data ingestion, workflow definition and using Pig and Hive to perform data analytics on Big Data. Labs are executed on a Hortonworks Sandbox and AWS or windows azure. Students will get to attend multiple field trips during the program to different local companies to see how big data is utilized.

Course Description

This course is prepared as an introduction to Big Data programming paradigms for senior undergrad students. There will be five lectures (two hours each) and five labs (two hours each lab) with quizzes and mid and final light exams. First, Students will explore and understand Map-Reduce, which is the fundamental programming paradigm supported by Hadoop for processing large data sets. Although, standard Map-Reduce gives a very deep level of controlling the way we want to process our data, but it takes hundreds of lines to write a very small program which could take a long time, is error-prone and needs experienced algorithmic programmers. Thus, students will learn Pig and Hive which are high-level platforms designed by Yahoo and Facebook to wrapper Map-Reduce programs used with Hadoop in order to reduce time and increase coding flexibility. Second, Students will learn Hadoop’s data storage which are HDFS (a distributed file system that distributes data across a cluster of machines taking care of redundancy etc.) and HBase (column based database). Finally, Students are going to write small programs using Hadoop (Hortonworks Sandbox and on cloud using either Amazon Web Services (AWS) or windows azure) to analysis, refine and visualize real data.

Objectives

Upon completing this course, the students will be able to design and implement Map-Reduce programs for various large data set processing tasks using either native Map-Reduce, Pig Latin script, or Hive HQL, and will be able to design data schema using HBase and HDFS. In addition, Students will be able to use Cloud Computing platforms such as Azure and AWS.

Target Audience

Undergrad students who need to understand and develop applications for Hadoop.

Daily Schedule

Days Description
First Day: Lecture(1) Time(2:00 pm- 4:00 pm) Room(228) Introduction:What is Big Data, Big Data attributes, Sources of Big Data, Why Big Data now? Definition and Characteristics of Big Data, Big Data Projects That Could Impact Your Life, Why is Big Data needed? What are the challenges for processing big data? Apache Hadoop, Hadoop-related Apache Projects, RDBMS vs Hadoop, when to use and when not to use Hadoop, Big Data analyzing, and the home of the U.S. Government’s open data.
Second Day: Lecture(2) Time (2:00 pm- 4:00 pm) Room (228)
  • Big Data Storage (HDFS): Data storage in HDFS (Blocks and data replication), Accessing HDFS (CLI, Java based), Fault tolerance, and HDFS Federation.
  • Big Data Storage (HBase): Architecture and schema design, HMaster and Region Servers, Column Families and Regions, write pipeline, read pipeline, Hbase commands.
  • Processing (Map-Reduce and RDD): Map-Reduce Story, Architecture, How Map-Reduce works, Developing Map-Reduce, Map-Reduce Programming model.
  • Quiz
Third Day: Lecture (3) Time (2:00 pm- 4:00 pm ) Room(228)
  • PIG: Introduction to Pig, Map-Reduce vs Pig, Different data types in Pig, Models of execution in Pig, Gunt shell, Loading data, Exploring Pig, Latin commands, Use HCatLoader and HCatStorer, Split and join a dataset using Pig.
  • Quiz
Fourth Day: Lecture (4) Time(2:00 pm- 4:00 pm) Room(228)
  • Hive: Hive introduction, Hive architecture, Hive vs RDBMS, HiveQL and Shell, Managing tables (external vs managed), Data types and schemas, Partitions and buckets, perform a join of two datasets, Use Hive analytics functions.
  • Quiz
Fifth Day: Lecture (5) Time(2:00 pm- 4:00 pm) Room(228)
  • AWS Tutorial, What is Cloud Computing, Private and Public Cloud, IaaS, PaaS, and SaaS. Students will learn the easiest way to get started with Enterprise Hadoop, and Create an AWS Account, Create an Amazon S3 Bucket for Your Cluster Logs and Output Data, Launch an Amazon EMR Cluster, and run Hive script on EMR.
  • Mid-Exam
Sixth Day: Lab (1) Time (2:00 pm- 4:00 pm) Lab(119)
  • Refine and Visualize Sentiment Data: to understand how the public feels about a product launch. This lab takes students through the steps for extracting sentiment data from Twitter and analyzing the performance of a recent movie release.
  • Quiz
Seventh Day: Lab (2) Time(2:00 pm- 4:00 pm) Lab(119)
  • Refine and Visualize Server Log Data: to respond quickly to an enterprise security breaches. In this lab, students will learn how an enterprise security breach analysis and response might be performed.
  • Quiz
Eighth Day: Lab(3) Time(2:00 pm- 4:00 pm) Lab(119)
  • Analysis the Clickstream Data: to increase online conversions and revenue click stream. In this lab, students learn how an online retailer can optimize buying paths to reduce bounce rate and improve conversion.
  • Quiz
Ninth Day: Lab(4) Time(2:00 pm- 4:00 pm) Lab(119)
  • Analyze Machine and Sensor Data: to analyze Machine and Sensor Data to maintain comfortable building temperature. Students will see how Hadoop can be used to analyze heating, ventilation and air conditioning data to maintain ideal office temperatures and minimize expenses.
  • Quiz
Tenth Day: Lab(5) Time(2:00 pm- 4:00 pm) Lab(119)
  • Analysis Geolocation Data: to analysis Geolocation Data to reduce fuel costs and improve driver safety. In this lab, Students will be shown how a trucking company can analyze geolocation data to reduce fuel costs and improve driver safety.
  • Final-Exam

Course materials

The course lectures and labs will be posted online on the Schoology website or KSU Blackboard. Thus, all students are required to have accounts in this website prior classes.

Prerequisites

Students should be familiar with programming principles and have idea about Data Structures and databases. SQL knowledge is also helpful. No prior Hadoop knowledge is required.

Textbooks and websites

  1. Hadoop: The Definitive Guide, Tom White, O'Reilly. ISBN: 978-1-449-38973-4
  2. Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming hive. " O'Reilly Media, Inc.", 2012. ISBN: 978-1-449-31933-5
  3. Gates, Alan. Programming Pig. " O'Reilly Media, Inc.", 2011. ISBN: 978-1-449-30264-1
  4. Pig Latin Basics
  5. Apache Hive
  6. Programming Amazon EC2 by Jurg van Vliet (Author), Flavia Paganelli (Author)
Requirement: students must download the free books prior classes. These books are available at no cost. Also, students should visit the two websites to get some idea about Apache Pig, Hive.

Instructor

Name: Salem Othman
Website: www.sothman.com
Email: sothman@kent.edu

Cheating and Plagiarism

Plagiarism of any time will not be tolerated. It will be dealt with in accordance to Kent State University's policy on cheating and plagiarism described in the student handbook.

Field Trip

Students will get to visit the following places once a week:

  1. Global NOC in Indiana
  2. Progressive Insurance and Hyland Software (use of big data) - Cleveland
  3. Chicago Big Data Tour
  4. Ohio Super Computer Center and OARnet - Columbus
  5. Niagara Falls Power Plant (overnight)

Pictures demonstrate some of Big Data components and places to be visited

Pictures of some of places that we visited