Cloud Computing (Spring 2020)

Instructor Lei Deng, Ph.D., Professor
Location: Rm.404, No.2 Comprehensive Laboratory Building
Office hour: 9:00-17:00
Email: leideng@csu.edu.cn
Time and location Week 2-9
Online and offline mixed teaching
Course description What is the "cloud"? How do we build software systems and components that scale to millions of users and petabytes of data, and are "always available"?

In the modern Internet, virtually all large Web services run atop multiple geographically distributed data centers: Google, Yahoo, Facebook, iTunes, Amazon, eBay, Bing, etc. Services must scale across thousands of machines, tolerate faults, and support thousands of concurrent requests. Increasingly, the major providers (including Amazon, Google, Microsoft, HP, and IBM) are looking at "hosting" third-party applications in their data centers - forming so-called "cloud computing" services. A significant number of these services also process "streaming" data: geocoding information from cell phones, tweets, streaming video, etc.

This course, aimed at a sophomore with exposure to basic programming within the context of a single machine, focuses on the issues and programming models related to such cloud and distributed data processing technologies: data partitioning, storage schemes, stream processing, and "mostly shared-nothing" parallel algorithms.

Topics covered Google cloud computing, the MapReduce programming model, Hadoop, Spark, Amazon cloud, ...
Format The format will be 4-hour lecture per week, plus assigned readings. There will be regular homework assignments and a term project.
Prerequisites JAVA/C++ Programming
Discrete Mathematics
Data Structures
Databases
Texts and readings Hadoop: The Definitive Guide, Fourth Edition, by Tom White (O'Reilly)
Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer (Morgan & Claypool)
Cloud Computing (3rd edition), by Peng Liu (in Chinese, Tsinghua University Press)
Additional materials will be provided as handouts or in the form of light technical papers.
Grading Homework/Participation/Presentation 50%, Final Project/Paper 50%
Policies You are encouraged to discuss your homework assignments with your classmates; however, any code you submit must be your own work. You may not share code with others or copy code from outside sources, except where the assignment specifically allows it. Plagiarism can have serious consequences
Final project/paper Option 1: Build a small Facebook-like application using Amazon's SimpleDB. Based on network analysis, the application should make friend recommendations; it should also visualize the social network. A report with at least 6 pages in English is needed.
Option 2: Create an experiment by using hadoop mapreduce or spark to process some big data. And then write a paper with at least 6 pages in English. The paper shoud include sections of introduction, methods, results, conclusion and references.
Schedule Below is the tentative schedule for the course:

Date Topic Details Reading Homework
Week 2 Introduction
Google Cloud Part 1
Course overview
Google File System(GFS): distributed storage system
MapReduce: a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
Chubby: lock service for loosely-coupled distributed systems
  • Case Studies: NY Times article

  • Edge Computing: Vision and Challenges (Presentation 1)

  • Armbrust et al.: A View of Cloud Computing (Presentation 2)

  • Sanjay et al.: The Google File System

  • Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

  •  
    Week 3 Google Cloud Part 2
    Presentation and Discussion
    Bigtable: a distributed storage system for structured data; scalable, highly available storage system
    Megastore: distributed systems tracing infrastructure

  • Fay et al.: Bigtable: A Distributed Storage System for Structured Data (Presentation 3)

  •  
    Week 4 Google Cloud Part 3
    Presentation and Discussion
    Dapper: a Large-Scale Distributed Systems Tracing Infrastructure
    Google App Engine: a platform for building web apps and mobile backends that automatically scale
  • Benjamin et al.: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (Presentation 4)

  •  
    Week 5 Amazon Cloud
    Presentation and Discussion
    Experiment 1: set up Hadoop
    Dynamo: a highly available key-value storage system
    EC2, S3 and SQS
    Simple DB: a simple database storage solution that allows developers to simply store & query data items via web services requests, saving time
    solutions
  • Subramanian et al.:f4: Facebook’s Warm BLOB Storage System(Presentation 5)

  •  
    Week 6 Hadoop
    Presentation and Discussion
    Experiment 2: HDFS commands & APIs
    Basics: Data types, drivers, mappers, reducers
    HDFS; dataflow in Hadoop
    Fault tolerance in Hadoop
    Programming model of Spark
  • Matei et al.:Spark: Cluster Computing with Working Sets(Presentation 6)

  • Spanner: Google’s Globally-Distributed Database, OSDI 2012 (Presentation 7)

  •  
    Week 7 Experiment 3: Process data with Hadoop/Mapreduce
    Presentation and Discussion
    Spark
    Set up Hadoop
    Learn how to use HDFS
  • Kim et al.:Database High Availability Using SHADOW Systems(Presentation 8)


  • HW1
    Week 8 Virtualization
    Experiment 4: Spark
    Presentation and Discussion
    Learn how to use Spark
  • Zohreh et al.:Heterogeneity in Mobile Cloud Computing: Taxonomy and Open Challenges(Presentation 9)

  • HW2
    Week 9 OpenStack
    Presentation and Discussion
    Virtualization: Concepts and Technologies Introduction to OpenStack
  • Wei et al.:Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics
  • (Presentation 10)
  • Alsheikh et al.:Mobile Big Data Analytics Using Deep Learning and Apache Spark
  • (Presentation 11)
  • Moritz et al.:SPARKNET: TRAINING DEEP NETWORKS IN SPARK
  • (Presentation 12)
  • UDDIN et al.:Human Action Recognition Using Adaptive Local Motion Descriptor in Spark
  • (Presentation 13)
     

    Experiment materials Virtural machine (VM) accounts
    Client software
    Linux Fundamentals
    Unix Tutorial
    Unix/Linux Command Reference
    Hadoop tutorial


    Experiment 1: set up Hadoop
    Hadoop brief installation manual
    Hadoop
    JDK 1.7


    Experiment 2: HDFS commands & APIs
    hdfs java manual (by Nassor)
    Frequently used HDFS shell commands
    hdfs java api
    apache-ant


    Experiment 3: Process data with Hadoop/Mapreduce
    Assignment: download
    Data: download


    Experiment 4: Spark
    Spark manual

    课件下载