[go: up one dir, main page]

0% found this document useful (0 votes)
128 views12 pages

Data Steaming Sylll

This document provides an overview of the Data Streaming Nanodegree program from Udacity. The goal of the program is to teach students how to process data in real-time using tools like Apache Spark, Kafka, Spark Streaming, and Kafka Streaming. The program consists of 2 courses and 2 projects, taking an estimated 2 months at 5-10 hours per week to complete. The first course covers foundations of data streaming and SQL/data modeling, and the first project involves using Kafka to optimize Chicago bus and train availability by streaming public transit status data.

Uploaded by

harsh varudkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views12 pages

Data Steaming Sylll

This document provides an overview of the Data Streaming Nanodegree program from Udacity. The goal of the program is to teach students how to process data in real-time using tools like Apache Spark, Kafka, Spark Streaming, and Kafka Streaming. The program consists of 2 courses and 2 projects, taking an estimated 2 months at 5-10 hours per week to complete. The first course covers foundations of data streaming and SQL/data modeling, and the first project involves using Kafka to optimize Chicago bus and train availability by streaming public transit status data.

Uploaded by

harsh varudkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

NANODEGREE PROGR AM SYLL ABUS

Data Streaming

Need Help? Speak with an Advisor: www.udacity.com/advisor


Overview
The ultimate goal of the Data Streaming Nanodegree program is to provide students with the latest skills to
process data in real-time by building fluency in modern data engineering tools, such as Apache Spark,Kafka,
Spark Streaming, and Kafka Streaming. A graduate of this program will be able to:

• Understand the components of data streaming systems. Ingest data in real-time using Apache Kafka
and Spark and run analysis
• Use the Faust Stream Processing Python library to build a real-time stream-based application. Compile
real-time data and run live analytics, as well as draw insights from reports generated by the streaming
console.
• Learn about the Kafka ecosystem, and the types of problems each solution is designed to solve. Use
the Confluent Kafka Python library for simple topic management, production, and consumption.
• Explain the components of Spark Streaming (architecture and API), integrate Apache Spark Structured
Streaming and Apache Kafka, manipulate data using Spark, and understand the statistical report
generated by the Structured Streaming console.

This program is comprised of 2 courses and 2 projects. Each project you build will be an opportunity to
demonstrate what you’ve learned in the course, and will demonstrate to potential employers that you have
skills in these areas.

Prerequisite Knowledge: Intermediate SQL, Python, and experience with ETL. Basic familiarity with
traditional batch processing and traditional service architectures is desired, but not required.

Estimated Time: Prerequisites:


2 Months at Intermediate
5-10 hrs/week SQL, Python, and
experience with
ETL

Flexible Learning: Need Help?


Self-paced, so udacity.com/advisor
you can learn on Discuss this program
the schedule that with an enrollment
works best for you. advisor.

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 2


Course 1: Foundations of Data Streaming,
and SQL & Data Modeling for the Web
The goal of this course is to demonstrate knowledge of the tools taught throughout, including Kafka
Consumers, Producers, & Topics; Kafka Connect Sources and Sinks, Kafka REST Proxy for producing data
over REST, Data Schemas with JSON and Apache Avro/Schema Registry, Stream Processing with the Faust
Python Library, and Stream Processing with KSQL.

For your first project, you’ll be streaming public transit status


using Kafka and the Kafka ecosystem to build a stream processing
application that shows the status of trains in real-time. Based on the
skills you learn, you will be able to optimize the availability of buses
and trains in Chicago based on streaming data. You will learn how
to have your own Python code produce events, use REST Proxy to
Course Project send events over HTTP, and use Kafka Connect to collect data from
Optimize Chicago Bus a Postgres database to produce streaming data from a number of
and Train Availability sources into Kafka. Then, you will use KSQL to combine related data
models into a single topic ready for consumption by the downstream
Using Kafka
Python applications, and complete a simple Python application that
ingests data from the Kafka topics for analysis. Finally, you will use
the Faust Python Stream Processing library to further transform train
station data into a more streamlined representation: using stateful
processing, this library will show whether passenger volume is
increasing, decreasing, or staying steady.

LEARNING OUTCOMES

• Describe and explain streaming data stores and stream


processing
• Describe and explain real-world usages of stream processing
Introduction to
LESSON ONE • Describe and explain append-only logs, events, and how
Stream Processing
stream processing differs from batch processing
• Utilize Kafka CLI tools and the Confluent Kafka Python library
for topic management, production, and consumption

• Understand Kafka architecture, topics, and


configuration
• Utilize Confluent Kafka Python to create topics and
configuration
LESSON TWO Apache Kafka
• Understand Kafka producers, consumers, and
configuration
• Utilize Confluent Kafka Python to create producers and
configuration

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 3


LEARNING OUTCOMES

• Utilize Confluent Kafka Python to create topics, configuration,


and manage offsets
LESSON TWO
Apache Kafka • Describe and explain user privacy considerations
(CONTINUED)
• Describe and explain performance monitoring for consumers,
producers, and the cluster itself

• Understand what a data schema is and the value it provides


• Understand what Apache Avro is and what value it provides
• Utilize AvroProducer and AvroConsumer in Confluent Kafka
Data Schemas and Python
LESSON THREE
Apache Avro • Describe and explain schema evolution and data compatibility
types
• Utilize Schema Registry components in Confluent Kafka Python
to manage compatibility

• Describe and explain what problem Kafka Connect solves


for and where it would be more appropriate than a traditional
consumer
• Describe and explain common connectors and how they work
• Utilize Kafka Connect FIleStream & JDBC Source and Sink
Kafka Connect and • Describe and explain what problem Kafka REST Proxy solves
LESSON FOUR
REST Proxy for and where it would be more appropriate than alternatives
• Describe, explain, and utilize the REST Proxy metadata and
administrative APIs
• Describe and explain the REST Proxy consumer APIs
• Utilize the REST Proxy consumer, subscription, and offset APIs
• Describe, explain, and utilize the REST Proxy producer APIs

• Describe and explain common scenarios for stream processing,


and where you would use stream versus batch
• Describe and explain common stream processing strategies
• Describe and explain how time and windowing works in stream
Stream Processing
LESSON FIVE processing
Fundamentals
• Describe and explain what a stream versus a table is in stream
processing, and where you would use on over the other
• Describe and explain how data storage works in stream
processing applications and why it is needed

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 4


LEARNING OUTCOMES

• Describe and explain the Faust Stream Processing Python


library, and how it fits into the ecosystem relative to
solutions
like Kafka Streams
• Describe and explain Faust stream-based processing
• Utilize Faust to create a stream-based application
Stream Processing
LESSON SIX • Describe and explain how Faust table-based processing
with Faust
works
• Utilize Faust to create a table-based application
• Describe and explain Faust processors and function usage
• Utilize Faust processor and function
• Describe and explain Faust serialization and deserialization
• Utilize Faust serialization and deserialization

• Describe and explain how KSQL fits into the Kafka


ecosystem, and why you would choose it over a stream
processing application built from scratch
• Describe and explain KSQL architecture
• Describe and explain how to create KSQL streams and
tables from topics. Understand the importance of KEY and
schema transformations
LESSON SEVEN KSQL
• Utilize KSQL to create tables and streams
• Describe and explain KSQL selection syntax
• Utilize KSQL syntax to query tables and streams
• Describe and explain KSQL windowing
• Utilize KSQL windowing within the context of table analysis
• Describe and explain KSQL grouping and aggregates
• Utilize KSQL grouping and aggregates within queries

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 5


Course 2: Streaming API Development and
Documentation
The goal of this course is to grow your expertise in the components of streaming data systems, and build a real-
time analytics application. Specifically, you will be able to: explain components of Spark Streaming (architecture
and API), ingest streaming data to Apache Spark Structured Streaming and perform analysis, integrate Apache
Spark Structured Streaming and Apache Kafka, and understand the statistical report generated by the Structured
Streaming console.

In this project, you will analyze a real-world dataset of the SF Crime


Course Project Rate, extracted from kaggle, to provide statistical analysis using
Apache Spark Structured Streaming. You will be provided with
Analyze San Francisco dataset, and use a Kafka server locally to produce and ingest data
Crime Rate with Apache through Spark Structured Streaming. Then, you will use various APIs
Spark Streaming to create and execute logics. You will create an ETL pipeline that
produces Kafka data and ingests the data through Spark. Finally,
you will generate a meaningful statistical report from the data.

LEARNING OUTCOMES

• Describe and explain the big data ecosystem


• Describe and explain the hardware behind big data
LESSON ONE The Power of Spark
• Describe and explain distributed systems
• Understand when to use Spark and when not to use it

• Manipulate data using Functional Programming


Data Wranglng with • Manipulate data using Maps and Lambda functions
LESSON TWO
Spark • Read and write data into SparkSQL and Spark dataframes
• Manipulate data using Spark for ETL purposes

• Set up a Spark cluster on AWS (transition from local to


distributed mode)
• Upload and retrieve data on AWS Cloud using Jupyter
Notebook
Debugging and
LESSON THREE • Submit data using Python notebook
Optimization
• Read and write data using distributed data storage,
Amazon S3, and HDFS
• Diagnose, correct errors, and optimize code using Spark
WebUI and Accumulators

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 6


LEARNING OUTCOMES

• Learn Apache Fundamental’s core building blocks (RDD/


Introduction to Dataframe/Dataset)
LESSON FOUR
Spark Streaming • Review Action/Transformation functions and learn how
these concepts apply in streaming

• Understand the concept of lazy evaluation


Structured
LESSON FIVE • Describe different join types between streaming and static
Streaming APIs
dataframes

• Describe Kafka Source Provider


• Describe Kafka Offset Management
• Describe Triggers in Spark Streaming
Integration of Spark
LESSON SIX • Describe Progress Report in Spark Console to analyze
Streaming and Kafka
batches in Kafka
• Understand sample business architectures and learn how
to tune them for best performance from examples

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 7


Our Classroom Experience
REAL-WORLD PROJECTS
Build your skills through industry-relevant projects. Get
personalized feedback from our network of 900+ project
reviewers. Our simple interface makes it easy to submit
your projects as often as you need and receive unlimited
feedback on your work.

KNOWLEDGE
Find answers to your questions with Knowledge, our
proprietary wiki. Search questions asked by other students
and discover in real-time how to solve the challenges that
you encounter.

STUDENT HUB
Leverage the power of community through a simple, yet
powerful chat interface built within the classroom. Use
Student Hub to connect with your technical mentor and
fellow students in your Nanodegree program.

WORKSPACES
See your code in action. Check the output and quality of
your code by running them on workspaces that are a part
of our classroom.

QUIZZES
Check your understanding of concepts learned in the
program by answering simple and auto-graded quizzes.
Easily go back to the lessons to brush up on concepts
anytime you get an answer wrong.

CUSTOM STUDY PLANS


Work with a mentor to create a custom study plan to suit
your personal needs. Use this plan to keep track of your
progress toward your goal.

PROGRESS TRACKER
Stay on track to complete your Nanodegree program with
useful milestone reminders.

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 8


Learn with the Best

Ben Goldberg Judit Lantos


S TA F F E N G I N E E R S E N I O R DATA E N G I N E E R
AT S P OT H E R O AT N E T F L I X
In his career as an engineer, Ben Goldberg Currently, Judit is a Senior Data Engineer
has worked in fields ranging from at Netflix. Formerly a Data Engineer at
Computer Vision to Natural Language Split, where she worked on the statistical
Processing. At SpotHero, he founded engine of their full-stack experimentation
and built out their Data Engineering platform, she has also been an instructor
team, using Airflow as one of the key at Insight Data Science, helping software
technologies. engineers and academic coders transition
to DE roles.

David Drummond Jillian Kim


VP OF ENGINEERING S E N I O R DATA E N G I N E E R
AT I N S I G H T AT C H A N G E H E A LT H C A R E
David is VP of Engineering at Insight where Jillian has worked in roles from building data
he enjoys breaking down difficult concepts analytics platforms to machine learning
and helping others learn data engineering. pipelines. Previously, she was a research
David has a PhD in Physics from UC engineer at Samsung focused on data
Riverside. analytics and ML, and now leads building
pipelines at scale as a Senior Data Engineer at
Change Healthcare.

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 9


All Our Nanodegree Programs Include:

EXPERIENCED PROJECT REVIEWERS


RE VIE WER SERVICES

• Personalized feedback
• Unlimited submissions and feedback loops
• Practical tips and industry best practices
• Additional suggested resources to improve

INDIVIDUAL 1-ON-1 MENTORSHIP


M E N TO R S H I P S E R V I C E S

• 6+ hrs of mentor support per month


• Weekly 1-on-1 personal mentor calls
• 1-on-1 mentor chats anytime
• Custom weekly learning plan focused on your
progress, goals and availability
• Daily progress tracking
• Proactive check-ins with you
• Mentors are compensated based on your
progress and success

PERSONAL CAREER SERVICES


C A R E E R CO A C H I N G

• Personal assistance in your job search


• Monthly 1-on-1 calls
• Personalized feedback and career guidance
• Access Udacity Talent Program used by our
network of employers to source candidates
• Advice on negotiating job offers
• Interview preparation
• Resume services
• Github portfolio review
• LinkedIn profile optimization

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 10


Frequently Asked Questions
PROGR AM OVERVIE W

WHY SHOULD I ENROLL?


As businesses increasingly rely on applications that produce and process data in
real-time, data streaming is an increasingly in-demand skill for data engineers.
The Data Streaming Nanodegree program will prepare you for the cutting edge of
data engineering as more and more companies look to derive live insights from
data at scale.

Students will learn how to process data in real-time by building fluency in modern
data engineering tools, such as Apache Spark, Kafka, Spark Streaming, and Kafka
Streaming.

You’ll start by understanding the components of data streaming systems. You’ll


then build a real-time analytics application. You will also compile data and run
analytics, as well as draw insights from reports generated by the streaming
console.

WHAT JOBS WILL THIS PROGRAM PREPARE ME FOR?


This program is designed to upskill experienced Software Engineers and Data
Engineers to learn the latest advancements in data processing, sending data
records continuously to support live updating.

The projects in the Data Streaming Nanodegree program will prepare you to
develop systems and applications capable of interpreting data in real-time,
and position you for roles in all industries that require live data processing
for functions including big data, cloud computing, web personalization, fraud
detection, sensor monitoring, anomaly detection, supply chain maintenance,
location-based services, and much more.

HOW DO I KNOW IF THIS PROGRAM IS RIGHT FOR ME?


This program is intended for software engineers looking to build real-time data
processing proficiency, as well as data engineers looking to enhance their existing
skill set with the next advancement in data engineering.

ENROLLMENT AND ADMISSION

DO I NEED TO APPLY? WHAT ARE THE ADMISSION CRITERIA?


There is no application. This Nanodegree program accepts everyone,
regardless of experience and specific background.

WHAT ARE THE PREREQUISITES FOR ENROLLMENT?


The Data Streaming Nanodegree program is designed for students with
intermediate Python and SQL skills, as well as experience with ETL.

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 11


FAQs Continued
Basic familiarity with traditional batch processing and basic conceptual
familiarity with traditional service architectures is desired, but not required.

IF I DO NOT MEET THE REQUIREMENTS TO ENROLL, WHAT SHOULD I DO?


Udacity’s Programming for Data Science with Python Nanodegree program is
great preparation for the Data Engineer Nanodegree program. You’ll learn to
code with Python and SQL.

Similarly, the Data Engineering Nanodegree program is great preparation for the
Data Streaming Nanodegree program.

TUITION AND TERM OF PROGR AM

HOW IS THIS NANODEGREE PROGRAM STRUCTURED?


The Data Streaming Nanodegree program is comprised of content and
curriculum to support two projects. We estimate that students can complete the
program in two months, working five to ten hours per week.

Each project will be reviewed by the Udacity reviewer network. Feedback will be
provided, and if you do not pass the project, you will be asked to resubmit the
project until it passes.

HOW LONG IS THIS NANODEGREE PROGRAM?


Access to this Nanodegree program runs for the length of time specified in
the payment card on the Nanodegree program overview page. If you do not
graduate within that time period, you will continue learning with month to
month payments. See the Terms of Use for other policies around the terms of
access to our Nanodegree programs.

CAN I SWITCH MY START DATE? CAN I GET A REFUND?


Please see the Udacity Nanodegree program FAQs for policies on enrollment in
our programs.

S O F T WA R E A N D H A R D WA R E

WHAT SOFTWARE AND VERSIONS WILL I NEED IN THIS PROGRAM?


There are no software and version requirements to complete this Nanodegree
program. All coursework and projects can be completed via Student Workspaces
in the Udacity online classroom.

Need Help? Speak with an Advisor: www.udacity.com/advisor Data Streaming | 12

You might also like