SCI-241 | HDP Data Science - Updated November 2018 - Clone Me!

SCI-241 | HDP Data Science - Updated November 2018 - Clone Me!

SCI-241 Data Science Theory and Application

About this course

This course provides instruction on the theory and practice of data science, including machine learning and natural language processing. This course introduces many of the core concepts behind today’s most commonly used algorithms and introducing them in practical applications. We’ll discuss concepts and key algorithms in all of the major areas – Classification, Regression, Clustering, Dimensionality Reduction, including a primer on Neural Networks. We’ll focus on both single-server tools and frameworks (Python, NumPy, pandas, SciPy, Scikit-learn, NLTK, TensorFlow Jupyter) as well as large-scale tools and frameworks (Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap, Apache Zeppelin).

PREREQUISITES

Students must have experience with Python and Scala, Spark, and prior exposure to statistics, probability, and a basic understanding of big data and Hadoop principles. While brief reviews are offered in these topics, students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials (HDP-123) course and HDP Developer: Apache Spark 2.3 (DEV-343), as well as the language-specific introduction courses.


TARGET AUDIENCE

Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Spark/Hadoop
.

FORMAT

50% Lecture/Discussion

50% Hands-0n Labs

DAY 1 OBJECTIVES

  • Discuss aspects of Data Science, the team members, and the various roles in the team
  • Discuss use cases for Data Science
  • Discuss the current State of the Art and its future direction
  • Review HDFS, Spark, Jupyter, and Zeppelin
  • Work with SciKit-Learn, Pandas, NumPy, Matplotlib, and Seaborn
  • Review and use Spark DataFrames and NOSQL in ETL
  • Review and use Apache NiFi to create and manage data flows
  • Review and use Spark Structured Streaming with Kafka
  • Review essential Mathematics, Statistics, and Probability used in ML with Zeppelin
DAY 1 LABS AND DEMONSTRATIONS
  • Lab: Hello, ML w/ SciKit-Learn (30 min, using Jupyter and with visualizations with Matplotlib & Seaborn)
  • Lab: Spark REPLs, Spark Submit, & Zeppelin Review (30 minutes, pre-built apps to be executed all 3 ways; reviews DF-functional paradigm)
  • Lab: HDFS Review (15-20 minutes, moving data to/from HDFS)
  • Lab: Spark DataFrames and Files (20-30 min, JSON, CSV, Parquet, ORC, Avro files)
  • Lab: Spark DataFrames and NOSQL (MariaDB, Mongo, GraphX/GraphFrames(??))
  • Lab: NiFi Review (30 minutes, essentials of moving data to/from HDFS with NiFi)
  • Lab: Kafka and Structured Streaming Review (30 minutes, reviewing an app that streams data from Kafka)
  • Lab: Essential Math Review (30 min, graphing, plotting, probability, lead in to gradient descent)

DAY 2 OBJECTIVES

  • Discuss categories and use cases of the various ML Algorithms
  • Understand the similarities and differences in classification and regression categories
  • Understand Linear Regression, Logistic Regression, and Support Vectors
  • Understand Decision Trees and their limitations
  • Understand Random Forests and Gradient Boosted Trees
  • Understand Nearest-Neighbors
  • Discuss and demonstrate a Spam Classifier
DAY 2 LABS AND DEMONSTRATIONS
  • Lab: Linear Regression as a Projection (30 min, includes visualization)
  • Lab: Logistic Regression (30 min, includes visualization)
  • Lab: Support Vectors (30 min, includes visualization(?))
  • Lab: Decision Trees (30 min, includes visualization(??))
  • Lab: Random Forests (30 min, includes visualization(??))
  • Lab: Linear Regression as a Classifier (30 min, includes visualization)
  • Lab: KNN (30 min, includes visualization)
  • Demo: Creating a Spam Classifier with MLlib (30 min)

DAY 3 OBJECTIVES

  • Discuss and understand Clustering Algorithms
  • Discuss and understand Neural Networks, particularly Convolutional, Recurrent and LSTMs
  • Work with TensorFlow to create a basic neural network
  • Discuss Natural Language Processing
  • Compare and contrast NLTK and Stanford CoreNLP
  • Discuss and demonstrate Sentiment Analysis
  • Discuss Dimensionality Reduction Algorithms
DAY 3 LABS AND DEMONSTRATIONS
  • Lab: K-Means Clustering (30 min, includes visualization)
  • Lab: GMM Clustering (30 min, includes visualization)
  • Lab: Essential TensorFlow (30 min)
  • Lab: NLTK
  • Lab: Stanford NLP
  • Demo: Sentiment Analysis
  • Lab: Dimensionality Reduction with PCA (30 min)

DAY 4 OBJECTIVES

  • Discuss Hyper-Parameter Tuning and K-Fold Validation
  • Understand Ensemble Models
  • Discuss ML Pipelines in Spark MLlib
  • Discuss ML in production and real-world issues
  • Demonstrate TensorFlowOnSpark
  • Describe real-world use cases of ML
DAY 4 LABS AND DEMONSTRATIONS
  • Lab: Hyper-parameter tuning (30 min, includes visualization)
  • Lab: K-Fold Validation (30 min)
  • Lab: Ensemble Methods (30 min)
  • Lab: ML Pipelines in SparkML (30 min)
  • Demo: TensorFlowOnSpark (20-30 min)
  • Demo: Use Cases

Live Training Event

Event Date Spaces left
Live Virtual Class Dec. 5, 2018, 10 a.m. -
Dec. 7, 2018, 5 p.m. EST
20

Curriculum

  • Course Logistics
  • Live Training Event
  • Course Presentation
  • SCI-241_DataScience-Presentation.pdf
  • Lab Guides
  • SCI-241_DataScience-LabGuide.pdf
  • 04-Lab-FirstStepsWithPythonInTheREPL.pdf
  • 09-Lab-UsingHDFSCommands.pdf
  • 10-Lab-IntroductionToSparkREPLsAndZeppelin.pdf
  • 14-Lab-LinearRegression.pdf
  • 15-Lab-LinearRegressionForClassification.pdf
  • 16-Lab-LogisticRegression.pdf
  • 17-Lab-SupportVectorMachines.pdf
  • 19-Lab-DecisionTrees.pdf
  • 20-Lab-RandomForests.pdf
  • 21-Lab-K-NearestNeighbors.pdf
  • 24-Lab-GaussianMixtureModel.pdf
  • 25-Lab-PrincipalComponentAnalysis.pdf
  • 32-Lab-NLP.pdf
  • 34-Lab-CrossValidation.pdf
  • 35-Lab-GridSearch.pdf
  • 37-Lab-EnsembleMethods.pdf
  • Lesson Slides
  • 01-Lesson-DataScienceLogistics.pdf
  • 02-Lesson-WhatIsDataScience.pdf
  • 03-Lesson-PythonPrimer.pdf
  • 05-Lesson-ZeppelinOverview-Part#1.pdf
  • 06-Lesson-ZeppelinOverview-Part#2.pdf
  • 07-Lesson-ThePythonScientificEcosystem.pdf
  • 08-Lesson-TheBigDataScienceStack.pdf
  • 11-Lesson-IntroductionToMachineLearning.pdf
  • 12-Lesson-EvaluatingMachineLearningModels.pdf
  • 13-Lesson-SupervisedLearning.pdf
  • 18-Lesson-DecisionTrees-RandomForests.pdf
  • 22-Lesson-SpamClassification.pdf
  • 23-Lesson-Clustering-DimensionalityReduction.pdf
  • 26-Lesson-IntroductionToNeuralNetworks.pdf
  • 27-Lesson-ConvolutionalNeuralNetworks.pdf
  • 28-Lesson-TensorFlow.pdf
  • 29-Lesson-TensorFlowOnSpark.pdf
  • 30-Lesson-NaturalLanguageProcessing.pdf
  • 31-Lesson-NLPLibraries.pdf
  • 33-Lesson-CrossValidation-HyperParameterTuning.pdf
  • 36-Lesson-EnsembleMethods.pdf
  • 38-Lesson-MachineLearningUseCases.pdf
  • 39-Lesson-MachineLearningInProduction.pdf
  • 40-Lesson-ThankYou.pdf
  • Wrapping Up
  • Course & Instructor Survey

About this course

This course provides instruction on the theory and practice of data science, including machine learning and natural language processing. This course introduces many of the core concepts behind today’s most commonly used algorithms and introducing them in practical applications. We’ll discuss concepts and key algorithms in all of the major areas – Classification, Regression, Clustering, Dimensionality Reduction, including a primer on Neural Networks. We’ll focus on both single-server tools and frameworks (Python, NumPy, pandas, SciPy, Scikit-learn, NLTK, TensorFlow Jupyter) as well as large-scale tools and frameworks (Spark MLlib, Stanford CoreNLP, TensorFlowOnSpark/Horovod/MLeap, Apache Zeppelin).

PREREQUISITES

Students must have experience with Python and Scala, Spark, and prior exposure to statistics, probability, and a basic understanding of big data and Hadoop principles. While brief reviews are offered in these topics, students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials (HDP-123) course and HDP Developer: Apache Spark 2.3 (DEV-343), as well as the language-specific introduction courses.


TARGET AUDIENCE

Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Spark/Hadoop
.

FORMAT

50% Lecture/Discussion

50% Hands-0n Labs

DAY 1 OBJECTIVES

  • Discuss aspects of Data Science, the team members, and the various roles in the team
  • Discuss use cases for Data Science
  • Discuss the current State of the Art and its future direction
  • Review HDFS, Spark, Jupyter, and Zeppelin
  • Work with SciKit-Learn, Pandas, NumPy, Matplotlib, and Seaborn
  • Review and use Spark DataFrames and NOSQL in ETL
  • Review and use Apache NiFi to create and manage data flows
  • Review and use Spark Structured Streaming with Kafka
  • Review essential Mathematics, Statistics, and Probability used in ML with Zeppelin
DAY 1 LABS AND DEMONSTRATIONS
  • Lab: Hello, ML w/ SciKit-Learn (30 min, using Jupyter and with visualizations with Matplotlib & Seaborn)
  • Lab: Spark REPLs, Spark Submit, & Zeppelin Review (30 minutes, pre-built apps to be executed all 3 ways; reviews DF-functional paradigm)
  • Lab: HDFS Review (15-20 minutes, moving data to/from HDFS)
  • Lab: Spark DataFrames and Files (20-30 min, JSON, CSV, Parquet, ORC, Avro files)
  • Lab: Spark DataFrames and NOSQL (MariaDB, Mongo, GraphX/GraphFrames(??))
  • Lab: NiFi Review (30 minutes, essentials of moving data to/from HDFS with NiFi)
  • Lab: Kafka and Structured Streaming Review (30 minutes, reviewing an app that streams data from Kafka)
  • Lab: Essential Math Review (30 min, graphing, plotting, probability, lead in to gradient descent)

DAY 2 OBJECTIVES

  • Discuss categories and use cases of the various ML Algorithms
  • Understand the similarities and differences in classification and regression categories
  • Understand Linear Regression, Logistic Regression, and Support Vectors
  • Understand Decision Trees and their limitations
  • Understand Random Forests and Gradient Boosted Trees
  • Understand Nearest-Neighbors
  • Discuss and demonstrate a Spam Classifier
DAY 2 LABS AND DEMONSTRATIONS
  • Lab: Linear Regression as a Projection (30 min, includes visualization)
  • Lab: Logistic Regression (30 min, includes visualization)
  • Lab: Support Vectors (30 min, includes visualization(?))
  • Lab: Decision Trees (30 min, includes visualization(??))
  • Lab: Random Forests (30 min, includes visualization(??))
  • Lab: Linear Regression as a Classifier (30 min, includes visualization)
  • Lab: KNN (30 min, includes visualization)
  • Demo: Creating a Spam Classifier with MLlib (30 min)

DAY 3 OBJECTIVES

  • Discuss and understand Clustering Algorithms
  • Discuss and understand Neural Networks, particularly Convolutional, Recurrent and LSTMs
  • Work with TensorFlow to create a basic neural network
  • Discuss Natural Language Processing
  • Compare and contrast NLTK and Stanford CoreNLP
  • Discuss and demonstrate Sentiment Analysis
  • Discuss Dimensionality Reduction Algorithms
DAY 3 LABS AND DEMONSTRATIONS
  • Lab: K-Means Clustering (30 min, includes visualization)
  • Lab: GMM Clustering (30 min, includes visualization)
  • Lab: Essential TensorFlow (30 min)
  • Lab: NLTK
  • Lab: Stanford NLP
  • Demo: Sentiment Analysis
  • Lab: Dimensionality Reduction with PCA (30 min)

DAY 4 OBJECTIVES

  • Discuss Hyper-Parameter Tuning and K-Fold Validation
  • Understand Ensemble Models
  • Discuss ML Pipelines in Spark MLlib
  • Discuss ML in production and real-world issues
  • Demonstrate TensorFlowOnSpark
  • Describe real-world use cases of ML
DAY 4 LABS AND DEMONSTRATIONS
  • Lab: Hyper-parameter tuning (30 min, includes visualization)
  • Lab: K-Fold Validation (30 min)
  • Lab: Ensemble Methods (30 min)
  • Lab: ML Pipelines in SparkML (30 min)
  • Demo: TensorFlowOnSpark (20-30 min)
  • Demo: Use Cases

Live events

Live Training Event

Event Date Spaces left
Live Virtual Class Dec. 5, 2018, 10 a.m. -
Dec. 7, 2018, 5 p.m. EST
20

Curriculum

  • Course Logistics
  • Live Training Event
  • Course Presentation
  • SCI-241_DataScience-Presentation.pdf
  • Lab Guides
  • SCI-241_DataScience-LabGuide.pdf
  • 04-Lab-FirstStepsWithPythonInTheREPL.pdf
  • 09-Lab-UsingHDFSCommands.pdf
  • 10-Lab-IntroductionToSparkREPLsAndZeppelin.pdf
  • 14-Lab-LinearRegression.pdf
  • 15-Lab-LinearRegressionForClassification.pdf
  • 16-Lab-LogisticRegression.pdf
  • 17-Lab-SupportVectorMachines.pdf
  • 19-Lab-DecisionTrees.pdf
  • 20-Lab-RandomForests.pdf
  • 21-Lab-K-NearestNeighbors.pdf
  • 24-Lab-GaussianMixtureModel.pdf
  • 25-Lab-PrincipalComponentAnalysis.pdf
  • 32-Lab-NLP.pdf
  • 34-Lab-CrossValidation.pdf
  • 35-Lab-GridSearch.pdf
  • 37-Lab-EnsembleMethods.pdf
  • Lesson Slides
  • 01-Lesson-DataScienceLogistics.pdf
  • 02-Lesson-WhatIsDataScience.pdf
  • 03-Lesson-PythonPrimer.pdf
  • 05-Lesson-ZeppelinOverview-Part#1.pdf
  • 06-Lesson-ZeppelinOverview-Part#2.pdf
  • 07-Lesson-ThePythonScientificEcosystem.pdf
  • 08-Lesson-TheBigDataScienceStack.pdf
  • 11-Lesson-IntroductionToMachineLearning.pdf
  • 12-Lesson-EvaluatingMachineLearningModels.pdf
  • 13-Lesson-SupervisedLearning.pdf
  • 18-Lesson-DecisionTrees-RandomForests.pdf
  • 22-Lesson-SpamClassification.pdf
  • 23-Lesson-Clustering-DimensionalityReduction.pdf
  • 26-Lesson-IntroductionToNeuralNetworks.pdf
  • 27-Lesson-ConvolutionalNeuralNetworks.pdf
  • 28-Lesson-TensorFlow.pdf
  • 29-Lesson-TensorFlowOnSpark.pdf
  • 30-Lesson-NaturalLanguageProcessing.pdf
  • 31-Lesson-NLPLibraries.pdf
  • 33-Lesson-CrossValidation-HyperParameterTuning.pdf
  • 36-Lesson-EnsembleMethods.pdf
  • 38-Lesson-MachineLearningUseCases.pdf
  • 39-Lesson-MachineLearningInProduction.pdf
  • 40-Lesson-ThankYou.pdf
  • Wrapping Up
  • Course & Instructor Survey