MPP On-Demand: Implementing Predictive Solutions with Spark in HDInsight (DAT202.3x)

Follow the link to our self-service price quote form to generate an email with a price quote.

MPP On-Demand: Implementing Predictive Solutions with Spark in HDInsight (DAT202.3x) Course Outline

Special Note to New Hampshire Residents
This course has not yet been approved by the New Hampshire Department of Education. Please contact us for an update on when the class will be available in New Hampshire.

*** Note: This is an On-Demand Self Study Class ***
You can take this class at any time; there are no set dates. It features hands-on labs so you can practice new skills at your workstation. Other parts of the course include video lectures that you can view on-the-go from your phone or tablet. In all cases, customers must call us directly to register this class at 800-288-8221.

MPP On-Demand Series
This course is part of the Microsoft Professional Program Series, MPP: Data Science Track and MPP: Big Data Track. It can be taken individually or as part of either 9-course series, plus a capstone project. You will need to purchase a validated certificate to get credit for this class as part of the MPP. Request the certificate at time of registration.

On-Demand Learner Profiles
MPP On-Demand is a self-study training solution that was designed for two types of learners. First, MPP On-Demand is a great fit for experienced IT professionals who don't need traditional 5-day classes to upgrade their existing skills. They can pick and choose topics to make the most effective use of their time. Second, MPP On-Demand is perfect for highly-motivated individuals who are new to a technology and need to space their learning over a period of weeks or months. These learners can take their time and repeat sections as needed until they master the new concepts.

Overview
This course is part of the Microsoft Professional Program Certificate in Big Data.

The open-source programming language R has for a long time been popular (particularly in academia) for data processing and statistical analysis. Among R's strengths are that it's a succinct programming language and has an extensive repository of third party libraries for performing all kinds of analyses. Together, these two features make it possible for a data scientist to very quickly go from raw data to summaries, charts, and even full-blown reports. However, one deficiency with R is that traditionally it uses a lot of memory, both because it needs to load a copy of the data in its entirety as a data.frame object, and also because processing the data often involves making further copies (sometimes referred to as copy-on-modify). This is one of the reasons R has been more reluctantly received by industry compared to academia.

The main component of Microsoft R Server (MRS) is the RevoScaleR package, which is an R library that offers a set of functionalities for processing large datasets without having to load them all at once in the memory. RevoScaleR offers a rich set of distributed statistical and machine learning algorithms, which get added to over time. Finally, RevoScaleR also offers a mechanism by which we can take code that we developed on our laptop and deploy it on a remote server such as SQL Server or Spark (where the infrastructure is very different under the hood), with minimal effort.

In this course, we will show you how to use MRS to run an analysis on a large dataset and provide some examples of how to deploy it on a Spark cluster or a SQL Server database. Upon completion, you will know how to use R for big-data problems.

Since RevoScaleR is an R package, we assume that the course participants are familiar with R. A solid understanding of R data structures (vectors, matrices, lists, data frames, environments) is required. For example, students should be able to confidently tell the difference between a list and a data frame, or what each object is generally a good representation for and how to subset it. Students should be familiar with basic programming concepts such as control flows, loops, functions and scope. Students should have a good understanding of how to write and debug R functions. Finally, students are expected to have a good understanding of data manipulation and data processing in R (e.g. functions such as merge, transform, subset, cbind, rbind, lapply, apply). Familiarity with 3rd party packages such as dplyr is also helpful.

Duration
Approximately 4 weeks assuming 2 to 4 hours of effort per week.

What you'll learn
You will learn how to use MRS to read, process, and analyze large datasets including:
• Read data from flat files into R’s data frame object, investigate the structure of the dataset and make corrections, and store prepared datasets for later use
• Prepare and transform the data
• Calculate essential summary statistics, do crosstabulation, write your own summary functions, and visualize data with the ggplot2 package
• Build predictive models, evaluate and compare models, and generate predictions on new data
View outline in Word

ED2023

Attend hands-on, instructor-led MPP On-Demand: Implementing Predictive Solutions with Spark in HDInsight (DAT202.3x) training classes at ONLC's more than 300 locations. Not near one of our locations? Attend these same live classes from your home/office PC via our Remote Classroom Instruction (RCI) technology.

For additional training options, check out our list of Courses and select the one that's right for you.

GENERAL INFO

Class Format
Class Policies
Student Reviews

MPP On-Demand: Implementing Predictive Solutions with Spark in HDInsight (DAT202.3x) Course Outline

Business to Business Only

New Hampshire Residents