Machine Learning Using Spark.ML and Python

Learn how to use Spark.ML and Python for machine learning.

Machine Learning Using Spark.ML and Python

Learn how to use Spark.ML and Python for machine learning.
237
views
Share the link to this page
Copied

About the Class

You've heard how practically useful classification is for learning about your users. You are keen to try it for yourself. However, you've been unable to find any realistic user classification datasets. Such data is very difficult to find in published data sources.

You're familiar with Python and Apache Spark. You may have done some machine learning using Spark.ML. You have done parts of an end-to-end machine learning task, but there are other parts that you have yet to learn. 

You are good at some aspects of machine learning but less confident at others. You would like to work through a user classification task from beginning to end, starting from raw transactional user log data.

You will use a realistic user classification dataset that closely emulates the type of data one might encounter in a production setting. You will create a machine learning feature set from raw user log data and use it to create a classification model. You won't stop there - you will tune this model and improve it using hyperparameter tuning and data selection.

Spark combines the power of distributed computing with the ease of use of Python and SQL. Level up today.

What will you learn in this course?

  • How to train a classification model on high-dimensional data
  • How to do sensitivity analysis on a trained model.
  • How to assess the accuracy of a trained model.
  • How to improve model accuracy by tuning model hyperparameters.
  • How to improve model accuracy by judiciously selecting the data.
  • How to automate model tuning.
  • How to determine whether a dataset lends itself to classification. This entails analyzing the data and determining whether a classifier can have any chance of attaining a useful level of prediction accuracy. What is the power-law analysis and how to do it on a classification dataset?
  • How to analyze raw user log data.  In practice, data is not served up in an easy-to-use form. It typically is stored in a format that is easy to store. This is because being able to store the data efficiently often outweighs the need to be able to access that data in the future. 
  • How to do machine learning from end-to-end. This has several stages: (a) loading a raw file, (b) putting the data into a format suitable for analysis and for feature set creation, (c) creating the feature data, for example vectorizing the data, properly handling missing values, and reducing the dimensionality, (d) training the machine learning model, (e) assessing the performance of the model, (e) tuning the model.
  • Extracting, Transforming, and Selecting (ETS).  How to convert raw user log data to a tabular format that is most useful for training a classifier. The data may not be provided in a tabular format.  A machine learning engineer often needs to create a table to represent the data. They will also be responsible for identifying and resolving issues with the data that would otherwise hamper the accuracy of the model fit to the data.

What are the learning objectives?

The learner will complete two machine learning projects from beginning-to-end.

  • For the first project, the learner will be given a dataset that initially may seem inscrutable. However, they are soon able to extract useful information from the data. The learner will be able to do this using both interactive analyses as well as automated methods.
  • For the second project, the learner will predict the last item in a sequence. This technique is generally applicable to several types of recommendation systems.

The learner will know how to improve the accuracy of a seemingly marginally useful model. The learner will do a pre-deployment analysis. A machine learning task does not end when the model is trained. Calculating the overall model accuracy is not always sufficient. Assessing the model accuracy under different constraints on the input data is important to a successful deployment. This is because a model can be more or less accurate in some regions of the input space compared to other regions of the input space.

Here are some of the specific concepts that are covered:

  • What is the first thing to do with a messy comma-delimited log file
  • How to load raw data from a file and convert it into a tabular format 
  • What is the power-law analysis and how to use it on a dataset to determine whether that dataset is suitable for classification?
  • How to create feature data for training a logistic regression classifier.
  • How to train a logistic regression classifier on text data. 
  • How data selection can be used to optimize model accuracy, and the tradeoff between coverage and accuracy.
  • How to use sensitivity analysis to determine how well the model performs in different regions of the input space.

What technologies, packages, and functions will students use?

  • Bash
  • Python
  • Matplotlib
  • Pyplot
  • Apache Spark
  • Spark SQL
  • Spark.ML

All libraries used in this course are contained in the pyspark package. This means that we will not need to install additional packages. All modules are either part of the standard python library or imported from the pyspark module.

What terms and jargon will be defined?

Here is a list of technical terms, jargon, and acronyms that will be used in the course:

Feature engineering, machine learning, model fitting, classification, logistic regression, logistic classification, power law, Extract Transform and Select (ETS), pipeline, cross-validation, hyperparameters, feature engineering, feature sets, vectorizer, vocabulary, area under the curve (AUC), data selection, grid search, automated model tuning, Dataframe, Spark.ML, StructType, StructField, CountVectorizer, 

What concepts will be taught?

Active versus Casual is a common way to characterize users.  Active users can have many different behavioral characteristics than casual users. How can you ascertain this from the data?

Sanity checks and sensitivity analysis are pro tips familiar to every seasoned machine learning engineer.  How can you quickly determine whether your model is unbiased? How can you quickly determine how your model will perform on various segments of your user base? 

Prediction Accuracy vs Coverage. Students are usually taught to calculate prediction accuracy by averaging over the entire sample space. One can surface opportunities to improve prediction accuracy by segmenting the space. One can also determine what is the effective coverage of a model. The coverage is determined by the portion of novel data for which the prediction accuracy exceeds a desired threshold. A model might do extremely well for certain data, but be no better than a random guess on other data.

When presented with a large amount of raw data that belongs to two different classes, the analyst might initially believe that there is no discernible statistical difference between the two classes of data. However, if you know how to look more carefully, you can determine differences that can be identified.  Moreover, you can determine whether these effects are strong enough to automate a classification task using machine learning.

What pro tips are taught?

  • Learners may be inclined to use too much data when training a machine learning model. They might give short shrift to dimensionality reduction.
  • Learners may be inclined to impose their own preconceptions on the data, rather than automating analysis and letting the data speak for itself.
  • Learners might be aware of how to tune hyperparameters, but they might not be aware of how data selection can be used to optimize model performance.
  • Learners might not know how to use simple forms of sanity checks and sensitivity analysis to more quickly ascertain whether their model is valid.

What datasets will be used?

Two datasets are used. One is for guessing user demographic class membership from usage log data. The second is a text corpus. 

The usage of the log dataset is special.  This dataset closely emulates data commonly found in production settings, but it is difficult to find in published data sources. The dataset has been carefully created to emulate real demographic data.  Such data is rarely if ever published. One reason is to protect proprietary secrets. Another is to avoid privacy leaks.

This dataset looks and behaves very much like real data. This allows practicing real techniques that are easily transferrable to proprietary data you might encounter in a production setting. 

The data will be transformed several times along the way for the various stages of the task. So, while the entire course is based on a single dataset, the data will be manipulated such that it can seem like a completely different dataset at various stages along the way.

Although the dataset closely emulates the statistical qualities of a real-world demographic dataset, rather than use real-world labels we instead use a comical hypothetical scenario. However, make no mistake, the data is very real in that it captures very realistic qualities of data encountered in a production setting.

This approach has some benefits.  Firstly, it prevents us from being biased by unwarranted assumptions. You may encounter a dataset and make assumptions about it based on your own experience. Instead, you should be data-driven.

Using this hypothetical scenario also emphasizes the generality of this technique. This approach can be used to guess the gender, political affiliation, retiree vs teenager, homeowner vs renter, cancerous vs healthy, hotdog vs non-hotdog, just to name a few.

Additionally, this allows us to have a bit of levity along the way.

The primary dataset labels its two classes as "rabbit" and "duck".  At first, they look alike, but when the learner knows what to look for, they are easy to tell apart.

The text corpus dataset is a standard dataset commonly used to teach or demonstrate text processing. In our case, we are going to use it to demonstrate sequence prediction.  The text corpus is large enough that it poses realistic constraints on our algorithms. It is small enough that training can still be done quickly. In a production situation, the actual text corpus may be orders of magnitude larger. However, the underlying concepts taught here are applicable.

We use the text corpus as a stand-in for user session data.  Examples of this include song identifiers, topic ids, hashtags, URLs, and any other type of identifier that is consumed in a sequential manner. The task is predicting the last item in a sequence. 

We could have instead created a dataset that closely mimics the statistical characteristics of a production session log dataset.  However, using a text corpus has some advantages. Working with text makes this more intuitive. It is easier to understand what is going on when working with sequences of tokens that correspond to sentences of words rather than opaque identifiers.

Author

Mark Plutowski

Data Professional
Has 20 years experience in data-driven analysis and machine learning development. Mark has a Ph.D in Computer Science from UCSD and a Master of Science in Electrical and Computer Engineering from USC. Mark worked at IBM, Sony, and Netflix generating 29 patents. Mark is an experienced educator, having published courseware on multiple online...

School

Mark Plutowski's School

Requirements

  • You should have SQL installed on your PC/Mac
  • You should have knowledge of python lambdas
  • You should have Pyspark and Python installed on your PC/Mac
One-time Fee
$69.99
List Price:  $99.99
You save:  $30
€65.66
List Price:  €93.80
You save:  €28.14
£56.20
List Price:  £80.29
You save:  £24.09
CA$96.29
List Price:  CA$137.56
You save:  CA$41.27
A$109
List Price:  A$155.72
You save:  A$46.72
S$95.29
List Price:  S$136.13
You save:  S$40.84
HK$548.18
List Price:  HK$783.15
You save:  HK$234.96
CHF 63.57
List Price:  CHF 90.82
You save:  CHF 27.25
NOK kr772.45
List Price:  NOK kr1,103.56
You save:  NOK kr331.10
DKK kr489.98
List Price:  DKK kr700
You save:  DKK kr210.02
NZ$118.75
List Price:  NZ$169.65
You save:  NZ$50.90
د.إ257.05
List Price:  د.إ367.23
You save:  د.إ110.18
৳7,680.95
List Price:  ৳10,973.26
You save:  ৳3,292.30
₹5,842.52
List Price:  ₹8,346.81
You save:  ₹2,504.29
RM334.79
List Price:  RM478.30
You save:  RM143.50
₦90,777.03
List Price:  ₦129,687.03
You save:  ₦38,910
₨19,491.43
List Price:  ₨27,846.09
You save:  ₨8,354.66
฿2,579.37
List Price:  ฿3,684.97
You save:  ฿1,105.60
₺2,281.41
List Price:  ₺3,259.29
You save:  ₺977.88
B$366.88
List Price:  B$524.14
You save:  B$157.26
R1,343.94
List Price:  R1,920
You save:  R576.05
Лв128.54
List Price:  Лв183.64
You save:  Лв55.09
₩96,612.79
List Price:  ₩138,024.19
You save:  ₩41,411.40
₪264.97
List Price:  ₪378.55
You save:  ₪113.57
₱4,029.04
List Price:  ₱5,756.02
You save:  ₱1,726.98
¥10,812.58
List Price:  ¥15,447.20
You save:  ¥4,634.62
MX$1,208.81
List Price:  MX$1,726.94
You save:  MX$518.13
QR256.24
List Price:  QR366.07
You save:  QR109.83
P970.12
List Price:  P1,385.95
You save:  P415.82
KSh9,343.66
List Price:  KSh13,348.66
You save:  KSh4,005
E£3,382.65
List Price:  E£4,832.56
You save:  E£1,449.91
ብር3,997.39
List Price:  ብር5,710.80
You save:  ብር1,713.41
Kz58,507.55
List Price:  Kz83,585.80
You save:  Kz25,078.25
CLP$67,632.73
List Price:  CLP$96,622.33
You save:  CLP$28,989.60
CN¥506.81
List Price:  CN¥724.04
You save:  CN¥217.23
RD$4,150.64
List Price:  RD$5,929.75
You save:  RD$1,779.10
DA9,419.60
List Price:  DA13,457.15
You save:  DA4,037.55
FJ$159.29
List Price:  FJ$227.57
You save:  FJ$68.27
Q546.30
List Price:  Q780.47
You save:  Q234.16
GY$14,650.65
List Price:  GY$20,930.40
You save:  GY$6,279.74
ISK kr9,869.98
List Price:  ISK kr14,100.58
You save:  ISK kr4,230.60
DH711.61
List Price:  DH1,016.63
You save:  DH305.01
L1,253.52
List Price:  L1,790.82
You save:  L537.30
ден4,049.40
List Price:  ден5,785.11
You save:  ден1,735.70
MOP$564.53
List Price:  MOP$806.50
You save:  MOP$241.97
N$1,337.58
List Price:  N$1,910.91
You save:  N$573.33
C$2,584.86
List Price:  C$3,692.82
You save:  C$1,107.95
रु9,352.12
List Price:  रु13,360.74
You save:  रु4,008.62
S/263.28
List Price:  S/376.13
You save:  S/112.85
K266.90
List Price:  K381.31
You save:  K114.40
SAR262.55
List Price:  SAR375.08
You save:  SAR112.53
ZK1,789.88
List Price:  ZK2,557.08
You save:  ZK767.20
L326.78
List Price:  L466.85
You save:  L140.07
Kč1,658.90
List Price:  Kč2,369.96
You save:  Kč711.06
Ft25,921.07
List Price:  Ft37,031.68
You save:  Ft11,110.61
SEK kr766.36
List Price:  SEK kr1,094.85
You save:  SEK kr328.48
ARS$60,874.37
List Price:  ARS$86,967.11
You save:  ARS$26,092.74
Bs485.36
List Price:  Bs693.40
You save:  Bs208.04
COP$272,889.01
List Price:  COP$389,858.15
You save:  COP$116,969.14
₡35,190.72
List Price:  ₡50,274.61
You save:  ₡15,083.89
L1,734
List Price:  L2,477.26
You save:  L743.25
₲519,666.57
List Price:  ₲742,412.64
You save:  ₲222,746.06
$U2,710.62
List Price:  $U3,872.48
You save:  $U1,161.86
zł284.04
List Price:  zł405.79
You save:  zł121.74

What's Included

Language: English
Level: Intermediate
Skills: Data Analysis, Python, Pyplot, Apache Spark, Matplotlib, Machine Learning, Spark SQL, Spark.ML
Age groups: All ages
Duration: 3 hours 12 minutes
17 Videos
4 Documents
0
Saves
237
Views
This class has not been saved

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.