Topic #19 – Machine Learning (Actually)¶
How did…¶
… Nate Silver predict the results of the 2012 US Presidential election?
… Amazon know what products heavy computer gamers like to purchase? (NSFW)
… Target know that this girl was pregnant before her parents did?
… Google build a machine capable of teaching itself to recognize cats on YouTube?
This list can go on forever.
Warning
It might be safer to use Colab, IPython, or Spyder for this. By all means, try PyCharm, if it doesn’t work there, try Spyder.
Machine Learning¶
We’re about to jump about threeish years ahead in your CS education.
There is a very rich, very old (by CS standards) field of computer science called Artificial Intelligence
One small corner of this vast field is an area called Machine Learning
Normally, you’d learn a whole bunch of basic CS. Both theoretical and applied.
Then you’d take a couple of general AI course.
Then you’d take a specialized course in machine learning.
If we wanted to do this right, we’d need to learn about:
AI (of course)
The theory of computation
Complexity theory
Advanced algorithms & Data structures
Linear Algebra
Multivariable calculus
Multivariate statistics (lots of stats, actually)
Even more stats
Think you’ve got enough stats? NO! MOAR STATS!
Signal Processing
Information Theory
…
But that’d take too long, so…
We’re going to skip straight to the last step.
SRSLY?
Yes. Machine learning is now too important for me not to show it to you.
What you can expect:
A very superficial introduction to ML
You’ll have some ideas about how to apply specific ML techniques and what they can tell you about data.
You should feel comfortable to begin exploring
scikitlearn
after working through this class.Everything is pretty much going to be tiny wizards and magic.
Hopefully you get excited enough about what these techniques can do to take the time to learn the details properly.
In order to avoid getting bogged down in detail, I’m going to play fast and loose with some definitions and concepts. Sorry (or not, depending on your perspective).
You’ll be able to turn your life up to an 11!
scikitlearn¶
Lucky for us, Python has a whole whack of ML libraries (including many specialized for particular fields).
We’re going to use scikitlearn as it is relatively fullfeatured and easy to use.
Requires Supervision¶
 Very broadly speaking, there are two types of ML (there are more, but we’re keeping it simple):
Supervised learning – you have a bunch of labeled training data and you want to build a program that will learn to generalize the training data so that it can classify new inputs
Unsupervised learning – you have a bunch of unlabeled data and you want to answer the question: “Does any of this stuff look like any of the other stuff?”. You want a program that will divide your dataset into clusters where all of the data items in the same cluster are similar to each other in some way.
There are many algorithms for both types of learning and new ones being described every day. We’re just going to barely scratch the surface here.
Activity
With your neighbours, come up with some situations in which you think you’d use supervised learning and some more in which you’d use unsupervised learning.
 To speed things up, we’re going to work with a dataset built in to
scikitlearn
. Toy data to play around with.
 To speed things up, we’re going to work with a dataset built in to
 If you want to use your own data, you just load it into a 2D array.
Each row is a data point
 Each column is a feature
In ML terminology, a single observation of a property (like petal length) is called a
feature
This data set records 4 features (sepal and petal length and width) for 150 Irises of three different types (Setosa, Versicolour, and Virginica).
# Imports
import numpy
import sklearn
import sklearn.datasets
# Loads a *dataset*
iris = sklearn.datasets.load_iris()
# Get's the recorded observations
# These are 150 observations of
# flowers' sepal and petal
# length and width
X = iris.data
# Print out the shape of the data
print(X.shape)
The dataset we loaded came with labels already classifying the Irises:
# Gets the labels for the flowers
# This is like, which classification
# the observation is
y = iris.target
print(y.shape)
 So
X
now contains feature vectors for 150 irises andy
contains the known truth about what type each iris is. The 0th thing in X corresponds to the 0th label in y
The 1st thing in X corresponds to the 1st thing in y
…
The nth thing in X corresponds to the nth thing in y
 So
 By the way, X and y are common labels:
X means independent variables
y means dependent variable.
Activity+++
Write a function called what_type_is_this(x)
that will take one observation of a iris (sepal and petal length and width) and it will return which classification of iris it is.
Also, don’t actually do this. Just ask yourself how you would do it. Imagine the number of if
statements.
Good effin’ luck writing that function.
Here’s the thing though. Wouldn’t it be awesome if we could have python come up with that function for us?
Supervised: kNearest Neighbours¶
 Imagine we do this:
For each row in our training set
data
, plot the 4 features (lengths) in a 4D space.When we get a new iris, we also plot it in the 4D space.
Find the
k
closest points to the new point we just plotted.Whatever iris type the majority of those points came from… that’s our guess for the new iris.
Let’s go through it on the board, with a 2D feature space.
Now let’s automate this with scikit, where we aren’t limited to 2D (and by our own growing boredom at plotting points).
# Import the thing we need
# to make a Knn classifier
import sklearn.neighbors
# Make the Knn OBJECT
knn = sklearn.neighbors.KNeighborsClassifier()
 We’re all set to train this KNN classifier on our data with the labels we have.
 We’re going to basically say:
Hey, algo, look at this data and learn what the rules are to correctly identify all these things
# This trains the knn on the
# data (X) and labels (y)
knn.fit(X, y)
That’s literally it. That’s how easy
scikitlearn
makes ML for you.knn
is now a knearest neighbours classifier for irises.Let’s try it. When we get a new iris for which we want to predict the class, we use:
# new_iris_matrix will be like a
# list of things we want to predict.
# So, ultimately like a list of lists
knn.predict(new_iris_matrix)
Activity
Pick some random irises from your data
set and attempt to classify them.
Check the answer using your known labels in y
.
For example:
# We need to wrap our one
# question (X[50]) into a list
# because predict wants a list
# of lists to predict
pred = knn.predict([X[50]])
Are they the same?
pred[0] == y[50]
What if we don’t want just a classification. What if we want to quantify how accurate our model is?
# This will see what percent
# accuracy the model gets
# with known data X and labels y
acc = knn.score(X,y)
print(acc)
Hey, that’s pretty good! Or maybe not.
What atrocity have we committed in our analysis of the classifier?
import sklearn.model_selection
# Split the X and y into a
# training and testing set
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.25)
Let’s try again, the right way
knn = sklearn.neighbors.KNeighborsClassifier()
knn.fit(X_train, y_train)
acc = knn.score(X_test, y_test)
print(acc)
Although simple, kNN is a pretty decent estimator… for datasets with small feature vectors.
In general, as the size of your feature vector grows linearly, the size of the training set required to make a good estimator grows exponentially.
Intuitively, is it easier to “fill in”: a line, a plane, or a cube? .
Supervised: Support Vector Machines (SVM)¶
Let’s go back and look at a simple plotting of our data (reduced to 2D for convenience).
 Maybe I could do this:
draw lines that separate regions of the plane that all contain the same type of iris.
treat those lines as absolute partitions of the plane.
when I get a new iris, plot it on the plane, and label it according to whatever partition it falls in.
Let’s try on the board again.
(In general, of course, our feature vectors will be higherdimensional… in which case just substitute the word ‘line’ with ‘hyperplane’. The idea is exactly the same: partition the space).
This idea leads to the Linear Support Vector Machine.
This is a bit more complex than the kNN classifier but, fortunately for us, it’s just as easy to use:
# Import and make a support vector
# classifier object
import sklearn.svm
svc = sklearn.svm.SVC(kernel='linear')
# Train it with the training data
# we already split up
svc.fit(X_train, y_train)
# test it with the testing data
acc = svc.score(X_test, y_test)
print(acc)
Activity+
For fun, go find some of your own data. Like online, or some data from another one of your classes.
Load it up into Python and see if you can build a classifier for it.
Activity+
I showed you 2 types of supervised machine learning for classification. Go check out the zoo of other methods out there.
Now, see if you can switch out the KNN or SVC classifiers we used for one of these.
Unsupervised: Kmeans clustering¶
Supervised machine learning is good and all, but what if you don’t really have any labels.
We obviously can’t make a classifier…
… but we can still look for structure in our data.
Let’s try this. Let’s pretend for a min that we have the iris data, but no labels. Pretend we’re the first humans to ever come across a field of irises. We record the data. We load it into python. We ask ourselves… hmmm, I wonder if there’s any patterns in this data that might tell me something?
Plot all of our datapoints on the plane.
 Guess the number of clusters we’re looking for. Let’s use the fact that we know there are 3 types of iris and pick 3 clusters.
This is totally cheating in this case.
Randomly place 3 “means” on the plane.
 Repeat the following until convergence:
Associate each data point to the nearest “mean”.
Compute the centroid of all of the points attached to each “mean”.
Move the position of the “mean” to this centroid.
There, that’s basically Kmeans clustering.
import sklearn.cluster
k_means = sklearn.cluster.KMeans(3)
k_means.fit(X)
We can now ask
k_means
what group each data point belongs to.
labels = k_means.labels_
print(labels)
 Two big questions for you:
Do the actual label numbers here matter?
Did we break the whole training/testing rule here?
Activity+
Pretending you don’t have access to labels
, what, if anything, does this result tell you?
Try visualizing your results.
Activity+
Quantify how good of a job kmeans clustering did of grouping together irises of the same type. To do this, you’ll need to bring in your “ground truth” labels
.
CrossValidation¶
 One of the things you learned above was the importance of proper crossvalidation of machine learning results.
training/testing
Because this is so important, scikitlearn has several built in crossvalidation generators that will slice your data into test and training sets for you… and then do the testing and training.




Split it K folds, train on K1, test on leftout 
Make sure that all classes are even across the folds 
Leave one observation out 
Takes a label array to group observations 
More generally, there is a whole set of tools to help with Model Selection .
The Zoo¶
This has been a (very) meagre taste of ML.
There is a whole zoo of Supervised and Unsupervised learning methods, with new ones being published every day.
Although the techniques we just looked at are ‘simple’, they are by no means insignificant!
scikitlearn has a pretty decent collection of the major algorithms, and a unified interface that makes it easy to try different options with minimum effort.
(And, like any good Python package, has a nice gallery ).
It is, however, by no means complete.
ML is a very powerful tool, especially in an age where we produce more data than is possible to analyze by hand.
Like any powerful tool, it’s also really easy to misuse.
 If you want to use ML in your research, you owe it to yourself to learn more. A couple of pointers to start you off:
Andrew Ng offers a ML course on Coursera. It’s awesome. If you want to use ML, take this course and do all the assignments.
If you really want to learn ML, get Chris Bishop’s Book . It starts from basic probability theory and goes from there. It is comprehensive, it is rigorous… it is not easy to read.
Activity
 Break into small groups. Identify a problem that you think could be solved well with machine learning. Specifically, you should be able to answer:
What is the data source?
What do you hope to learn from the data?
What ML approach(es) will allow you to do so?
How would you gather your data? Store it? Implement the ML step?
What approach would you take to analyzing your results?
What impact would your results have?