Douglas McIlwraith, Haralambos Marmanis, Dmitry Babenko-Algorithms of the Intelligent Web-Manning Publications — facebook.com..


Чтобы посмотреть этот PDF файл с форматированием и разметкой, скачайте его и откройте на своем компьютере.
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: [email protected]
2016 by Manning Publications Co. All rights reserved.
This book is dedicated to Elly,
forewordix
prefacexi
acknowledgmentsxiii
about this bookxv
Building applications
for the intelligent web1
1.1An intelligent algorithm in action: Google Now3
1.2The intelligent algorithm lifecycle5
1.3Further examples of intelligent algorithms6
1.4Things that intelligent applications are not7
Intelligent algorithms are not all-purpose thinking machines7
Intelligent algorithms are not a drop-in replacement for humans 7
Intelligent algorithms are not discovered by accident8
1.5Classes of intelligent algorithm8
Artificial intelligence9
Machine learning9
Predictive
analytics10
1.6Evaluating the performance
of intelligent algorithms12
Evaluating intelligence 12
Evaluating predictions12
1.7Important notes about intelligent algorithms15
Your data is not reliable15
Inference does not happen
instantaneously16
Size matters!16
Different algorithms
have different scaling characteristics16
Everything is not a
nail!17
Data isnt everything17
CONTENTS
variable17
Generalization is the goal17
Human intuition
is problematic18
Think about engineering new features18
Learn many different models18
Correlation is not the same
as causation18
1.8Summary19
Extracting structure from
data: clustering and
transforming your data20
2.1Data, structure, bias, and noise22
2.2The curse of dimensionality25
2.3K-means26
K-means in action31
2.4The Gaussian mixture model33
What is the Gaussian distribution?34
Expectation
maximization and the Gaussian distribution36
The Gaussian
mixture model36
An example of learning using a Gaussian
mixture model38
CONTENTS
Classification: placing th
ings where they belong77
4.1The need for classification78
4.2An overview of classifiers81
Structural classification algorithms82
Statistical classification
algorithms84
The lifecycle of a classifier85
CONTENTS
6.4Multilayer perceptrons142
Training using backpropagation 145
Activation
functions 146
backpropagation147
Backpropagation theory148
MLNN in scikit-learn150
A learned MLP153
6.5Going deeper: from mult
foreword
The World Wide Web is the underlying infr
FOREWORD
share my enjoyment when you read it. More
important, I hope that after you finish
reading, you find yourself equipped with
the skill and knowledge to make the web
more intelligent.
ROFESSOR
IRECTOR
ATA
CIENCE
NSTITUTE
MPERIAL
preface
We consider ourselves lucky to
be working in one of the mo
st exciting fields of tech-
nology that our time has to offer. In a
few short years, weve moved from a nascent
PREFACE
technique over a mathematics-heavy approach.
This is a tricky line to tread, and we
hope that weve done it well.
As youll see, the book is arranged in
to eight core chapters, each covering an
important area of the intelligent web from
an algorithmic standpoint. The book ends
with an appendix that looks at the inte
lligent web from a data-processing vantage.
Weve included it so that, as a practitioner
in the space, you can appreciate how cru-
cial (and difficult) it is to move high-v
elocity data around a system effectively.
xiii
Thanks to everyone at Manning who made this book possible: publisher Marjan Bace
ACKNOWLEDGMENTS
First and foremost, I want to thank my fi
ance, Elly. Darling, your love, patience,
and support are much-appreciated constants
in my life, and Im
certain that without
them, this book would never ha
ve been written. I love you.
Second, Id like to thank my parents and
family, who have always fostered my curi-
osity and supported me when Ive foundered.
I hope you enjoy this book and realize
how grateful I am for your nurture and care.
Third, Id like to mention my friends
and colleagues, who are too numerous to
name. Ive been lucky enough to work and
play alongside some
extraordinary people,
and this has made every day a pleasure. Thank you all.
Id also like to thank both of my editors, Jeff Bleiel and Jennifer Stout. Youve been
I wholeheartedly thank my cherished
wife, Aurora, and our three sons: Nikos,
Lukas, and Albertthe greatest pride and joy of
my life. Ill always be grateful for their
love, patience, and understanding. The incessa
nt curiosity of my children has been a
continuous inspiration for my studies on
learning. A huge acknowledgment is due to
my parents-in-law, Cuchi and Jose; my sister
s, Maria and Katerina; and my best friends,
Michael and Antonio for their continuous encouragement and unconditional support.
Id be remiss if I didnt acknowledge the
manifold support of Drs. Amilcar Avenda-
o and Maria Balerdi, who taught me a lot
about cardiology and funded my early work
on learning. My thanks also are due to Pr
ofessor Leon Cooper, and many other amaz-
ing people at Brown University, whose zeal
for studying the way that our brain works
trickled down to folks like me and instigat
ed my work on intelligent applications.
To my past and present colleagues, Ajay
Bhadari, Kavita Kantka
r, Alexander Petrov,
Kishore Kirdat, and many others, who enco
uraged and supported all the intelligence-
related initiatives at work: there are only a few lines that I can write here, but my grati-
tude is much larger than that.
ARALAMBOS
First and foremost, I want to thank my beloved wife Elena.
Id also like to thank all of my past an
d present colleagues, who influenced my pro-
fessional life and served as an inspiration:
Konstandin Bobovich, Paul A. Dennis, Keith
Lawless, and Kevin Bedell.
MITRY
ABENKO
Algorithms of the Intelligent Web
was written to provide you with a roadmap to the design
and creation of intelligent algorithms. It
draws on many areas
of computer science,
including machine learning and artificial in
telligence, but has been written with the
practitioner in mind. Its
a cookbook of sorts and provides the relative newcomer in
this field with several real-w
orld, worked examples that can be modified for your own
purpose.
This book is firmly aimed at those who are
new to writing intell
igent algorithms but
who have a firm grasp of programming and
basic math and statistics. Wherever possi-
ble, weve tried to write the book so the ma
thematical rigor can be
glossed over, leav-
ing you with an overall impression of the
applicability of an approach. Of course, if
youre more mathematically inclined, we
encourage you to follow the detail. Ideally,
readers of this book should have had at
least some exposure to a programming lan-
guage and have attended a first-year
undergraduate mathematics course.
This book is organized into ei
ght chapters and one appendix:
Chapter 1 provides an overview of inte
lligent algorithms and outlines some key
characteristics. It also pr
ovides you with guiding principles for the rest of the
book.
Chapter 2 discusses the concept of struct
ure within your data. In particular, it
introduces the concept of the feature sp
ace. In this chapter, we introduce
expectation maximization and eigenvectors.
ABOUT
THIS
BOOK
Chapter 3 introduces recomme
nder systems. We introd
uce collaborative filter-
ABOUT
THIS
BOOK
Math conventions
In the text, youll find many equations that support the code and concepts intro-
duced. We follow a standard convention
for the notation throughout. Matrices are
represented by upright, capital, bold characters, such as
. Upright, lowercase, bold
characters represent vectors:
. Scalar values are expressed
using lowercase italic, as in
the case of
About the authors
LWRAITH
earned his first degree at
Cambridge in computer science
ABOUT
THIS
BOOK
xviii
About the cover illustration
The illustration on the cover of
Algorithms of the Intelligent Web, Second Edition
, is taken
from a French book of dress customs,
Encyclopedie des Voyages
by J. G. St. Saveur
, pub-
lished in 1706. Travel for pl
easures was a relatively new
phenomenon at the time, and
illustrated guides such as this one were
popular, introducing both the tourist and the
armchair traveler to the inhabitants of other far-off regions of the world, as well as to
the more familiar regional co
stumes of France and Europe.
The diversity of the drawings in the
Encyclopedie des Voyages
speaks vividly of the
uniqueness and individuality of the worlds
countries and peoples
just 200 years ago.
This was a time when the dress codes of
two regions separated by a few dozen miles
identified people uniquely as belonging to one or the other, and when members of a
social class or a trade or a tribe could be ea
sily distinguished by what they were wear-
ing. This was also a time when people were fascinated by foreign lands and faraway
places, even though they couldnt travel
to these exotic destinations themselves.
Dress codes have changed since then, and
the diversity by region, so rich at the
time, has faded away. Its now often hard to
tell the inhabitants of one continent from
another. Perhaps, trying to vi
ew it optimistically, weve tr
aded a world of cultural and
visual diversity for a more varied personal
life, or a more varied and interesting intel-
lectual and technical life.
We at Manning celebrate the inventivene
ss, the initiative, and the fun of the com-
puter business with book covers based on na
tive and tribal costumes from two centu-
ries ago, brought back to life by pict
ures like those from this travel guide.
The intelligent web means different things
to different people. To some it repre-
sents the evolution of the web into a more
responsive and useful entity that can
learn from and react to its users. To others it represents the inclusion of the web
into many more aspects of ou
r lives. To me, far from being the first iteration of Sky-
net, in which computers take over in a dy
stopian future, the intelligent web is about
designing and implementing more naturally
ns that make our
online experiences better in some quanti
fiable way. Theres a good chance that
every reader has encountered machine inte
lligence on many
separate occasions,
and this chapter will highlight some examples
so that youll be better equipped to
recognize these in the future. This will, in
turn, help you understand whats really
happening under the hood when you intera
ct with intelligent
applications again.
This chapter covers
Recognizing intelligence on the web
Types of intelligent algorithms
HAPTER
Building applications for the intelligent web
Now that you know this book isnt about writing entities that will try to take over
the world, we should perhaps discuss some
other things that you wont find within
these pages! First, this is very much a back-end book. In these pages, you wont learn
about beautiful interactive visu
alizations or platforms. For
this, we refer you to excel-
lent publications by Scott Murray,
David McCandless,
and Edward Tufte.
Suffice to
say that we dont have space here to do th
is topic justice along
with what were about
to cover. Also, this book wont teach you statistics; but to gain the most from the book
well assume you have a 101 level of knowle
dge or higher, in that you should have at
least taken a course in statistics at some point in the past.
This is also not a book about data science.
Scott Murray,
Interactive Data Visualization for the Web
(OReilly, 2013).
David McCandless,
Information Is Beautiful
(HarperCollins, 2010).
Edward Tufte,
The Visual Display of Quantitative Information
(Graphics Press USA, 2001).
Data Science From Scratch: First Principles with Python
(OReilly, 2015).
An intelligent algorithm in action: Google Now
Where possible, we reference known examples
of these in the wild, so that you can test
your knowledge against the behavior of
such systemsand no doubt impress your
friends!
Event
Intelligent
algorithm
Figure 1.1Overview of an intelligent
algorithm. Such algorithms display
intelligence because the decisions
they make change depending on the
data theyve received.
HAPTER
Building applications for the intelligent web
granularity. In the context of figure 1.1, this
is one aspect of the data thats being used
to change the behavior of the algorithm.
From here its a short step to determine
home and work locations. This is performed through the use of
: that is,
knowledge that has somehow be
en distilled into the algorithm before it started learn-
ing from the data. In this case, the prior kn
owledge could take the form of the follow-
ing rules:
The location most usually occupied overnight is home.
The location most usually occupied during the day is work.
People, in general, travel to their
workplace and then home again almost
every day.
Although this is an imperfect example, it
does illustrate our point wellthat a con-
cept of work, home, and commuting exists wi
The intelligent-algorithm lifecycle
algorithm. Each solution that you create should be grounded in and built on work in
the fieldmuch of which youll
find in this book. Weve
introduced several key terms
here in italics, and well refer to these in the coming chapters as we tackle individual
algorithms in more depth.
1.2The intelligent-algorithm lifecycle
In the previous section, we introduced you
to the concept of an intelligent algorithm
comprising a black
box, taking data and making predictions from events. We also more
Ben Fry, PhD thesis,
Computational Information Design
(MIT, 2004).
Update
Model
Current
location
Current location
Figure 1.2Graphical overview of one aspect of the
Google Now project. In order for Google Now to make
predictions about your future locations, it uses a
model about past locations, along with your current
position. Priors are a way to distill initially known
information into the system.
or structure
Figure 1.3The intelligent-algorithm lifecycle
HAPTER
Building applications for the intelligent web
When designing intelligent algorithms, you fi
rst must acquire data
(the focus of the
appendix) and then parse and clean it, becaus
e its often not in the format you require.
You must then understand that data, whic
h you can achieve through data exploration
and visualization. Subsequently, you can repr
esent that data in more appropriate for-
mats (the focus of chapter 2). At this poin
t, youre ready to train a model and evaluate
the predictive power of your solution. Chap
ters 3 through 7 cover various models that
you might use. At the output of any stage,
Things that intelligent applications are not
that you should withdraw your savings an
d start trading based on this companys
HAPTER
Building applications for the intelligent web
algorithms arent well suited to such do
mains without process simplification and
formalization.
Lets draw a simple but effective parallel with the automation of the motor-vehicle
assembly line. In contrast to
the early days of assembly-line automation pioneered by
learning
analytics
intelligence
Figure 1.4Taxonomy of intelligent algorithms
Classes of intelligent algorithm
1.5.1Artificial intelligence
Artificial intelligence, widely known by its acronym
, began as a computational field
around 1950. Initially, the goals of
were ambitious and aimed at developing
machines that could think like humans.
Over time, the goals became more practical
and concrete as the full extent of the wo
rk required to emulate intelligence was
uncovered. To date, there are many definitions of
. For example, Stuart Russell and
Peter Norvig describe
as the study of agents that
receive percepts from the envi-
ronment and perform actions,
whereas John McCarthy stat
es, It is the science and
engineering of making intelligent machin
es, especially intelligent computer pro-
Importantly, he also states that intel
ligence is the computational part of the
ability to achieve goals in the world.
In most discussions, AI is about the st
udy of agents (software and machines) that
have a series of options to choose from at
a given point and must
achieve a particular
goal. Studies have focused on
particular problem domains (Go,
chess,
and the
game show Jeopardy!
), and in such constrained environments, results are often
excellent. For example,
s Deep Blue beat Garry Kasparov at chess in 1997, and in
2011
s Watson received the first-place prize of $1 million on the
. television
show Jeopardy! Unfortunately,
few algorithms have come close to faring well under
Alan Turings
imitation game
(otherwise known as the Turing test), which is consid-
ered by most the gold standard for intell
igence. Under this game (without any visual
or audio cuesthat is, typewritten answers on
ly), an interrogator must be fooled by a
machine attempting to act as a human. This
situation is considerably harder, because
the machine must possess a wide range of kn
owledge on many topi
cs; the interrogator
isnt restricted in the
questions they may ask.
1.5.2Machine learning
Machine learning (
) refers to the capability of a software system to generalize
based on past experience. Importantly,
is about using these generalizations to pro-
vide answers to questions relating to previously collected data as well as data that
hasnt been encountered before. Some appr
oaches to learning create explainable
modelsa layperson can follow the reasonin
g behind the generalization. Examples of
explainable models are decision trees and,
more generally, any rule-based learning
Herbert Simon,
The Shape of Automation for Men and Management
(Harper & Row, 1965).
Stuart Russell and Peter Norvig,
Artificial Intelligence: A Modern Approach
(Prentice Hall, 1994).
John McCarthy, What Is Artificial Inte
lligence? (Stanford University, 2007),
http://www-formal.stanford
.edu/jmc/whatisai
Bruno Bouzy and Tristan Cazenave, Com
puter Go: An AI Oriented Survey,
Artificial Intelligence
(Elsevier)
132, no. 1 (2001): 39103.
Murray Campbell, A. J. Hoane, an
d Feng-hsiung Hsu, Deep Blue,
Artificial Intelligence
(Elsevier) 134, no. 1
(2002): 5783.
HAPTER
Building applications for the intelligent web
method. Other algorithms, though, arent
as transparent to humansneural net-
ctor machines (
) fall in this category.
You might say that the scope of machine lear
ning is very different than that of AI.
looks at the world through the lens of
agents
striving to achieve
goals
(akin
to how a human agent may operate at
a high level in their environment),
focuses
on learning and generalization (more akin
to how a human operates internally). For
example,
deals with such problems as classifi
cation (how to recognize data labels
given the data), clustering (how to group
Apply
Train
Predict
10
4
5
2
2
Figure 1.5The machine-learning
data flow. Data is used to train a
model, which can then be applied
to previously unseen data. In this
case, the diagram illustrates
classification, but the diagrams for
clustering and regression are
conceptually similar.
Classes of intelligent algorithm
You may ask, How is this different from
, which is also about pr
ediction? This is a
good question! In general,
techniques focus on understanding and generalizing
the underlying structure and relationships of
card, mobile phone, or mortgage, retailers
are exposing themselves to an element of
risk (and also an element of reward!). In
order to balance this, retailers want to kn
ow that theyre mostly providing credit to
credit-worthy individuals, while rejecting thos
e who are more likely to default on their
payments. In practice, this deci
sion is referred to speciali
st credit agencies that will,
HAPTER
Building applications for the intelligent web
1.6Evaluating the performanc
e of intelligent algorithms
Thus far, weve talked about the high-level
classes of intellig
ent algorithms and pro-
vided several motivating exam
ples. But how does the inte
lligent-algorithm practitio-
ner evaluate their algorithm? This is of great importance for several reasons. First,
without an objective evaluation, its not
possible to track perf
ormance, and so its
impossible to know if modifications have
improved operation.
Second, if you cant
measure performance, then justification be
comes difficult. From
a business perspec-
tive, managers and technologists will alwa
ys try to measure va
lue against cost, and
being able to solidly evaluate your solution will help keep it in production!
Lets now return to each class of intel
ligent algorithm and discuss strategies for
performance measurement. Although well
touch on how to evaluate intelligence,
this section is largely about evaluating pr
edictions (that is, for machine learning and
predictive analytics). Evaluati
ng (and defining) intelligence is a big subject and could
fill a book like this on its own! Rather than
delve into this topic, we refer you to Linda
Gottfredson,
and Alan Turing.
1.6.1Evaluating intelligence
In the preceding text, we mentioned Turing
s imitation game, but there may be a call
to evaluate systems that are less grand in
their approach to intelligence: for example,
intelligent systems that play chess or comp
Linda S. Gottfredson, Mainstream Science on Intelligence:
An Editorial with 52 Sign
atories, History, and Bib-
liography,
Evaluating the performance of intelligent algorithms
Ground truth
Prediction
104...TrueFalse
207...TrueTrue
56...FalseFalse
12...FalseTrue
Table 1.2Performance metrics used in the evaluation of intelligent algorithms
Metric
Calculation
True positive rate (TPR)Predicted true
positives / ground truth true positives
True negative rate (TNR)Predicted true negatives / ground truth true negatives
False positive rate (FPR)1 true negative rate
False negative rate (FNR)1 true positive rate
HAPTER
Building applications for the intelligent web
00.10.20.30.40.5
0.60.70.80.9
Important notes about intelligent algorithms
So far, so sensible! Be aware, however, th
may not always be so easy. Take, for exampl
e, the issue of credit scoring. When you
provide a score, you may not always be able
to follow up on the behavior of that user
in the future. Also, in these cases youre
usually not attempting to provide the answer
but rather some indicator thats correlated wi
th a target variable
(such as creditworthi-
ness). Although the
ROC
curve can be used in these cases, there can be a few more
hoops to jump through to get there.
1.7Important notes about intelligent algorithms
Thus far weve covered a lot of introductory
material. By now, you
should have a fairly
good, albeit high-level, understanding of in
telligent algorithms and how youre going
to use them. Youre probably sufficiently
motivated and anxious to dive into the
details. We wont disappoint you. Every foll
owing chapter is loaded with new and valu-
able code. But before we embark on our jo
urney into the exciting and (for the more
cynical among us) financially rewarding world
of intelligent applications, well present
some useful things to know, many of wh
ich have been adapted from Pedro Domin-
goss excellent and easy-to-read paper on the subject.
These should help guide you
as you progress through this book and, later on, the field of intelligent algorithms.
1.7.1Your data is not reliable
Your data may be unreliable for many reas
ons. Thats why you should always examine
whether the data youll work with can be
trusted before you start considering specific
intelligent-algorithmic solutions to your
problem. Even intelligent people who use
very bad data will typically arrive at erroneous conclusions. The following is an indica-
Pedro Domingos, A Few Useful Things
to Know About Machine Learning,
Communications of the ACM
55,
no. 10 (2012): 7887.
HAPTER
Building applications for the intelligent web
Your data may not be normalized. Lets say that youre looking at the weight of
Important notes about intelligent algorithms
versions, but you should investigate this from
Probabilistic Reasoning
in Intelligent Systems
(Morgan Kaufmann Publishers, 1988).
HAPTER
Building applications for the intelligent web
1.7.9Human intuition is problematic
As the feature space grows, theres a combin
atorial explosion in the number of values
the input may take. This makes it extremely
unlikely that for any moderate-sized fea-
Domingos, A Few Useful Things to Know About Machine Learning.
Summary
1.8Summary
We provided the 50,000foot view of intelligent algorithms, providing many
specific examples based
on real-world problems.
An intelligent algorithm is any that ca
n modify its behavior
based on incoming
data.
We provided some intelligent-algorithm de
sign anti-patterns that will hopefully
serve as a warning flag for practitioners in the space.
Youve seen that intelligent algorithms
can be broadly divided into three cate-
gories: artificial intelligence, machine le
arning, and predictive
analytics. Well
spend the most time on the latter two
in this book, so if you glanced over
these sections, it would be worth reread
ing them to ensure good foundational
knowledge.
Extracting structure
from data: clustering and
transforming your data
This chapter covers
Features and the feature space
Expectation maximizationa way of training algorithms
Transforming your data axes to
ages correlate with higher salaries? Is a larger percentage of the wealth present in a
smaller percentage of the population?
If found, these generalizations can be
extracted directly either to
provide evidence of a pattern
Figure 2.1Visualizing structure in data. x- and
y-axis values are arbitrary and unimportant.
This data shows three different classes of data,
all colored differently. You see that the x and y
values appear to cluster around certain areas of
the x, y plot dependent on their class. You also
see that higher values of x correlate to higher
values of y regardless of the class. Each of
these properties relates to the structure of the
data, and well cover each in turn in the
remainder of this chapter.
HAPTER
Extracting structure from data: clustering and transforming your data
expectation maximization
). This key concept will be re
visited again in this chapter
when we discuss our second
clustering method using the
Gaussian mixture model
Using a
to cluster can be seen as an exte
nsion of the k-means algorithm. Its
possible to achieve exactl
y the same results with
and k-means if you provide cer-
tain limitations to your
Data, structure, bias, and noise
��� import numpy as np
Imports scikit-learn datasets
Loads in the
Creates an array of both the
Contents of the first
of the array displayed
HAPTER
Extracting structure from data: clustering and transforming your data
that has been recorded about the Iris dataset. You can observe that, for each row of
iris.data
, the features consist of the sepal le
ngth (cm), the sepal width (cm), the
The curse of dimensionality
individual species, there are a few other issu
es to watch out for. For example, what if
the data was collected early in the flower
-growing season? What if the flowers were
immature and hadnt grown to their full si
ze? This would mean th
e flowers are system-
atically different from the overall populationthis is known as
. Clusters gener-
many measurements, there are two fundamental problems when collecting data with
high dimensionality that are particularly
important when finding structure. The first
problem is that the large number of dimens
ions increases the amount of space avail-
able for spreading data points. That is, if
you keep the number of data points fixed
and increase the number of attributes you want to use to describe them, the density of
the points in your space decreases exponent
ially! So, you can wander around in your
data space for a long time without being able to identify a formation thats preferable
to another one.
The second fundamental problem has
a frightening name. Its called the
curse of
dimensionality
. In simple terms, it means if you have
any
x-axis
x-axis
y-axis
Figure 2.2The curse of dimensionality: every point tends to have the same
distance from any other point.
HAPTER
Extracting structure from data: clustering and transforming your data
If you look at figure 2.2 from left to right,
youll see that the dimensionality increases
by 1 for each drawing. We start with eight
points in one dimension (x-axis) distributed
Stuart P. Lloyd, Least Sq
uares Quantization in PCM,
IEEE Transactions on Information Theory
28 (1982): 129-
137.
K-means
mentioning that this simple one-dimensio
nal example demonstrates the concept with-
out loss of generalization. In reality, nontri
vial patterns and structure would be found
from higher-dimensional data, subject, of
course, to the curse of dimensionality.
Lloyds algorithm works by iteratively tryi
ng to determine the means of the clusters
Figure 2.3A number line containing two possible clusters in a single dimension. An entry indicates
the existence of that number in the dataset. Data points represented by the same shape indicate one
possible intuitive clustering of the data.
HAPTER
Extracting structure from data: clustering and transforming your data
represent this, new means are
calculated based on the aver
age values in that cluster:
that is, the new mean for cluster k
is given by the mean of the green data points, and
the new mean for cluster k
is given by the mean of the red data points. Thus, the new
and k
are 2.5 and 11.2, respectively. You can see the updated means in
figure 2.6. Essentially, both cluster mean
s have been dragged up by the data points
assigned to them.
We now proceed with the second iteration. As before, for each data point, we
assign it to the cluster with the closest ce
ntroid. All of our points below approximately
6.8 are now assigned to cluster k
, and all of our points above this number are
Figure 2.4Initial assignment of the means of the k clusters. The mean of cluster k
is denoted
by a triangle, and a circle denotes the mean of cluster k
Figure 2.5Initial assignment of data points to
clusters. Those data points assigned to cluster k
are shaded green (diamonds), and those assigned to cluster k
are shaded red (squares).
K-means
assigned to cluster k
. If we continue as before an
d calculate the new means of the
clusters, this gives centroids of 3.5 and 12.8
. At this point, no number of additional
iterations will change the assignment of the
data points to their clusters, and thus the
cluster means wont change. Another way of
saying this is that the algorithm has
verged
The same nomenclature is used when learning the parame
ters for any algorithm. When a cost function is min-
imized, practitioners say that the training algorithm has
converged. In this book, well mostly rely on scikit-
learn libraries to reach conver
gence for us during training.
Figure 2.6Updated means given the new cluster assignment. Cluster centroids have been updated
Figure 2.7Final assignment of the data points to
the clusters. Final cluster centroids are given by
the triangle and circle, respectively. Data points below 6.8 have been assigned to cluster k
, and data
points above this have been assigned to cluster k
. Additional iterations of the k-means algorithm
wont change this clustering.
HAPTER
Extracting structure from data: clustering and transforming your data
Note that this final assignment is exactly
the means we would have chosen if we had
manually performed clustering (that is, we
Leon Bottou and Yoshua Bengio, Convergence
Properties of the K-Means Algorithms,
Advances in Neural
Information Processing Systems
7 (1994).
Listing2.3K-means pseudo-code: expectation maximization
Expectation step
Maximization
step
K-means
2.3.1K-means in action
Listing2.4The k-means algorithm in action
Listing2.5Visualizing the output of k-means
Contains only the data,
not the labels
Passes in the value of k, which
in this case is equal to 3
HAPTER
Extracting structure from data: clustering and transforming your data
The Gaussian mixture model
A quick glance over the data seems to su
ggest that plotting petal length against
HAPTER
Extracting structure from data: clustering and transforming your data
2.4.1What is the Gaus
sian distribution?
You may have heard of the Gaussian distribu
0
5
15
20
Figure 2.9Histogram of a normal distribution, in this case the height of 334 fictitious people. The
modal (most frequently occurring) bin is centered at 180 cm.
-----------------
-------------------
The Gaussian mixture model
deviations from the mean, we might gue
results of the cumulati
ve density function (
CDF
) for the values of
that define the
range. This works because the
for a given value of
Figure 2.10Fictional data on the height of 334 people. The sample probability of a given user
occurring in a given height range is shown in red (t
he bar on the left in each pair). The probability from
HAPTER
Extracting structure from data: clustering and transforming your data
A brief look at figure 2.10
shows that our initial guess of
180 for the mean and 28 for
the standard deviation isnt bad. The standa
rd deviation looks re
could potentially be decreased slightly. Of
course, we could continue to change the
The Gaussian mixture model
If we assume that a user can be either
male or female, then our Gaussian distribu-
tion from earlier is the result of samplin
g from two separate and overlapping Gaussian
distributions. Rather than mo
deling the space using a sing
le Gaussian, we can now use
two (or more!) such distributions:
This equation is extremely similar to the
one earlier in the chapter, save for a few
details. First, note that instead of a sing
le Gaussian, the probability is made up from
the sum of K separate Gaussian distribution
()
-----------------
-----------------------
110
115
Distribution of womens heights
Distribution of mens heights
Figure 2.11Probability distributions of height for men and women. Note that these probabilities are
conditioned on the fact that gender is known: so, fo
r example, given that we know a particular person
is a woman, her probability of having a height
HAPTER
Extracting structure from data: clustering and transforming your data
The probabilities given on the y-axis in figu
re 2.11 are conditioned on the fact that we
know the gender of the person. In general, for the purposes of this example, we
wouldnt necessarily have that data at hand
(maybe it wasnt recorded at measurement
time). Thus wed need to learn not only the parameters of each distribution but also
the gender split in the sample population (
The Gaussian mixture model
v *= 9
ell = mpl.patches.Ellipse(gmm.means_[n, [x,y]], v[0], v[1],
180 + angle, color=color)
HAPTER
Extracting structure from data: clustering and transforming your data
Ron Weiss and Gael Varoquaux, GMM classification,
http://mng.bz/uKPu
Figure 2.12Output from listing 2.6. Each panel show
and
k-means if additional constraints are plac
ed on the learned covariance matrices.
In particular, if covariance matrices for each cluster are tied together (that is, they
all must be the same), and covariances ac
ross the diagonal are
restricted to being
equal, with all other entries set to zero, th
en youll obtain spherical clusters of the
same size and shape for each cluster. Under
such circumstances, each point will always
belong to the cluster with the closest mean. Try it and see!
As with k-means, training
a Gaussian mixture model with
can be sensitive to ini-
tial starting conditions. If
you compare and contrast
GMM
to k-means, youll find a few
more initial starting conditions in the former
than in the latter. In particular, not only
must the initial centroids be specified, but the initial covariance matrices and mixture
weights must be specified al
so. Among other strategies,
one approach is to run
k-means and use the resultant centroids to
A word about make_ellipses
The method
make_ellipses
is derived from
plot_gmm_classifier
, written by Ron
Weiss and Gael Varoquaz for scikit-learn. By
extracting the covariance matrix in the
two dimensions youre plotting, you can find the first and second directions of maxi-
mal variance, along with their respective magnitudes. You then use these to orient
and draw ellipses relating to
the shape of the learned Ga
rections and
magnitudes are known as
and
, respectively, and will be cov-
HAPTER
Extracting structure from data: clustering and transforming your data
2.6Transforming the data axis
So far in this chapter, weve concentrated on
clustering data as its presented in its
original feature space. But wh
at if you were to change th
is feature space into a more
appropriate oneperhaps one
with fewer features thats
more descriptive of the
underlying structure?
K. F. Riley, M. P. Hobson, and S. J. Bence,
Mathematical Methods for Physics and Engineering
(Cambridge Univer-
sity Press, 2006).
v
=
Transforming the data axis
represent? It represents how much the shear
operates in a given direction: that is, the
magnitude of that operation. As such, the
eigenvalues are a measure of importance of
that direction. Thus, you could describe mo
st of the operation of the shear matrix by
recording only the eigenvectors with the largest eigenvalues.
2.6.2Principal component analysis
Principal component analysis
uses the eigenvector/eigenvalue
decomposition in a specific
way in order to obtain the
principal
components
Jonathon Shlens, A Tutorial on Principal Component Analysis, eprint arXiv:1404.1100, 2014.
Claude Brezinski, The Life
and Work of Andre Cholesky,
Numer. Algorithms
43 (2006):279-288.

HAPTER
Extracting structure from data: clustering and transforming your data
2.6.3An example of principal component analysis
Now that you understand eigenvectors, eigenvalues, and
x
u
v
Figure 2.13The two principal components of a
Initializes a PCA
decomposition with
two components
Fits the data: solves
Transforms the data
into the basis.
Because we chose to
components, this has
two dimensions.
For each iris category, pl
ots (in a different color)
the coordinates in the new coordinate axis
Transforming the data axis
components. At this point, no decompositio
n has been performed (for a start, no data
has been passed in); you mere
ly prepare the Python objects for the next stage. The
next stage is the
formed into the new data axis
. The original data is now expressed as a combination
of the two eigenvectors with the largest ei
genvalues (because you restrict the number
of components to two in
), which you can now plot as per figure 2.14
. It should
now be fairly clear that the x-axis (y=0) an
d the y-axis (x=0) are
the directions (in four
dimensions in Iris) of the highest variance
and second-highest va
riance, respectively.
The values plotted correspond to the distan
ce along that line of the original data.
Thus, youve transformed a 4-D dataset do
Figure 2.14Eigenvector decomposition of the Iris
HAPTER
Extracting structure from data: clustering and transforming your data
Principal component analysis is a powerful tool that can help you to calculate a
more efficient representation
of the data while minimizing
loss to that datas discrimi-
natory power. Although we used this on a re
t here, it can be, and
often is, applied to real-world data as a firs
t passas a prep
to the classi-
fication and regression problems that well spend more time on later in the book.
Note, however, that complexity grows faster
as the number of features in your dataset
grows, so removing redu
ndant and irrelevant feat
ures as a step before
may pay
dividends.
2.7Summary
relevant content
In todays world, were overwhelmed with ch
This chapter covers
Understanding recommendation engines based on users,
Finding recommendations about friends, articles, and
news stories
HAPTER
Recommending relevant content
like
or services like
, theyll be wasting their time, and youll be annoyed. The
broadcasting approach of traditional ad
vertising methods (such as billboards,
ads,
and radio ads) suffers from th
at problem. The goal of broadcasting is to alter your
preferences by incessantly repeating the same
message. An alternative, more pleasant,
Distance and similarity
characteristics of genres. For
a more creative twist, you could help people establish a
Movie
0: Frank0: Toy Story5
1: Jumanji4
2: Grumpier Old Men5
3: Waiting to Exhale4
4: Father of the Bride Part II5
5: Heat4
6: Sabrina5
a. GroupLens Research, MovieLens dataset,
http://grouplens.org/
datasets/movielens/
HAPTER
Recommending relevant content
Listing3.1Calculating the similarity of users based on their ratings of movies
Table 3.1Ratings for the users show that Frank and Constantine
Frank and Catherine. This data was modified,
with permission, from the MovieLens
database.
(continued)
Movie
Reads in the data using
Similarity of Frank and Constantine
under both measures of similarity
Similarity of Frank
with himself under
both measures of similarity
Distance and similarity
111514
3
Items
121811
3
5
Items
HAPTER
Recommending relevant content
All distances are greater than or equal to
zero. In most cases, we constrain the
similarities to be nonnegative like distan
ces. In fact, we constrain the similarities
within the interval [0,1].
William James,
Principles of Psychology
(Holt, 1890).
Classification and Cognition
(Oxford University Press, 1994).
Distance and similarity
The plots of the ratings in figure 3.1 cl
early display the somewhat reciprocal nature
of distance and similarity. The greater
the distance between the two curves, the
If no common item
rating
rating
HAPTER
Recommending relevant content
This equation states the definition of
watched three movies, and among these three
differed by (for
each movie), their similarity would also be
0.25, according to the nave similarity met-
ric. Intuitively, we expect these two users
to be more similar than those who watched a
single movie and whose opinions differ
ed by three units (out of five).
The nave similarity squeezes the simi
larity values for small distances (because
weadd 1) while leaving large distances (value
s of the distance much larger than 1.0)
unaffected. What if we add another value?
The general form of the nave similarity is
y= beta / (beta + x)
, where
tangent from 1.0, so that the final
value of similarity ranges between 0.0 and 1.
0, with 0.0 implying dissimilarity and 1.0
implying the highest similarity. Voil! Weve
arrived at our first definition of similarity
of users based on their ratings.
The second similarity definition that we present in listing 3.2, in the case where
sim_type
=1, improves on the first similarity by
taking into account the ratio of the
common items versus the numb
er of all possibl
e common items. Thats a heuristic
that intuitively makes sense. If Ive wa
tched 30 movies and youve watched 20, we
Distance and similarity
formula. In other words, the extent to wh
ich we watch the same movies should some-
how affect the degree of our si
milarity as movie enthusiasts.
3.2.2Which is the best similarity formula?
It may be clear by now that you can use many formulas to establish the similarity
between two usersor between two items, for
that matter. In addition to the two simi-
larities introduced earlier, we could
ve used a metric formula known as the
Jaccard sim-
ilarity
between users, which is defined by the ratio of the intersection over the union of
their item setsor, in the case of item simi
larity, the ratio of the intersection over the
union of the user sets. In ot
her words, the Jaccard similari
Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten,
Evaluating Similarity Measures: A Large-Scale Study
0.5
1
Euclidean distance
y = 1 / (1 + x)
Figure 3.2Nave similarity curves as functions of the Euclidean distance
intersectionAB
(,)
unionAB
(,)
HAPTER
Recommending relevant content
We dont advise that you randomly choose your similarity metric, but if youre in a
hurry, consider the Euclidean or the Jaccard
similarity first. It should give you decent
results. You should try to understand the na
ture of your data and what it means for
two users or two items to be similar. If yo
u dont understand the reasons why a particu-
User-based collaborative filtering
In item-based
, were most interested
in the similarity of items. For example, in
the case of our movie recommendation exampl
e, wed typically build a matrix of mov-
ies whereby matrix entries ex
press some degree of similari
ty. Movies can then be rec-
ommended that are most similar to those
that users have alre
ady seen or liked.
Conversely, with user-based
, the similarity of users is considered important. In this
scenario, users may be reco
mmended content if theyre similar to those who have
seen or liked a particular item. In model-based
, neither item-item similarity nor
user-user similarity is calculated explicitly
, but a model is introduced to capture the
Item-based CF vs. user-based CF
Depending on the data, there can be a distinct performance profile in the use of each
HAPTER
Recommending relevant content
user similarity. We use the same data as
we did to demonstrate
different measures of
similaritythe data given in table 3.1.
data = Data()
format = {'col':0, 'row':1, 'value':2, 'ids': 'int'}
data.load(dat_file, sep='::', format=format)
similarity_matrix = SimilarityMatrix()
recommend(0,10)
recommend(1,10)
recommend(2,10)
Listing 3.3 provides the high-lev
el code to perform user-based
. We read in the data
using the
load
Listing3.3User-based CF recommendation
Provides the top
recommendations for
user 0 (Frank)
Provides the top
user 1 (Constantine)
Provides the top
recommendations for
user 2 (Catherine)
User-based collaborative filtering
def recommend(user_id, top_n):
#[(item,value),(item1, value1)...]
recommendations = []
for i in itemdict.keys():
if (int(i) not in items_reviewed(int(user_id),userdict)):
recommendations.append((i,predict_rating(user_id, i)))
recommendations.sort(key=lambda t: t[1], reverse=True)
Listing3.4
Listing3.5Predicting the ratings of a user
HAPTER
Recommending relevant content
score that they input, along wi
th a calculation of how similar this user is to the one for
whom we wish to make a prediction. We then
calculate a weighted rating, which is the
similarity of the two users mu
ltiplied by the rating given by this new user. In order to
calculate a value that considers all the us
ers in the system, we calculate the sum of
weighted_ratings
along with the sum of the similarities. The latter is used to
normalize the values back to the range [0,5
]. Provided that there are some users in
Listing3.6
class
Calls the
Calculates only values of
v that are larger than
the current value of u
(upper triangular form)
The similarity
Due to the upper triangular form, we
User-based collaborative filtering
SimilarityMatrix
class encapsulates the storage,
calculation, and access to the
matrix that defines how similar each user is
with respect to another. When its first
instantiated, it calls the
build
Listing3.7
class
HAPTER
Recommending relevant content
those ratings. As you can see, its not enough for two users to have both rated a movie;
they also must broadly agree on its quality.
RatingCountMatrix
encapsulates this
through the use of a matrix that captures
the covariance of their ratings. For users
who always agree on the ratings for the movi
es theyve seen, wed expect the diagonal
of the matrix to be populated with higher numbers: that is,
where
. For those
who never agree or who have opposing views,
wed expect those entries farthest away
from the diagonal to be populated: that is,
where
),
rating
=max(
).
The calculation of similarity falls out of the matrix intuitively, as you saw in listing
3.6. We can use the total number of rating
s that the users have agreed on, divided by
the total number of items that both users ha
ve provided a rating for. Note again that
our computational performance
likely far below optimal, because
we recalculate the
RatingCountMatrix
Model-based recommendation using singular value decomposition
Singular Value Decomposition,
Wolfram MathWorld
, 2015,
http://mng.bz/TxyY
Jan J. Gerbrands, On the Relationships between SVD,
KLT and PCA, Conference on Pattern Recognition
(1980).
AUDV
HAPTER
Recommending relevant content
that it makes sense to select
features that are far away from
each other in user space. If
sci-fi fans never watch romantic comedies an
d vice versa, then such a feature would be
useful in providing a recommendation. Simu
ltaneously, we know that for any two col-
umns in
, or equivalently any two rows in
, theyre also orthogonal. That is, latent
features are selected such that the features
are as far away as possible from each other
ACM RecSys Wiki,
www.recsyswiki.com/wiki/Python-recsys
Model-based recommendation using singular value decomposition
svd = SVD()
recsys.algorithm.VERBOSE = True
dat_file = './ml-1m/ratings.dat'
item_file = './ml-1m/movies.dat'
data = Data()
data.load(dat_file, sep='::',
format={'col':0, 'row':1, 'value':2, 'ids': int})
items_full = read_item_data(item_file)
user_full = read_user_data_from_ratings(dat_file)
Listing3.8SVD for recommendation
Listing3.9Factorizing a matrix using SVD for recommendation
Oscar Celma, Python Recsys v1.0 documentation,
http://mng.bz/UBiX
Sends back progress
to the screen
Movie info
HAPTER
Recommending relevant content
rows
Rescale the rows of the input matrix
so that they all have unit Euclid-
ean magnitude.
cols
Rescale the columns of the input matrix so that they all have unit
Euclidean magnitude.
Rescale the rows and columns of the input matrix, by dividing both the
rows and the columns by the square root of their Euclidean norm.
mean_center
Center the input matrix (aka mean substraction).
post_normalize
Normalize every row of
to be a unit vector. Thus, row
similarity (using cosine dist
C. Lanczos, An Iteration Method for
the Solution of the Eigenvalue Probl
em of Linear Differential and Inte-
gral Operators,
Journal of Research of the National Bureau of Standards
45, no. 4 (October 1950),
http://mng.bz/cB2d
Doug Rohde, SVDLIBC,
http://tedlab.mit.edu/~dr/SVDLIBC/
Model-based recommendation using singular value decomposition
Listing3.10Output from model-based recommendation
Listing3.11Films rated by user 10
HAPTER
Recommending relevant content
higher than 3, and figure 3.4 provides th
e genres for which the user has rated lower
than 3. We dont report on films that ha
ve been rated with the average score of 3.
This should now start to make more sense.
Figure 3.3 shows that this user is clearly
interested in comedy and drama films, because they account for the overwhelming
majority of positive ratings
Comedy|Romance
Drama|Romance
Comedy|Drama
Action|Adventure|Sci-Fi
Animation|Childrens
Adventure|Children
s|Fantasy
Animation|Childrens|Musical
Comedy|Horror
Comedy|Sci-Fi
Action|Sci-Fi|
Action|Adventure
Drama|Sci-Fi
Animation|Childrens|Comedy
Comedy|Fantasy
Comedy|Musical
Musical|Romance
Figure 3.3The number of films that have been rate
d by user 10 as 4 or higher. These are broken
out by genre, so user 10 has rated more than 40 comedy
films with a 4 or a 5. Only the top 19 genres
are displayed here.
Model-based recommendation using singular value decomposition
received two or more low rankings, but we
Arkadiusz Paterek, Improving Regularized Singular Value Decomposition for Collaborative Filtering,
Pro-
ceedings of the ACM SIGKDD
(2007),
http://mng.bz/TQnD
0.5
1
1.5
2
2.5
3
3.5
Action|Sci-Fi
Comedy|
Action|Western
Action|Children
s|Fantasy
Action|
War
Drama|Romance
Figure 3.4The number of films that have been ranked as lower than 3 by user 10, broken out by
genre. In this case, the user has ranked three
comedy films and two action/sci-fi films below 3 and
has watched several other categories of films and ranked them with a score of 1.
HAPTER
Recommending relevant content
Listing3.12Investigating user recommendations in model-based CF
Listing3.13Analyzing the ratings of users recommended
Star Wars: TPM
Model-based recommendation using singular value decomposition
['Comedy|Sci-Fi\n', 8],
['Drama|Sci-Fi\n', 7]
20
30
50
�Aggregate of ratings 3 for users who recommended Star Wars: The Phantom Menace
a|War
Drama|
Action|Sci-Fi
Action|Adventure|Sci-Fi
Action|Sci-Fi|
Comedy|Romance
Comedy|
Action|
a|War
Drama|
Romance|War
a|Thriller
Horror|Sci-Fi
Film-Noir|Thriller
Western
Comedy|Sco-Fi
Figure 3.5Number �of ratings 3 aggregated by g
HAPTER
Recommending relevant content
comedy categories. At first glance this ma
y seem erroneous, but remember that there
are many films in these categories. Also
remember that films in these genres may
receive high ratings from all users in the
16, 17
Although this was a truly impressive ac
hievement, Netflix didnt actually imple-
ment the final algorithm and instead chose
to continue with a simpler, albeit less-
accurate algorithm (yielding only an 8.43% improvement versus the winning
10.06%). This algorithm won the first
progress prize and was implemented even
Martin Piotte and Martin Chabbert, The Pragmatic
HAPTER
Recommending relevant content
rental slots. With the focus shifting to
online streaming, this became less relevant
because the delivery mechansim was amost instantaneous. A user can watch whatever
they want, wherever and wherever they want
. Herein lies our second important lesson:
a reasonably intelligent algorithm delivere
d quickly can often be worth more than an
extremely accurate one delivered at a later date.
3.7Evaluating your recommender
Commercial recommendation systems oper
ate under demanding conditions. The
number of users is typically on the order of
millions and the number of items on the
order of hundreds of thousand
s. An additional requiremen
t is the capability to pro-
vide recommendations in real
time (typically, sub-second
response times) without sac-
rificing the quality of the recommendation
s. This is typically performed with some
layer of caching, and the time taken to use
SVD
for matrix factorization is usually not
real-time and depends on many other fact
ors: for example, the amount of memory
you have and, more important, the implementation of
. It also might depend on
how long youre willing
to wait for the answer!
As youve seen, by accumulating ratings fr
om each user, its possible to enhance the
accuracy of predictions over time. But in real
life, its imperative
that we give excellent
recommendations to new users for whom, by de
finition, we dont have a lot of ratings.
Another stringent requirement for state-
of-the-art recommendation systems is the
ability to update predictions based on in
coming ratings. In
large commercial sites,
place in a few hours and perhaps tens of
thousands (or more) in the course of a sing
le day. The ability to update the recom-
mendation system with that additional
information is import
ant and must happen
onlinewithout downtime.
Lets say that you wrote a recommender an
d youre satisfied with its speed and the
amount of data it can handle. Is this a good
recommender? Its not useful to have a
fast and scalable recommender that pro
Jonathan L. Herlocker, et al., Evaluatin
g Collaborative Filtering Recommender Systems,
ACM Transactions
on Information Systems
22, no. 1 (January 2004).
Summary
We can argue that the
RMSE
is probably too nave. Lets consider two cases. In the
first case, we recommend to a user a movie with
four stars, and he really doesnt like it
(hed rate it two stars); in the second ca
se, we recommend a mo
vie with three stars,
but the user loves it (hed rate it five star
s). In both cases, th
e contribution to the
RMSE
is the same, but its likely that the user
s dissatisfaction would probably be larger
in the first case than in the second case.
You can find the code that calculates the
RMSE
for our model-based
solution in the following listing. As weve mentioned,
Listing3.14Calculating the root-mean-square error for a recommender
HAPTER
Recommending relevant content
In the final pages of this chapter, we l
ooked at the Netflix Prize. Although this
things where they belong
What is this? is the question children perhaps ask most frequently. The popularity
of that question among childrenwhose inqu
isitive nature is as wonderful as it is
persistentshouldnt be surprising. In order to understand the world around us,
we organize our perceptions into groups
and categories. In the previous chapter,
This chapter covers
Understanding classification techniques
HAPTER
Classification: placing things where they belong
classification of the item as a boat, narrow
ns for us. In a simi-
The need for classification
there are classification systems for spinal co
rd injuries; for coma, concussion, and trau-
matic brain injuries; and so on.
You must have
heard the term
Homo sapiens
, of which
is our genus and
sapiens
is our species. This classification can be, and typically is,
extended to include othe
r attributes such as
family
order
phylum
, and so forth.
Generally speaking, the more attributes yo
u use, the finer the degree of classifica-
tion is going to be. Having a large number
of attributes is usually a good thing, but
there are caveats to this general principl
e. One notorious symptom of dealing with
many attributes is the
curse of dimensionality
, which was discussed in chapter 2.
As you may recall, the curse of dimensio
nality refers to the fact that a space
becomes more and more homogenous as the nu
mber of attributes increases. In other
this analogy to obtain some in
sight into your structure.
HAPTER
Classification: placing things where they belong
We could obviously go on with more classi
fication systems; theyre everywhere. The
point is that classifying data is equivalent
to organizing it. Classification systems
improve communication by redu
cing errors due to ambiguities. They also help us
organize our thoughts and pl
an our actions. The referenc
e structure, which is used
for organizing our data, can be as simple
D. Gasevic, D. Djuric, and V. Devedzic,
Model Driven Architecture
and Ontology Development
(Springer, 2006).
NAME Toyota Prius
Figure 4.1The basic elements of a reference structure (a rudimentary ontology). Ellipses denote
concepts, whereas rounded rectangles denote attributes. Rectangles denote instances. Concepts
inherit attributes from above.
An overview of classifiers
application. When you think about your data and the ways it could be organized,
youll realize the value of a
classification system for your
application. Starting with
section4.3, well introduce classification me
chanisms in web applications in order to
demonstrate the use of classification algori
thms and the issues that may arise. But
first, heres an overview of classification systems. If you want to jump into action
quickly, you can skip the next section.
4.2An overview of classifiers
HAPTER
Classification: placing things where they belong
separated into functional and nearest neighbor schemes; and
perceptron
neighbor
trees
Bayes
Figure 4.2An overview of classification algorithms based on their design
An overview of classifiers
will eliminate as many candidates as po
ssible based on the provided information.
Decision-tree algorith
ms have several advantages, such as ease of use and computa-
tional efficiency. Their disadvantages are us
ually prominent when we deal with contin-
uous variables, because were forced to
perform a discretizationthe continuum of
the values must be divided into a finite numb
er of bins in order to construct the tree.
In general, decision-tree algorithms dont
have good generalization properties, and
thus they dont perform well with unseen data. A commonly used algorithm in this
category is C5.0 (on Unix machines) or
See5 (on Microsoft Windows machines). It
can be found in a number of commerci
al products, such as Clementine, now
SPSS
Modeler,
and RuleQuest.
The second branch of structural algorith
ms is composed of distance-based algo-
rithms. In the previous chapters, we intr
oduced and extensively used the notions of
similarity measure and genera
lized distance. These algorithms are fairly intuitive, but
its easy to misuse them and end up with ba
d classification results because many of the
data-point attributes arent directly related
to each other. A single similarity measure
cant expediently capture the differences in the way the attributes should be mea-
sured; careful normalization and analysis of
the attribute space ar
e crucial to the suc-
cess of distance-based algori
thms. Nevertheless, in many
low-dimensional cases, with
low complexity, these algorith
ms perform well and are fairly
simple to implement. We
can further divide distance-b
ased algorithms into functi
onal and nearest neighbor
type algorithms.
Functional classifier
s approximate the data by functi
on, as the name suggests. This
is similar to regression, but we differentiat
Tom Khabaza,
The Story of Clementine
(Internal Report, Integral Solutions Limited, 1999).
Ross Quinlan, RuleQuest Research: Data Mining Tools,
www.rulequest.com
HAPTER
Classification: placing things where they belong
flavor of a statistical approach in the
case of so-called
logistic regression
. In this case, the model (t
he logistic function) pre-
dicts values between 0 and 1, which can be
interpreted as the probability of class mem-
bership; this works well in the case of
binary classification if we threshold the
probability. This model is widely used in in
dustry and will be the
tion. Here well address its us
e to classify online fraud,
an especially difficult task
because these events occur so rarely
in comparison to non-fraud events.
Many of the techniques in the statistical algorithms category use a probability theo-
rem known as the
Bayes rule
Bayes theorem.
In this kind of statistical classification
algorithm, the least common denominator is th
e attributes of the
problem are independent of each other, in
a fairly quantitatively
explicit form. The
fascinating aspect of Bayesian algorithms is
that they seem to work well even when that
Trevor Hastie, Robert Tibshirani, and Jerome Friedman,
The Elements of Statistical
Learning: Data Mining, Infer-
ence and Prediction
, 2nd. ed. (Springer-Verlag, 2009).
Kevin Murphy,
Machine Learning: A Probabilistic Perspective
(MIT Press, 2012).
An overview of classifiers
R. E. Neapolitan,
Classifier?Training
ClassifierEValidation
data
ClassifierProduction
Figure 4.3The lifecycle of a
classifier: training, validation (or
testing), and use in production
HAPTER
Classification: placing things where they belong
combining intelligent algori
thms was also discussed in
the context of recommender
systems in chapter 3.
In the production stage, were using our cl
assifier in a live system to produce classi-
fications on the fly. Typically, the parameters
of the classifier dont change during the
production stage. But its possible to enha
nce the results of the
classifier by embed-
ding (into the production stage) a short-live
d training stage thats based on human-in-
the-loop feedback. These three steps are repe
ated as we get more data from our pro-
tter ideas for our classifier. Youve now seen
the big picture about classification algori
thms. In the literature, you can find many
different overviews of cl
assification algorithms.
In order to provide you with a sufficientl
y in-depth and applicab
le example, for the
remainder of this chapter well concentrate on
statistical algorithms
and, in particular,
L. Holmstrm, P. Koistinen, J. Laaksonen, and E. Oj
a, Neural and Statistical ClassifiersTaxonomy and Two
Case Studies,
ymxc
0
x
Linear (data)
Figure 4.4Graphical illustration of a linear model.
The data points provide a clear linear relationship
Figure 4.5Calculating the residuals, or the
HAPTER
Classification: placing things where they belong
above or below the model line. This is known as
model training
. In general, our
straight-line model isnt restricted to only a single variable,
, and can be generalized
into
+1 dimensions by using the following form:
In this case, our approach to modeling is
exactly the same; all that has changed is the
+++++
01234567891011121314151617181920
Figure 4.6The response of
with
for a logistic curve. This particular curve is centered on
and has been plotted using 20 samples uniformly distributed on
. Note the larger differences in
for
values of
around 10 and much smaller differences in
around
=0 and
-------------------------------
+++
-------------------------------------------------------------
HAPTER
Classification: placing things where they belong
The next step is somewhat a leap of faith. We know that
is equivalent to the probabil-
ity of the event occurring with the
features given in the exponent of
, so now we rear-
range the equation to express
)the probability of the event occurring divided
by the probability of the event not
occurring. This is known as the
of the event.
Lets first calculate 1
and substitute this back into
our equation afterward:
Now that we have the equation for 1
, the modeled probability of the negative event,
lets express the odds using our initial definition of
Substituting back in for
and taking the natural logarithm on both sides, we end up
with our final equation:
Compare this with the original equation from section 4.3.1, and what do you notice?
Thats correct; theyre both linear models
, but where our original model expressed
probability as a linear comb
ination of the input variables, our new logistic regres-
sion model expresses the
log-odds
as a linear combination
of input variables. Train-
ing can occur in a way much similar to th
e linear model: we update the values of
in order to best fit the data (to minimize loss). To define
and minimize error, we
can use maximum likelihood estimation: that
------------
y

1
e
e

=
----------------------
------------
----------
------------
------------
----------
------------
+++
HAPTER
Classification: placing things where they belong
In figure 4.7, the columns
the first line of data
Listing4.2Masking categorical features
Listing4.3Splitting out and encoding categorical features
Kevin Murphy,
Machine Learning: A Probabilistic Perspective
(MIT Press, 2012).
Creates a mask object full of True
Fills the columns with 0s where
the data is of categorical type
Extracts the non-
Extracts the categorical data
Encodes only the categorical data
HAPTER
Classification: placing things where they belong
The code in listing 4.3 splits ou
t the features into two arrays:
data_non_categoricals
which contains the continuous features, and
data_categoricals
, which contains the
categorical features. We then encode the la
tter using one-hot-encoding to ensure that
the value each categorical variable can take is encoded independently and that the
ordering of any feature encoding
doesnt impact the end results.
Lets first look at the performance of a simple model using only the continuous
features, shown in the next listing.
new_data=data_non_categoricals
new_data=new_data.astype(np.float)
X_train, X_test, y_train, y_test =
cross_validation.train_test_split(new_data,
Listing4.4Simple model using continuous features only
One-hot-encoding
In the example, we discuss the use of one-hot-encoding for categorical features. But
what is this, and why should we bother wi
th it? For logistic re
gression, all features
are eventually encoded as a number, and th
is is where we can experience problems
if were not careful to deal with
categorical vari
Creates a train and
Trains the
logistic
regression
model
Obtains the output of the
HAPTER
Classification: placing things where they belong
Odds: [[ 1.00137002 0.47029208]]
Odds intercept[ 0.50889666]
Likelihood Intercept:[ 0.33726409]
This tells us several things. The first line
indicates that, holding
stant, a unit increase in the transactio
n amount increases the odds of fraud by
1.00137002. Looking at the data, this intuitively makes sense. We do see that in our
Listing4.5Using both categorical and non categorical variables with LR
Binary Ground Truth Labels / Model Log. Prob.
Binary Ground Truth Labels / Model Prob.
Test Instances
Test Instances
Figure 4.8Output of the model plotted against the test data ground truth. Values for the data labels
may only take values of 1 or 0, indicating fraud and not fraud, respectively. Log-probability may take
HAPTER
Classification: placing things where they belong
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
Are your results credible?
Odds: [[ 1.00142244 0.48843238 1. 0.97071389
0.77855355 0.95091843 0.85321151 0.98791223
0.99385787 0.95983606 0.81406157 1.
0.869761 1. 0.94819493 1.
0.96913702 0.91475145 0.81320387 0.99837556
0.97071389 0.869761 0.9778946 1.
0.81320387 0.99385787 0.94499953 0.82707014
0.98791223 0.98257044]]
Odds intercept[ 0.61316832]
Likelihood Intercept:[ 0.38010188]
Given the two graphs returned from each
version of our experiment, we might con-
clude that these additional features have had
very little impact on the classifiers per-
formance. Well discuss more about performa
nce in the next section, but before we
do that, you should understand why the model parameters have such a different form
in this experiment co
mpared to the last.
Youll note that there are a total of 30 model parameters in this version. Why is
this, given that there are only six featur
es? This is a consequence of our encoding
features would almost
certainly be differ-
ent than those weve obtained here.
Lets now take this opportunity to summari
ze what weve accomplished in this sec-
tion. We started out with th
e most humble linear model
that is, drawing a straight
line through our data pointsand we extended this to a model that is linear in the
log-odds of the target variable. This powerful
technique allows for a change in an attri-
bute to provide a nonlinear change in the
odds (although linear in the log odds). We
showed how this can then be applied to a
HAPTER
Classification: placing things where they belong
things that range from exagge
rations to outright nonsense. The goal of this section is
to help you evaluate your own classifier,
if youre a developer, and help you under-
stand the legitimacy (or otherwise) of th
Table 4.1A typical confusion matrix for a simple binary
classification problem
Positive
Negative
True
True positive (TP)True negative (TN)
False
False positive (FP)False negative (FN)
Are your results credible?
prefer to release 100 guilty
people than convict 1 innocent person; that sensitivity
remains in the European courts. The moral of this anecdote is that decisions have
consequences, and the degree of the conseque
nces isnt uniform. This is particularly
true in the case of multicla
ss classification. Consider the
following defini
tions, some of
which were introduced in section 1.6:
N, where N
TN
Specificity
1
FP rate
TN
Recall
TP
P, where P
TP
Precision
TP
(TP
TN)
(P
N)
F-score
Precision
Recall
Suppose we find out about a classifier whose
accuracy, as defined earlier, is 75%. How
close to the true accuracy of the classifier
is our estimate? In othe
r words, if we repeat
the classification task with different data, how
likely is it that our accuracy will be 75%?
To answer that question, well resort to
Ian Witten and Eibe Frank,
Data Mining
(Morgan Kaufmann, 2005).
HAPTER
Classification: placing things where they belong
a large number of tests, on
HAPTER
Classification: placing things where they belong
statistical information about your data,
such as minimum and maximum values, mean
values, median values, valid ou
tliers, percentage of missing data in attribute values,
and so on. You can use that information to
sample previously unseen data from your
Summary
Pay attention to the idiosyncrasies of your
classification system
. If you use a rule-
based system, you may encounter whats known as the
utility problem
. The learning pro-
cessthe accumulation of rule
scan result in the overall
slowdown of the system in
production. There are ways to avoid or
at least mitigate the utility problem,
but you
need to be aware of them and ensure that
your implementation is compatible with
these techniques. Of course, the degradatio
n of performance isnt the only problem
in that case. Youd also need to provide ways to manage and organize these rules,
which is an engineering problem with a solu
tion thatll depend strongly on the spe-
cific domain of your application. In ge
neral, the more complicated the classifier
implementation, the more careful you shou
ld be to understand the performance
characteristics (both speed and
quality) of your classifier.
4.6Summary
We provided a taxonomy of this broad fi
eld covering both statistical and struc-
tural algorithms. Statistical algorithms fo
cus on the fit of a particular function
to some data, often using algorithmic ap
proaches to find the maximum likeli-
hood of the data given the model. Conver
sely, structural algorithms look at fea-
R. B. Doorenbos,
Production Matching for Large Learning Systems
, PhD thesis, Carnegie Mellon (CMU, 1995).
Case study: click prediction
for online advertising
Online click prediction is a specific probl
em in the world of advertising and is an
excellent example of a high-velocity,
high-throughput problem where decisions
must be made with a very low latency. In
general, this class of
problems has a large
number of applications. Onli
ne trading, website optimi
zation, social media, the
This chapter covers
History and background
From an advertisers standp
oint, we can only proxy a positive impact through inter-
action with an ad. So, we tr
y to predict the likelihood of interaction with an ad on a
web page, based on all of the information
surrounding a user and the context. Due to
the volume and velocity of data generated
on the web and the requirement for a rapid
answer, this isnt a simple problem.
The solutions presented here should be equally applicable to different applica-
tions. For example, if we consider a smart city, where all cars are enabled with sensors
Total US Ad Spending to See Largest Increase Since 2004,
HAPTER
Case study: click prediction for online advertising
exchange
demand-side platform.
Well further develop these concepts in this
chapter, and youll learn how you can use
the information collected about a user to
increase the odds of interaction with an ad.
Figure 5.1 provides a graphical overview of the ecosystem. You can see that the
exchange is the medium through wh
ich publishers (websites such as
huffingtonpost.com
, and more) can sell advertising
space (slots and spaces of a
predefined size) to individu
al advertisers (think compan
ies like Nike, Adidas, O2, and
Vodafone). Advertisers often buy throug
h whats known in the industry as a
demand-
side platform
acts on behalf of multiple advertis
ers and combines buying power to reach
thevery best publishers on the we
b. There are many examples of
DSP
s on the web,
such as Turn (www.turn.com), MediaM
ath (www.mediamath.com), and DataXu
(www.dataxu.com). These platforms are a co
nduit through which its possible to buy
advertising. We can easily draw parallels with the world of equity trading. Not every-
body is allowed on the trading floor, so repr
esentatives act on behalf of many individu-
als who wish to buy.
In this parallel, the
is equivalent to the trader, where the
advertising exchange is equivalent to th
e stock exchange. In the first case, the
aggregates demand, just as an individua
l trader does. In th
e second case, both
Figure 5.1The online advertising ecosystem. Publishers make their content
(advertising slots) available through an exchange that allows access
programmaticallythat is, through an intelligent algorithm. A demand-side platform
aggregates demand for content from advertisers and buys on their behalf.
The exchange
because advertising space is bought through the use of a computer program (an intel-
ligent algorithm) as opposed to in person.
info
Cookie
match
Ad monitoring
Ad placement
Bid
Figure 5.2A graphical overview of the steps involved in buying ad placements on an
exchange. Events should be read from top to bo
ttom. First, a cookie match occurs. This
can be performed either exchange side or DSP side. Once the DSP has a unique identifier
that it can use, it uses this to look up additional information about the user and construct
a bid price (the subject of section 5.4.5 onwar
d). If the bid is successful, bid notification
HAPTER
Case study: click prediction for online advertising
5.2.1Cookie matching
This is the first step that
occurs every time a page that contains ads is loaded. A
cookie
is
a unique identifier for a user for each domain (or website/web service) thats held in
the browser. Its guaranteed to be unique,
identifier for the user that can be used to look up further information. The
DSP
also
obtains specific information about the ad
placement thats available. What the
The exchange
Vijay Krishna,
Auction Theory,
2nd ed. (Academic Press, 2009).
Internet Advertising Bureau, IAB Standard Ad Unit Portfolio,
www.iab.com/guidelines/iab-standard-ad-
unit-portfolio
HAPTER
Case study: click prediction for online advertising
5.3What is a bidder?
Understanding the process of bidding is the first step toward understanding the con-
text in which an intelligent algorithm must
operate. So far, weve only provided a
high-level overview of how we buy ads on an exchange. All we know is that were given
some information by the exchange and that
we must return a bid price in order to
secure this user and this placement. In
practice, a bidder that follows loosely the
design presented in figure 5.3 achieves this.
The bidder is made up of an outer shell and an inner core. The outer shell is respon-
sible for performing the crucia
l tasks associated wi
th bidding, such as retrieving user
information and responding to the exchange
in a timely manner. At the inner core of
the bidder is the decision engine, responsi
ble for taking into account all user, place-
info
Decisioning
Figure 5.3The bidder consists of an outer shell thats
responsible for performing the crucial tasks associated
with bidding and the decision engine thats responsible for
taking into account all data associated with the bid.
What is a decisioning engine?
Simplicity
One way to achieve speed is thro
ugh adopting a simple design. The
bidder should perform only one functi
on and should be easy to debug and
maintain. Remember, once your bidder is
live and participating in auctions,
then every moment its offline repres
ents a loss of revenue for both you and
Exchanges contain a lot of inventory.
Part of the difficulty of partici-
pating in an exchange is that its easy to buy lots of poor inventory and to buy it
very quickly! For this reason its extrem
ely important that yo
ur bidder be accu-
rate. Over time, you should be able to determine a positive correlation with the
price that you bid and the performance of these ads.
Extensibility
There are many exchanges available through which inventory can
be bought. Some inventory is available through only one exchange, whereas
others are available on many. In order to maximize your reach as a
DSP
, youll
want to integrate with as many exchange
s as possible. For this reason, extending
your bidder should be easy. Dont be te
mpted to tightly couple your bidder with
the first exchange that you integrate with
, but keep your solution as generic as
Once your bidder is in place and performing reasonably well,
youll want to start improving your decisioning to try to obtain the best users at
the lowest cost. Every change to your
bidder and its associat
ed decisioning must
be monitored to ensure that changes have
the right effect. Often the results of a
change can only be seen over days
and weeks, not mi
5.4What is a decisioning engine?
In the previous sections, we introduced the
basics of the problem
of buying ad place-
ments online through an exchange and how this is performed in practice through the
use of a bidder. In this section, were going to concentrate on a specific aspect of the
bidder. Well uncover how to take the raw da
ta provided by the exchange and convert
this into a value youre willing to pay to de
liver an ad. Lets look at the types of infor-
mation you can use to generate these probabilities.
5.4.1Information about the user
In reality, the only information provided abou
t a user is a cookie identifier. Its up to
to record and store such information.
When someone is seen for the first
time, no information is available about them
whatsoever! If they interact with an ad
theyre shown, this will be stored, as will an
y other interaction that occurs in the scope
of a
s advertising: for example, if the user
starts a video or
cancels a full-screen
advertisement. These small interactions are
stored away and looked
up each time this
user is seen in the wild. This allows a
to build up a picture of user behaviors that
are synonymous with the likelihood to click, as well as those that arent.
HAPTER
Case study: click prediction for online advertising
5.4.2Information about the placement
Quite separate from information about the
user, information about the ad placement
can be predictive of a click. Independent of
the user, certain sizes of ads tend to per-
form well, as do certain publishers. For ex
ample, ads on premium news sites such as
theguardian.co.uk may have a
higher click-through rate (
CTR
) than those on small
BBC News, Black Friday: Online Spending Surge in UK and US, November 27, 2015,
www.bbc.co.uk/
news/business-34931837
BBC News, Cyber Monday Adds to Online Sales Surge, November 30, 2015,
www.bbc.co.uk/news/business-
34961959
What is a decisioning engine?
s to calculate the probability of a user inte
racting with an ad. In
the previous chap-
ter, we placed a hard threshold on the output of the logistic regression model in order
to create a classifier; but in this chapter,
well use the output of the logistic regression
model as is, taking the output probability as
the likelihood of a click. Recall that the
general form of the logistic regr
ession model is given as follows
where y is the probability of the event under investigation, 1
the inverse probability,
for
0 represents the feature coefficients, and
represents the threshold. Thus, in
+++
Bid CPMCPCECTR
1000
Bid CPMCPCy
1000
HAPTER
Case study: click prediction for online advertising
ad placements on a site
-by-site basis. Many
s and algorithm designers in this space
John Langford, Vowpal Wabbit command-line arguments,
http://mng.bz/YTlo
John Langford, Vowpal Wabbit tutorial,
http://mng.bz/Gl40
Criteo, Kaggle Display Advertising Challenge Dataset,
http://mng.bz/75iS
John Langford, Vowpal Wabbit wiki,
http://mng.bz/xt7H
Click prediction with Vowpal Wabbit
some of its features here. As ever, the GitH
ub page and mailing li
st will have the most
detailed information, and I strongly urge you to keep checking there if youre going
to develop further applications with
5.5.1Vowpal Wabbit data format
Vowpal Wabbit uses a specific file format th
at is very flexible but can be difficult to
r
i
n
te
r
act?
These colu
mn
s show 13
p
r
ope
r
ties exp
r
essed as
i
n
tege
r
s; likely to be eve
n
t
cou
n
elati
n
g to use
r
.
The
n
ext colu
mn
s
a
r
e 26 catego
r
ical
p
r
ope
r
ties associated
with the bid
r
equest.
Figure 5.4Overview of the raw data provided by Criteo. The first field relates to the event were
trying to predict. This relates to the user performing some event, such as clicking the ad. The next
13 columns are integer properties with count semantics; a higher number means some event has
been performed more times than if a lower number is seen. The next 26 features are categorical and
most likely represent some property of the bid
request: for example, the site from which the bid
request was made.
HAPTER
Case study: click prediction for online advertising
wont use the count semantics just mentioned. Each unique count value will be treated
as a label, and the model will lose the ability
to understand that all the count values are
related through integer semantics: for example, that 2 is greater than 1, 3 is greater
than 2, and so on. If you look in the accomp
anying folder for this chapter, youll find
that weve already written this code for you!
In order to proceed with the example in this chapter, youll need to download
dac.tar.gz from Criteo,
untar/gz the file, and extract the train.txt data file from the
dac subdirectory. Next, run the included Pyth
on script against the training file using
the following command:
python ch5-criteo-process.py /pa&#x/pa7;&#x.5th;&#x/to/;&#xtrai;&#xn.tx;&#xt000;th/to/train.txt
This will create two files called train_vw_fil
e and test_vw_file. These are created in the
current directory of the aforementioned scri
pt. Note that if you wish, you can change
the output files from this script by
providing two additional parameters:
python ch5-criteo-process.py /pa&#x/pa7;&#x.5th;&#x/to/;&#xtrai;&#xn.tx;&#xt000;th/to/train.txt /path/&#x/pat;&#xh/7.;to/;&#xtrai;&#xn_vw;&#x_fil;to/train_vw_file
.5/;&#xpath;&#x/to/;&#xtest;&#x_vw_;ile;/path/to/test_vw_file
data in the following format:
-1 |i1 c:1 |i2 c:1 |i3 c:5 |i4 c:0 |i5 c:1382 |i6 c:4 |i7 c:15 |i8 c:2 |i9
c:181 |i10 c:1 |i11 c:2 |i13 c:2 |c1 68fd1e64 |c2 80e26c9b |c3 fb936136
|c4 7b4723c4 |c5 25c83c98 |c6 7e0ccccf |c7 de7995b8 |c8 1f89b562 |c9
a73ee510 |c10 a8cd5504 |c11 b2cb9c98 |c12 37c9c164 |c13 2824a5f6 |c14
1adce6ef |c15 8ba8b39a |c16 891b62e7 |c17 e5ba7672 |c18 f54016b9 |c19
21ddcdc9 |c20 b1252a9d |c21 07b5194c |c23 3a171ecb |c24 c5c50484 |c25
e8b83407 |c26 9727dd16
Criteo,
http://mng.bz/75iS
Click prediction with Vowpal Wabbit
The
namespace in the previous example contains the feature
68fd1e64
Implicit to this is a value,
, that denotes that feature
68fd1e64
is present; thus, its cat-
egorical. We could equiva
lently write this as
68fd1e64:1
. In the example of
namespace has a feature,
. This has associated with
it a continuous variable,
, denot-
ing the count on the number of events (of name
) relating to
that have occurred.
In practice, this distinction is important because it means the difference between
training a single coefficient for a continuous
variable (the correct way to proceed for
integer semantics) or a coefficient for every seen value of the variable (incorrect!). We
wont go further into this here, but I encourage you to read the associated documen-
tation on namespaces on the project wiki.
We mentioned earlier that you need to create two files. You create these files directly
from the Criteo training data (train.txt).
Because the Criteo data was originally com-
John Langford, Vowpal
Wabbit input format,
http://mng.bz/2v6f
How features are represented in VW
VW uses a bit vector of a specified size called a
feature table
, and each feature is
represented by hashing into this vector.
Weights are learned, not per feature, but per
entry within the vector.
This has a few benefits, including a reduced fixed-size learning space, but it means
collisions can occur (where tw
o different features trigger
the same bit). In general,
provided an appropriate vector size is chos
en, the impact of this on the accuracy of
the trained classifier is negligible. It also means that to uncover the impact of a fea-
ture (analyze its learned weights), we need to hash into the bit vector and perform
the dot product of the vector and the weights. We'll encounter this later.
HAPTER
Case study: click prediction for online advertising
samples. Put another way, the negative samp
les will have more importance than the
positive ones. This isnt ideal and can be best tackled by forcing the data to be bal-
anced before training.
We can ensure that this is true with
a bit of command-line manipulation. A word to
the wise: these files are large, and editing th
em in your favorite text editor may cause
your memory to fill up and your application to
crash! Take a look at the following list-
ing for an appropriately safe method to do this.
wc -l train_vw_file
� 32090601 train_vw_file
grep c '^-1' train_vw_file
� 23869264
grep c '^1' train_vw_file
� 8221337
Here we use some simple co
mmand-line utilities to determ
ine the number of positive
Listing5.1Determining how balanced the data is
Leon Bottou, Online Learning and
Stochastic Approximations, in
Uses word count (w
Counts only those lines starting
Counts only those lines starting
(positive example)
Click prediction with Vowpal Wabbit
grep '^-1' train_vw_file | sort -R� negative_examples.dat
grep '^1' train_vw_file | sort -R� positive_examples.dat
awk 'NR % 3 == 0' negative_exampl�es.dat \
negative_examples_downsampled.dat
cat negative_examples_downsampled�.dat all_examples.dat
cat posi��tive_examples.dat all_examples.dat
cat all_�examples.dat | sort -R all_examples_shuffled.dat
awk 'NR % 10 == 0' all_examples_s�huffled.dat \
all_examples_shuffled_down.dat
This listing uses several more command-
line tools to down-sample and shuffle our
Extracts negative samples, shuffles, and
places in the negative_examples.dat file
mples, shuffles, and
places in the positi
Uses awk to take every
third example from
negative_examples.dat file
sampled negative
examples into
Down-samples the
Appends the
positive
examples to the
negative ones
balanced
HAPTER
Case study: click prediction for online advertising
vw all_examples_shuffled_down.dat \
--loss_function=logistic \
-b 22 \
--passes=3 \
-f model.vw
vw all_examples_shuffled_down.dat \
-t \
-i model.vw \
--invert_hash readable.model
cat read�able.model | awk 'NR 9 {print}' | \
sort -r -g -k 3 -t : | head -�1000 readable_model_sorted_top
The first line trains our model using
. Several command-line switches perform spe-
cific functions, and more information regard
ing these can be found in Langford, Vow-
pal Wabbit command-line argu
ments, cited previously. In
order to perform logistic
regression, we need to tell
to use the logistic loss functi
on to calculate the difference
Listing5.3Training a logistic regression model using VW
Trains a logistic regr
ession model using a
cache (-c) , a feature table of size 22 (-b 22),
Runs vw in test-only mode (-t) over the
same data using the
model. The purpose of this isnt to test on
the same data but to run through the data
once and create the in
Sorts the readable model by the learned
Click prediction with Vowpal Wabbit
human-readable features so we can gain some
intuition as to what features the model
thinks are important. This is the focus of
the second line of listing 5.3. We call
again
in test mode (
d. In general, you should never
test on the same data you train on, but in th
is case were not inte
rested in the results!
By running through the train data in
test mode a single
time with the
invert_hash
option set, we build a table that links the
human-readable feature
name to the effective
impact on the classifier, thro
ugh the weights associated with the hashed feature value.
Because there are so many learned values
, and theyre written out to the readable
.model file in no particular order, before
we can analyze the coefficients we must sort
this file. We do this using the command-line tools
awk
sort
head
. Essentially, we
throw away the first nine rows (which in
clude header informat
ion and preamble) and
then sort the remaining rows on the third column, telling
sort
that the columns are
denoted by the colon (
). The final output is sent to
readable_model_sorted_top
and we reproduce the top 20 coefficients in the listing that follows. The features you
see here are those that, if set (holding all
other features constant), cause the greatest
increase to log odds.
c24^e5c2573e:395946:3.420166
c3^6dd570f4:3211833:1.561701
c12^d29f3d52:2216996:1.523759
c12^977d7916:2216996:1.523759
c13^61537f27:2291896:1.373842
c12^d27ed0ed:2291896:1.373842
c12^b4ae013a:2291896:1.373842
c16^9c91bbc5:1155379:1.297896
c3^f25ae70a:64455:1.258160
c26^ee268cbb:64455:1.258160
c21^cfa6d1ae:64455:1.258160
c26^06016427:3795516:1.243104
c15^2d65361c:506605:1.240656
c12^13972841:506605:1.240656
c3^e2f2a6c7:3316053:1.234188
c18^eafb2187:3316053:1.234188
c18^fe74f288:3396459:1.218989
c3^8d935e77:2404744:1.214586
c4^f9a7e394:966724:1.212140
c3^811ddf6f:966724:1.212140
Listing5.4Human-readable output from VW
Most important features
HAPTER
Case study: click prediction for online advertising
from our test data.
5.5.3Testing the model
The following listing provides the code to
evaluate a test set against a trained model
and to generate both the probabilities and ground truth that will be used in the next
step to create a
curve.
vw -d test_vw_file \
-i model.vw \
--loss_function=logistic \
-r predictions.out
~/dev/vowpal_wabbit/utl/logistic �-0 predictions.out \
probabilities.out | cut -d ' ' -f 1 test_vw_file | \
sed -e� 's/^-1/0/' ground_truth.dat
The first command uses the
option to specify the data file, along with the
option
for testing (no learning). The
option loads the previously trained model into mem-
ory, and we specify the
switch to output the raw predictions. The next line uses a
helper function to transform these raw predictions (which are of the form
into their associated prob
ability, as specified by
in section 5.4.5.
Note that youll most
likely need to change the path to the helper
function on your system. The logistic script
canbe found under the utl subdirectory of the
repository (
http://mng.bz/pu5U
originally checked out as per the books prerequisites/requirements file.
More information about feature hashing
and collision avoidance can be found at
http://mng.bz/WQi7
Listing5.5Testing against a trained model in VW
Click prediction with Vowpal Wabbit
These resulting probabilities are stored in
the probabilities.out file. The final line
of listing 5.5 separates the co
lumns of the training file by
splitting on white space and
selecting the first column, which contains
the ground-truth class data. This is then
piped to
, which replaces any occurrences of
(how
deals with negative
classes in logistic regression) with
. The output is sent to
the ground_truth.dat file.
Were now finally at the position where we can evaluate the efficacy of our model
against the test data. The next listing show
s the scikit-learn code
thats required to
generate the
curve, shown in figure 5.5.
import numpy as np
import pylab as pl
Listing5.6Generating the ROC curve with scikit-learn
scikit-learn, Receiver Operating Characteristic (ROC),
http://mng.bz/9jBY
positive and true
positive rates (and
their associated
thresholds) using
scikit-learn
HAPTER
Case study: click prediction for online advertising
The results show a value of 0.76 for the
area under the curve based on our trained
model and held-out test data. To translate
this into a more intuitive number, if we
chose a threshold corresponding to the point on the
curve closest to the top-left
corner of figure 5.5 and used this thresh
old as a hard classifier to choose between
clickers and non-clickers, wed have a true po
sitive rate of about 0.7 (that is, when the
model predicted a click, it was right 70% of the time) and a false positive rate of about
0.3 (that is, 30% of non-clickers were incorrectly identified as clickers). To obtain
these numbers, find the point closest to the upper-left corner on the upper curve in
the figure, and read off the values on the x-
and y-axes, respective
ly. In reality, how-
ever, models such as these are used not to cl
assify impressions that will be clicked but
to determine how much money should be paid to show an ad, as per the methodology
in section 5.4.6.
5.5.4Model calibration
In the previous section, we alluded to the
fact that we dont use the output of our
trained model to create a classifer per se; rath
er, we use the output of the logistic regres-
sion model to estimate the probability of
the click occurring. But if you remember, to
achieve efficient model training, we changed th
e frequency of the positive events in the
data such that the positive and negative ev
ents represented an even proportion of the
0.20.40.6
0.81.0
Figure 5.5The receiver operating characteristic for our model based on the
supplied test data. The area under the curve is given as 0.76.
Click prediction with Vowpal Wabbit
training data. Thus, in order to obtain outp
uts from the model that are representative
of the underlying probabilities of a click, we must
calibrate
the model.
,()Pry1=x,()-----------------------------------------
+++
---------------------------------
-------------------------------------------------------------
---------------------------------
---------------------------------------------------------------------
---------------------------------
----------------------------------
,=()-----------------------------------------
+++
,=()-----------------------------------------
()
+++
HAPTER
Case study: click prediction for online advertising
model that reflects the probability of the
positive event accurately, you must complete a further step.
As you can see, the trained model will have an intercept thats ln(
) lower than that
which would be obtained if you didnt used
The future of real-time prediction
need to capture, direct, store, preprocess, and process continuously, generating train-
HAPTER
Case study: click prediction for online advertising
5.8Summary
Weve built an intelligent algori
thm for online click prediction.
We provided a complete introduction to
a very important application on the
intelligent web. Very few applications ha
ve received such attention from the ML
community and continue to do so at this level.
You learned about the use of an out-of
-core machine-learning library called
Vowpal Wabbit that allows training to
occur without reading all the data into
memory.
We discussed the future of real-t
ime click prediction on the web.
Deep learning and
This chapter covers
HAPTER
Startup.ML, Deep Learning News, June 30, 2015,
http://news.startup.ml
Jeubin Huang, Overview of Cerebral Function, Merck Manual, September 1, 2015,
http://mng.bz/128W
HAPTER
Warren S. McCulloch and Walter H. Pitts, A Logical Ca
lculus of the Ideas Immanent in Nervous Activity,
Summation
Figure 6.2On the left, we
provide a schematic of a
human biological neuron. To
the right, we show a human-
The perceptron
6.3The perceptron
In the previous section, we introduced the
neuron. With this basic approach, it
turns out that its possible to
learn and to genera
lize training data, but in a very lim-
Frank Rosenblatt,
The PerceptronA Perceiving and Recognizing Automaton
(Cornell Aeronautical Laboratory,
1957).
Rosenblatt, The Perceptron: A Probab
ilistic Model for Information Storag
e and Organization in the Brain,
Psychological Review
65, no. 6 (November 1958): 386408.
Threshold output
5
0.4
0.6
0.8
1.2
3012345678910
Summation output
Figure 6.3MCP as a 2-D linear model without inhibitory input. The weights of the model correspond
to the coefficients of a linear model. In this case, our neuron supports only a single input, and the
weight has been set to 1 for illustration. Given a threshold of 0, all inputs with a value less than or
equal to 0 would inhibit the neuron from firing, whereas
all inputs with a value greater than 0 would
cause the neuron to fire.
HAPTER
final result,
6.3.1Training
Now that you know that a neur
x
a
=
w
w
a
)
Figure 6.4The perceptron. Inputs
are received and multiplied by
their associated weight, with perceptron
, being added in afterward. This
output, given by
, is then passed through
a threshold function to obtain the output.
x
w
,
=
otherwise
The perceptron
Weighted sum
Sign of weighted sum
101.50.5Negative0
011.50.5Negative0
001.51.5Negative0
111.50.5Positive1
Rosenblatt, The Perceptron: A Probab
ilistic Model for Information Storage and Organization in the Brain.
Rosenblatt,
Principles of Neurodynamics; Perceptron
s and the Theory of Brain Mechanisms
(Spartan Books, 1962).
Listing6.1The perceptron learning algorithm
1
x
w
,
=
HAPTER
Listing6.2The perceptron learning algorithm (2)
Brian Ripley,
Listing6.3Creating data for perceptron learning
k
=
Calculates the output of the
perceptron given the
current weights (time t)
1
+
+
=
Updates the features. Note this is
often done as a vector operation
rather than a for loop.
Features are updated only if the expected output
and
the actual output differ. If th
ey do, we move the weight in
the sign of the correct answ
er but a magnitude given by
the corresponding input and scaled by
represents the
speed at which the weights are updated in the algorithm.
of these data
points. In this ex
point x=
Creates four data points in a NumPy array.
The perceptron
p = perceptron.Perceptron(n_iter=100)
Listing6.4Training a single perceptron
0.00.20.40.60.81.01.2
Figure 6.5Graphical overview of the data
for a single perceptron. Data with class
label
is represented by a round dot,
whereas data with class label
is represented
by a star. Its the aim of the perceptron to separate these points.
Creates a single perceptron. n_iter
specifies the number of times the
training data should be iterated
through when training.
Trains the perceptron using the data
and the associated
Prints out the coefficients and
the bias of the perceptron
HAPTER
scikit-learn, Perceptron,
http://mng.bz/U20J
Listing6.5Plotting the output of the perceptron
4

3
x
x
3
x
x
2
x
2
3
2
x
=
Sets up a color array, and plots
the data points:
black star for
class zero, red dot for class
The perceptron
grad = -p.coef_[0][0]/p.coef_[0][1]
intercept = -p.intercept_/p.coef_[0][1]
x_vals = np.linspace(0,1)
y_vals = grad*x_vals + intercept
plt.plot(x_vals,y_vals)
plt.show()
We plot our four data points along with th
e projection of the separating plane learned
by the perceptron onto the viewing plane. Th
is provides us with a separation as per
figure 6.6. In general, for a larger number of input variables, you can think of these as
existing in
dimensional space, with the percep
tron separating these using a hyper-
+ 1 dimensions. You should now be able to see that the basic linear form of
the perceptronthat is, with a threshold acti
vationis equivalent to separation using
a hyperplane. Consequently, such
models are of use only where data is linearly separa-
ble and will be unable to separate posi
tive and negative classes otherwise.
Creates the data points, and
plots the line of intersection
0.00.20.40.60.81.01.2
Figure 6.6The projection of our separating plane onto the viewing plane (
= 0). All
points to the top right of the figure satisfy the constraint w
� x 0, whereas points to
the bottom left of the line satisfy the constraint w
HAPTER
function isnt linearly
separable in two-dimensional space; there ex
ists no single hyperplane that can sepa-
rate positive and negative classes perfectly.
Try to draw a line anywhere on this graph
with all positive classes on on
e side of the line and all ne
gative classes on the opposite
side of the line, and youll fail! We could, ho
wever, separate these data points if we had
Marvin Minsky and Seymour Papert,
Output
Table 6.2Input and output values for the
XOR function. It outputs a 1 if either
or
Multilayer perceptrons
Hidden layer
Figure 6.7A multilayer perceptron. Read from top to bottom, it consists of an input layer
0.00.20.40.60.81.01.2
Figure 6.8Graphical representation of the XOR f
unction. Positive classes are specified with
circles, whereas negative classes are specified
with stars. No single hyperplane can separate
HAPTER

1

1

0.5

1.5
Figure 6.9A two-layer perceptron can
separate a nonlinearly separable function
(XOR). Values on the connections
demonstrate the weight of those
connections. The introduction of the bias
term ensures correct operation with an
activation threshold of 0. Conceptually, the
two hidden neurons correspond to two
hyperplanes. The final combining
perceptron is equivalent to a logical AND on
the output of the two hidden neurons. This
can be thought of as picking out the areas
of the (
) plane for which the two hidden
Multilayer perceptrons
6.4.1Training using backpropagation
In the previous examples, we used the step
function as our neuron-activation func-
tionthat is, a single threshold value over which the neuron is able to fire. Unfortu-
nately, coming up with an automated method
to train such a network is difficult. This
is because the step function doesnt allow
us to encode uncertainly in any formthe
thresholds are hard.
It would be more appropriate if we coul
d use a function that approximates the step
function but is more gradual. In this way,
a small change to one of the weights within
0.00.20.40.60.81.01.2
=

+ 1.5
y
=

0
y
=

0
y
=

1
x
=

+ 0.5
Figure 6.10Separating a nonlinearly separable datas
HAPTER
012345
Figure 6.11Output profiles for several activation funct
ions over the same range as provided in figure
6.3. Activation profiles demonstrated are square root, logistic, negative exponential, and hyperbolic.




Multilayer perceptrons
6.4.3Intuition behind backpropagation
To provide the intuition behind backpropag
ation, were going to work with the same
example as previously, the
function, but well try to
learn the weights rather than
specify them by hand. Note also that from
now on, the activation function will be
assumed to be sigmoid (logistic). Figure 6.
overview of our new
Logistic regression, the perceptron, and the generalized linear model
Think back to chapter 4, where we intro
duced the concept of logistic regression.
There we identified
that a linear re
sponse model would be un
suitable fo
estimation and instead modified the logistic response to curve to create a more
appropriate response. From this starting point, we derived that the log-odds are linear
in the combination of weights and input variables and applied this to a classification
problem.
In this chapter, we started from a basic biological concept and built a computational
formalism that captur
es this. We havent discussed pr
obability at all
but started from
a standpoint of activation within a neuron.
Intuitively, weve extended this into a more
general concept and reached the same equation:
In fact, what weve encountered here is a more general class of problem known as
generalized linear
models
(GLM).
In this class of models, a linear model
is related to the output variable,
, by a link function, which
in this case is th
e logistic function.
This equivalence of algorithms and concepts is common in machine learning, and
youve seen it already in this book. Just think back to section 2.5, where we dis-
cussed the equivalence of expectation maxi
mization (EM) with a Gaussian mixture
model with tied and coupled
the vanilla k-means al
gorithm. The usual
reason for this is that mult
iple researchers have started from different points within
the research and discovered equivalence
through the extension of basic building
blocks.
a. Peter McCullagh and John A. Nelder,
Generalized Linear Models
(Chapman and Hall/CRC, 1989).
----------------------------------------------------------------
+++
1
e

HAPTER
David E. Rumelhart, Geoffrey E. Hinton, and Ronald
J. Williams, Learning Representations by Back-
Propagating Errors,
Nature
323 (October 1986): 53336.
(
x
= 3)
w
(
x
= 2)
w
(
j
= 2
= 3)
w
(
j
= 1
= 3)
w
(
x
= 1)
w
(
x
= 1)
w
(
x
= 2)
w
(
x
= 2)
w
(
x
= 1)
j
= 2
j
= 3
j
= 1
Figure 6.12Overview of the backpropagation example. Given a set of inputs
(

)
+
=
Multilayer perceptrons
The first thing to note about training is
that were interested in how the error at
the output changes with respect to the change
in a single weight value. Only through
this value can we move the weight in a dire
ction that minimizes th
wij
------------------
------------
Error given the output of the
------------
Output of the activation function
given the weighted sum of its inputs
wij
------------------
Weighted sum of its inputs and
------------
------------
()
wij
------------------
------------
------------
expected
expected
Rate of change of error given
change in weight
HAPTER
Rumelhart et al., Learning Representations by Back-Propagating Errors, 53336.
Listing6.6Building a multilayer perceptron using PyBrain
expected
()
Multilayer perceptrons
in_to_h = FullConnection(inl, hidl)
h_to_out = FullConnection(hidl, outl)
bias_to_h = FullConnection(b,hidl)
bias_to_out = FullConnection(b,outl)
Listing6.7Building a multilayer perceptron using PyBrain (2)
PyBrain Quickstart
http://pybrain.org/docs
/index.html#quickstart
PyBrain
SupervisedDataSet
Randomly samples the four
data points
,000 times and
HAPTER
Creates a new backpropag
ation trainer object
Performs backpropagation 50 times
using the same
,000 data points.
Prints the error after every iteration.
Multilayer perceptrons
HAPTER
scikit-learn, Restricted Boltzmann Machin
e Features for Digi
t Classification,
http://mng.bz/3N42
Geoffrey Hinton, A Practical Guide to Tr
aining Restricted Boltzmann Machines in
Visible units
Figure 6.13Graphical overview of a
Restricted Boltzmann Machine. A RBM is a
HAPTER
()
---------------
()
Yann LeCun et al., A Tutorial on Energy-Based Learning, in
Predicting Structured Data
j
i
j
=
vhW
vhW
--------------------------------
vhW
vhW
----------------------------------
HAPTER
scikit-learn, Restricted Boltzmann Machin
e Features for Digit Classification,
http://mng.bz/3N42
Listing6.12Generating artificial data
scikit-learn, Restricted Boltzmann Machin
e Features for Digit Classification,
http://mng.bz/3N42
HAPTER
Listing6.14Evaluating the RBM/LR pipeline
Listing6.15Representing the hidden units graphically
Figure 6.14A graphical representation of the weights
Summary
In our
, we have 100 hidden nodes and 64 vi
sible units, because this is the size
of the images being used. Each square in
figure 6.14 isagrayscale interpretation of
the weights between that hidd
en component and each visible unit. In a sense, each
hidden component can be thought of as recognizing the image as given previously. In
the pipeline, the logistic regression model
then uses the 100 activation probabilities
=1|
image
) for each
) as its input; thus, instead of
doing logistic regression over
64 raw pixels, its performed over 100 inputs
, each having a high value when the input
looks close to the ones provided in figure
6.14. Going back to the first section in this
chapter, you should now be able to see th
at weve created a network that has automat-
ically learned some intermediate representation of the numbers, using a
Given a number of options, how do we
choose the one that maximizes our reward
(or, equivalently, minimizes the cost)? Fo
r example, if we have two possible routes
to work, how might we choose the one that
minimizes the time spent traveling? In
this example, our reward is based on the
This chapter covers
Complexities when making the right choice
Multi-armed bandits
.
In this figure, we assume that the actor
is exposed many times to a situation that
requires a decision to be made. Each time a decision is made, an outcome is reached,
and this outcome has an associated reward a
ssociated with it. In
the case of our com-
muter, assuming they go to work every day,
they have to choose their route every time
they go to work. The outcome is the combination of all the factors on that commute,
and the reward is formulated from this.
Although our toy example fits the probl
em perfectly, the intelligent web offers
many other problems with the same format.
It offers many applications where were
presented with several choices and
making the correct one is crucial:
Landing-page optimization
With the advent of the intelligent web, there are
Reward B
Reward A
Figure 7.1Formalizing the
problem definition. In this
situation, our actor is faced with a
decision, and, depending on their
choice, they reach an outcome.
This outcome has an associated
the maximum reward (or,
equivalently, the minimum cost)?
HAPTER
Making the right choice
When performing A/B testing, we do
so with two groups: group A and group B.
The first is a control group, and the second
has some factor chan
ged. For example, in
the case of landing-page optimization, gr
oup A may be shown the existing landing
page and group B a new landing page whos
e content or layout has changed in some
way. The purpose of our A/B test is to un
estimate of the population conversion rate. This is true
of both groups A/B. We can therefor come up with a new, also normal, random vari-
able thats the combination of random variab
les from groups A and B. This is the dis-
tribution of differences. Let us write
to refer to this new ra
ndom variable, defined as
=

Stuart Wilson and Rory MacLean,
Research Methods and Data Analysis for Psychology
(McGraw-Hill Education
Europe, 2011).
Student, The Probable Error of a Mean,
where
is the random variable of conversion
rates of our experimental group, and
is the random variable of conversion ra
tes of our control gr
oup. Given this new
random variable, we can now write the null and alternative hypothesis. Our null
hypothesis can be stated as follows:
= 0
That is, the experimental group is no diff
erent than the control group. Both random
and
are distributed around the same
population mean, so our new
random variable
should be distributed around 0.
Our alternative hypothesis can be
stated as follows:
� 0
That is, the expectation of the random variable of the experimental group is larger
than that of the control group; the popu
lation mean of this group is higher.
We can perform a one-tailed z-test on the distribution of
experimental one. From the definition of
the standard score, we may now write
where
is the conversion rate of the experiment,
control
is the conversion rate
of the control, and
is the standard error for the difference in conversion rates.
Lawrence D. Brown, T. Tony Cai, and Anirban DasG
upta, Confidence Intervals for a Binomial Proportion
and Asymptotic Expansions,
The Annals of Statistics
30, no. 1 (2002): 160201.
Sean Wallis, Binomial Confidence Intervals and Cont
ingency Tests: Mathematical Fundamentals and the
experiment
control
-------------------
HAPTER
Making the right choice
equivalent to the conversion rate, and beca
use the standard error of two variables can
be combined by addition, we note the following:
By substitution, we can now write our z-test
as follows. This is a formulation of the
Wald (or normal) interval fo
r the binomial distribution:
The greater the value of
, the more evidence against the null hypothesis. To obtain a
90% confidence interval for a one-tail test, our value of
would need to be greater than
1.28. What this actually says is that, unde
r the assumption of the null hypothesis (the
population mean for groups A and B is the sa
me), the probability of this difference in
conversion rates, or one larger than this,
occurring by chance is less than 10%. Put
another way, assuming the co
ntrol and experiment conversion rates are drawn from a
distribution with the same mean, if we run
this same experiment 100 times, only 10 of
these would have at least such an extreme
value. We can provide even tighter bounds
and more evidence against the null hypothes
is by using a 95% confidence interval.
This increases the
value required to 1.65.
It might be useful for you to think about the factors that will impact the size of
Obviously, if we draw two conversion rates
at a given point in time from an experimen-
tal set and a control set, a larger differ
ence in these will lead to a larger
score, and
therefore more evidence that theyre drawn
from different populations with different
means. But the number of samples is also
important. As youve seen, a larger number
of samples will lead to an overall smaller
standard error. This captures the fact that
our estimate of conversion rate is more ac
curate the longer we run the experiment. In
the following section, well provide an illustration of this and a worked example of
A/B testing in Python using this methodology.
7.1.2The code
exp
control
exp
experiment
experiment
experiment
------------------------------------------------------------
control
control
control
control
--------------------------------------------------
experiment
control
control
and found to be 0.002 for the new landing page and 0.001 for the original landing
page. You need to know whether this increa
se is significant enough to warrant the
value for the experiment. Give
n these values, we obtain a
value of 1.827. This exceeds the 92% confid
ence interval but not the 95% interval. We
can say that there is a probability less than
0.08 that the data was drawn from the con-
trol distribution. Consequently, at this interv
al the data is significant. We should reject
the null hypothesis that there is no diff
erence at all between the groups and accept
the alternative hypothesis, that the second
group has a higher
conversion rate. If
weve controlled all other aspects of the us
er groups, this should imply that the web-
site redesign has ha
d a positive effect.
You should be able to see from this code
that the standard error of the distributions
from which the conversion rate is
drawn has a direct effect on the
Listing7.1Calculating standard (
) value for your experiment
HAPTER
Making the right choice
7.1.3Suitability of A/B
In this section, weve given a basic introd
uction to a statistical method for testing
called the z-test. We discussed it in the cont
ext of an A/B test, where we want to test
the impact of a change in operations so as
0500010000
1500020000
Figure 7.2Were given a fixed value for the conversion rates of the A/B group, the
Multi-armed bandits
advantage. What if we could
exploit new solutions in a quic
ker fashion, without waiting
for the test to finish completely? Well, we can. Enter the multi-armed bandit (
7.2Multi-armed bandits
The name
multi-armed bandit (
MAB
comes from the popular casino game, the one-
armed bandit. For those of you who have ne
ver been to a casino, this is a machine
whereby you pull on a lever
(the arm); depending on the values displayed by the
machine, youre either paid out or (more
likely) not. As youd expect, the odds of
these machines are in favor of the house, so
the likelihood of being paid out is typi-
cally very small!
Figure 7.3The multi-armed bandit
problem. A user is faced with a number of
one-armed bandits, each of which has a
different probability of paying out. The
user doesnt know what these
probabilities are and may only uncover
them through playing the machines. What
strategy should the user employ in order
T

t
1
=
T
=
HAPTER
Making the right choice
where
is the number of steps weve carried out so far,
is the reward that was
received at step
, and
is the mean payout returned per play from the optimal ban-
Listing7.2Pseudo-code for the epsilon-first strategy
opt
Explores the solution space
Exploits your knowledge
Multi-armed bandits
PSILON
GREEDY
With the
epsilon-greedy
approach,
acts as a probability that
well explore the space, as
opposed to exploiting the best-known arm.
More formally, the following pseudo-code
outlines the approach.
epsilon=0.1
best_bandit
bandit_array
total_reward=0
number_trials
current_trial=0
while((number_trials-current_tria�l)0):
random_float = rand(0,1)
if(random_floatepsilon):
random_bandit = rand(0,len(bandit_array))
total_reward += play(bandit_array[random_bandit])
update_best_bandit()
else:
total_reward +=play(bandit_array[best_bandit])
current_trial+=1
The advantage of this approach is that we
dont need to wait for the explore phase to
complete before we can start exploiting
our knowledge of the bandits performance
but be careful! This algorithm doesnt take
into account the statistical significance of
our performance data. Its possible that a peak in positive payouts for a particular ban-
dit will result in all plays being shifted to th
is bandit erroneously. More on this shortly.
You can see that
controls the probability that we
explore rather than exploit. Low
values of
array
make it less likely that well expl
ore the space, whereas higher values
make it more likely. Theres a clear trade-off here, and the selection of our value of
depends on many factors. Both the number
of bandits and the probabilities of pay-
Listing7.3Pseudocode for the epsilon-greedy strategy
Index of the best bandit in the array
Array containing bandit objects
the best
Exploits your knowledge
HAPTER
Making the right choice
epsilon=1
best_bandit
bandit_array
total_reward=0
number_trials
current_trial=0
while((number_trials-current_tria�l)0):
random_float = rand(0,1)
if(random_floatepsilon):
random_bandit = rand(0,len(bandit_array))
total_reward += play(bandit_array[random_bandit])
update_best_bandit()
else:
total_reward +=play(bandit_array[best_bandit])
current_trial+=1
epsilon = update_epsilon(epsilon)
Note that there are several methods to choose an optimal rate to update
depends on the number of bandits,
, and the respective weights at which they pay out.
AYESIAN
BANDITS
As mentioned, one of the limitations of this algorithm is that we dont take into
account the significance of the performanc
e data as we explore the space. Although
you might be able to exploit your knowle
dge sooner, you may be making erroneous
decisions. Thats not to say that these te
chniques arent useful! But they require a
Listing7.4Pseudo-code for the epsilon-decreasing strategy
Starts by exploring all the time
the best
Multi-armed bandits
bandit_distribution_array
total_reward=0
number_trials
current_trial=0
while((number_trials-current_tria�l)0):
sample_array = sample(bandit_distribution_array)
best_bandit = index_of(max(sample_array))
reward =play(bandit_array[best_bandit])
total_reward+=reward
current_trial+=1
update_distribution_array(best_bandit,reward)
You can see the elegance of this solution.
Although its simple to implement, this
approach both models the uncertainly of
our estimates and provides an excellent
Listing7.5Pseudo-code for the Bayesian bandit strategy
0
Figure 7.4Modeling the knowledge about the payout rate of three bandits using the Bayesian
bandit method. The mean payout rates for arms 1, 2, and 3 are 0.1, 0.3, and 0.4, respectively. Bandit
1 has a lower mean but a much wider variance. Bandit 2 has a higher mean and smaller variance.
Bandit 3 has an even higher mean and a smaller variance. In order to choose the bandit to play, each
distribution is sampled, and the bandit corresponding
to the distribution from which the highest
sample was drawn will be picked. After selection, the bandit is played and the respective distribution
is updated. This has the effect that even bandits with a low payout rate have a chance to redeem
themselves if their mean rate is uncertain (high variance).
Initializes with
suitable priors
HAPTER
Making the right choice
7.3Bayesian bandits in the wild
Listing7.6A class encoding a one-armed bandit
Bayesian bandits in the wild
def sample_distributions_and_choose(bandit_params):
sample_array = \
Listing7.8Running the Bayes bandit
HAPTER
Making the right choice
bandit_list = [Bandit(0.1),Bandit(0.3),Bandit(0.8)]
bandit_params = [(1,1),(1,1),(1,1)]
x = np.linspace(0,1, 100)
plt.plot(x,
Listing7.9Plotting the initial priors for the Bayes bandit
Listing7.10Plotting a single execution of a Bayes bandit strategy
Bayesian bandits in the wild
Here we use the
Listing7.11Plotting the posterior probability density functions
0.00.20.40.60.8
Figure 7.5Probability density function over the
beliefs in payout probabilities for each bandit.
Note that the three plots are identical and that the area under the curve is equal to 1. This
illustrates our belief that all payout probabilities are equally likely for each bandit. This will
change as we observe the bandits in action.
HAPTER
Making the right choice
alpha=0.6,
label='Bandit 1')
plt.plot(x,
Bayesian bandits in the wild
If you consider figure 7.6 and figure 7.7,
you may be able to understand a little better
whats happening. During the very early plays, because the distributions of beliefs are
almost identical, the algorithm is equally
likely to choose any of the bandits to play.
This leads to a cumulative
0.20.40.60.81.0
Bandit 2
Bandit 3
Figure 7.7Posterior probability distributions af
ter a single run of 1,000 plays. This represents
our belief over the possible true distributions
of the bandits, which have an unobservable true
probability of payout equal to 0.1, 0.3, and 0.8, respectively.
HAPTER
Making the right choice
Bayesian bandits in the wild
Here you can see that the outcome from an
identically initialized Bayesian bandit can
be dramatically different for independent
executions. The cumulative regret ranges
from about 4 (the expected
difference between the outcome and the optimal outcome
is five payouts) to 18. In most cases, th
HAPTER
Making the right choice
this scenario. But this is an especially good
Bayesian bandits in the wild
about the bandit. Second, this has the kn
ock-on effect that the overall regret
will be higher than when working with
bandits with higher probabilities. The
reason for the inversion of the ordering is that this is such an early stage of the
0.2, 0.5, 0.8
0.3, 0.5, 0.7
HAPTER
Making the right choice
0.01, 0.05, 0.09
0.02, 0.05, 0.08
0.04, 0.05, 0.06
# Bandits: 2
# Bandits: 4
# Bandits: 5
# Bandits: 8
# Bandits: 9
# Bandits: 10
A/B vs. the Bayesian bandit
7.4A/B vs. the Bayesian bandit
So far in this chapter weve discussed seve
ral approaches to making the right choice.
Much has been said in the community about
s, and there has been a lot of hype
about themwith many seeing them as a no-b
rainer alternative to A/B testing. After
all, if you can test your choices while
performing optimization, doesnt that make
more sense than waiting for statisti
cal significance with an A/B test?
Well, yes and no. As with all machine-lear
Bayesian bandit
Continuous online optimizationSingle-shot optimization after run
Multiple variablesSingle test variable
Confidence not explicitly measuredStatistically significant result
Convergence related to the number of options, pay-
out rates, and difference
Convergence related to the payout rate and differ-
HAPTER
Making the right choice
John Myles White,
Bandit Algorithms for Website Optimization
(OReilly, 2012).
Richard Weber, On the Gittins Index for Multiarmed Bandits,
Annals of Applied Probability
(1992): 1024
1033.
J. C. Gittins,
Multi-armed Bandit Allocation Indices
(John Wiley and Sons, 1989).
John Langford, Contextual Bandits,
Machine Learning (Theory)
, October 24, 2007,
ABCDEFZ
r
of att
r
ibutes assig
n
ed to a pa
r
ticula
r
use
r

a
n
cha
n
ges payout p
r
obabilities.
Figure 7.13A contextual bandit. The payout
probability isnt fixed and is dependent on the
context or situation. One way to think about
this is that the player has assigned to them a
vector of attributes at a given time, and,
depending on those attributes, the payout
probabilities of the machine will change. The
best strategy must minimize the cumulative
Extensions to multi-armed bandits
7.5.2Adversarial bandits
If you recall from previous sections, up un
til now weve made the implicit assumption
that the payout distribution is
always static. That is, these distributions dont change as
we play the bandits. Solutions to the
adversarial bandits
problem
dont make this
assumption. In this formalism,
The adversary selects a vector
: the size of the number of bandits. This contains
the rewards for each bandit at that step.
The player, without knowledge of the adve
rsarys selection, chooses a bandit to
play based on their strategy.
In the full-information game, the player then may see the entire reward vector.
In the partial-information game, the player
sees only the reward associated with
their chosen bandit.
Play continues in this way over a fixed numb
er of steps, and the
player must maximize
P. Auer, N. Cesa-Bianchi, Y. Freund, and Robert E. Sh
apire, Gambling in a Rigged Casino: The Adversarial
Multi-armed Bandit Problem,
Foundations of Computer Science
(1995), Proceedings of the 36th Annual Sympo-
sium on Foundations of Comput
er Science (IEEE, 1995): 32231.
t
r
of att
r
ibutes chose
n
by the
adve
r
sa
r
y cha
n
ges payout p
r
obabilities.
Figure 7.14The adversarial bandit
problem. In this variant of the MAB
problem, solutions make no assumptions
about the underlying distributions of the
rewards. Instead, the problem is modeled
HAPTER
Making the right choice
7.6Summary
The future of
In this book, weve presented to you the current state of the intelligent web and
walked you through the basics, teaching you what an intelligent algorithm is and
how to evaluate one. Your eyes should no
This chapter covers
Summary and review
Future applications of the intelligent web
Social implications of the intelligent web
Pedro Domingos, A Few Useful Things
to Know about Machine Learning,
Communications of the ACM 55
no. 10 (October 2012): 7887.
HAPTER
The future of the intelligent web
seem removed from what many consider an
intelligent algorithm, we cant stress
enough the importance to the enthusiast of having at least a passing interest in this
area. Future practitioners in this space w
ill absolutely need to know the magnitude,
velocity, and availability
of their real-time data.
Weve covered what we consider some of
the most pervasive topics in intelligent
algorithms: extracting struct
ure; providing recommendations; classification, click pre-
diction, and deep learning; as well as sele
ction and testing. These topics arent com-
Marc Weiser, The Computer for the 21st Century,
Scientific American
, September 1, 1991: 6675.
Nest Thermostat, 01/01/2015,
Future applications of the intelligent web
makes appointments, interacts with your sma
rtphone, starts dinner, and activates the
washing machine? This appears to be some
Douglas McIlwraith, Wearable and Ambient Sensor Fusion for the Characterisation of Human Motion,
International Conference on Intelligent Robo
ts and Systems (IROS) (IEEE, 2010): 505510.
Julien Pansiot, Danail Stoyanov, Douglas McIlwraith
, Benny Lo, and Guang-Zhong Yang, Ambient and Wear-
able Sensor Fusion for Activity Recognition in Health
care Monitoring Systems, IF
MBE proc. of the 4th Inter-
HAPTER
The future of the intelligent web
may have already seen an illustration of this in the film
Minority Report
alized ads are presented physically as people interact with their environment. Far-
Andrew Orlowski, Facebook Brings Creepy Min
ority Report-Style Ads One Step Closer,
The Register,
http://mng.bz/ML01
Siraj Datoo, This Recycling Bin Is Following You,
Quartz
, August 08, 2013,
http://mng.bz/UUbt
Joe Miller, City of London Calls Ha
lt to Smartphone Tracking Bins,
BBC News,
August 12, 2013,
http://mng.bz/k56c
Although as we have seen, this is changing rapidlyconsider Google Now and Apples Siri.
www.apple.com/uk/ios/siri/
Tim Berners-Lee, James Hendler, and Ora Lassila, The Semantic Web,
Scientific American
, May 1, 2001: 3443.
Social implications of the intelligent web
We might be some way from the fluid an
d natural user experience with the web as
described in the opening example of the
aforementioned paper, but much research
toward this is finding application in indust
ry. What we can be sure of is that any
attempt to provide richer semantics to th
e vast amount of information available on
the web can only be a good thing for design
ers of intelligent algorithms, allowing us
access to knowledge as
8.2Social implications of the intelligent web
In the previous sections, weve provided several potential areas for the development of
As youve learned in this book,
intelligent applications
are those that can change their
behavior based on information. It follows,
then, that we must have a mechanism for
the capture and access of data. Because we
re talking about web-scale processing, it
stands to reason that we may need a syst
em designed with the following in mind:
Volume
Our system should be capable
of dealing with web-scale data.
Scalability
Our system should be config
urable with changing load.
Durability
A motivating example: showing ads online
paid for on a cost per mille (
) basis; that is, for ever
y thousand ads shown, an
amount is paid from the advertiser to the
publisher. The latter is charged on a cost per
click (
) basis, where payment is exchanged ev
ery time a consumer sees and subse-
quently clicks an ad.
The complexity arises in the process by wh
ich this all occurs. Rather than advertis-
ers working with publishers directly, these
CPM
CPM
CPM
CPM
CPC
CPC
CPC
Figure A.1Overview of a demand-side platform. DSPs use their knowledge
PPENDIX
Capturing data on the web
a user
known to the
DSP
, or the exchange will pass its own
and the
will have to look up its own
accordingly. The reasons for this have to do with
the security of the page and the scope of th
e identifiers; an exact explanation lies out-
side the scope of this example. Suffice to say that the
can obtain an
in the
format to look up information regarding user behavior.
What information does the
store? Essentially, any interaction with associated
users, sites, and ads. To il
DSP server
Figure A.2Overview of the log-processing architecture of a typical DSP. Web page
interactions are recorded by the browser and s
ent back to the DSPs server(s) via HTTP.
Depending on the interaction type, these will be written to different files and stored in
different locations. Over time, these files are collated and imported to a database for
downstream processing.
Managing data collection at scale
will have to be configured to deal with
a high connection load. Once interactions are
received, a number of things need to happen.
First, as you can see, depending on the
interaction type, the event is logged into different folders. The reason has to do with
priority. Click logs equate to money, so mi
nimizing the latency of this data makes good
For the purposes of
where log files are sent on to the database, it
makes sense to periodically finalize files an
d start new ones so that we can send the
data as soon as possible. Consequently, file
s need to be continually created, finalized,
and recycled while data is entering and flowing through the system.
At this stage, finalized data is shipped to the database for import. Here we must
deal gracefully with the several possible fail
ure cases. First, corrupt
files must be ring-
fenced for manual intervention. Processed
files must be separa
ted for archival, and
transmission failures must be re
tried. Its imperative that data be received by the data-
base in a consistent state,
because without this, billing and reporting cant happen
effectively.
Managing data co
llection at scale
What weve just described is an extremel
PPENDIX
Capturing data on the web
Latency in this system is closely related to
several aspects of its design. In order for
an event to be shipped to the database, it must first have reached the file writer and
been written to a file, and that file must be closed. When this happens will depend on
the load and queuing capability of the rece
iving server as well as the maximum file
size for a log file. After shipping, the file mu
st again wait, to be imported by the data-
In terms of flexibility, theres only a single consumer of the data: the database.
Multiple consumers would result in an in
crease in transmission bandwidth, because
log files would have to be sent to all cons
umers. This would have an associated impact
on the logic required to keep the data consistent. If consumers were allowed to
beatdifferent states of progress, enabling logic would have to be borne by the log-
shipping code.
Hopefully, this simple example is illustrati
ve enough to convince you that data cap-
ture at web scale isnt easy! This, of course, all depends on your application and its
requirements, but striving to achieve the prop
erties at the start of this appendix will
lay the groundwork for the rest of your a
pplication. Each property has an associated
knock-on effect to your intelligent algori
thm and its deployment lifecycle. For exam-
ple, if your training data
reaches your algorith
m with low latency, you can train and
deploy more-relevant models more quickly.
If access is flexible, multiple algorithms
can consume the data in different ways in parallel. Wouldnt it be nice if there were a
system that was designed
and built for this purpose?
As it happens, there is a system built for
this purpose! Apache Kafka is a distributed
log-processing platform for dealing with high
volumes at low latency. At its core, Kafka
is a distributed cluster of br
okers. Producers of data can
publish messages to a topic (a
stream of messages of a given type) that are
Introducing Kafka
bound to port 2181. This is a centralized se
rvice thats used to share Kafka configura-
tionmore on this later. To communicate with Kafka, youll need bindings in your
particular language of choice. Because wer
e using Python throughout this book, well
use David Arthurs kafka-python
to connect to our Kafka instance (v0.9.5). In order to
run the examples in this appendix, were goin
g to assume that youve made this avail-
able to your favorite Python environment al
David Arthur, kafka-python,
https://github.com/dpkp/kafka-python.git
ListingA.1Kafka simple publish
Creates a KafkaClient object that
points to your lo
Imports the relevant definitions
from the kafka-python module
Sends messages to kafka
under the "test" topic
Creates a SimpleProducer
instance that will be used to
publish messages
PPENDIX
Capturing data on the web

Topics comprise multiple part
itions. Each partition must be held on the same broker,
but the partitions for a single topic can be
spread over multiple brokers. This serves
several purposes. First, it allows topics to
scale beyond the size of a single server.
Although each partition must physically fi
t on the machine that hosts it, the logical
unitthe topiccan grow to be much larger
than a single machine. It also provides a
level of redundancy and parallelism. Duplic
ate partitions can be held, and multiple
partitions can be read from and serviced
by different machines. The number con-
Apache Kafka Documentation,
Apache Projects
http://kafka.apache.org/documentation.html
ListingA.2Kafka simple subscribe
Anatomy of a Topic
Writes
OldNew
Figure A.3Anatomy of a topic, taken from the Kafka documentation.
consist of a partitioned log, which can only be appended to. The partitions are
immutable: they cant be changed once committed.
Creates a KafkaClient object
Instantiates a SimpleConsumer
with the KafkaClient object
Prints each message
Introducing Kafka
Here we create a
SimpleConsumer
, which takes as arguments a
KafkaClient
as well as
two string arguments. The first is known as a
consumer group
, and the second relates to
the topic the consumer is to consume from
. The concept of a consumer group allows
multiple consumers to coordinate, such that
a single message pu
blished to a topic is
only ever delivered to a sing
le member of a consumer group. In this example, theres
only a single consumer in our group, titled
mygroup
, and it will receive all messages
destined for this group.
ListingA.3Configuration changes in server.properties for broker 2
PPENDIX
Capturing data on the web
Note that because we havent modified Z
ookeeper details in server.properties, they
will all communicate with the same Zookeep
er instance. No further configuration is
required to ensure that they work as part
of the same cluster. Youll require some con-
figuration to create a topic with a new re
plication level, howeve
r. Navigate to your
Kafka folder, and input the following:
bin/kafka-topics.sh -create -zookeeper localhost 2181 \
-replication-factor 3 -partitions 1 -topic test-replicated-topic
This will create a new topic called
test-replicated-topic
. This topic has one parti-
tion and a replication level of 3. To
confirm this, issue the following command:
./bin/kafka-topics.sh --describe --zookeeper localhost:2181
All being well, you should se
e some output that looks re
markably like the following
listing.
Topic:test
PartitionCount:1
ReplicationFactor:1
Topic: test
Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Topic:test-replicated-topic
PartitionCount:1
ReplicationFactor:3
Topic: test-replicated-topic
Partition: 0 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2
ListingA.4Kafka topic summary
ListingA.5Kafka publish to cluster
SimpleProducer with
Introducing Kafka
producer.send_messages("test-replicated-topic","Hello Kafka Cluster!")
producer.send_messages("test-replicated-topic","Message to be replicated.")
producer.send_messages("test-replicated-topic","As is this!")
We provide an alternat
ive construction of
SimpleProducer
. Here we specify synchro-
nous communication, blocking until the en
tire cluster has acknowledged a new mes-
sage before
SimpleProducer
ListingA.6Kafka topic summary (dead replica)
PPENDIX
Capturing data on the web
http://kafka-python.readthed
ocs.org/en/latest/usage.html
Introducing Kafka
have no effect on the attempts made by the
cluster to achieve consistency. They do, how-
ever, allow the producer to block on differ
ent events within the replication protocol
and modify downstream behavior dependent on certain outcomes. How replica syn-
chronization is achieved, even in the case
Jun Rao, Intra-cluster Replication in
Apache Kafka, February 2, 2013,
http://mng.bz/9Z99
ACKACK
Write
Write
Figure A.4The different levels of acknowledgments
in Kafka. From left to right, they offer
increasing degrees of availability at
the cost of acknowledgement latency.
Follower
Follower
Logs
Figure A.5Kafka replication, reproduced from Rao.
In this example, we have four brokers and one
topic with two partitions. Broker 1 leads
partition 1, and broker 4 leads partition 2.
PPENDIX
Capturing data on the web
In figure A.5, we present a Kafka cluster of four brokers. This cluster manages a
single topic with two partitions. The replicatio
synchronization. Under normal operation,
Introducing Kafka
Consumer groups, balancing, and ordering
In listing A.2, we introduced
the concept of Kafka consumer groups. This abstraction
allows Kafka to be used either as a qu
eue, whereby multiple
consumers process
itemsfrom the head, with each message goin
g to a single consumer, or as a publish-
subscribe system, in which each mess
age is broadcast to all consumers.
Recall that a single message
published to a topic is guar
anteed to be delivered only
to a single consumer of a consumer group. So where we have one topic and one
group, this emulates the behavior of a queue. Where each consumer is a member of
its own group, this provides the same be
havior as a publish-subscribe systemeach
PPENDIX
Capturing data on the web
concept, well introduce a schema of data as
illustrated in the following listing. This can
also be found in the additi
onal resources available at
this books website (as file
sh to follow the example programmatically.
381094c0-9e45-424f-90b6-ad64b06cc184 2014-01-02 17:33:07 click
6615087e-ea19-492c-869c-28fc1fa77588 2014-01-02 17:33:20 view
d4889090-942b-4790-9831-362051b0847b 2014-01-02 17:34:01 adview
e6688db5-d31f-4ede-985a-115fb51836fb 2014-01-02 17:35:06 adview
e6688db5-d31f-4ede-985a-115fb51836fb 2014-01-02 17:35:30 click
e6688db5-d31f-4ede-985a-115fb51836fb 2014-01-02 17:37:01 click
e6688db5-d31f-4ede-985a-115fb51836fb 2014-01-02 17:39:00 convert
ae6ae5c9-acb2-479b-adcb-0c29623d921b 2014-01-02 17:40:00 adview
1ac384c1-1b2d-4ed0-b467-7c90b7ac42d8 2014-01-02 17:40:01 adview
280bfa16-07ac-49ed-a1a5-9ab50a754027 2014-01-02 17:40:03 click
dda0e95d-9c30-4f60-bb6a-febf05526b83 2014-01-02 17:40:05 adview
8a1204f1-5076-4d4c-8b23-84c77ad541d8 2014-01-02 17:40:10 adview
3bdb8f17-11cc-49cb-94cf-75676be909b7 2014-01-02 17:40:11 adview
69b61156-6c31-4317-aec5-bd48908b4973 2014-01-02 17:40:13 adview
69722471-0532-4f29-b2b4-2f9007604e4f 2014-01-02 17:40:14 adview
00e5edf6-a483-48fa-82ed-fbfac8a6b1e6 2014-01-02 17:40:15 adview
ListingA.7Click user data
LeaderReplica List
Broker 2
Broker 3
Consumer group: group-1
Figure A.6Our final Kafka cluster. In this exampl
e, we use a low-level producer to perform our own
partitioning of the stream of data. This cluster is
operating with three brokers, three partitions, and
level-3 replication. Two consumer groups exist,
one with a single consumer called group-1 and the
other with three consumers, known as group-2.
Adview, clicks, and conv
ert, all from the same
user and within a short period of time
Introducing Kafka
9398d369-6382-4be0-97bc-182b3713745f 2014-01-02 17:40:17 convert
f40c1588-e4e1-4f7d-8ef5-5f76046886fb 2014-01-02 17:40:18 adview
54823527-fe62-4a81-8551-6282309b0a3f 2014-01-02 17:40:20 click
46d6f178-7c11-48c1-a1d7-f7152e7b2f1c 2014-01-02 17:40:26 adview
4c4e545b-d194-4531-962f-66e9d3b6116d 2014-01-02 17:41:00 convert
42b311f5-ba84-4666-a901-03063f7504a9 2014-01-02 17:41:01 adview
bfa28923-c358-4741-bcbf-ff99b640ee14 2014-01-02 17:42:06 adview
54c29b39-5640-49b8-b610-6f2e6dc6bd1b 2014-01-02 17:42:10 convert
edf6c5d2-1373-4dbb-8528-925d525b4a42 2014-01-02 17:43:03 click
f7f6752f-03bf-43f1-927c-8acdafd235e2 2014-01-02 17:43:11 adview
f4b7c0a6-b209-4cc4-b4e7-395489e0e724 2014-01-02 17:43:19 click
This is a fictitious and simplified dataset
based on page interaction. Note that the
schema has three distinct colu
mns. The first is a uniquely identifying code, or globally
unique identifier (
); the second is the time of the interaction; and the third is
the interaction type. There are only
three possible types of interaction:
adview
, which
corresponds to the action that an
ad has been seen by a user;
click
, which denotes a
click by a given user; and
convert
ListingA.8Custom partitioning with a producer
Opens the data file
that contains the
stream data
Iterates
the data
Splits the file by tab,
and pulls out the first
column, which contains the GUID of the user
Creates a ProduceRequest object that specifies
the click-streams topic and partition
Sends this data, with
responses stored
in the resps object
obtain a number
PPENDIX
Capturing data on the web
This toy producer partitions by user
. In a production implementation of such a
system, the click stream would come directly
from a web browser or associated code,
but for the purposes of illustration this data comes directly from a file.
If your broker from listing A.6 is still dead, restart it now. If you wish to run this
example, youll need to create the
click-streams
topic with replication level 3. Make
sure you do this before you run the code in
listing A.8; otherwise, youll end up auto-
matically creating a topic with too few partitions:
bin/kafka-topics.sh --create \
--zookeeper localhost 2181 \
--replication-factor 3 \
--partitions 3 \
--topic click-streams
Recall that we wish to ensure that all click streams from the same user end up at the
same partition. Working through lis
ting A.8, you can see that each
GUID
has the
modulo
operator applied to it (
modulo 3
), and the resulting number is used to deter-
mine the partition to publish to. Thus, all en
tries for the same user result in the same
output (modulo 3) and will end up at the same partition! This partition isnt exclusive
for that userother users data will also
end up therebut this isnt a requirement
for the system, because the number of part
itions is much less than the number of
given user. In
group-1
, consumer A is the only member and thus will receive all the
data for the topic
click-streams
. It does this by periodic communication with the
leaders of all partitions of the topic. Becaus
e this consumer acquires data from all par-
titions, for any given user, all data
will end up with a single consumer.
Perhaps more
interesting is
group-2
, containing consumers B, C, and D. Recall that
the consumer group guarantee is such that a
given message is to be received by only a
single consumer within a consumer group and
that this is achieved through the assign-
ment of a partition to a cons
umer. In this case, the leader of each partition communi-
cates with a single consumer. Broker 1 (leadi
ng partition 1) is assigned to consumer B,
broker 2 to consumer C, and broker 3 to co
nsumer D. Note that this is the maximum
number of consumers possible in this example because the number of consumers in a
group cant exceed the number of partit
ions in the topic its consuming from.
Because of our custom partitioning strate
gy, its now guaranteed that a single con-
sumer consumes the messages from a given user
. It would be possible to do things like
click de-duplication (to remove errone
ous clicks and work out how much the
should bill its client) at the consumer, beca
use all relevant data is processed by that
Evaluating Kafka: data collection at scale
consumer and also because the ordering gu
arantees provided by Kafka ensure that
events that are published in order at a part
ition will be retrieved in that same order.
Theres also an intermediate example,
which we havent illustrated here. In the
case where we have two consumers and three partitions, one of the consumers will be
responsible for two of the partitions. This
assignment will be random, and it does
impact the load-balance on the consumers;
but it again ensures that a single con-
sumer obtains all the data for a given user.
Evaluating Kafka: data
collection at scale
Thus far, weve presented Kafka along with
the kafka-python client, which can be used
to communicate with your cluster via Python
. We present this as an alternative to a
bespoke log-shipping service,
which, although simple to understand, has been dem-
onstrated to be non-optimal for our purpos
PPENDIX
Capturing data on the web
copies. Again, although not impossible, this
adds a significant amount of complexity
to a file-management service.
In our na
ve approach, latency isnt a tunable parameter. When an event occurs,
we must wait until it has been logged, that
file has been closed, and the data eventu-
ally reaches its destination before we can co
nsider that event to be processed. In con-
trast, if we look at Kafka, both read
Kafka design patterns
Kafka plus Storm
Kafka alone cant perform any transformations to
the data it ingests. If we wish to per-
form in-stream transformations, we must co
uple Kafka with an additional technology,
such as Storm.
Storm is a distributed, real-time
system for processing stream data.
This allows us to process data in arbitrar
y and complex ways through the use of several
simple literals, two of which are the
and the
. The spout acts as a source of
data, whereas the bolt is a processor of da
ta that can consume a number of streams
and do some processing, possibly emitting
a new stream. These literals, when com-
bined with several ways to group and relay data between stages (shuffle randomly to
the next stage, partition by a particular fiel
d, send to all, and so on), provide a power-
ful way to process data in flight. Such combinations are known as
topologies
For example, using the combination of
Kafka and Storm, it would be possible to
remove duplicate clicks from our
click-streams
topic, as mentioned earlier. This
would be achieved using
a Kafka spout, such as
storm-kafka
, which extracts data from
Kafka and acts as a data source for a
topology. Figure A.7 shows an example.
In this configuration, were using the power of Kafka plus Storm to deduplicate
clicks in our click stream that occur within
a short time of each other. The consumer
bolt
Field grouping (GUID)
bolt
bolt
spout
spout
Figure A.7In this example, a consumer group of spouts is consuming from the Kafka
cluster that has been partitioned by GUID.
This ensures that each consumer spout
receives all the data for a given user. Thr
ough Storms field grouping, we can also
partition data on GUID as we send this data to one of several deduplication bolts that
operate on the incoming clicks. Field groupi
ng again on GUID, we can send these to
one of several producers that will write these back into a new, deduplicated topic,
again partitioned by GUID.
PPENDIX
Capturing data on the web
group illustrated consists of Storm consum
er spouts, which pull data from Kafka in
the same way as any other consumer but fo
rward information in a way thats configu-
rable to Storm. In this case, we use the Storm
field grouping
on to three bolts that will perform the deduplication. Much in the same way that
Kafka works, this grouping ensu
res that all user data for a given user will end up at a
single bolt. This is necessary to enable
us to perform the deduplication. The output
from each of the bolts is a stream minus th
e duplicate data, and in this case we choose
to use the field grouping again to send the
data to the producers. This keeps per-user
event ordering within a given producer. Th
e producers then send messages back into
the cluster, partitioning by
into a new topic, which will contain the deduplicated
data. Because the data is processed in the order in which it arrives, per-user dedupli-
cated data should arrive at its particular pa
rtition in more or less the same order as the
duplicate data.
Youll notice that we say more or less.
This doesnt sound very strict! Storm allows
for several levels of guarantees with resp
ect to ordering. In its simplest mode, mes-
sages may be droppedor even
replayed if the message
times outcreating multiple
entries in the output. For
exactly-once semantics
, Storm Trident must be used, which is
an abstraction to provide e
nhanced guarantees on topology execution. Using this
abstraction, we coul
d guarantee that our new deduplicate topic would be in exactly
the same order as the original, minus the duplicate clicks.
Note the scalability here. Each of the four
Kafka design patterns
we need to present the data to Kafka in a queryable manner, and to illustrate this
weve chosen Hadoop (with some
other enabling technologies).
Apache Hadoop (
https://hadoop.apache.org
) is a software framework that allows
large datasets to be processed in parallel
across many machines. It uses a simple pro-
cessing paradigm known as MapReduce to pe
rform computation in
ion much more quickly than
could be achieved on a single machine. Hadoop is an
extremely powerful framework, and we urge
you to refer to the associated literature to
At the time of writing, Camus is being phased out an
d replaced by a next-gener
ation project called Gobblin
https://github.com/linkedin/gobblin
). This extends Camus and unifies it with several other LinkedIn data-
ingestion projects.
Hive/Pig
Camus
Figure A.8Overview of Kafka plus Hadoop integration. Camus is used to generate
MapReduce jobs, which extract data from Kafka in parallel, placing the data in
HDFS. Camus jobs must be executed periodically; hence the scheduler. High-level
languages such as Hive and Pig can then be used to query the imported data.
PPENDIX
Capturing data on the web
With data residing in
HDFS
, it will now be possible to run MapReduce jobs or regis-
ter the data with a higher-level qu
ery language, such as Apache Hive
https://hive.apache.org
) or Pig (
https://pig.
). Such languages provide a
familiar programming paradigm for anal
ysts running ad hoc queries over the
imported data.
Numerics
2-D array
2-D combinations
2-D linear model
4-D Gaussian clustering
10-fold cross-validation
A/B testing
code
167
suitability of
theory
166
vs. Bayesian bandits
accuracy
backpropagation
150
balanced data
Bandit objects
Bayes theorem
Bayesian bandits
172
186
factors impacting
183
call number
campaigns
capturing data on web
216
data collection
managing at scale
197
Kafka log-processing
platform
211
consumer groups
207
design patterns
evaluating
plus Hadoop
plus Storm
replication in
showing ads online
196
categorical features
causation, difference from
correlation
CF (collaborative filtering)
choices
188
A/B testing
166
suitability of
theory
164
vs. Bayesian bandits
Bayesian bandits
172
A/B vs.
186
factors impacting
181
multi-armed bandits
epsilon decreasing strategy
epsilon first strategy
epsilon greedy strategy
extensions to
Cinematch algorithm
classes of intelligent
algorithms
classes of intelligent
(continued)
artificial intelligence
machine learning
predictive analytics
classification
classifiers
lifecycle of
statistical classification
algorithms
structural classification
algorithms
credibility of results
dac.tar.gz file
DAGs (directed acyclic graphs)
data array
data axis, transforming
eigenvectors and eigenvalues
principal component analysis
data capture.
capturing data
data collection
managing at scale
nave approach
data preparation
data reliability
data_categoricals array
data_non_categoricals array
DT (decision tree) algorithms
duplicate partitions
durability
eigenvalues
eigenvectors
EM (expectation maximiza-
tion) algorithm
encoded data
192
energy-based learning
epsilon decreasing strategy
epsilon first strategy
epsilon greedy strategy
error attribute
199
Euclidean distance
evaluation
events
exchange, click prediction for
online advertising
109
ad monitoring
111
ad placement
111
bid win (or loss) notification
cookie matching
expectation maximization
explore-exploit conundrum
extensibility
f1 score
feature engineering
Gaussian mixture model.
See
GMM
Gaussian processes
GDBTs (gradient-boosted deci-
sion trees)
generalization
generalized linear models
Gerbrands
Gibbs measure
GLM (generalized linear
model)
GMM (Gaussian mixture
model)
example of learning using
Gaussian distribution
general discussion
k-means and
gmm object
Google Glass
132
Google Now example
grayscale values
GUID (globally unique
identifier)
Hadoop framework, Kafka plus
hand-built neuron
hashing trick
122
HDFS (Hadoop Distributed
File System)
215
head tool
healthcare
hidden layer
144
hidden nodes
155
hidden units
155
hierarchical reference
structures
higher-dimensional spaces
high-frequency trading
Hinton, Geoffrey
home healthcare
horizontal scaling
Huang, Jeubin
human-readable tags
118
HW (high watermark)
hyperbolic profile
hyperbolic tangent function
hyperplane
if-then clauses
incorrect classification
inhibitory input
input layer
150
input variables
in-sync replicas
intelligence, evaluating
intelligent algorithms
108
112
artificial intelligence
machine learning
predictive analytics
data reliability and
data size and
evaluating performance of
evaluating intelligence
evaluating predictions
examples of
generalization and
Google Now example
inference and
lifecycle of
scaling characteristics of
what they are not
intelligent web
future applications of
190
home healthcare
invert_hash option
Jaccard similarity
Kafka log-processing platform
consumer groups
design patterns
212
evaluating
plus Hadoop
plus Storm
replication in
201
message acknowledgement
replication, leaders, and in-
sync replicas
KafkaClient object
keyword searches
KISS (keep it simple, stupid!)
k-means algorithm
kNN (k-nearest neighbors)
landing-page optimization
163
n dimensional space
141
odds of event
only_unknowns=True
PA (predictive analytics)
Papert, Seymour
Gibbs measure
ProduceResponse objects
production rules
production stage
programmatic buying
publish-subscribe system
quantified self
r switch
RatingCountMatrix class
RBM/LR pipeline
RBMs (Restricted Boltzmann
Machines)
BRBMs (Bermoulli Restricted
Boltzmann Machines)
illustrative example
overview
154
training
readable.models
recall
receiver operating characteris-
tic curve.
ROC
194
scale of probability
scaling characteristics, of intelli-
gent algorithms
scikit-learn
138
153
SE (standard error)
165
second price auction
111
self-driving vehicle
semi-empirical approach
103
SGD (stochastic gradient
descent)
similarity
similarity of probability
SimilarityMatrix class
SimpleConsumer
SimpleProducer
203
simplicity
single hidden units
single hyperplane
144
single perceptron
single test variable
single weight value
single-shot optimization
ubiquitous computing
190
user-based collaborative
userdict object
users, information about
utility problem
validation stage
vehicles, self-driving
191
Vickrey auction
visible units
160
Vowpal Wabbit (VW), click pre-
diction with
128
model calibration
Wald interval
intelligent web
149
weighted_ratings
XOR function
147
148
Intelligent Web
DMITRY BABENKO
OREWORD
aluable insights are buried in the tracks web users leave
as they navigate pages and applications. You can uncover
them by using intelligent algorithms like the ones that
have earned Facebook, Google, and Twitter a place among the
giants of web data pattern extraction.
teaches you
how to create machine learning applications that crunch and
wrangle data collected from users, web applications, and
website logs. In this totally revised edition, youll look at
intelligent algorithms that extract real value from data. Key
machine learning concepts are explained with code examples
in Pythons
scikit-learn.
rithms to capture, store, and structure data streams coming
from the web. Youll explore recommendation engines and
dive into classi
cation via statistical algorithms, neural
Introduction to machine learning
Extracting structure from data
is a machine learning expert and data
eld of online advertising.
machine learning techniques for industrial solutions.
[INCLUDING eBOOK]
Algorithms
of the
Intelligent Web
WEB DEVELOPMENT/PYTHON
SEE INSERT
To download their free eBook in PDF, ePub, and Kindle formats,
owners of this book should visit
www.manning.com/books/algorithms-of-the-intelligent-web-second-edition
sample code in Python.
brings fresh life to an
Marius Butuc, Shopify
Covers the most essential
areas of machine learning
application in the real world.
A great hands-on approach.
Radha Ranjan Madhav, Amazon
Dike E. Kalu, Fara Frica

Приложенные файлы

  • pdf 14233092
    Размер файла: 5 MB Загрузок: 0

Добавить комментарий