*Чтобы посмотреть этот PDF файл с форматированием и разметкой, скачайте его и откройте на своем компьютере.*

For online information and ordering of this and other Manning books, please visit

www.manning.com. The publisher offers discounts on this book when ordered in quantity.

For more information, please contact

Special Sales Department

Manning Publications Co.

20 Baldwin Road

PO Box 761

Shelter Island, NY 11964

Email: [email protected]

2016 by Manning Publications Co. All rights reserved.

This book is dedicated to Elly,

forewordix

prefacexi

acknowledgmentsxiii

about this bookxv

Building applications

for the intelligent web1

1.1An intelligent algorithm in action: Google Now3

1.2The intelligent algorithm lifecycle5

1.3Further examples of intelligent algorithms6

1.4Things that intelligent applications are not7

Intelligent algorithms are not all-purpose thinking machines7

Intelligent algorithms are not a drop-in replacement for humans 7

Intelligent algorithms are not discovered by accident8

1.5Classes of intelligent algorithm8

Artificial intelligence9

Machine learning9

Predictive

analytics10

1.6Evaluating the performance

of intelligent algorithms12

Evaluating intelligence 12

Evaluating predictions12

1.7Important notes about intelligent algorithms15

Your data is not reliable15

Inference does not happen

instantaneously16

Size matters!16

Different algorithms

have different scaling characteristics16

Everything is not a

nail!17

Data isnt everything17

CONTENTS

variable17

Generalization is the goal17

Human intuition

is problematic18

Think about engineering new features18

Learn many different models18

Correlation is not the same

as causation18

1.8Summary19

Extracting structure from

data: clustering and

transforming your data20

2.1Data, structure, bias, and noise22

2.2The curse of dimensionality25

2.3K-means26

K-means in action31

2.4The Gaussian mixture model33

What is the Gaussian distribution?34

Expectation

maximization and the Gaussian distribution36

The Gaussian

mixture model36

An example of learning using a Gaussian

mixture model38

CONTENTS

Classification: placing th

ings where they belong77

4.1The need for classification78

4.2An overview of classifiers81

Structural classification algorithms82

Statistical classification

algorithms84

The lifecycle of a classifier85

CONTENTS

6.4Multilayer perceptrons142

Training using backpropagation 145

Activation

functions 146

backpropagation147

Backpropagation theory148

MLNN in scikit-learn150

A learned MLP153

6.5Going deeper: from mult

foreword

The World Wide Web is the underlying infr

FOREWORD

share my enjoyment when you read it. More

important, I hope that after you finish

reading, you find yourself equipped with

the skill and knowledge to make the web

more intelligent.

ROFESSOR

IRECTOR

ATA

CIENCE

NSTITUTE

MPERIAL

preface

We consider ourselves lucky to

be working in one of the mo

st exciting fields of tech-

nology that our time has to offer. In a

few short years, weve moved from a nascent

PREFACE

technique over a mathematics-heavy approach.

This is a tricky line to tread, and we

hope that weve done it well.

As youll see, the book is arranged in

to eight core chapters, each covering an

important area of the intelligent web from

an algorithmic standpoint. The book ends

with an appendix that looks at the inte

lligent web from a data-processing vantage.

Weve included it so that, as a practitioner

in the space, you can appreciate how cru-

cial (and difficult) it is to move high-v

elocity data around a system effectively.

xiii

Thanks to everyone at Manning who made this book possible: publisher Marjan Bace

ACKNOWLEDGMENTS

First and foremost, I want to thank my fi

ance, Elly. Darling, your love, patience,

and support are much-appreciated constants

in my life, and Im

certain that without

them, this book would never ha

ve been written. I love you.

Second, Id like to thank my parents and

family, who have always fostered my curi-

osity and supported me when Ive foundered.

I hope you enjoy this book and realize

how grateful I am for your nurture and care.

Third, Id like to mention my friends

and colleagues, who are too numerous to

name. Ive been lucky enough to work and

play alongside some

extraordinary people,

and this has made every day a pleasure. Thank you all.

Id also like to thank both of my editors, Jeff Bleiel and Jennifer Stout. Youve been

I wholeheartedly thank my cherished

wife, Aurora, and our three sons: Nikos,

Lukas, and Albertthe greatest pride and joy of

my life. Ill always be grateful for their

love, patience, and understanding. The incessa

nt curiosity of my children has been a

continuous inspiration for my studies on

learning. A huge acknowledgment is due to

my parents-in-law, Cuchi and Jose; my sister

s, Maria and Katerina; and my best friends,

Michael and Antonio for their continuous encouragement and unconditional support.

Id be remiss if I didnt acknowledge the

manifold support of Drs. Amilcar Avenda-

o and Maria Balerdi, who taught me a lot

about cardiology and funded my early work

on learning. My thanks also are due to Pr

ofessor Leon Cooper, and many other amaz-

ing people at Brown University, whose zeal

for studying the way that our brain works

trickled down to folks like me and instigat

ed my work on intelligent applications.

To my past and present colleagues, Ajay

Bhadari, Kavita Kantka

r, Alexander Petrov,

Kishore Kirdat, and many others, who enco

uraged and supported all the intelligence-

related initiatives at work: there are only a few lines that I can write here, but my grati-

tude is much larger than that.

ARALAMBOS

First and foremost, I want to thank my beloved wife Elena.

Id also like to thank all of my past an

d present colleagues, who influenced my pro-

fessional life and served as an inspiration:

Konstandin Bobovich, Paul A. Dennis, Keith

Lawless, and Kevin Bedell.

MITRY

ABENKO

Algorithms of the Intelligent Web

was written to provide you with a roadmap to the design

and creation of intelligent algorithms. It

draws on many areas

of computer science,

including machine learning and artificial in

telligence, but has been written with the

practitioner in mind. Its

a cookbook of sorts and provides the relative newcomer in

this field with several real-w

orld, worked examples that can be modified for your own

purpose.

This book is firmly aimed at those who are

new to writing intell

igent algorithms but

who have a firm grasp of programming and

basic math and statistics. Wherever possi-

ble, weve tried to write the book so the ma

thematical rigor can be

glossed over, leav-

ing you with an overall impression of the

applicability of an approach. Of course, if

youre more mathematically inclined, we

encourage you to follow the detail. Ideally,

readers of this book should have had at

least some exposure to a programming lan-

guage and have attended a first-year

undergraduate mathematics course.

This book is organized into ei

ght chapters and one appendix:

Chapter 1 provides an overview of inte

lligent algorithms and outlines some key

characteristics. It also pr

ovides you with guiding principles for the rest of the

book.

Chapter 2 discusses the concept of struct

ure within your data. In particular, it

introduces the concept of the feature sp

ace. In this chapter, we introduce

expectation maximization and eigenvectors.

ABOUT

THIS

BOOK

Chapter 3 introduces recomme

nder systems. We introd

uce collaborative filter-

ABOUT

THIS

BOOK

Math conventions

In the text, youll find many equations that support the code and concepts intro-

duced. We follow a standard convention

for the notation throughout. Matrices are

represented by upright, capital, bold characters, such as

. Upright, lowercase, bold

characters represent vectors:

. Scalar values are expressed

using lowercase italic, as in

the case of

About the authors

LWRAITH

earned his first degree at

Cambridge in computer science

ABOUT

THIS

BOOK

xviii

About the cover illustration

The illustration on the cover of

Algorithms of the Intelligent Web, Second Edition

, is taken

from a French book of dress customs,

Encyclopedie des Voyages

by J. G. St. Saveur

, pub-

lished in 1706. Travel for pl

easures was a relatively new

phenomenon at the time, and

illustrated guides such as this one were

popular, introducing both the tourist and the

armchair traveler to the inhabitants of other far-off regions of the world, as well as to

the more familiar regional co

stumes of France and Europe.

The diversity of the drawings in the

Encyclopedie des Voyages

speaks vividly of the

uniqueness and individuality of the worlds

countries and peoples

just 200 years ago.

This was a time when the dress codes of

two regions separated by a few dozen miles

identified people uniquely as belonging to one or the other, and when members of a

social class or a trade or a tribe could be ea

sily distinguished by what they were wear-

ing. This was also a time when people were fascinated by foreign lands and faraway

places, even though they couldnt travel

to these exotic destinations themselves.

Dress codes have changed since then, and

the diversity by region, so rich at the

time, has faded away. Its now often hard to

tell the inhabitants of one continent from

another. Perhaps, trying to vi

ew it optimistically, weve tr

aded a world of cultural and

visual diversity for a more varied personal

life, or a more varied and interesting intel-

lectual and technical life.

We at Manning celebrate the inventivene

ss, the initiative, and the fun of the com-

puter business with book covers based on na

tive and tribal costumes from two centu-

ries ago, brought back to life by pict

ures like those from this travel guide.

The intelligent web means different things

to different people. To some it repre-

sents the evolution of the web into a more

responsive and useful entity that can

learn from and react to its users. To others it represents the inclusion of the web

into many more aspects of ou

r lives. To me, far from being the first iteration of Sky-

net, in which computers take over in a dy

stopian future, the intelligent web is about

designing and implementing more naturally

ns that make our

online experiences better in some quanti

fiable way. Theres a good chance that

every reader has encountered machine inte

lligence on many

separate occasions,

and this chapter will highlight some examples

so that youll be better equipped to

recognize these in the future. This will, in

turn, help you understand whats really

happening under the hood when you intera

ct with intelligent

applications again.

This chapter covers

Recognizing intelligence on the web

Types of intelligent algorithms

HAPTER

Building applications for the intelligent web

Now that you know this book isnt about writing entities that will try to take over

the world, we should perhaps discuss some

other things that you wont find within

these pages! First, this is very much a back-end book. In these pages, you wont learn

about beautiful interactive visu

alizations or platforms. For

this, we refer you to excel-

lent publications by Scott Murray,

David McCandless,

and Edward Tufte.

Suffice to

say that we dont have space here to do th

is topic justice along

with what were about

to cover. Also, this book wont teach you statistics; but to gain the most from the book

well assume you have a 101 level of knowle

dge or higher, in that you should have at

least taken a course in statistics at some point in the past.

This is also not a book about data science.

Scott Murray,

Interactive Data Visualization for the Web

(OReilly, 2013).

David McCandless,

Information Is Beautiful

(HarperCollins, 2010).

Edward Tufte,

The Visual Display of Quantitative Information

(Graphics Press USA, 2001).

Data Science From Scratch: First Principles with Python

(OReilly, 2015).

An intelligent algorithm in action: Google Now

Where possible, we reference known examples

of these in the wild, so that you can test

your knowledge against the behavior of

such systemsand no doubt impress your

friends!

Event

Intelligent

algorithm

Figure 1.1Overview of an intelligent

algorithm. Such algorithms display

intelligence because the decisions

they make change depending on the

data theyve received.

HAPTER

Building applications for the intelligent web

granularity. In the context of figure 1.1, this

is one aspect of the data thats being used

to change the behavior of the algorithm.

From here its a short step to determine

home and work locations. This is performed through the use of

: that is,

knowledge that has somehow be

en distilled into the algorithm before it started learn-

ing from the data. In this case, the prior kn

owledge could take the form of the follow-

ing rules:

The location most usually occupied overnight is home.

The location most usually occupied during the day is work.

People, in general, travel to their

workplace and then home again almost

every day.

Although this is an imperfect example, it

does illustrate our point wellthat a con-

cept of work, home, and commuting exists wi

The intelligent-algorithm lifecycle

algorithm. Each solution that you create should be grounded in and built on work in

the fieldmuch of which youll

find in this book. Weve

introduced several key terms

here in italics, and well refer to these in the coming chapters as we tackle individual

algorithms in more depth.

1.2The intelligent-algorithm lifecycle

In the previous section, we introduced you

to the concept of an intelligent algorithm

comprising a black

box, taking data and making predictions from events. We also more

Ben Fry, PhD thesis,

Computational Information Design

(MIT, 2004).

Update

Model

Current

location

Current location

Figure 1.2Graphical overview of one aspect of the

Google Now project. In order for Google Now to make

predictions about your future locations, it uses a

model about past locations, along with your current

position. Priors are a way to distill initially known

information into the system.

or structure

Figure 1.3The intelligent-algorithm lifecycle

HAPTER

Building applications for the intelligent web

When designing intelligent algorithms, you fi

rst must acquire data

(the focus of the

appendix) and then parse and clean it, becaus

e its often not in the format you require.

You must then understand that data, whic

h you can achieve through data exploration

and visualization. Subsequently, you can repr

esent that data in more appropriate for-

mats (the focus of chapter 2). At this poin

t, youre ready to train a model and evaluate

the predictive power of your solution. Chap

ters 3 through 7 cover various models that

you might use. At the output of any stage,

Things that intelligent applications are not

that you should withdraw your savings an

d start trading based on this companys

HAPTER

Building applications for the intelligent web

algorithms arent well suited to such do

mains without process simplification and

formalization.

Lets draw a simple but effective parallel with the automation of the motor-vehicle

assembly line. In contrast to

the early days of assembly-line automation pioneered by

learning

analytics

intelligence

Figure 1.4Taxonomy of intelligent algorithms

Classes of intelligent algorithm

1.5.1Artificial intelligence

Artificial intelligence, widely known by its acronym

, began as a computational field

around 1950. Initially, the goals of

were ambitious and aimed at developing

machines that could think like humans.

Over time, the goals became more practical

and concrete as the full extent of the wo

rk required to emulate intelligence was

uncovered. To date, there are many definitions of

. For example, Stuart Russell and

Peter Norvig describe

as the study of agents that

receive percepts from the envi-

ronment and perform actions,

whereas John McCarthy stat

es, It is the science and

engineering of making intelligent machin

es, especially intelligent computer pro-

Importantly, he also states that intel

ligence is the computational part of the

ability to achieve goals in the world.

In most discussions, AI is about the st

udy of agents (software and machines) that

have a series of options to choose from at

a given point and must

achieve a particular

goal. Studies have focused on

particular problem domains (Go,

chess,

and the

game show Jeopardy!

), and in such constrained environments, results are often

excellent. For example,

s Deep Blue beat Garry Kasparov at chess in 1997, and in

2011

s Watson received the first-place prize of $1 million on the

. television

show Jeopardy! Unfortunately,

few algorithms have come close to faring well under

Alan Turings

imitation game

(otherwise known as the Turing test), which is consid-

ered by most the gold standard for intell

igence. Under this game (without any visual

or audio cuesthat is, typewritten answers on

ly), an interrogator must be fooled by a

machine attempting to act as a human. This

situation is considerably harder, because

the machine must possess a wide range of kn

owledge on many topi

cs; the interrogator

isnt restricted in the

questions they may ask.

1.5.2Machine learning

Machine learning (

) refers to the capability of a software system to generalize

based on past experience. Importantly,

is about using these generalizations to pro-

vide answers to questions relating to previously collected data as well as data that

hasnt been encountered before. Some appr

oaches to learning create explainable

modelsa layperson can follow the reasonin

g behind the generalization. Examples of

explainable models are decision trees and,

more generally, any rule-based learning

Herbert Simon,

The Shape of Automation for Men and Management

(Harper & Row, 1965).

Stuart Russell and Peter Norvig,

Artificial Intelligence: A Modern Approach

(Prentice Hall, 1994).

John McCarthy, What Is Artificial Inte

lligence? (Stanford University, 2007),

http://www-formal.stanford

.edu/jmc/whatisai

Bruno Bouzy and Tristan Cazenave, Com

puter Go: An AI Oriented Survey,

Artificial Intelligence

(Elsevier)

132, no. 1 (2001): 39103.

Murray Campbell, A. J. Hoane, an

d Feng-hsiung Hsu, Deep Blue,

Artificial Intelligence

(Elsevier) 134, no. 1

(2002): 5783.

HAPTER

Building applications for the intelligent web

method. Other algorithms, though, arent

as transparent to humansneural net-

ctor machines (

) fall in this category.

You might say that the scope of machine lear

ning is very different than that of AI.

looks at the world through the lens of

agents

striving to achieve

goals

(akin

to how a human agent may operate at

a high level in their environment),

focuses

on learning and generalization (more akin

to how a human operates internally). For

example,

deals with such problems as classifi

cation (how to recognize data labels

given the data), clustering (how to group

Apply

Train

Predict

10

4

5

2

2

Figure 1.5The machine-learning

data flow. Data is used to train a

model, which can then be applied

to previously unseen data. In this

case, the diagram illustrates

classification, but the diagrams for

clustering and regression are

conceptually similar.

Classes of intelligent algorithm

You may ask, How is this different from

, which is also about pr

ediction? This is a

good question! In general,

techniques focus on understanding and generalizing

the underlying structure and relationships of

card, mobile phone, or mortgage, retailers

are exposing themselves to an element of

risk (and also an element of reward!). In

order to balance this, retailers want to kn

ow that theyre mostly providing credit to

credit-worthy individuals, while rejecting thos

e who are more likely to default on their

payments. In practice, this deci

sion is referred to speciali

st credit agencies that will,

HAPTER

Building applications for the intelligent web

1.6Evaluating the performanc

e of intelligent algorithms

Thus far, weve talked about the high-level

classes of intellig

ent algorithms and pro-

vided several motivating exam

ples. But how does the inte

lligent-algorithm practitio-

ner evaluate their algorithm? This is of great importance for several reasons. First,

without an objective evaluation, its not

possible to track perf

ormance, and so its

impossible to know if modifications have

improved operation.

Second, if you cant

measure performance, then justification be

comes difficult. From

a business perspec-

tive, managers and technologists will alwa

ys try to measure va

lue against cost, and

being able to solidly evaluate your solution will help keep it in production!

Lets now return to each class of intel

ligent algorithm and discuss strategies for

performance measurement. Although well

touch on how to evaluate intelligence,

this section is largely about evaluating pr

edictions (that is, for machine learning and

predictive analytics). Evaluati

ng (and defining) intelligence is a big subject and could

fill a book like this on its own! Rather than

delve into this topic, we refer you to Linda

Gottfredson,

and Alan Turing.

1.6.1Evaluating intelligence

In the preceding text, we mentioned Turing

s imitation game, but there may be a call

to evaluate systems that are less grand in

their approach to intelligence: for example,

intelligent systems that play chess or comp

Linda S. Gottfredson, Mainstream Science on Intelligence:

An Editorial with 52 Sign

atories, History, and Bib-

liography,

Evaluating the performance of intelligent algorithms

Ground truth

Prediction

104...TrueFalse

207...TrueTrue

56...FalseFalse

12...FalseTrue

Table 1.2Performance metrics used in the evaluation of intelligent algorithms

Metric

Calculation

True positive rate (TPR)Predicted true

positives / ground truth true positives

True negative rate (TNR)Predicted true negatives / ground truth true negatives

False positive rate (FPR)1 true negative rate

False negative rate (FNR)1 true positive rate

HAPTER

Building applications for the intelligent web

00.10.20.30.40.5

0.60.70.80.9

Important notes about intelligent algorithms

So far, so sensible! Be aware, however, th

may not always be so easy. Take, for exampl

e, the issue of credit scoring. When you

provide a score, you may not always be able

to follow up on the behavior of that user

in the future. Also, in these cases youre

usually not attempting to provide the answer

but rather some indicator thats correlated wi

th a target variable

(such as creditworthi-

ness). Although the

ROC

curve can be used in these cases, there can be a few more

hoops to jump through to get there.

1.7Important notes about intelligent algorithms

Thus far weve covered a lot of introductory

material. By now, you

should have a fairly

good, albeit high-level, understanding of in

telligent algorithms and how youre going

to use them. Youre probably sufficiently

motivated and anxious to dive into the

details. We wont disappoint you. Every foll

owing chapter is loaded with new and valu-

able code. But before we embark on our jo

urney into the exciting and (for the more

cynical among us) financially rewarding world

of intelligent applications, well present

some useful things to know, many of wh

ich have been adapted from Pedro Domin-

goss excellent and easy-to-read paper on the subject.

These should help guide you

as you progress through this book and, later on, the field of intelligent algorithms.

1.7.1Your data is not reliable

Your data may be unreliable for many reas

ons. Thats why you should always examine

whether the data youll work with can be

trusted before you start considering specific

intelligent-algorithmic solutions to your

problem. Even intelligent people who use

very bad data will typically arrive at erroneous conclusions. The following is an indica-

Pedro Domingos, A Few Useful Things

to Know About Machine Learning,

Communications of the ACM

55,

no. 10 (2012): 7887.

HAPTER

Building applications for the intelligent web

Your data may not be normalized. Lets say that youre looking at the weight of

Important notes about intelligent algorithms

versions, but you should investigate this from

Probabilistic Reasoning

in Intelligent Systems

(Morgan Kaufmann Publishers, 1988).

HAPTER

Building applications for the intelligent web

1.7.9Human intuition is problematic

As the feature space grows, theres a combin

atorial explosion in the number of values

the input may take. This makes it extremely

unlikely that for any moderate-sized fea-

Domingos, A Few Useful Things to Know About Machine Learning.

Summary

1.8Summary

We provided the 50,000foot view of intelligent algorithms, providing many

specific examples based

on real-world problems.

An intelligent algorithm is any that ca

n modify its behavior

based on incoming

data.

We provided some intelligent-algorithm de

sign anti-patterns that will hopefully

serve as a warning flag for practitioners in the space.

Youve seen that intelligent algorithms

can be broadly divided into three cate-

gories: artificial intelligence, machine le

arning, and predictive

analytics. Well

spend the most time on the latter two

in this book, so if you glanced over

these sections, it would be worth reread

ing them to ensure good foundational

knowledge.

Extracting structure

from data: clustering and

transforming your data

This chapter covers

Features and the feature space

Expectation maximizationa way of training algorithms

Transforming your data axes to

ages correlate with higher salaries? Is a larger percentage of the wealth present in a

smaller percentage of the population?

If found, these generalizations can be

extracted directly either to

provide evidence of a pattern

Figure 2.1Visualizing structure in data. x- and

y-axis values are arbitrary and unimportant.

This data shows three different classes of data,

all colored differently. You see that the x and y

values appear to cluster around certain areas of

the x, y plot dependent on their class. You also

see that higher values of x correlate to higher

values of y regardless of the class. Each of

these properties relates to the structure of the

data, and well cover each in turn in the

remainder of this chapter.

HAPTER

Extracting structure from data: clustering and transforming your data

expectation maximization

). This key concept will be re

visited again in this chapter

when we discuss our second

clustering method using the

Gaussian mixture model

Using a

to cluster can be seen as an exte

nsion of the k-means algorithm. Its

possible to achieve exactl

y the same results with

and k-means if you provide cer-

tain limitations to your

Data, structure, bias, and noise

��� import numpy as np

Imports scikit-learn datasets

Loads in the

Creates an array of both the

Contents of the first

of the array displayed

HAPTER

Extracting structure from data: clustering and transforming your data

that has been recorded about the Iris dataset. You can observe that, for each row of

iris.data

, the features consist of the sepal le

ngth (cm), the sepal width (cm), the

The curse of dimensionality

individual species, there are a few other issu

es to watch out for. For example, what if

the data was collected early in the flower

-growing season? What if the flowers were

immature and hadnt grown to their full si

ze? This would mean th

e flowers are system-

atically different from the overall populationthis is known as

. Clusters gener-

many measurements, there are two fundamental problems when collecting data with

high dimensionality that are particularly

important when finding structure. The first

problem is that the large number of dimens

ions increases the amount of space avail-

able for spreading data points. That is, if

you keep the number of data points fixed

and increase the number of attributes you want to use to describe them, the density of

the points in your space decreases exponent

ially! So, you can wander around in your

data space for a long time without being able to identify a formation thats preferable

to another one.

The second fundamental problem has

a frightening name. Its called the

curse of

dimensionality

. In simple terms, it means if you have

any

x-axis

x-axis

y-axis

Figure 2.2The curse of dimensionality: every point tends to have the same

distance from any other point.

HAPTER

Extracting structure from data: clustering and transforming your data

If you look at figure 2.2 from left to right,

youll see that the dimensionality increases

by 1 for each drawing. We start with eight

points in one dimension (x-axis) distributed

Stuart P. Lloyd, Least Sq

uares Quantization in PCM,

IEEE Transactions on Information Theory

28 (1982): 129-

137.

K-means

mentioning that this simple one-dimensio

nal example demonstrates the concept with-

out loss of generalization. In reality, nontri

vial patterns and structure would be found

from higher-dimensional data, subject, of

course, to the curse of dimensionality.

Lloyds algorithm works by iteratively tryi

ng to determine the means of the clusters

Figure 2.3A number line containing two possible clusters in a single dimension. An entry indicates

the existence of that number in the dataset. Data points represented by the same shape indicate one

possible intuitive clustering of the data.

HAPTER

Extracting structure from data: clustering and transforming your data

represent this, new means are

calculated based on the aver

age values in that cluster:

that is, the new mean for cluster k

is given by the mean of the green data points, and

the new mean for cluster k

is given by the mean of the red data points. Thus, the new

and k

are 2.5 and 11.2, respectively. You can see the updated means in

figure 2.6. Essentially, both cluster mean

s have been dragged up by the data points

assigned to them.

We now proceed with the second iteration. As before, for each data point, we

assign it to the cluster with the closest ce

ntroid. All of our points below approximately

6.8 are now assigned to cluster k

, and all of our points above this number are

Figure 2.4Initial assignment of the means of the k clusters. The mean of cluster k

is denoted

by a triangle, and a circle denotes the mean of cluster k

Figure 2.5Initial assignment of data points to

clusters. Those data points assigned to cluster k

are shaded green (diamonds), and those assigned to cluster k

are shaded red (squares).

K-means

assigned to cluster k

. If we continue as before an

d calculate the new means of the

clusters, this gives centroids of 3.5 and 12.8

. At this point, no number of additional

iterations will change the assignment of the

data points to their clusters, and thus the

cluster means wont change. Another way of

saying this is that the algorithm has

verged

The same nomenclature is used when learning the parame

ters for any algorithm. When a cost function is min-

imized, practitioners say that the training algorithm has

converged. In this book, well mostly rely on scikit-

learn libraries to reach conver

gence for us during training.

Figure 2.6Updated means given the new cluster assignment. Cluster centroids have been updated

Figure 2.7Final assignment of the data points to

the clusters. Final cluster centroids are given by

the triangle and circle, respectively. Data points below 6.8 have been assigned to cluster k

, and data

points above this have been assigned to cluster k

. Additional iterations of the k-means algorithm

wont change this clustering.

HAPTER

Extracting structure from data: clustering and transforming your data

Note that this final assignment is exactly

the means we would have chosen if we had

manually performed clustering (that is, we

Leon Bottou and Yoshua Bengio, Convergence

Properties of the K-Means Algorithms,

Advances in Neural

Information Processing Systems

7 (1994).

Listing2.3K-means pseudo-code: expectation maximization

Expectation step

Maximization

step

K-means

2.3.1K-means in action

Listing2.4The k-means algorithm in action

Listing2.5Visualizing the output of k-means

Contains only the data,

not the labels

Passes in the value of k, which

in this case is equal to 3

HAPTER

Extracting structure from data: clustering and transforming your data

The Gaussian mixture model

A quick glance over the data seems to su

ggest that plotting petal length against

HAPTER

Extracting structure from data: clustering and transforming your data

2.4.1What is the Gaus

sian distribution?

You may have heard of the Gaussian distribu

0

5

15

20

Figure 2.9Histogram of a normal distribution, in this case the height of 334 fictitious people. The

modal (most frequently occurring) bin is centered at 180 cm.

-----------------

-------------------

The Gaussian mixture model

deviations from the mean, we might gue

results of the cumulati

ve density function (

CDF

) for the values of

that define the

range. This works because the

for a given value of

Figure 2.10Fictional data on the height of 334 people. The sample probability of a given user

occurring in a given height range is shown in red (t

he bar on the left in each pair). The probability from

HAPTER

Extracting structure from data: clustering and transforming your data

A brief look at figure 2.10

shows that our initial guess of

180 for the mean and 28 for

the standard deviation isnt bad. The standa

rd deviation looks re

could potentially be decreased slightly. Of

course, we could continue to change the

The Gaussian mixture model

If we assume that a user can be either

male or female, then our Gaussian distribu-

tion from earlier is the result of samplin

g from two separate and overlapping Gaussian

distributions. Rather than mo

deling the space using a sing

le Gaussian, we can now use

two (or more!) such distributions:

This equation is extremely similar to the

one earlier in the chapter, save for a few

details. First, note that instead of a sing

le Gaussian, the probability is made up from

the sum of K separate Gaussian distribution

()

-----------------

-----------------------

110

115

Distribution of womens heights

Distribution of mens heights

Figure 2.11Probability distributions of height for men and women. Note that these probabilities are

conditioned on the fact that gender is known: so, fo

r example, given that we know a particular person

is a woman, her probability of having a height

HAPTER

Extracting structure from data: clustering and transforming your data

The probabilities given on the y-axis in figu

re 2.11 are conditioned on the fact that we

know the gender of the person. In general, for the purposes of this example, we

wouldnt necessarily have that data at hand

(maybe it wasnt recorded at measurement

time). Thus wed need to learn not only the parameters of each distribution but also

the gender split in the sample population (

The Gaussian mixture model

v *= 9

ell = mpl.patches.Ellipse(gmm.means_[n, [x,y]], v[0], v[1],

180 + angle, color=color)

HAPTER

Extracting structure from data: clustering and transforming your data

Ron Weiss and Gael Varoquaux, GMM classification,

http://mng.bz/uKPu

Figure 2.12Output from listing 2.6. Each panel show

and

k-means if additional constraints are plac

ed on the learned covariance matrices.

In particular, if covariance matrices for each cluster are tied together (that is, they

all must be the same), and covariances ac

ross the diagonal are

restricted to being

equal, with all other entries set to zero, th

en youll obtain spherical clusters of the

same size and shape for each cluster. Under

such circumstances, each point will always

belong to the cluster with the closest mean. Try it and see!

As with k-means, training

a Gaussian mixture model with

can be sensitive to ini-

tial starting conditions. If

you compare and contrast

GMM

to k-means, youll find a few

more initial starting conditions in the former

than in the latter. In particular, not only

must the initial centroids be specified, but the initial covariance matrices and mixture

weights must be specified al

so. Among other strategies,

one approach is to run

k-means and use the resultant centroids to

A word about make_ellipses

The method

make_ellipses

is derived from

plot_gmm_classifier

, written by Ron

Weiss and Gael Varoquaz for scikit-learn. By

extracting the covariance matrix in the

two dimensions youre plotting, you can find the first and second directions of maxi-

mal variance, along with their respective magnitudes. You then use these to orient

and draw ellipses relating to

the shape of the learned Ga

rections and

magnitudes are known as

and

, respectively, and will be cov-

HAPTER

Extracting structure from data: clustering and transforming your data

2.6Transforming the data axis

So far in this chapter, weve concentrated on

clustering data as its presented in its

original feature space. But wh

at if you were to change th

is feature space into a more

appropriate oneperhaps one

with fewer features thats

more descriptive of the

underlying structure?

K. F. Riley, M. P. Hobson, and S. J. Bence,

Mathematical Methods for Physics and Engineering

(Cambridge Univer-

sity Press, 2006).

v

=

Transforming the data axis

represent? It represents how much the shear

operates in a given direction: that is, the

magnitude of that operation. As such, the

eigenvalues are a measure of importance of

that direction. Thus, you could describe mo

st of the operation of the shear matrix by

recording only the eigenvectors with the largest eigenvalues.

2.6.2Principal component analysis

Principal component analysis

uses the eigenvector/eigenvalue

decomposition in a specific

way in order to obtain the

principal

components

Jonathon Shlens, A Tutorial on Principal Component Analysis, eprint arXiv:1404.1100, 2014.

Claude Brezinski, The Life

and Work of Andre Cholesky,

Numer. Algorithms

43 (2006):279-288.

HAPTER

Extracting structure from data: clustering and transforming your data

2.6.3An example of principal component analysis

Now that you understand eigenvectors, eigenvalues, and

x

u

v

Figure 2.13The two principal components of a

Initializes a PCA

decomposition with

two components

Fits the data: solves

Transforms the data

into the basis.

Because we chose to

components, this has

two dimensions.

For each iris category, pl

ots (in a different color)

the coordinates in the new coordinate axis

Transforming the data axis

components. At this point, no decompositio

n has been performed (for a start, no data

has been passed in); you mere

ly prepare the Python objects for the next stage. The

next stage is the

formed into the new data axis

. The original data is now expressed as a combination

of the two eigenvectors with the largest ei

genvalues (because you restrict the number

of components to two in

), which you can now plot as per figure 2.14

. It should

now be fairly clear that the x-axis (y=0) an

d the y-axis (x=0) are

the directions (in four

dimensions in Iris) of the highest variance

and second-highest va

riance, respectively.

The values plotted correspond to the distan

ce along that line of the original data.

Thus, youve transformed a 4-D dataset do

Figure 2.14Eigenvector decomposition of the Iris

HAPTER

Extracting structure from data: clustering and transforming your data

Principal component analysis is a powerful tool that can help you to calculate a

more efficient representation

of the data while minimizing

loss to that datas discrimi-

natory power. Although we used this on a re

t here, it can be, and

often is, applied to real-world data as a firs

t passas a prep

to the classi-

fication and regression problems that well spend more time on later in the book.

Note, however, that complexity grows faster

as the number of features in your dataset

grows, so removing redu

ndant and irrelevant feat

ures as a step before

may pay

dividends.

2.7Summary

relevant content

In todays world, were overwhelmed with ch

This chapter covers

Understanding recommendation engines based on users,

Finding recommendations about friends, articles, and

news stories

HAPTER

Recommending relevant content

like

or services like

, theyll be wasting their time, and youll be annoyed. The

broadcasting approach of traditional ad

vertising methods (such as billboards,

ads,

and radio ads) suffers from th

at problem. The goal of broadcasting is to alter your

preferences by incessantly repeating the same

message. An alternative, more pleasant,

Distance and similarity

characteristics of genres. For

a more creative twist, you could help people establish a

Movie

0: Frank0: Toy Story5

1: Jumanji4

2: Grumpier Old Men5

3: Waiting to Exhale4

4: Father of the Bride Part II5

5: Heat4

6: Sabrina5

a. GroupLens Research, MovieLens dataset,

http://grouplens.org/

datasets/movielens/

HAPTER

Recommending relevant content

Listing3.1Calculating the similarity of users based on their ratings of movies

Table 3.1Ratings for the users show that Frank and Constantine

Frank and Catherine. This data was modified,

with permission, from the MovieLens

database.

(continued)

Movie

Reads in the data using

Similarity of Frank and Constantine

under both measures of similarity

Similarity of Frank

with himself under

both measures of similarity

Distance and similarity

111514

3

Items

121811

3

5

Items

HAPTER

Recommending relevant content

All distances are greater than or equal to

zero. In most cases, we constrain the

similarities to be nonnegative like distan

ces. In fact, we constrain the similarities

within the interval [0,1].

William James,

Principles of Psychology

(Holt, 1890).

Classification and Cognition

(Oxford University Press, 1994).

Distance and similarity

The plots of the ratings in figure 3.1 cl

early display the somewhat reciprocal nature

of distance and similarity. The greater

the distance between the two curves, the

If no common item

rating

rating

HAPTER

Recommending relevant content

This equation states the definition of

watched three movies, and among these three

differed by (for

each movie), their similarity would also be

0.25, according to the nave similarity met-

ric. Intuitively, we expect these two users

to be more similar than those who watched a

single movie and whose opinions differ

ed by three units (out of five).

The nave similarity squeezes the simi

larity values for small distances (because

weadd 1) while leaving large distances (value

s of the distance much larger than 1.0)

unaffected. What if we add another value?

The general form of the nave similarity is

y= beta / (beta + x)

, where

tangent from 1.0, so that the final

value of similarity ranges between 0.0 and 1.

0, with 0.0 implying dissimilarity and 1.0

implying the highest similarity. Voil! Weve

arrived at our first definition of similarity

of users based on their ratings.

The second similarity definition that we present in listing 3.2, in the case where

sim_type

=1, improves on the first similarity by

taking into account the ratio of the

common items versus the numb

er of all possibl

e common items. Thats a heuristic

that intuitively makes sense. If Ive wa

tched 30 movies and youve watched 20, we

Distance and similarity

formula. In other words, the extent to wh

ich we watch the same movies should some-

how affect the degree of our si

milarity as movie enthusiasts.

3.2.2Which is the best similarity formula?

It may be clear by now that you can use many formulas to establish the similarity

between two usersor between two items, for

that matter. In addition to the two simi-

larities introduced earlier, we could

ve used a metric formula known as the

Jaccard sim-

ilarity

between users, which is defined by the ratio of the intersection over the union of

their item setsor, in the case of item simi

larity, the ratio of the intersection over the

union of the user sets. In ot

her words, the Jaccard similari

Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten,

Evaluating Similarity Measures: A Large-Scale Study

0.5

1

Euclidean distance

y = 1 / (1 + x)

Figure 3.2Nave similarity curves as functions of the Euclidean distance

intersectionAB

(,)

unionAB

(,)

HAPTER

Recommending relevant content

We dont advise that you randomly choose your similarity metric, but if youre in a

hurry, consider the Euclidean or the Jaccard

similarity first. It should give you decent

results. You should try to understand the na

ture of your data and what it means for

two users or two items to be similar. If yo

u dont understand the reasons why a particu-

User-based collaborative filtering

In item-based

, were most interested

in the similarity of items. For example, in

the case of our movie recommendation exampl

e, wed typically build a matrix of mov-

ies whereby matrix entries ex

press some degree of similari

ty. Movies can then be rec-

ommended that are most similar to those

that users have alre

ady seen or liked.

Conversely, with user-based

, the similarity of users is considered important. In this

scenario, users may be reco

mmended content if theyre similar to those who have

seen or liked a particular item. In model-based

, neither item-item similarity nor

user-user similarity is calculated explicitly

, but a model is introduced to capture the

Item-based CF vs. user-based CF

Depending on the data, there can be a distinct performance profile in the use of each

HAPTER

Recommending relevant content

user similarity. We use the same data as

we did to demonstrate

different measures of

similaritythe data given in table 3.1.

data = Data()

format = {'col':0, 'row':1, 'value':2, 'ids': 'int'}

data.load(dat_file, sep='::', format=format)

similarity_matrix = SimilarityMatrix()

recommend(0,10)

recommend(1,10)

recommend(2,10)

Listing 3.3 provides the high-lev

el code to perform user-based

. We read in the data

using the

load

Listing3.3User-based CF recommendation

Provides the top

recommendations for

user 0 (Frank)

Provides the top

user 1 (Constantine)

Provides the top

recommendations for

user 2 (Catherine)

User-based collaborative filtering

def recommend(user_id, top_n):

#[(item,value),(item1, value1)...]

recommendations = []

for i in itemdict.keys():

if (int(i) not in items_reviewed(int(user_id),userdict)):

recommendations.append((i,predict_rating(user_id, i)))

recommendations.sort(key=lambda t: t[1], reverse=True)

Listing3.4

Listing3.5Predicting the ratings of a user

HAPTER

Recommending relevant content

score that they input, along wi

th a calculation of how similar this user is to the one for

whom we wish to make a prediction. We then

calculate a weighted rating, which is the

similarity of the two users mu

ltiplied by the rating given by this new user. In order to

calculate a value that considers all the us

ers in the system, we calculate the sum of

weighted_ratings

along with the sum of the similarities. The latter is used to

normalize the values back to the range [0,5

]. Provided that there are some users in

Listing3.6

class

Calls the

Calculates only values of

v that are larger than

the current value of u

(upper triangular form)

The similarity

Due to the upper triangular form, we

User-based collaborative filtering

SimilarityMatrix

class encapsulates the storage,

calculation, and access to the

matrix that defines how similar each user is

with respect to another. When its first

instantiated, it calls the

build

Listing3.7

class

HAPTER

Recommending relevant content

those ratings. As you can see, its not enough for two users to have both rated a movie;

they also must broadly agree on its quality.

RatingCountMatrix

encapsulates this

through the use of a matrix that captures

the covariance of their ratings. For users

who always agree on the ratings for the movi

es theyve seen, wed expect the diagonal

of the matrix to be populated with higher numbers: that is,

where

. For those

who never agree or who have opposing views,

wed expect those entries farthest away

from the diagonal to be populated: that is,

where

),

rating

=max(

).

The calculation of similarity falls out of the matrix intuitively, as you saw in listing

3.6. We can use the total number of rating

s that the users have agreed on, divided by

the total number of items that both users ha

ve provided a rating for. Note again that

our computational performance

likely far below optimal, because

we recalculate the

RatingCountMatrix

Model-based recommendation using singular value decomposition

Singular Value Decomposition,

Wolfram MathWorld

, 2015,

http://mng.bz/TxyY

Jan J. Gerbrands, On the Relationships between SVD,

KLT and PCA, Conference on Pattern Recognition

(1980).

AUDV

HAPTER

Recommending relevant content

that it makes sense to select

features that are far away from

each other in user space. If

sci-fi fans never watch romantic comedies an

d vice versa, then such a feature would be

useful in providing a recommendation. Simu

ltaneously, we know that for any two col-

umns in

, or equivalently any two rows in

, theyre also orthogonal. That is, latent

features are selected such that the features

are as far away as possible from each other

ACM RecSys Wiki,

www.recsyswiki.com/wiki/Python-recsys

Model-based recommendation using singular value decomposition

svd = SVD()

recsys.algorithm.VERBOSE = True

dat_file = './ml-1m/ratings.dat'

item_file = './ml-1m/movies.dat'

data = Data()

data.load(dat_file, sep='::',

format={'col':0, 'row':1, 'value':2, 'ids': int})

items_full = read_item_data(item_file)

user_full = read_user_data_from_ratings(dat_file)

Listing3.8SVD for recommendation

Listing3.9Factorizing a matrix using SVD for recommendation

Oscar Celma, Python Recsys v1.0 documentation,

http://mng.bz/UBiX

Sends back progress

to the screen

Movie info

HAPTER

Recommending relevant content

rows

Rescale the rows of the input matrix

so that they all have unit Euclid-

ean magnitude.

cols

Rescale the columns of the input matrix so that they all have unit

Euclidean magnitude.

Rescale the rows and columns of the input matrix, by dividing both the

rows and the columns by the square root of their Euclidean norm.

mean_center

Center the input matrix (aka mean substraction).

post_normalize

Normalize every row of

to be a unit vector. Thus, row

similarity (using cosine dist

C. Lanczos, An Iteration Method for

the Solution of the Eigenvalue Probl

em of Linear Differential and Inte-

gral Operators,

Journal of Research of the National Bureau of Standards

45, no. 4 (October 1950),

http://mng.bz/cB2d

Doug Rohde, SVDLIBC,

http://tedlab.mit.edu/~dr/SVDLIBC/

Model-based recommendation using singular value decomposition

Listing3.10Output from model-based recommendation

Listing3.11Films rated by user 10

HAPTER

Recommending relevant content

higher than 3, and figure 3.4 provides th

e genres for which the user has rated lower

than 3. We dont report on films that ha

ve been rated with the average score of 3.

This should now start to make more sense.

Figure 3.3 shows that this user is clearly

interested in comedy and drama films, because they account for the overwhelming

majority of positive ratings

Comedy|Romance

Drama|Romance

Comedy|Drama

Action|Adventure|Sci-Fi

Animation|Childrens

Adventure|Children

s|Fantasy

Animation|Childrens|Musical

Comedy|Horror

Comedy|Sci-Fi

Action|Sci-Fi|

Action|Adventure

Drama|Sci-Fi

Animation|Childrens|Comedy

Comedy|Fantasy

Comedy|Musical

Musical|Romance

Figure 3.3The number of films that have been rate

d by user 10 as 4 or higher. These are broken

out by genre, so user 10 has rated more than 40 comedy

films with a 4 or a 5. Only the top 19 genres

are displayed here.

Model-based recommendation using singular value decomposition

received two or more low rankings, but we

Arkadiusz Paterek, Improving Regularized Singular Value Decomposition for Collaborative Filtering,

Pro-

ceedings of the ACM SIGKDD

(2007),

http://mng.bz/TQnD

0.5

1

1.5

2

2.5

3

3.5

Action|Sci-Fi

Comedy|

Action|Western

Action|Children

s|Fantasy

Action|

War

Drama|Romance

Figure 3.4The number of films that have been ranked as lower than 3 by user 10, broken out by

genre. In this case, the user has ranked three

comedy films and two action/sci-fi films below 3 and

has watched several other categories of films and ranked them with a score of 1.

HAPTER

Recommending relevant content

Listing3.12Investigating user recommendations in model-based CF

Listing3.13Analyzing the ratings of users recommended

Star Wars: TPM

Model-based recommendation using singular value decomposition

['Comedy|Sci-Fi\n', 8],

['Drama|Sci-Fi\n', 7]

20

30

50

�Aggregate of ratings 3 for users who recommended Star Wars: The Phantom Menace

a|War

Drama|

Action|Sci-Fi

Action|Adventure|Sci-Fi

Action|Sci-Fi|

Comedy|Romance

Comedy|

Action|

a|War

Drama|

Romance|War

a|Thriller

Horror|Sci-Fi

Film-Noir|Thriller

Western

Comedy|Sco-Fi

Figure 3.5Number �of ratings 3 aggregated by g

HAPTER

Recommending relevant content

comedy categories. At first glance this ma

y seem erroneous, but remember that there

are many films in these categories. Also

remember that films in these genres may

receive high ratings from all users in the

16, 17

Although this was a truly impressive ac

hievement, Netflix didnt actually imple-

ment the final algorithm and instead chose

to continue with a simpler, albeit less-

accurate algorithm (yielding only an 8.43% improvement versus the winning

10.06%). This algorithm won the first

progress prize and was implemented even

Martin Piotte and Martin Chabbert, The Pragmatic

HAPTER

Recommending relevant content

rental slots. With the focus shifting to

online streaming, this became less relevant

because the delivery mechansim was amost instantaneous. A user can watch whatever

they want, wherever and wherever they want

. Herein lies our second important lesson:

a reasonably intelligent algorithm delivere

d quickly can often be worth more than an

extremely accurate one delivered at a later date.

3.7Evaluating your recommender

Commercial recommendation systems oper

ate under demanding conditions. The

number of users is typically on the order of

millions and the number of items on the

order of hundreds of thousand

s. An additional requiremen

t is the capability to pro-

vide recommendations in real

time (typically, sub-second

response times) without sac-

rificing the quality of the recommendation

s. This is typically performed with some

layer of caching, and the time taken to use

SVD

for matrix factorization is usually not

real-time and depends on many other fact

ors: for example, the amount of memory

you have and, more important, the implementation of

. It also might depend on

how long youre willing

to wait for the answer!

As youve seen, by accumulating ratings fr

om each user, its possible to enhance the

accuracy of predictions over time. But in real

life, its imperative

that we give excellent

recommendations to new users for whom, by de

finition, we dont have a lot of ratings.

Another stringent requirement for state-

of-the-art recommendation systems is the

ability to update predictions based on in

coming ratings. In

large commercial sites,

place in a few hours and perhaps tens of

thousands (or more) in the course of a sing

le day. The ability to update the recom-

mendation system with that additional

information is import

ant and must happen

onlinewithout downtime.

Lets say that you wrote a recommender an

d youre satisfied with its speed and the

amount of data it can handle. Is this a good

recommender? Its not useful to have a

fast and scalable recommender that pro

Jonathan L. Herlocker, et al., Evaluatin

g Collaborative Filtering Recommender Systems,

ACM Transactions

on Information Systems

22, no. 1 (January 2004).

Summary

We can argue that the

RMSE

is probably too nave. Lets consider two cases. In the

first case, we recommend to a user a movie with

four stars, and he really doesnt like it

(hed rate it two stars); in the second ca

se, we recommend a mo

vie with three stars,

but the user loves it (hed rate it five star

s). In both cases, th

e contribution to the

RMSE

is the same, but its likely that the user

s dissatisfaction would probably be larger

in the first case than in the second case.

You can find the code that calculates the

RMSE

for our model-based

solution in the following listing. As weve mentioned,

Listing3.14Calculating the root-mean-square error for a recommender

HAPTER

Recommending relevant content

In the final pages of this chapter, we l

ooked at the Netflix Prize. Although this

things where they belong

What is this? is the question children perhaps ask most frequently. The popularity

of that question among childrenwhose inqu

isitive nature is as wonderful as it is

persistentshouldnt be surprising. In order to understand the world around us,

we organize our perceptions into groups

and categories. In the previous chapter,

This chapter covers

Understanding classification techniques

HAPTER

Classification: placing things where they belong

classification of the item as a boat, narrow

ns for us. In a simi-

The need for classification

there are classification systems for spinal co

rd injuries; for coma, concussion, and trau-

matic brain injuries; and so on.

You must have

heard the term

Homo sapiens

, of which

is our genus and

sapiens

is our species. This classification can be, and typically is,

extended to include othe

r attributes such as

family

order

phylum

, and so forth.

Generally speaking, the more attributes yo

u use, the finer the degree of classifica-

tion is going to be. Having a large number

of attributes is usually a good thing, but

there are caveats to this general principl

e. One notorious symptom of dealing with

many attributes is the

curse of dimensionality

, which was discussed in chapter 2.

As you may recall, the curse of dimensio

nality refers to the fact that a space

becomes more and more homogenous as the nu

mber of attributes increases. In other

this analogy to obtain some in

sight into your structure.

HAPTER

Classification: placing things where they belong

We could obviously go on with more classi

fication systems; theyre everywhere. The

point is that classifying data is equivalent

to organizing it. Classification systems

improve communication by redu

cing errors due to ambiguities. They also help us

organize our thoughts and pl

an our actions. The referenc

e structure, which is used

for organizing our data, can be as simple

D. Gasevic, D. Djuric, and V. Devedzic,

Model Driven Architecture

and Ontology Development

(Springer, 2006).

NAME Toyota Prius

Figure 4.1The basic elements of a reference structure (a rudimentary ontology). Ellipses denote

concepts, whereas rounded rectangles denote attributes. Rectangles denote instances. Concepts

inherit attributes from above.

An overview of classifiers

application. When you think about your data and the ways it could be organized,

youll realize the value of a

classification system for your

application. Starting with

section4.3, well introduce classification me

chanisms in web applications in order to

demonstrate the use of classification algori

thms and the issues that may arise. But

first, heres an overview of classification systems. If you want to jump into action

quickly, you can skip the next section.

4.2An overview of classifiers

HAPTER

Classification: placing things where they belong

separated into functional and nearest neighbor schemes; and

perceptron

neighbor

trees

Bayes

Figure 4.2An overview of classification algorithms based on their design

An overview of classifiers

will eliminate as many candidates as po

ssible based on the provided information.

Decision-tree algorith

ms have several advantages, such as ease of use and computa-

tional efficiency. Their disadvantages are us

ually prominent when we deal with contin-

uous variables, because were forced to

perform a discretizationthe continuum of

the values must be divided into a finite numb

er of bins in order to construct the tree.

In general, decision-tree algorithms dont

have good generalization properties, and

thus they dont perform well with unseen data. A commonly used algorithm in this

category is C5.0 (on Unix machines) or

See5 (on Microsoft Windows machines). It

can be found in a number of commerci

al products, such as Clementine, now

SPSS

Modeler,

and RuleQuest.

The second branch of structural algorith

ms is composed of distance-based algo-

rithms. In the previous chapters, we intr

oduced and extensively used the notions of

similarity measure and genera

lized distance. These algorithms are fairly intuitive, but

its easy to misuse them and end up with ba

d classification results because many of the

data-point attributes arent directly related

to each other. A single similarity measure

cant expediently capture the differences in the way the attributes should be mea-

sured; careful normalization and analysis of

the attribute space ar

e crucial to the suc-

cess of distance-based algori

thms. Nevertheless, in many

low-dimensional cases, with

low complexity, these algorith

ms perform well and are fairly

simple to implement. We

can further divide distance-b

ased algorithms into functi

onal and nearest neighbor

type algorithms.

Functional classifier

s approximate the data by functi

on, as the name suggests. This

is similar to regression, but we differentiat

Tom Khabaza,

The Story of Clementine

(Internal Report, Integral Solutions Limited, 1999).

Ross Quinlan, RuleQuest Research: Data Mining Tools,

www.rulequest.com

HAPTER

Classification: placing things where they belong

flavor of a statistical approach in the

case of so-called

logistic regression

. In this case, the model (t

he logistic function) pre-

dicts values between 0 and 1, which can be

interpreted as the probability of class mem-

bership; this works well in the case of

binary classification if we threshold the

probability. This model is widely used in in

dustry and will be the

tion. Here well address its us

e to classify online fraud,

an especially difficult task

because these events occur so rarely

in comparison to non-fraud events.

Many of the techniques in the statistical algorithms category use a probability theo-

rem known as the

Bayes rule

Bayes theorem.

In this kind of statistical classification

algorithm, the least common denominator is th

e attributes of the

problem are independent of each other, in

a fairly quantitatively

explicit form. The

fascinating aspect of Bayesian algorithms is

that they seem to work well even when that

Trevor Hastie, Robert Tibshirani, and Jerome Friedman,

The Elements of Statistical

Learning: Data Mining, Infer-

ence and Prediction

, 2nd. ed. (Springer-Verlag, 2009).

Kevin Murphy,

Machine Learning: A Probabilistic Perspective

(MIT Press, 2012).

An overview of classifiers

R. E. Neapolitan,

Classifier?Training

ClassifierEValidation

data

ClassifierProduction

Figure 4.3The lifecycle of a

classifier: training, validation (or

testing), and use in production

HAPTER

Classification: placing things where they belong

combining intelligent algori

thms was also discussed in

the context of recommender

systems in chapter 3.

In the production stage, were using our cl

assifier in a live system to produce classi-

fications on the fly. Typically, the parameters

of the classifier dont change during the

production stage. But its possible to enha

nce the results of the

classifier by embed-

ding (into the production stage) a short-live

d training stage thats based on human-in-

the-loop feedback. These three steps are repe

ated as we get more data from our pro-

tter ideas for our classifier. Youve now seen

the big picture about classification algori

thms. In the literature, you can find many

different overviews of cl

assification algorithms.

In order to provide you with a sufficientl

y in-depth and applicab

le example, for the

remainder of this chapter well concentrate on

statistical algorithms

and, in particular,

L. Holmstrm, P. Koistinen, J. Laaksonen, and E. Oj

a, Neural and Statistical ClassifiersTaxonomy and Two

Case Studies,

ymxc

0

x

Linear (data)

Figure 4.4Graphical illustration of a linear model.

The data points provide a clear linear relationship

Figure 4.5Calculating the residuals, or the

HAPTER

Classification: placing things where they belong

above or below the model line. This is known as

model training

. In general, our

straight-line model isnt restricted to only a single variable,

, and can be generalized

into

+1 dimensions by using the following form:

In this case, our approach to modeling is

exactly the same; all that has changed is the

+++++

01234567891011121314151617181920

Figure 4.6The response of

with

for a logistic curve. This particular curve is centered on

and has been plotted using 20 samples uniformly distributed on

. Note the larger differences in

for

values of

around 10 and much smaller differences in

around

=0 and

-------------------------------

+++

-------------------------------------------------------------

HAPTER

Classification: placing things where they belong

The next step is somewhat a leap of faith. We know that

is equivalent to the probabil-

ity of the event occurring with the

features given in the exponent of

, so now we rear-

range the equation to express

)the probability of the event occurring divided

by the probability of the event not

occurring. This is known as the

of the event.

Lets first calculate 1

and substitute this back into

our equation afterward:

Now that we have the equation for 1

, the modeled probability of the negative event,

lets express the odds using our initial definition of

Substituting back in for

and taking the natural logarithm on both sides, we end up

with our final equation:

Compare this with the original equation from section 4.3.1, and what do you notice?

Thats correct; theyre both linear models

, but where our original model expressed

probability as a linear comb

ination of the input variables, our new logistic regres-

sion model expresses the

log-odds

as a linear combination

of input variables. Train-

ing can occur in a way much similar to th

e linear model: we update the values of

in order to best fit the data (to minimize loss). To define

and minimize error, we

can use maximum likelihood estimation: that

------------

y

1

e

e

=

----------------------

------------

----------

------------

------------

----------

------------

+++

HAPTER

Classification: placing things where they belong

In figure 4.7, the columns

the first line of data

Listing4.2Masking categorical features

Listing4.3Splitting out and encoding categorical features

Kevin Murphy,

Machine Learning: A Probabilistic Perspective

(MIT Press, 2012).

Creates a mask object full of True

Fills the columns with 0s where

the data is of categorical type

Extracts the non-

Extracts the categorical data

Encodes only the categorical data

HAPTER

Classification: placing things where they belong

The code in listing 4.3 splits ou

t the features into two arrays:

data_non_categoricals

which contains the continuous features, and

data_categoricals

, which contains the

categorical features. We then encode the la

tter using one-hot-encoding to ensure that

the value each categorical variable can take is encoded independently and that the

ordering of any feature encoding

doesnt impact the end results.

Lets first look at the performance of a simple model using only the continuous

features, shown in the next listing.

new_data=data_non_categoricals

new_data=new_data.astype(np.float)

X_train, X_test, y_train, y_test =

cross_validation.train_test_split(new_data,

Listing4.4Simple model using continuous features only

One-hot-encoding

In the example, we discuss the use of one-hot-encoding for categorical features. But

what is this, and why should we bother wi

th it? For logistic re

gression, all features

are eventually encoded as a number, and th

is is where we can experience problems

if were not careful to deal with

categorical vari

Creates a train and

Trains the

logistic

regression

model

Obtains the output of the

HAPTER

Classification: placing things where they belong

Odds: [[ 1.00137002 0.47029208]]

Odds intercept[ 0.50889666]

Likelihood Intercept:[ 0.33726409]

This tells us several things. The first line

indicates that, holding

stant, a unit increase in the transactio

n amount increases the odds of fraud by

1.00137002. Looking at the data, this intuitively makes sense. We do see that in our

Listing4.5Using both categorical and non categorical variables with LR

Binary Ground Truth Labels / Model Log. Prob.

Binary Ground Truth Labels / Model Prob.

Test Instances

Test Instances

Figure 4.8Output of the model plotted against the test data ground truth. Values for the data labels

may only take values of 1 or 0, indicating fraud and not fraud, respectively. Log-probability may take

HAPTER

Classification: placing things where they belong

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

Are your results credible?

Odds: [[ 1.00142244 0.48843238 1. 0.97071389

0.77855355 0.95091843 0.85321151 0.98791223

0.99385787 0.95983606 0.81406157 1.

0.869761 1. 0.94819493 1.

0.96913702 0.91475145 0.81320387 0.99837556

0.97071389 0.869761 0.9778946 1.

0.81320387 0.99385787 0.94499953 0.82707014

0.98791223 0.98257044]]

Odds intercept[ 0.61316832]

Likelihood Intercept:[ 0.38010188]

Given the two graphs returned from each

version of our experiment, we might con-

clude that these additional features have had

very little impact on the classifiers per-

formance. Well discuss more about performa

nce in the next section, but before we

do that, you should understand why the model parameters have such a different form

in this experiment co

mpared to the last.

Youll note that there are a total of 30 model parameters in this version. Why is

this, given that there are only six featur

es? This is a consequence of our encoding

features would almost

certainly be differ-

ent than those weve obtained here.

Lets now take this opportunity to summari

ze what weve accomplished in this sec-

tion. We started out with th

e most humble linear model

that is, drawing a straight

line through our data pointsand we extended this to a model that is linear in the

log-odds of the target variable. This powerful

technique allows for a change in an attri-

bute to provide a nonlinear change in the

odds (although linear in the log odds). We

showed how this can then be applied to a

HAPTER

Classification: placing things where they belong

things that range from exagge

rations to outright nonsense. The goal of this section is

to help you evaluate your own classifier,

if youre a developer, and help you under-

stand the legitimacy (or otherwise) of th

Table 4.1A typical confusion matrix for a simple binary

classification problem

Positive

Negative

True

True positive (TP)True negative (TN)

False

False positive (FP)False negative (FN)

Are your results credible?

prefer to release 100 guilty

people than convict 1 innocent person; that sensitivity

remains in the European courts. The moral of this anecdote is that decisions have

consequences, and the degree of the conseque

nces isnt uniform. This is particularly

true in the case of multicla

ss classification. Consider the

following defini

tions, some of

which were introduced in section 1.6:

N, where N

TN

Specificity

1

FP rate

TN

Recall

TP

P, where P

TP

Precision

TP

(TP

TN)

(P

N)

F-score

Precision

Recall

Suppose we find out about a classifier whose

accuracy, as defined earlier, is 75%. How

close to the true accuracy of the classifier

is our estimate? In othe

r words, if we repeat

the classification task with different data, how

likely is it that our accuracy will be 75%?

To answer that question, well resort to

Ian Witten and Eibe Frank,

Data Mining

(Morgan Kaufmann, 2005).

HAPTER

Classification: placing things where they belong

a large number of tests, on

HAPTER

Classification: placing things where they belong

statistical information about your data,

such as minimum and maximum values, mean

values, median values, valid ou

tliers, percentage of missing data in attribute values,

and so on. You can use that information to

sample previously unseen data from your

Summary

Pay attention to the idiosyncrasies of your

classification system

. If you use a rule-

based system, you may encounter whats known as the

utility problem

. The learning pro-

cessthe accumulation of rule

scan result in the overall

slowdown of the system in

production. There are ways to avoid or

at least mitigate the utility problem,

but you

need to be aware of them and ensure that

your implementation is compatible with

these techniques. Of course, the degradatio

n of performance isnt the only problem

in that case. Youd also need to provide ways to manage and organize these rules,

which is an engineering problem with a solu

tion thatll depend strongly on the spe-

cific domain of your application. In ge

neral, the more complicated the classifier

implementation, the more careful you shou

ld be to understand the performance

characteristics (both speed and

quality) of your classifier.

4.6Summary

We provided a taxonomy of this broad fi

eld covering both statistical and struc-

tural algorithms. Statistical algorithms fo

cus on the fit of a particular function

to some data, often using algorithmic ap

proaches to find the maximum likeli-

hood of the data given the model. Conver

sely, structural algorithms look at fea-

R. B. Doorenbos,

Production Matching for Large Learning Systems

, PhD thesis, Carnegie Mellon (CMU, 1995).

Case study: click prediction

for online advertising

Online click prediction is a specific probl

em in the world of advertising and is an

excellent example of a high-velocity,

high-throughput problem where decisions

must be made with a very low latency. In

general, this class of

problems has a large

number of applications. Onli

ne trading, website optimi

zation, social media, the

This chapter covers

History and background

From an advertisers standp

oint, we can only proxy a positive impact through inter-

action with an ad. So, we tr

y to predict the likelihood of interaction with an ad on a

web page, based on all of the information

surrounding a user and the context. Due to

the volume and velocity of data generated

on the web and the requirement for a rapid

answer, this isnt a simple problem.

The solutions presented here should be equally applicable to different applica-

tions. For example, if we consider a smart city, where all cars are enabled with sensors

Total US Ad Spending to See Largest Increase Since 2004,

HAPTER

Case study: click prediction for online advertising

exchange

demand-side platform.

Well further develop these concepts in this

chapter, and youll learn how you can use

the information collected about a user to

increase the odds of interaction with an ad.

Figure 5.1 provides a graphical overview of the ecosystem. You can see that the

exchange is the medium through wh

ich publishers (websites such as

huffingtonpost.com

, and more) can sell advertising

space (slots and spaces of a

predefined size) to individu

al advertisers (think compan

ies like Nike, Adidas, O2, and

Vodafone). Advertisers often buy throug

h whats known in the industry as a

demand-

side platform

acts on behalf of multiple advertis

ers and combines buying power to reach

thevery best publishers on the we

b. There are many examples of

DSP

s on the web,

such as Turn (www.turn.com), MediaM

ath (www.mediamath.com), and DataXu

(www.dataxu.com). These platforms are a co

nduit through which its possible to buy

advertising. We can easily draw parallels with the world of equity trading. Not every-

body is allowed on the trading floor, so repr

esentatives act on behalf of many individu-

als who wish to buy.

In this parallel, the

is equivalent to the trader, where the

advertising exchange is equivalent to th

e stock exchange. In the first case, the

aggregates demand, just as an individua

l trader does. In th

e second case, both

Figure 5.1The online advertising ecosystem. Publishers make their content

(advertising slots) available through an exchange that allows access

programmaticallythat is, through an intelligent algorithm. A demand-side platform

aggregates demand for content from advertisers and buys on their behalf.

The exchange

because advertising space is bought through the use of a computer program (an intel-

ligent algorithm) as opposed to in person.

info

Cookie

match

Ad monitoring

Ad placement

Bid

Figure 5.2A graphical overview of the steps involved in buying ad placements on an

exchange. Events should be read from top to bo

ttom. First, a cookie match occurs. This

can be performed either exchange side or DSP side. Once the DSP has a unique identifier

that it can use, it uses this to look up additional information about the user and construct

a bid price (the subject of section 5.4.5 onwar

d). If the bid is successful, bid notification

HAPTER

Case study: click prediction for online advertising

5.2.1Cookie matching

This is the first step that

occurs every time a page that contains ads is loaded. A

cookie

is

a unique identifier for a user for each domain (or website/web service) thats held in

the browser. Its guaranteed to be unique,

identifier for the user that can be used to look up further information. The

DSP

also

obtains specific information about the ad

placement thats available. What the

The exchange

Vijay Krishna,

Auction Theory,

2nd ed. (Academic Press, 2009).

Internet Advertising Bureau, IAB Standard Ad Unit Portfolio,

www.iab.com/guidelines/iab-standard-ad-

unit-portfolio

HAPTER

Case study: click prediction for online advertising

5.3What is a bidder?

Understanding the process of bidding is the first step toward understanding the con-

text in which an intelligent algorithm must

operate. So far, weve only provided a

high-level overview of how we buy ads on an exchange. All we know is that were given

some information by the exchange and that

we must return a bid price in order to

secure this user and this placement. In

practice, a bidder that follows loosely the

design presented in figure 5.3 achieves this.

The bidder is made up of an outer shell and an inner core. The outer shell is respon-

sible for performing the crucia

l tasks associated wi

th bidding, such as retrieving user

information and responding to the exchange

in a timely manner. At the inner core of

the bidder is the decision engine, responsi

ble for taking into account all user, place-

info

Decisioning

Figure 5.3The bidder consists of an outer shell thats

responsible for performing the crucial tasks associated

with bidding and the decision engine thats responsible for

taking into account all data associated with the bid.

What is a decisioning engine?

Simplicity

One way to achieve speed is thro

ugh adopting a simple design. The

bidder should perform only one functi

on and should be easy to debug and

maintain. Remember, once your bidder is

live and participating in auctions,

then every moment its offline repres

ents a loss of revenue for both you and

Exchanges contain a lot of inventory.

Part of the difficulty of partici-

pating in an exchange is that its easy to buy lots of poor inventory and to buy it

very quickly! For this reason its extrem

ely important that yo

ur bidder be accu-

rate. Over time, you should be able to determine a positive correlation with the

price that you bid and the performance of these ads.

Extensibility

There are many exchanges available through which inventory can

be bought. Some inventory is available through only one exchange, whereas

others are available on many. In order to maximize your reach as a

DSP

, youll

want to integrate with as many exchange

s as possible. For this reason, extending

your bidder should be easy. Dont be te

mpted to tightly couple your bidder with

the first exchange that you integrate with

, but keep your solution as generic as

Once your bidder is in place and performing reasonably well,

youll want to start improving your decisioning to try to obtain the best users at

the lowest cost. Every change to your

bidder and its associat

ed decisioning must

be monitored to ensure that changes have

the right effect. Often the results of a

change can only be seen over days

and weeks, not mi

5.4What is a decisioning engine?

In the previous sections, we introduced the

basics of the problem

of buying ad place-

ments online through an exchange and how this is performed in practice through the

use of a bidder. In this section, were going to concentrate on a specific aspect of the

bidder. Well uncover how to take the raw da

ta provided by the exchange and convert

this into a value youre willing to pay to de

liver an ad. Lets look at the types of infor-

mation you can use to generate these probabilities.

5.4.1Information about the user

In reality, the only information provided abou

t a user is a cookie identifier. Its up to

to record and store such information.

When someone is seen for the first

time, no information is available about them

whatsoever! If they interact with an ad

theyre shown, this will be stored, as will an

y other interaction that occurs in the scope

of a

s advertising: for example, if the user

starts a video or

cancels a full-screen

advertisement. These small interactions are

stored away and looked

up each time this

user is seen in the wild. This allows a

to build up a picture of user behaviors that

are synonymous with the likelihood to click, as well as those that arent.

HAPTER

Case study: click prediction for online advertising

5.4.2Information about the placement

Quite separate from information about the

user, information about the ad placement

can be predictive of a click. Independent of

the user, certain sizes of ads tend to per-

form well, as do certain publishers. For ex

ample, ads on premium news sites such as

theguardian.co.uk may have a

higher click-through rate (

CTR

) than those on small

BBC News, Black Friday: Online Spending Surge in UK and US, November 27, 2015,

www.bbc.co.uk/

news/business-34931837

BBC News, Cyber Monday Adds to Online Sales Surge, November 30, 2015,

www.bbc.co.uk/news/business-

34961959

What is a decisioning engine?

s to calculate the probability of a user inte

racting with an ad. In

the previous chap-

ter, we placed a hard threshold on the output of the logistic regression model in order

to create a classifier; but in this chapter,

well use the output of the logistic regression

model as is, taking the output probability as

the likelihood of a click. Recall that the

general form of the logistic regr

ession model is given as follows

where y is the probability of the event under investigation, 1

the inverse probability,

for

0 represents the feature coefficients, and

represents the threshold. Thus, in

+++

Bid CPMCPCECTR

1000

Bid CPMCPCy

1000

HAPTER

Case study: click prediction for online advertising

ad placements on a site

-by-site basis. Many

s and algorithm designers in this space

John Langford, Vowpal Wabbit command-line arguments,

http://mng.bz/YTlo

John Langford, Vowpal Wabbit tutorial,

http://mng.bz/Gl40

Criteo, Kaggle Display Advertising Challenge Dataset,

http://mng.bz/75iS

John Langford, Vowpal Wabbit wiki,

http://mng.bz/xt7H

Click prediction with Vowpal Wabbit

some of its features here. As ever, the GitH

ub page and mailing li

st will have the most

detailed information, and I strongly urge you to keep checking there if youre going

to develop further applications with

5.5.1Vowpal Wabbit data format

Vowpal Wabbit uses a specific file format th

at is very flexible but can be difficult to

r

i

n

te

r

act?

These colu

mn

s show 13

p

r

ope

r

ties exp

r

essed as

i

n

tege

r

s; likely to be eve

n

t

cou

n

elati

n

g to use

r

.

The

n

ext colu

mn

s

a

r

e 26 catego

r

ical

p

r

ope

r

ties associated

with the bid

r

equest.

Figure 5.4Overview of the raw data provided by Criteo. The first field relates to the event were

trying to predict. This relates to the user performing some event, such as clicking the ad. The next

13 columns are integer properties with count semantics; a higher number means some event has

been performed more times than if a lower number is seen. The next 26 features are categorical and

most likely represent some property of the bid

request: for example, the site from which the bid

request was made.

HAPTER

Case study: click prediction for online advertising

wont use the count semantics just mentioned. Each unique count value will be treated

as a label, and the model will lose the ability

to understand that all the count values are

related through integer semantics: for example, that 2 is greater than 1, 3 is greater

than 2, and so on. If you look in the accomp

anying folder for this chapter, youll find

that weve already written this code for you!

In order to proceed with the example in this chapter, youll need to download

dac.tar.gz from Criteo,

untar/gz the file, and extract the train.txt data file from the

dac subdirectory. Next, run the included Pyth

on script against the training file using

the following command:

python ch5-criteo-process.py /pa&#x/pa7;&#x.5th;&#x/to/;&#xtrai;&#xn.tx;&#xt000;th/to/train.txt

This will create two files called train_vw_fil

e and test_vw_file. These are created in the

current directory of the aforementioned scri

pt. Note that if you wish, you can change

the output files from this script by

providing two additional parameters:

python ch5-criteo-process.py /pa&#x/pa7;&#x.5th;&#x/to/;&#xtrai;&#xn.tx;&#xt000;th/to/train.txt /path/&#x/pat;&#xh/7.;to/;&#xtrai;&#xn_vw;&#x_fil;to/train_vw_file

.5/;&#xpath;&#x/to/;&#xtest;&#x_vw_;ile;/path/to/test_vw_file

data in the following format:

-1 |i1 c:1 |i2 c:1 |i3 c:5 |i4 c:0 |i5 c:1382 |i6 c:4 |i7 c:15 |i8 c:2 |i9

c:181 |i10 c:1 |i11 c:2 |i13 c:2 |c1 68fd1e64 |c2 80e26c9b |c3 fb936136

|c4 7b4723c4 |c5 25c83c98 |c6 7e0ccccf |c7 de7995b8 |c8 1f89b562 |c9

a73ee510 |c10 a8cd5504 |c11 b2cb9c98 |c12 37c9c164 |c13 2824a5f6 |c14

1adce6ef |c15 8ba8b39a |c16 891b62e7 |c17 e5ba7672 |c18 f54016b9 |c19

21ddcdc9 |c20 b1252a9d |c21 07b5194c |c23 3a171ecb |c24 c5c50484 |c25

e8b83407 |c26 9727dd16

Criteo,

http://mng.bz/75iS

Click prediction with Vowpal Wabbit

The

namespace in the previous example contains the feature

68fd1e64

Implicit to this is a value,

, that denotes that feature

68fd1e64

is present; thus, its cat-

egorical. We could equiva

lently write this as

68fd1e64:1

. In the example of

namespace has a feature,

. This has associated with

it a continuous variable,

, denot-

ing the count on the number of events (of name

) relating to

that have occurred.

In practice, this distinction is important because it means the difference between

training a single coefficient for a continuous

variable (the correct way to proceed for

integer semantics) or a coefficient for every seen value of the variable (incorrect!). We

wont go further into this here, but I encourage you to read the associated documen-

tation on namespaces on the project wiki.

We mentioned earlier that you need to create two files. You create these files directly

from the Criteo training data (train.txt).

Because the Criteo data was originally com-

John Langford, Vowpal

Wabbit input format,

http://mng.bz/2v6f

How features are represented in VW

VW uses a bit vector of a specified size called a

feature table

, and each feature is

represented by hashing into this vector.

Weights are learned, not per feature, but per

entry within the vector.

This has a few benefits, including a reduced fixed-size learning space, but it means

collisions can occur (where tw

o different features trigger

the same bit). In general,

provided an appropriate vector size is chos

en, the impact of this on the accuracy of

the trained classifier is negligible. It also means that to uncover the impact of a fea-

ture (analyze its learned weights), we need to hash into the bit vector and perform

the dot product of the vector and the weights. We'll encounter this later.

HAPTER

Case study: click prediction for online advertising

samples. Put another way, the negative samp

les will have more importance than the

positive ones. This isnt ideal and can be best tackled by forcing the data to be bal-

anced before training.

We can ensure that this is true with

a bit of command-line manipulation. A word to

the wise: these files are large, and editing th

em in your favorite text editor may cause

your memory to fill up and your application to

crash! Take a look at the following list-

ing for an appropriately safe method to do this.

wc -l train_vw_file

� 32090601 train_vw_file

grep c '^-1' train_vw_file

� 23869264

grep c '^1' train_vw_file

� 8221337

Here we use some simple co

mmand-line utilities to determ

ine the number of positive

Listing5.1Determining how balanced the data is

Leon Bottou, Online Learning and

Stochastic Approximations, in

Uses word count (w

Counts only those lines starting

Counts only those lines starting

(positive example)

Click prediction with Vowpal Wabbit

grep '^-1' train_vw_file | sort -R� negative_examples.dat

grep '^1' train_vw_file | sort -R� positive_examples.dat

awk 'NR % 3 == 0' negative_exampl�es.dat \

negative_examples_downsampled.dat

cat negative_examples_downsampled�.dat all_examples.dat

cat posi��tive_examples.dat all_examples.dat

cat all_�examples.dat | sort -R all_examples_shuffled.dat

awk 'NR % 10 == 0' all_examples_s�huffled.dat \

all_examples_shuffled_down.dat

This listing uses several more command-

line tools to down-sample and shuffle our

Extracts negative samples, shuffles, and

places in the negative_examples.dat file

mples, shuffles, and

places in the positi

Uses awk to take every

third example from

negative_examples.dat file

sampled negative

examples into

Down-samples the

Appends the

positive

examples to the

negative ones

balanced

HAPTER

Case study: click prediction for online advertising

vw all_examples_shuffled_down.dat \

--loss_function=logistic \

-b 22 \

--passes=3 \

-f model.vw

vw all_examples_shuffled_down.dat \

-t \

-i model.vw \

--invert_hash readable.model

cat read�able.model | awk 'NR 9 {print}' | \

sort -r -g -k 3 -t : | head -�1000 readable_model_sorted_top

The first line trains our model using

. Several command-line switches perform spe-

cific functions, and more information regard

ing these can be found in Langford, Vow-

pal Wabbit command-line argu

ments, cited previously. In

order to perform logistic

regression, we need to tell

to use the logistic loss functi

on to calculate the difference

Listing5.3Training a logistic regression model using VW

Trains a logistic regr

ession model using a

cache (-c) , a feature table of size 22 (-b 22),

Runs vw in test-only mode (-t) over the

same data using the

model. The purpose of this isnt to test on

the same data but to run through the data

once and create the in

Sorts the readable model by the learned

Click prediction with Vowpal Wabbit

human-readable features so we can gain some

intuition as to what features the model

thinks are important. This is the focus of

the second line of listing 5.3. We call

again

in test mode (

d. In general, you should never

test on the same data you train on, but in th

is case were not inte

rested in the results!

By running through the train data in

test mode a single

time with the

invert_hash

option set, we build a table that links the

human-readable feature

name to the effective

impact on the classifier, thro

ugh the weights associated with the hashed feature value.

Because there are so many learned values

, and theyre written out to the readable

.model file in no particular order, before

we can analyze the coefficients we must sort

this file. We do this using the command-line tools

awk

sort

head

. Essentially, we

throw away the first nine rows (which in

clude header informat

ion and preamble) and

then sort the remaining rows on the third column, telling

sort

that the columns are

denoted by the colon (

). The final output is sent to

readable_model_sorted_top

and we reproduce the top 20 coefficients in the listing that follows. The features you

see here are those that, if set (holding all

other features constant), cause the greatest

increase to log odds.

c24^e5c2573e:395946:3.420166

c3^6dd570f4:3211833:1.561701

c12^d29f3d52:2216996:1.523759

c12^977d7916:2216996:1.523759

c13^61537f27:2291896:1.373842

c12^d27ed0ed:2291896:1.373842

c12^b4ae013a:2291896:1.373842

c16^9c91bbc5:1155379:1.297896

c3^f25ae70a:64455:1.258160

c26^ee268cbb:64455:1.258160

c21^cfa6d1ae:64455:1.258160

c26^06016427:3795516:1.243104

c15^2d65361c:506605:1.240656

c12^13972841:506605:1.240656

c3^e2f2a6c7:3316053:1.234188

c18^eafb2187:3316053:1.234188

c18^fe74f288:3396459:1.218989

c3^8d935e77:2404744:1.214586

c4^f9a7e394:966724:1.212140

c3^811ddf6f:966724:1.212140

Listing5.4Human-readable output from VW

Most important features

HAPTER

Case study: click prediction for online advertising

from our test data.

5.5.3Testing the model

The following listing provides the code to

evaluate a test set against a trained model

and to generate both the probabilities and ground truth that will be used in the next

step to create a

curve.

vw -d test_vw_file \

-i model.vw \

--loss_function=logistic \

-r predictions.out

~/dev/vowpal_wabbit/utl/logistic �-0 predictions.out \

probabilities.out | cut -d ' ' -f 1 test_vw_file | \

sed -e� 's/^-1/0/' ground_truth.dat

The first command uses the

option to specify the data file, along with the

option

for testing (no learning). The

option loads the previously trained model into mem-

ory, and we specify the

switch to output the raw predictions. The next line uses a

helper function to transform these raw predictions (which are of the form

into their associated prob

ability, as specified by

in section 5.4.5.

Note that youll most

likely need to change the path to the helper

function on your system. The logistic script

canbe found under the utl subdirectory of the

repository (

http://mng.bz/pu5U

originally checked out as per the books prerequisites/requirements file.

More information about feature hashing

and collision avoidance can be found at

http://mng.bz/WQi7

Listing5.5Testing against a trained model in VW

Click prediction with Vowpal Wabbit

These resulting probabilities are stored in

the probabilities.out file. The final line

of listing 5.5 separates the co

lumns of the training file by

splitting on white space and

selecting the first column, which contains

the ground-truth class data. This is then

piped to

, which replaces any occurrences of

(how

deals with negative

classes in logistic regression) with

. The output is sent to

the ground_truth.dat file.

Were now finally at the position where we can evaluate the efficacy of our model

against the test data. The next listing show

s the scikit-learn code

thats required to

generate the

curve, shown in figure 5.5.

import numpy as np

import pylab as pl

Listing5.6Generating the ROC curve with scikit-learn

scikit-learn, Receiver Operating Characteristic (ROC),

http://mng.bz/9jBY

positive and true

positive rates (and

their associated

thresholds) using

scikit-learn

HAPTER

Case study: click prediction for online advertising

The results show a value of 0.76 for the

area under the curve based on our trained

model and held-out test data. To translate

this into a more intuitive number, if we

chose a threshold corresponding to the point on the

curve closest to the top-left

corner of figure 5.5 and used this thresh

old as a hard classifier to choose between

clickers and non-clickers, wed have a true po

sitive rate of about 0.7 (that is, when the

model predicted a click, it was right 70% of the time) and a false positive rate of about

0.3 (that is, 30% of non-clickers were incorrectly identified as clickers). To obtain

these numbers, find the point closest to the upper-left corner on the upper curve in

the figure, and read off the values on the x-

and y-axes, respective

ly. In reality, how-

ever, models such as these are used not to cl

assify impressions that will be clicked but

to determine how much money should be paid to show an ad, as per the methodology

in section 5.4.6.

5.5.4Model calibration

In the previous section, we alluded to the

fact that we dont use the output of our

trained model to create a classifer per se; rath

er, we use the output of the logistic regres-

sion model to estimate the probability of

the click occurring. But if you remember, to

achieve efficient model training, we changed th

e frequency of the positive events in the

data such that the positive and negative ev

ents represented an even proportion of the

0.20.40.6

0.81.0

Figure 5.5The receiver operating characteristic for our model based on the

supplied test data. The area under the curve is given as 0.76.

Click prediction with Vowpal Wabbit

training data. Thus, in order to obtain outp

uts from the model that are representative

of the underlying probabilities of a click, we must

calibrate

the model.

,()Pry1=x,()-----------------------------------------

+++

---------------------------------

-------------------------------------------------------------

---------------------------------

---------------------------------------------------------------------

---------------------------------

----------------------------------

,=()-----------------------------------------

+++

,=()-----------------------------------------

()

+++

HAPTER

Case study: click prediction for online advertising

model that reflects the probability of the

positive event accurately, you must complete a further step.

As you can see, the trained model will have an intercept thats ln(

) lower than that

which would be obtained if you didnt used

The future of real-time prediction

need to capture, direct, store, preprocess, and process continuously, generating train-

HAPTER

Case study: click prediction for online advertising

5.8Summary

Weve built an intelligent algori

thm for online click prediction.

We provided a complete introduction to

a very important application on the

intelligent web. Very few applications ha

ve received such attention from the ML

community and continue to do so at this level.

You learned about the use of an out-of

-core machine-learning library called

Vowpal Wabbit that allows training to

occur without reading all the data into

memory.

We discussed the future of real-t

ime click prediction on the web.

Deep learning and

This chapter covers

HAPTER

Startup.ML, Deep Learning News, June 30, 2015,

http://news.startup.ml

Jeubin Huang, Overview of Cerebral Function, Merck Manual, September 1, 2015,

http://mng.bz/128W

HAPTER

Warren S. McCulloch and Walter H. Pitts, A Logical Ca

lculus of the Ideas Immanent in Nervous Activity,

Summation

Figure 6.2On the left, we

provide a schematic of a

human biological neuron. To

the right, we show a human-

The perceptron

6.3The perceptron

In the previous section, we introduced the

neuron. With this basic approach, it

turns out that its possible to

learn and to genera

lize training data, but in a very lim-

Frank Rosenblatt,

The PerceptronA Perceiving and Recognizing Automaton

(Cornell Aeronautical Laboratory,

1957).

Rosenblatt, The Perceptron: A Probab

ilistic Model for Information Storag

e and Organization in the Brain,

Psychological Review

65, no. 6 (November 1958): 386408.

Threshold output

5

0.4

0.6

0.8

1.2

3012345678910

Summation output

Figure 6.3MCP as a 2-D linear model without inhibitory input. The weights of the model correspond

to the coefficients of a linear model. In this case, our neuron supports only a single input, and the

weight has been set to 1 for illustration. Given a threshold of 0, all inputs with a value less than or

equal to 0 would inhibit the neuron from firing, whereas

all inputs with a value greater than 0 would

cause the neuron to fire.

HAPTER

final result,

6.3.1Training

Now that you know that a neur

x

a

=

w

w

a

)

Figure 6.4The perceptron. Inputs

are received and multiplied by

their associated weight, with perceptron

, being added in afterward. This

output, given by

, is then passed through

a threshold function to obtain the output.

x

w

,

=

otherwise

The perceptron

Weighted sum

Sign of weighted sum

101.50.5Negative0

011.50.5Negative0

001.51.5Negative0

111.50.5Positive1

Rosenblatt, The Perceptron: A Probab

ilistic Model for Information Storage and Organization in the Brain.

Rosenblatt,

Principles of Neurodynamics; Perceptron

s and the Theory of Brain Mechanisms

(Spartan Books, 1962).

Listing6.1The perceptron learning algorithm

1

x

w

,

=

HAPTER

Listing6.2The perceptron learning algorithm (2)

Brian Ripley,

Listing6.3Creating data for perceptron learning

k

=

Calculates the output of the

perceptron given the

current weights (time t)

1

+

+

=

Updates the features. Note this is

often done as a vector operation

rather than a for loop.

Features are updated only if the expected output

and

the actual output differ. If th

ey do, we move the weight in

the sign of the correct answ

er but a magnitude given by

the corresponding input and scaled by

represents the

speed at which the weights are updated in the algorithm.

of these data

points. In this ex

point x=

Creates four data points in a NumPy array.

The perceptron

p = perceptron.Perceptron(n_iter=100)

Listing6.4Training a single perceptron

0.00.20.40.60.81.01.2

Figure 6.5Graphical overview of the data

for a single perceptron. Data with class

label

is represented by a round dot,

whereas data with class label

is represented

by a star. Its the aim of the perceptron to separate these points.

Creates a single perceptron. n_iter

specifies the number of times the

training data should be iterated

through when training.

Trains the perceptron using the data

and the associated

Prints out the coefficients and

the bias of the perceptron

HAPTER

scikit-learn, Perceptron,

http://mng.bz/U20J

Listing6.5Plotting the output of the perceptron

4

3

x

x

3

x

x

2

x

2

3

2

x

=

Sets up a color array, and plots

the data points:

black star for

class zero, red dot for class

The perceptron

grad = -p.coef_[0][0]/p.coef_[0][1]

intercept = -p.intercept_/p.coef_[0][1]

x_vals = np.linspace(0,1)

y_vals = grad*x_vals + intercept

plt.plot(x_vals,y_vals)

plt.show()

We plot our four data points along with th

e projection of the separating plane learned

by the perceptron onto the viewing plane. Th

is provides us with a separation as per

figure 6.6. In general, for a larger number of input variables, you can think of these as

existing in

dimensional space, with the percep

tron separating these using a hyper-

+ 1 dimensions. You should now be able to see that the basic linear form of

the perceptronthat is, with a threshold acti

vationis equivalent to separation using

a hyperplane. Consequently, such

models are of use only where data is linearly separa-

ble and will be unable to separate posi

tive and negative classes otherwise.

Creates the data points, and

plots the line of intersection

0.00.20.40.60.81.01.2

Figure 6.6The projection of our separating plane onto the viewing plane (

= 0). All

points to the top right of the figure satisfy the constraint w

� x 0, whereas points to

the bottom left of the line satisfy the constraint w

HAPTER

function isnt linearly

separable in two-dimensional space; there ex

ists no single hyperplane that can sepa-

rate positive and negative classes perfectly.

Try to draw a line anywhere on this graph

with all positive classes on on

e side of the line and all ne

gative classes on the opposite

side of the line, and youll fail! We could, ho

wever, separate these data points if we had

Marvin Minsky and Seymour Papert,

Output

Table 6.2Input and output values for the

XOR function. It outputs a 1 if either

or

Multilayer perceptrons

Hidden layer

Figure 6.7A multilayer perceptron. Read from top to bottom, it consists of an input layer

0.00.20.40.60.81.01.2

Figure 6.8Graphical representation of the XOR f

unction. Positive classes are specified with

circles, whereas negative classes are specified

with stars. No single hyperplane can separate

HAPTER

1

1

0.5

1.5

Figure 6.9A two-layer perceptron can

separate a nonlinearly separable function

(XOR). Values on the connections

demonstrate the weight of those

connections. The introduction of the bias

term ensures correct operation with an

activation threshold of 0. Conceptually, the

two hidden neurons correspond to two

hyperplanes. The final combining

perceptron is equivalent to a logical AND on

the output of the two hidden neurons. This

can be thought of as picking out the areas

of the (

) plane for which the two hidden

Multilayer perceptrons

6.4.1Training using backpropagation

In the previous examples, we used the step

function as our neuron-activation func-

tionthat is, a single threshold value over which the neuron is able to fire. Unfortu-

nately, coming up with an automated method

to train such a network is difficult. This

is because the step function doesnt allow

us to encode uncertainly in any formthe

thresholds are hard.

It would be more appropriate if we coul

d use a function that approximates the step

function but is more gradual. In this way,

a small change to one of the weights within

0.00.20.40.60.81.01.2

=

+ 1.5

y

=

0

y

=

0

y

=

1

x

=

+ 0.5

Figure 6.10Separating a nonlinearly separable datas

HAPTER

012345

Figure 6.11Output profiles for several activation funct

ions over the same range as provided in figure

6.3. Activation profiles demonstrated are square root, logistic, negative exponential, and hyperbolic.

Multilayer perceptrons

6.4.3Intuition behind backpropagation

To provide the intuition behind backpropag

ation, were going to work with the same

example as previously, the

function, but well try to

learn the weights rather than

specify them by hand. Note also that from

now on, the activation function will be

assumed to be sigmoid (logistic). Figure 6.

overview of our new

Logistic regression, the perceptron, and the generalized linear model

Think back to chapter 4, where we intro

duced the concept of logistic regression.

There we identified

that a linear re

sponse model would be un

suitable fo

estimation and instead modified the logistic response to curve to create a more

appropriate response. From this starting point, we derived that the log-odds are linear

in the combination of weights and input variables and applied this to a classification

problem.

In this chapter, we started from a basic biological concept and built a computational

formalism that captur

es this. We havent discussed pr

obability at all

but started from

a standpoint of activation within a neuron.

Intuitively, weve extended this into a more

general concept and reached the same equation:

In fact, what weve encountered here is a more general class of problem known as

generalized linear

models

(GLM).

In this class of models, a linear model

is related to the output variable,

, by a link function, which

in this case is th

e logistic function.

This equivalence of algorithms and concepts is common in machine learning, and

youve seen it already in this book. Just think back to section 2.5, where we dis-

cussed the equivalence of expectation maxi

mization (EM) with a Gaussian mixture

model with tied and coupled

the vanilla k-means al

gorithm. The usual

reason for this is that mult

iple researchers have started from different points within

the research and discovered equivalence

through the extension of basic building

blocks.

a. Peter McCullagh and John A. Nelder,

Generalized Linear Models

(Chapman and Hall/CRC, 1989).

----------------------------------------------------------------

+++

1

e

HAPTER

David E. Rumelhart, Geoffrey E. Hinton, and Ronald

J. Williams, Learning Representations by Back-

Propagating Errors,

Nature

323 (October 1986): 53336.

(

x

= 3)

w

(

x

= 2)

w

(

j

= 2

= 3)

w

(

j

= 1

= 3)

w

(

x

= 1)

w

(

x

= 1)

w

(

x

= 2)

w

(

x

= 2)

w

(

x

= 1)

j

= 2

j

= 3

j

= 1

Figure 6.12Overview of the backpropagation example. Given a set of inputs

(

)

+

=

Multilayer perceptrons

The first thing to note about training is

that were interested in how the error at

the output changes with respect to the change

in a single weight value. Only through

this value can we move the weight in a dire

ction that minimizes th

wij

------------------

------------

Error given the output of the

------------

Output of the activation function

given the weighted sum of its inputs

wij

------------------

Weighted sum of its inputs and

------------

------------

()

wij

------------------

------------

------------

expected

expected

Rate of change of error given

change in weight

HAPTER

Rumelhart et al., Learning Representations by Back-Propagating Errors, 53336.

Listing6.6Building a multilayer perceptron using PyBrain

expected

()

Multilayer perceptrons

in_to_h = FullConnection(inl, hidl)

h_to_out = FullConnection(hidl, outl)

bias_to_h = FullConnection(b,hidl)

bias_to_out = FullConnection(b,outl)

Listing6.7Building a multilayer perceptron using PyBrain (2)

PyBrain Quickstart

http://pybrain.org/docs

/index.html#quickstart

PyBrain

SupervisedDataSet

Randomly samples the four

data points

,000 times and

HAPTER

Creates a new backpropag

ation trainer object

Performs backpropagation 50 times

using the same

,000 data points.

Prints the error after every iteration.

Multilayer perceptrons

HAPTER

scikit-learn, Restricted Boltzmann Machin

e Features for Digi

t Classification,

http://mng.bz/3N42

Geoffrey Hinton, A Practical Guide to Tr

aining Restricted Boltzmann Machines in

Visible units

Figure 6.13Graphical overview of a

Restricted Boltzmann Machine. A RBM is a

HAPTER

()

---------------

()

Yann LeCun et al., A Tutorial on Energy-Based Learning, in

Predicting Structured Data

j

i

j

=

vhW

vhW

--------------------------------

vhW

vhW

----------------------------------

HAPTER

scikit-learn, Restricted Boltzmann Machin

e Features for Digit Classification,

http://mng.bz/3N42

Listing6.12Generating artificial data

scikit-learn, Restricted Boltzmann Machin

e Features for Digit Classification,

http://mng.bz/3N42

HAPTER

Listing6.14Evaluating the RBM/LR pipeline

Listing6.15Representing the hidden units graphically

Figure 6.14A graphical representation of the weights

Summary

In our

, we have 100 hidden nodes and 64 vi

sible units, because this is the size

of the images being used. Each square in

figure 6.14 isagrayscale interpretation of

the weights between that hidd

en component and each visible unit. In a sense, each

hidden component can be thought of as recognizing the image as given previously. In

the pipeline, the logistic regression model

then uses the 100 activation probabilities

=1|

image

) for each

) as its input; thus, instead of

doing logistic regression over

64 raw pixels, its performed over 100 inputs

, each having a high value when the input

looks close to the ones provided in figure

6.14. Going back to the first section in this

chapter, you should now be able to see th

at weve created a network that has automat-

ically learned some intermediate representation of the numbers, using a

Given a number of options, how do we

choose the one that maximizes our reward

(or, equivalently, minimizes the cost)? Fo

r example, if we have two possible routes

to work, how might we choose the one that

minimizes the time spent traveling? In

this example, our reward is based on the

This chapter covers

Complexities when making the right choice

Multi-armed bandits

.

In this figure, we assume that the actor

is exposed many times to a situation that

requires a decision to be made. Each time a decision is made, an outcome is reached,

and this outcome has an associated reward a

ssociated with it. In

the case of our com-

muter, assuming they go to work every day,

they have to choose their route every time

they go to work. The outcome is the combination of all the factors on that commute,

and the reward is formulated from this.

Although our toy example fits the probl

em perfectly, the intelligent web offers

many other problems with the same format.

It offers many applications where were

presented with several choices and

making the correct one is crucial:

Landing-page optimization

With the advent of the intelligent web, there are

Reward B

Reward A

Figure 7.1Formalizing the

problem definition. In this

situation, our actor is faced with a

decision, and, depending on their

choice, they reach an outcome.

This outcome has an associated

the maximum reward (or,

equivalently, the minimum cost)?

HAPTER

Making the right choice

When performing A/B testing, we do

so with two groups: group A and group B.

The first is a control group, and the second

has some factor chan

ged. For example, in

the case of landing-page optimization, gr

oup A may be shown the existing landing

page and group B a new landing page whos

e content or layout has changed in some

way. The purpose of our A/B test is to un

estimate of the population conversion rate. This is true

of both groups A/B. We can therefor come up with a new, also normal, random vari-

able thats the combination of random variab

les from groups A and B. This is the dis-

tribution of differences. Let us write

to refer to this new ra

ndom variable, defined as

=

Stuart Wilson and Rory MacLean,

Research Methods and Data Analysis for Psychology

(McGraw-Hill Education

Europe, 2011).

Student, The Probable Error of a Mean,

where

is the random variable of conversion

rates of our experimental group, and

is the random variable of conversion ra

tes of our control gr

oup. Given this new

random variable, we can now write the null and alternative hypothesis. Our null

hypothesis can be stated as follows:

= 0

That is, the experimental group is no diff

erent than the control group. Both random

and

are distributed around the same

population mean, so our new

random variable

should be distributed around 0.

Our alternative hypothesis can be

stated as follows:

� 0

That is, the expectation of the random variable of the experimental group is larger

than that of the control group; the popu

lation mean of this group is higher.

We can perform a one-tailed z-test on the distribution of

experimental one. From the definition of

the standard score, we may now write

where

is the conversion rate of the experiment,

control

is the conversion rate

of the control, and

is the standard error for the difference in conversion rates.

Lawrence D. Brown, T. Tony Cai, and Anirban DasG

upta, Confidence Intervals for a Binomial Proportion

and Asymptotic Expansions,

The Annals of Statistics

30, no. 1 (2002): 160201.

Sean Wallis, Binomial Confidence Intervals and Cont

ingency Tests: Mathematical Fundamentals and the

experiment

control

-------------------

HAPTER

Making the right choice

equivalent to the conversion rate, and beca

use the standard error of two variables can

be combined by addition, we note the following:

By substitution, we can now write our z-test

as follows. This is a formulation of the

Wald (or normal) interval fo

r the binomial distribution:

The greater the value of

, the more evidence against the null hypothesis. To obtain a

90% confidence interval for a one-tail test, our value of

would need to be greater than

1.28. What this actually says is that, unde

r the assumption of the null hypothesis (the

population mean for groups A and B is the sa

me), the probability of this difference in

conversion rates, or one larger than this,

occurring by chance is less than 10%. Put

another way, assuming the co

ntrol and experiment conversion rates are drawn from a

distribution with the same mean, if we run

this same experiment 100 times, only 10 of

these would have at least such an extreme

value. We can provide even tighter bounds

and more evidence against the null hypothes

is by using a 95% confidence interval.

This increases the

value required to 1.65.

It might be useful for you to think about the factors that will impact the size of

Obviously, if we draw two conversion rates

at a given point in time from an experimen-

tal set and a control set, a larger differ

ence in these will lead to a larger

score, and

therefore more evidence that theyre drawn

from different populations with different

means. But the number of samples is also

important. As youve seen, a larger number

of samples will lead to an overall smaller

standard error. This captures the fact that

our estimate of conversion rate is more ac

curate the longer we run the experiment. In

the following section, well provide an illustration of this and a worked example of

A/B testing in Python using this methodology.

7.1.2The code

exp

control

exp

experiment

experiment

experiment

------------------------------------------------------------

control

control

control

control

--------------------------------------------------

experiment

control

control

and found to be 0.002 for the new landing page and 0.001 for the original landing

page. You need to know whether this increa

se is significant enough to warrant the

value for the experiment. Give

n these values, we obtain a

value of 1.827. This exceeds the 92% confid

ence interval but not the 95% interval. We

can say that there is a probability less than

0.08 that the data was drawn from the con-

trol distribution. Consequently, at this interv

al the data is significant. We should reject

the null hypothesis that there is no diff

erence at all between the groups and accept

the alternative hypothesis, that the second

group has a higher

conversion rate. If

weve controlled all other aspects of the us

er groups, this should imply that the web-

site redesign has ha

d a positive effect.

You should be able to see from this code

that the standard error of the distributions

from which the conversion rate is

drawn has a direct effect on the

Listing7.1Calculating standard (

) value for your experiment

HAPTER

Making the right choice

7.1.3Suitability of A/B

In this section, weve given a basic introd

uction to a statistical method for testing

called the z-test. We discussed it in the cont

ext of an A/B test, where we want to test

the impact of a change in operations so as

0500010000

1500020000

Figure 7.2Were given a fixed value for the conversion rates of the A/B group, the

Multi-armed bandits

advantage. What if we could

exploit new solutions in a quic

ker fashion, without waiting

for the test to finish completely? Well, we can. Enter the multi-armed bandit (

7.2Multi-armed bandits

The name

multi-armed bandit (

MAB

comes from the popular casino game, the one-

armed bandit. For those of you who have ne

ver been to a casino, this is a machine

whereby you pull on a lever

(the arm); depending on the values displayed by the

machine, youre either paid out or (more

likely) not. As youd expect, the odds of

these machines are in favor of the house, so

the likelihood of being paid out is typi-

cally very small!

Figure 7.3The multi-armed bandit

problem. A user is faced with a number of

one-armed bandits, each of which has a

different probability of paying out. The

user doesnt know what these

probabilities are and may only uncover

them through playing the machines. What

strategy should the user employ in order

T

t

1

=

T

=

HAPTER

Making the right choice

where

is the number of steps weve carried out so far,

is the reward that was

received at step

, and

is the mean payout returned per play from the optimal ban-

Listing7.2Pseudo-code for the epsilon-first strategy

opt

Explores the solution space

Exploits your knowledge

Multi-armed bandits

PSILON

GREEDY

With the

epsilon-greedy

approach,

acts as a probability that

well explore the space, as

opposed to exploiting the best-known arm.

More formally, the following pseudo-code

outlines the approach.

epsilon=0.1

best_bandit

bandit_array

total_reward=0

number_trials

current_trial=0

while((number_trials-current_tria�l)0):

random_float = rand(0,1)

if(random_floatepsilon):

random_bandit = rand(0,len(bandit_array))

total_reward += play(bandit_array[random_bandit])

update_best_bandit()

else:

total_reward +=play(bandit_array[best_bandit])

current_trial+=1

The advantage of this approach is that we

dont need to wait for the explore phase to

complete before we can start exploiting

our knowledge of the bandits performance

but be careful! This algorithm doesnt take

into account the statistical significance of

our performance data. Its possible that a peak in positive payouts for a particular ban-

dit will result in all plays being shifted to th

is bandit erroneously. More on this shortly.

You can see that

controls the probability that we

explore rather than exploit. Low

values of

array

make it less likely that well expl

ore the space, whereas higher values

make it more likely. Theres a clear trade-off here, and the selection of our value of

depends on many factors. Both the number

of bandits and the probabilities of pay-

Listing7.3Pseudocode for the epsilon-greedy strategy

Index of the best bandit in the array

Array containing bandit objects

the best

Exploits your knowledge

HAPTER

Making the right choice

epsilon=1

best_bandit

bandit_array

total_reward=0

number_trials

current_trial=0

while((number_trials-current_tria�l)0):

random_float = rand(0,1)

if(random_floatepsilon):

random_bandit = rand(0,len(bandit_array))

total_reward += play(bandit_array[random_bandit])

update_best_bandit()

else:

total_reward +=play(bandit_array[best_bandit])

current_trial+=1

epsilon = update_epsilon(epsilon)

Note that there are several methods to choose an optimal rate to update

depends on the number of bandits,

, and the respective weights at which they pay out.

AYESIAN

BANDITS

As mentioned, one of the limitations of this algorithm is that we dont take into

account the significance of the performanc

e data as we explore the space. Although

you might be able to exploit your knowle

dge sooner, you may be making erroneous

decisions. Thats not to say that these te

chniques arent useful! But they require a

Listing7.4Pseudo-code for the epsilon-decreasing strategy

Starts by exploring all the time

the best

Multi-armed bandits

bandit_distribution_array

total_reward=0

number_trials

current_trial=0

while((number_trials-current_tria�l)0):

sample_array = sample(bandit_distribution_array)

best_bandit = index_of(max(sample_array))

reward =play(bandit_array[best_bandit])

total_reward+=reward

current_trial+=1

update_distribution_array(best_bandit,reward)

You can see the elegance of this solution.

Although its simple to implement, this

approach both models the uncertainly of

our estimates and provides an excellent

Listing7.5Pseudo-code for the Bayesian bandit strategy

0

Figure 7.4Modeling the knowledge about the payout rate of three bandits using the Bayesian

bandit method. The mean payout rates for arms 1, 2, and 3 are 0.1, 0.3, and 0.4, respectively. Bandit

1 has a lower mean but a much wider variance. Bandit 2 has a higher mean and smaller variance.

Bandit 3 has an even higher mean and a smaller variance. In order to choose the bandit to play, each

distribution is sampled, and the bandit corresponding

to the distribution from which the highest

sample was drawn will be picked. After selection, the bandit is played and the respective distribution

is updated. This has the effect that even bandits with a low payout rate have a chance to redeem

themselves if their mean rate is uncertain (high variance).

Initializes with

suitable priors

HAPTER

Making the right choice

7.3Bayesian bandits in the wild

Listing7.6A class encoding a one-armed bandit

Bayesian bandits in the wild

def sample_distributions_and_choose(bandit_params):

sample_array = \

Listing7.8Running the Bayes bandit

HAPTER

Making the right choice

bandit_list = [Bandit(0.1),Bandit(0.3),Bandit(0.8)]

bandit_params = [(1,1),(1,1),(1,1)]

x = np.linspace(0,1, 100)

plt.plot(x,

Listing7.9Plotting the initial priors for the Bayes bandit

Listing7.10Plotting a single execution of a Bayes bandit strategy

Bayesian bandits in the wild

Here we use the

Listing7.11Plotting the posterior probability density functions

0.00.20.40.60.8

Figure 7.5Probability density function over the

beliefs in payout probabilities for each bandit.

Note that the three plots are identical and that the area under the curve is equal to 1. This

illustrates our belief that all payout probabilities are equally likely for each bandit. This will

change as we observe the bandits in action.

HAPTER

Making the right choice

alpha=0.6,

label='Bandit 1')

plt.plot(x,

Bayesian bandits in the wild

If you consider figure 7.6 and figure 7.7,

you may be able to understand a little better

whats happening. During the very early plays, because the distributions of beliefs are

almost identical, the algorithm is equally

likely to choose any of the bandits to play.

This leads to a cumulative

0.20.40.60.81.0

Bandit 2

Bandit 3

Figure 7.7Posterior probability distributions af

ter a single run of 1,000 plays. This represents

our belief over the possible true distributions

of the bandits, which have an unobservable true

probability of payout equal to 0.1, 0.3, and 0.8, respectively.

HAPTER

Making the right choice

Bayesian bandits in the wild

Here you can see that the outcome from an

identically initialized Bayesian bandit can

be dramatically different for independent

executions. The cumulative regret ranges

from about 4 (the expected

difference between the outcome and the optimal outcome

is five payouts) to 18. In most cases, th

HAPTER

Making the right choice

this scenario. But this is an especially good

Bayesian bandits in the wild

about the bandit. Second, this has the kn

ock-on effect that the overall regret

will be higher than when working with

bandits with higher probabilities. The

reason for the inversion of the ordering is that this is such an early stage of the

0.2, 0.5, 0.8

0.3, 0.5, 0.7

HAPTER

Making the right choice

0.01, 0.05, 0.09

0.02, 0.05, 0.08

0.04, 0.05, 0.06

# Bandits: 2

# Bandits: 4

# Bandits: 5

# Bandits: 8

# Bandits: 9

# Bandits: 10

A/B vs. the Bayesian bandit

7.4A/B vs. the Bayesian bandit

So far in this chapter weve discussed seve

ral approaches to making the right choice.

Much has been said in the community about

s, and there has been a lot of hype

about themwith many seeing them as a no-b

rainer alternative to A/B testing. After

all, if you can test your choices while

performing optimization, doesnt that make

more sense than waiting for statisti

cal significance with an A/B test?

Well, yes and no. As with all machine-lear

Bayesian bandit

Continuous online optimizationSingle-shot optimization after run

Multiple variablesSingle test variable

Confidence not explicitly measuredStatistically significant result

Convergence related to the number of options, pay-

out rates, and difference

Convergence related to the payout rate and differ-

HAPTER

Making the right choice

John Myles White,

Bandit Algorithms for Website Optimization

(OReilly, 2012).

Richard Weber, On the Gittins Index for Multiarmed Bandits,

Annals of Applied Probability

(1992): 1024

1033.

J. C. Gittins,

Multi-armed Bandit Allocation Indices

(John Wiley and Sons, 1989).

John Langford, Contextual Bandits,

Machine Learning (Theory)

, October 24, 2007,

ABCDEFZ

r

of att

r

ibutes assig

n

ed to a pa

r

ticula

r

use

r

a

n

cha

n

ges payout p

r

obabilities.

Figure 7.13A contextual bandit. The payout

probability isnt fixed and is dependent on the

context or situation. One way to think about

this is that the player has assigned to them a

vector of attributes at a given time, and,

depending on those attributes, the payout

probabilities of the machine will change. The

best strategy must minimize the cumulative

Extensions to multi-armed bandits

7.5.2Adversarial bandits

If you recall from previous sections, up un

til now weve made the implicit assumption

that the payout distribution is

always static. That is, these distributions dont change as

we play the bandits. Solutions to the

adversarial bandits

problem

dont make this

assumption. In this formalism,

The adversary selects a vector

: the size of the number of bandits. This contains

the rewards for each bandit at that step.

The player, without knowledge of the adve

rsarys selection, chooses a bandit to

play based on their strategy.

In the full-information game, the player then may see the entire reward vector.

In the partial-information game, the player

sees only the reward associated with

their chosen bandit.

Play continues in this way over a fixed numb

er of steps, and the

player must maximize

P. Auer, N. Cesa-Bianchi, Y. Freund, and Robert E. Sh

apire, Gambling in a Rigged Casino: The Adversarial

Multi-armed Bandit Problem,

Foundations of Computer Science

(1995), Proceedings of the 36th Annual Sympo-

sium on Foundations of Comput

er Science (IEEE, 1995): 32231.

t

r

of att

r

ibutes chose

n

by the

adve

r

sa

r

y cha

n

ges payout p

r

obabilities.

Figure 7.14The adversarial bandit

problem. In this variant of the MAB

problem, solutions make no assumptions

about the underlying distributions of the

rewards. Instead, the problem is modeled

HAPTER

Making the right choice

7.6Summary

The future of

In this book, weve presented to you the current state of the intelligent web and

walked you through the basics, teaching you what an intelligent algorithm is and

how to evaluate one. Your eyes should no

This chapter covers

Summary and review

Future applications of the intelligent web

Social implications of the intelligent web

Pedro Domingos, A Few Useful Things

to Know about Machine Learning,

Communications of the ACM 55

no. 10 (October 2012): 7887.

HAPTER

The future of the intelligent web

seem removed from what many consider an

intelligent algorithm, we cant stress

enough the importance to the enthusiast of having at least a passing interest in this

area. Future practitioners in this space w

ill absolutely need to know the magnitude,

velocity, and availability

of their real-time data.

Weve covered what we consider some of

the most pervasive topics in intelligent

algorithms: extracting struct

ure; providing recommendations; classification, click pre-

diction, and deep learning; as well as sele

ction and testing. These topics arent com-

Marc Weiser, The Computer for the 21st Century,

Scientific American

, September 1, 1991: 6675.

Nest Thermostat, 01/01/2015,

Future applications of the intelligent web

makes appointments, interacts with your sma

rtphone, starts dinner, and activates the

washing machine? This appears to be some

Douglas McIlwraith, Wearable and Ambient Sensor Fusion for the Characterisation of Human Motion,

International Conference on Intelligent Robo

ts and Systems (IROS) (IEEE, 2010): 505510.

Julien Pansiot, Danail Stoyanov, Douglas McIlwraith

, Benny Lo, and Guang-Zhong Yang, Ambient and Wear-

able Sensor Fusion for Activity Recognition in Health

care Monitoring Systems, IF

MBE proc. of the 4th Inter-

HAPTER

The future of the intelligent web

may have already seen an illustration of this in the film

Minority Report

alized ads are presented physically as people interact with their environment. Far-

Andrew Orlowski, Facebook Brings Creepy Min

ority Report-Style Ads One Step Closer,

The Register,

http://mng.bz/ML01

Siraj Datoo, This Recycling Bin Is Following You,

Quartz

, August 08, 2013,

http://mng.bz/UUbt

Joe Miller, City of London Calls Ha

lt to Smartphone Tracking Bins,

BBC News,

August 12, 2013,

http://mng.bz/k56c

Although as we have seen, this is changing rapidlyconsider Google Now and Apples Siri.

www.apple.com/uk/ios/siri/

Tim Berners-Lee, James Hendler, and Ora Lassila, The Semantic Web,

Scientific American

, May 1, 2001: 3443.

Social implications of the intelligent web

We might be some way from the fluid an

d natural user experience with the web as

described in the opening example of the

aforementioned paper, but much research

toward this is finding application in indust

ry. What we can be sure of is that any

attempt to provide richer semantics to th

e vast amount of information available on

the web can only be a good thing for design

ers of intelligent algorithms, allowing us

access to knowledge as

8.2Social implications of the intelligent web

In the previous sections, weve provided several potential areas for the development of

As youve learned in this book,

intelligent applications

are those that can change their

behavior based on information. It follows,

then, that we must have a mechanism for

the capture and access of data. Because we

re talking about web-scale processing, it

stands to reason that we may need a syst

em designed with the following in mind:

Volume

Our system should be capable

of dealing with web-scale data.

Scalability

Our system should be config

urable with changing load.

Durability

A motivating example: showing ads online

paid for on a cost per mille (

) basis; that is, for ever

y thousand ads shown, an

amount is paid from the advertiser to the

publisher. The latter is charged on a cost per

click (

) basis, where payment is exchanged ev

ery time a consumer sees and subse-

quently clicks an ad.

The complexity arises in the process by wh

ich this all occurs. Rather than advertis-

ers working with publishers directly, these

CPM

CPM

CPM

CPM

CPC

CPC

CPC

Figure A.1Overview of a demand-side platform. DSPs use their knowledge

PPENDIX

Capturing data on the web

a user

known to the

DSP

, or the exchange will pass its own

and the

will have to look up its own

accordingly. The reasons for this have to do with

the security of the page and the scope of th

e identifiers; an exact explanation lies out-

side the scope of this example. Suffice to say that the

can obtain an

in the

format to look up information regarding user behavior.

What information does the

store? Essentially, any interaction with associated

users, sites, and ads. To il

DSP server

Figure A.2Overview of the log-processing architecture of a typical DSP. Web page

interactions are recorded by the browser and s

ent back to the DSPs server(s) via HTTP.

Depending on the interaction type, these will be written to different files and stored in

different locations. Over time, these files are collated and imported to a database for

downstream processing.

Managing data collection at scale

will have to be configured to deal with

a high connection load. Once interactions are

received, a number of things need to happen.

First, as you can see, depending on the

interaction type, the event is logged into different folders. The reason has to do with

priority. Click logs equate to money, so mi

nimizing the latency of this data makes good

For the purposes of

where log files are sent on to the database, it

makes sense to periodically finalize files an

d start new ones so that we can send the

data as soon as possible. Consequently, file

s need to be continually created, finalized,

and recycled while data is entering and flowing through the system.

At this stage, finalized data is shipped to the database for import. Here we must

deal gracefully with the several possible fail

ure cases. First, corrupt

files must be ring-

fenced for manual intervention. Processed

files must be separa

ted for archival, and

transmission failures must be re

tried. Its imperative that data be received by the data-

base in a consistent state,

because without this, billing and reporting cant happen

effectively.

Managing data co

llection at scale

What weve just described is an extremel

PPENDIX

Capturing data on the web

Latency in this system is closely related to

several aspects of its design. In order for

an event to be shipped to the database, it must first have reached the file writer and

been written to a file, and that file must be closed. When this happens will depend on

the load and queuing capability of the rece

iving server as well as the maximum file

size for a log file. After shipping, the file mu

st again wait, to be imported by the data-

In terms of flexibility, theres only a single consumer of the data: the database.

Multiple consumers would result in an in

crease in transmission bandwidth, because

log files would have to be sent to all cons

umers. This would have an associated impact

on the logic required to keep the data consistent. If consumers were allowed to

beatdifferent states of progress, enabling logic would have to be borne by the log-

shipping code.

Hopefully, this simple example is illustrati

ve enough to convince you that data cap-

ture at web scale isnt easy! This, of course, all depends on your application and its

requirements, but striving to achieve the prop

erties at the start of this appendix will

lay the groundwork for the rest of your a

pplication. Each property has an associated

knock-on effect to your intelligent algori

thm and its deployment lifecycle. For exam-

ple, if your training data

reaches your algorith

m with low latency, you can train and

deploy more-relevant models more quickly.

If access is flexible, multiple algorithms

can consume the data in different ways in parallel. Wouldnt it be nice if there were a

system that was designed

and built for this purpose?

As it happens, there is a system built for

this purpose! Apache Kafka is a distributed

log-processing platform for dealing with high

volumes at low latency. At its core, Kafka

is a distributed cluster of br

okers. Producers of data can

publish messages to a topic (a

stream of messages of a given type) that are

Introducing Kafka

bound to port 2181. This is a centralized se

rvice thats used to share Kafka configura-

tionmore on this later. To communicate with Kafka, youll need bindings in your

particular language of choice. Because wer

e using Python throughout this book, well

use David Arthurs kafka-python

to connect to our Kafka instance (v0.9.5). In order to

run the examples in this appendix, were goin

g to assume that youve made this avail-

able to your favorite Python environment al

David Arthur, kafka-python,

https://github.com/dpkp/kafka-python.git

ListingA.1Kafka simple publish

Creates a KafkaClient object that

points to your lo

Imports the relevant definitions

from the kafka-python module

Sends messages to kafka

under the "test" topic

Creates a SimpleProducer

instance that will be used to

publish messages

PPENDIX

Capturing data on the web

Topics comprise multiple part

itions. Each partition must be held on the same broker,

but the partitions for a single topic can be

spread over multiple brokers. This serves

several purposes. First, it allows topics to

scale beyond the size of a single server.

Although each partition must physically fi

t on the machine that hosts it, the logical

unitthe topiccan grow to be much larger

than a single machine. It also provides a

level of redundancy and parallelism. Duplic

ate partitions can be held, and multiple

partitions can be read from and serviced

by different machines. The number con-

Apache Kafka Documentation,

Apache Projects

http://kafka.apache.org/documentation.html

ListingA.2Kafka simple subscribe

Anatomy of a Topic

Writes

OldNew

Figure A.3Anatomy of a topic, taken from the Kafka documentation.

consist of a partitioned log, which can only be appended to. The partitions are

immutable: they cant be changed once committed.

Creates a KafkaClient object

Instantiates a SimpleConsumer

with the KafkaClient object

Prints each message

Introducing Kafka

Here we create a

SimpleConsumer

, which takes as arguments a

KafkaClient

as well as

two string arguments. The first is known as a

consumer group

, and the second relates to

the topic the consumer is to consume from

. The concept of a consumer group allows

multiple consumers to coordinate, such that

a single message pu

blished to a topic is

only ever delivered to a sing

le member of a consumer group. In this example, theres

only a single consumer in our group, titled

mygroup

, and it will receive all messages

destined for this group.

ListingA.3Configuration changes in server.properties for broker 2

PPENDIX

Capturing data on the web

Note that because we havent modified Z

ookeeper details in server.properties, they

will all communicate with the same Zookeep

er instance. No further configuration is

required to ensure that they work as part

of the same cluster. Youll require some con-

figuration to create a topic with a new re

plication level, howeve

r. Navigate to your

Kafka folder, and input the following:

bin/kafka-topics.sh -create -zookeeper localhost 2181 \

-replication-factor 3 -partitions 1 -topic test-replicated-topic

This will create a new topic called

test-replicated-topic

. This topic has one parti-

tion and a replication level of 3. To

confirm this, issue the following command:

./bin/kafka-topics.sh --describe --zookeeper localhost:2181

All being well, you should se

e some output that looks re

markably like the following

listing.

Topic:test

PartitionCount:1

ReplicationFactor:1

Topic: test

Partition: 0 Leader: 0 Replicas: 0 Isr: 0

Topic:test-replicated-topic

PartitionCount:1

ReplicationFactor:3

Topic: test-replicated-topic

Partition: 0 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2

ListingA.4Kafka topic summary

ListingA.5Kafka publish to cluster

SimpleProducer with

Introducing Kafka

producer.send_messages("test-replicated-topic","Hello Kafka Cluster!")

producer.send_messages("test-replicated-topic","Message to be replicated.")

producer.send_messages("test-replicated-topic","As is this!")

We provide an alternat

ive construction of

SimpleProducer

. Here we specify synchro-

nous communication, blocking until the en

tire cluster has acknowledged a new mes-

sage before

SimpleProducer

ListingA.6Kafka topic summary (dead replica)

PPENDIX

Capturing data on the web

http://kafka-python.readthed

ocs.org/en/latest/usage.html

Introducing Kafka

have no effect on the attempts made by the

cluster to achieve consistency. They do, how-

ever, allow the producer to block on differ

ent events within the replication protocol

and modify downstream behavior dependent on certain outcomes. How replica syn-

chronization is achieved, even in the case

Jun Rao, Intra-cluster Replication in

Apache Kafka, February 2, 2013,

http://mng.bz/9Z99

ACKACK

Write

Write

Figure A.4The different levels of acknowledgments

in Kafka. From left to right, they offer

increasing degrees of availability at

the cost of acknowledgement latency.

Follower

Follower

Logs

Figure A.5Kafka replication, reproduced from Rao.

In this example, we have four brokers and one

topic with two partitions. Broker 1 leads

partition 1, and broker 4 leads partition 2.

PPENDIX

Capturing data on the web

In figure A.5, we present a Kafka cluster of four brokers. This cluster manages a

single topic with two partitions. The replicatio

synchronization. Under normal operation,

Introducing Kafka

Consumer groups, balancing, and ordering

In listing A.2, we introduced

the concept of Kafka consumer groups. This abstraction

allows Kafka to be used either as a qu

eue, whereby multiple

consumers process

itemsfrom the head, with each message goin

g to a single consumer, or as a publish-

subscribe system, in which each mess

age is broadcast to all consumers.

Recall that a single message

published to a topic is guar

anteed to be delivered only

to a single consumer of a consumer group. So where we have one topic and one

group, this emulates the behavior of a queue. Where each consumer is a member of

its own group, this provides the same be

havior as a publish-subscribe systemeach

PPENDIX

Capturing data on the web

concept, well introduce a schema of data as

illustrated in the following listing. This can

also be found in the additi

onal resources available at

this books website (as file

sh to follow the example programmatically.

381094c0-9e45-424f-90b6-ad64b06cc184 2014-01-02 17:33:07 click

6615087e-ea19-492c-869c-28fc1fa77588 2014-01-02 17:33:20 view

d4889090-942b-4790-9831-362051b0847b 2014-01-02 17:34:01 adview

e6688db5-d31f-4ede-985a-115fb51836fb 2014-01-02 17:35:06 adview

e6688db5-d31f-4ede-985a-115fb51836fb 2014-01-02 17:35:30 click

e6688db5-d31f-4ede-985a-115fb51836fb 2014-01-02 17:37:01 click

e6688db5-d31f-4ede-985a-115fb51836fb 2014-01-02 17:39:00 convert

ae6ae5c9-acb2-479b-adcb-0c29623d921b 2014-01-02 17:40:00 adview

1ac384c1-1b2d-4ed0-b467-7c90b7ac42d8 2014-01-02 17:40:01 adview

280bfa16-07ac-49ed-a1a5-9ab50a754027 2014-01-02 17:40:03 click

dda0e95d-9c30-4f60-bb6a-febf05526b83 2014-01-02 17:40:05 adview

8a1204f1-5076-4d4c-8b23-84c77ad541d8 2014-01-02 17:40:10 adview

3bdb8f17-11cc-49cb-94cf-75676be909b7 2014-01-02 17:40:11 adview

69b61156-6c31-4317-aec5-bd48908b4973 2014-01-02 17:40:13 adview

69722471-0532-4f29-b2b4-2f9007604e4f 2014-01-02 17:40:14 adview

00e5edf6-a483-48fa-82ed-fbfac8a6b1e6 2014-01-02 17:40:15 adview

ListingA.7Click user data

LeaderReplica List

Broker 2

Broker 3

Consumer group: group-1

Figure A.6Our final Kafka cluster. In this exampl

e, we use a low-level producer to perform our own

partitioning of the stream of data. This cluster is

operating with three brokers, three partitions, and

level-3 replication. Two consumer groups exist,

one with a single consumer called group-1 and the

other with three consumers, known as group-2.

Adview, clicks, and conv

ert, all from the same

user and within a short period of time

Introducing Kafka

9398d369-6382-4be0-97bc-182b3713745f 2014-01-02 17:40:17 convert

f40c1588-e4e1-4f7d-8ef5-5f76046886fb 2014-01-02 17:40:18 adview

54823527-fe62-4a81-8551-6282309b0a3f 2014-01-02 17:40:20 click

46d6f178-7c11-48c1-a1d7-f7152e7b2f1c 2014-01-02 17:40:26 adview

4c4e545b-d194-4531-962f-66e9d3b6116d 2014-01-02 17:41:00 convert

42b311f5-ba84-4666-a901-03063f7504a9 2014-01-02 17:41:01 adview

bfa28923-c358-4741-bcbf-ff99b640ee14 2014-01-02 17:42:06 adview

54c29b39-5640-49b8-b610-6f2e6dc6bd1b 2014-01-02 17:42:10 convert

edf6c5d2-1373-4dbb-8528-925d525b4a42 2014-01-02 17:43:03 click

f7f6752f-03bf-43f1-927c-8acdafd235e2 2014-01-02 17:43:11 adview

f4b7c0a6-b209-4cc4-b4e7-395489e0e724 2014-01-02 17:43:19 click

This is a fictitious and simplified dataset

based on page interaction. Note that the

schema has three distinct colu

mns. The first is a uniquely identifying code, or globally

unique identifier (

); the second is the time of the interaction; and the third is

the interaction type. There are only

three possible types of interaction:

adview

, which

corresponds to the action that an

ad has been seen by a user;

click

, which denotes a

click by a given user; and

convert

ListingA.8Custom partitioning with a producer

Opens the data file

that contains the

stream data

Iterates

the data

Splits the file by tab,

and pulls out the first

column, which contains the GUID of the user

Creates a ProduceRequest object that specifies

the click-streams topic and partition

Sends this data, with

responses stored

in the resps object

obtain a number

PPENDIX

Capturing data on the web

This toy producer partitions by user

. In a production implementation of such a

system, the click stream would come directly

from a web browser or associated code,

but for the purposes of illustration this data comes directly from a file.

If your broker from listing A.6 is still dead, restart it now. If you wish to run this

example, youll need to create the

click-streams

topic with replication level 3. Make

sure you do this before you run the code in

listing A.8; otherwise, youll end up auto-

matically creating a topic with too few partitions:

bin/kafka-topics.sh --create \

--zookeeper localhost 2181 \

--replication-factor 3 \

--partitions 3 \

--topic click-streams

Recall that we wish to ensure that all click streams from the same user end up at the

same partition. Working through lis

ting A.8, you can see that each

GUID

has the

modulo

operator applied to it (

modulo 3

), and the resulting number is used to deter-

mine the partition to publish to. Thus, all en

tries for the same user result in the same

output (modulo 3) and will end up at the same partition! This partition isnt exclusive

for that userother users data will also

end up therebut this isnt a requirement

for the system, because the number of part

itions is much less than the number of

given user. In

group-1

, consumer A is the only member and thus will receive all the

data for the topic

click-streams

. It does this by periodic communication with the

leaders of all partitions of the topic. Becaus

e this consumer acquires data from all par-

titions, for any given user, all data

will end up with a single consumer.

Perhaps more

interesting is

group-2

, containing consumers B, C, and D. Recall that

the consumer group guarantee is such that a

given message is to be received by only a

single consumer within a consumer group and

that this is achieved through the assign-

ment of a partition to a cons

umer. In this case, the leader of each partition communi-

cates with a single consumer. Broker 1 (leadi

ng partition 1) is assigned to consumer B,

broker 2 to consumer C, and broker 3 to co

nsumer D. Note that this is the maximum

number of consumers possible in this example because the number of consumers in a

group cant exceed the number of partit

ions in the topic its consuming from.

Because of our custom partitioning strate

gy, its now guaranteed that a single con-

sumer consumes the messages from a given user

. It would be possible to do things like

click de-duplication (to remove errone

ous clicks and work out how much the

should bill its client) at the consumer, beca

use all relevant data is processed by that

Evaluating Kafka: data collection at scale

consumer and also because the ordering gu

arantees provided by Kafka ensure that

events that are published in order at a part

ition will be retrieved in that same order.

Theres also an intermediate example,

which we havent illustrated here. In the

case where we have two consumers and three partitions, one of the consumers will be

responsible for two of the partitions. This

assignment will be random, and it does

impact the load-balance on the consumers;

but it again ensures that a single con-

sumer obtains all the data for a given user.

Evaluating Kafka: data

collection at scale

Thus far, weve presented Kafka along with

the kafka-python client, which can be used

to communicate with your cluster via Python

. We present this as an alternative to a

bespoke log-shipping service,

which, although simple to understand, has been dem-

onstrated to be non-optimal for our purpos

PPENDIX

Capturing data on the web

copies. Again, although not impossible, this

adds a significant amount of complexity

to a file-management service.

In our na

ve approach, latency isnt a tunable parameter. When an event occurs,

we must wait until it has been logged, that

file has been closed, and the data eventu-

ally reaches its destination before we can co

nsider that event to be processed. In con-

trast, if we look at Kafka, both read

Kafka design patterns

Kafka plus Storm

Kafka alone cant perform any transformations to

the data it ingests. If we wish to per-

form in-stream transformations, we must co

uple Kafka with an additional technology,

such as Storm.

Storm is a distributed, real-time

system for processing stream data.

This allows us to process data in arbitrar

y and complex ways through the use of several

simple literals, two of which are the

and the

. The spout acts as a source of

data, whereas the bolt is a processor of da

ta that can consume a number of streams

and do some processing, possibly emitting

a new stream. These literals, when com-

bined with several ways to group and relay data between stages (shuffle randomly to

the next stage, partition by a particular fiel

d, send to all, and so on), provide a power-

ful way to process data in flight. Such combinations are known as

topologies

For example, using the combination of

Kafka and Storm, it would be possible to

remove duplicate clicks from our

click-streams

topic, as mentioned earlier. This

would be achieved using

a Kafka spout, such as

storm-kafka

, which extracts data from

Kafka and acts as a data source for a

topology. Figure A.7 shows an example.

In this configuration, were using the power of Kafka plus Storm to deduplicate

clicks in our click stream that occur within

a short time of each other. The consumer

bolt

Field grouping (GUID)

bolt

bolt

spout

spout

Figure A.7In this example, a consumer group of spouts is consuming from the Kafka

cluster that has been partitioned by GUID.

This ensures that each consumer spout

receives all the data for a given user. Thr

ough Storms field grouping, we can also

partition data on GUID as we send this data to one of several deduplication bolts that

operate on the incoming clicks. Field groupi

ng again on GUID, we can send these to

one of several producers that will write these back into a new, deduplicated topic,

again partitioned by GUID.

PPENDIX

Capturing data on the web

group illustrated consists of Storm consum

er spouts, which pull data from Kafka in

the same way as any other consumer but fo

rward information in a way thats configu-

rable to Storm. In this case, we use the Storm

field grouping

on to three bolts that will perform the deduplication. Much in the same way that

Kafka works, this grouping ensu

res that all user data for a given user will end up at a

single bolt. This is necessary to enable

us to perform the deduplication. The output

from each of the bolts is a stream minus th

e duplicate data, and in this case we choose

to use the field grouping again to send the

data to the producers. This keeps per-user

event ordering within a given producer. Th

e producers then send messages back into

the cluster, partitioning by

into a new topic, which will contain the deduplicated

data. Because the data is processed in the order in which it arrives, per-user dedupli-

cated data should arrive at its particular pa

rtition in more or less the same order as the

duplicate data.

Youll notice that we say more or less.

This doesnt sound very strict! Storm allows

for several levels of guarantees with resp

ect to ordering. In its simplest mode, mes-

sages may be droppedor even

replayed if the message

times outcreating multiple

entries in the output. For

exactly-once semantics

, Storm Trident must be used, which is

an abstraction to provide e

nhanced guarantees on topology execution. Using this

abstraction, we coul

d guarantee that our new deduplicate topic would be in exactly

the same order as the original, minus the duplicate clicks.

Note the scalability here. Each of the four

Kafka design patterns

we need to present the data to Kafka in a queryable manner, and to illustrate this

weve chosen Hadoop (with some

other enabling technologies).

Apache Hadoop (

https://hadoop.apache.org

) is a software framework that allows

large datasets to be processed in parallel

across many machines. It uses a simple pro-

cessing paradigm known as MapReduce to pe

rform computation in

ion much more quickly than

could be achieved on a single machine. Hadoop is an

extremely powerful framework, and we urge

you to refer to the associated literature to

At the time of writing, Camus is being phased out an

d replaced by a next-gener

ation project called Gobblin

https://github.com/linkedin/gobblin

). This extends Camus and unifies it with several other LinkedIn data-

ingestion projects.

Hive/Pig

Camus

Figure A.8Overview of Kafka plus Hadoop integration. Camus is used to generate

MapReduce jobs, which extract data from Kafka in parallel, placing the data in

HDFS. Camus jobs must be executed periodically; hence the scheduler. High-level

languages such as Hive and Pig can then be used to query the imported data.

PPENDIX

Capturing data on the web

With data residing in

HDFS

, it will now be possible to run MapReduce jobs or regis-

ter the data with a higher-level qu

ery language, such as Apache Hive

https://hive.apache.org

) or Pig (

https://pig.

). Such languages provide a

familiar programming paradigm for anal

ysts running ad hoc queries over the

imported data.

Numerics

2-D array

2-D combinations

2-D linear model

4-D Gaussian clustering

10-fold cross-validation

A/B testing

code

167

suitability of

theory

166

vs. Bayesian bandits

accuracy

backpropagation

150

balanced data

Bandit objects

Bayes theorem

Bayesian bandits

172

186

factors impacting

183

call number

campaigns

capturing data on web

216

data collection

managing at scale

197

Kafka log-processing

platform

211

consumer groups

207

design patterns

evaluating

plus Hadoop

plus Storm

replication in

showing ads online

196

categorical features

causation, difference from

correlation

CF (collaborative filtering)

choices

188

A/B testing

166

suitability of

theory

164

vs. Bayesian bandits

Bayesian bandits

172

A/B vs.

186

factors impacting

181

multi-armed bandits

epsilon decreasing strategy

epsilon first strategy

epsilon greedy strategy

extensions to

Cinematch algorithm

classes of intelligent

algorithms

classes of intelligent

(continued)

artificial intelligence

machine learning

predictive analytics

classification

classifiers

lifecycle of

statistical classification

algorithms

structural classification

algorithms

credibility of results

dac.tar.gz file

DAGs (directed acyclic graphs)

data array

data axis, transforming

eigenvectors and eigenvalues

principal component analysis

data capture.

capturing data

data collection

managing at scale

nave approach

data preparation

data reliability

data_categoricals array

data_non_categoricals array

DT (decision tree) algorithms

duplicate partitions

durability

eigenvalues

eigenvectors

EM (expectation maximiza-

tion) algorithm

encoded data

192

energy-based learning

epsilon decreasing strategy

epsilon first strategy

epsilon greedy strategy

error attribute

199

Euclidean distance

evaluation

events

exchange, click prediction for

online advertising

109

ad monitoring

111

ad placement

111

bid win (or loss) notification

cookie matching

expectation maximization

explore-exploit conundrum

extensibility

f1 score

feature engineering

Gaussian mixture model.

See

GMM

Gaussian processes

GDBTs (gradient-boosted deci-

sion trees)

generalization

generalized linear models

Gerbrands

Gibbs measure

GLM (generalized linear

model)

GMM (Gaussian mixture

model)

example of learning using

Gaussian distribution

general discussion

k-means and

gmm object

Google Glass

132

Google Now example

grayscale values

GUID (globally unique

identifier)

Hadoop framework, Kafka plus

hand-built neuron

hashing trick

122

HDFS (Hadoop Distributed

File System)

215

head tool

healthcare

hidden layer

144

hidden nodes

155

hidden units

155

hierarchical reference

structures

higher-dimensional spaces

high-frequency trading

Hinton, Geoffrey

home healthcare

horizontal scaling

Huang, Jeubin

human-readable tags

118

HW (high watermark)

hyperbolic profile

hyperbolic tangent function

hyperplane

if-then clauses

incorrect classification

inhibitory input

input layer

150

input variables

in-sync replicas

intelligence, evaluating

intelligent algorithms

108

112

artificial intelligence

machine learning

predictive analytics

data reliability and

data size and

evaluating performance of

evaluating intelligence

evaluating predictions

examples of

generalization and

Google Now example

inference and

lifecycle of

scaling characteristics of

what they are not

intelligent web

future applications of

190

home healthcare

invert_hash option

Jaccard similarity

Kafka log-processing platform

consumer groups

design patterns

212

evaluating

plus Hadoop

plus Storm

replication in

201

message acknowledgement

replication, leaders, and in-

sync replicas

KafkaClient object

keyword searches

KISS (keep it simple, stupid!)

k-means algorithm

kNN (k-nearest neighbors)

landing-page optimization

163

n dimensional space

141

odds of event

only_unknowns=True

PA (predictive analytics)

Papert, Seymour

Gibbs measure

ProduceResponse objects

production rules

production stage

programmatic buying

publish-subscribe system

quantified self

r switch

RatingCountMatrix class

RBM/LR pipeline

RBMs (Restricted Boltzmann

Machines)

BRBMs (Bermoulli Restricted

Boltzmann Machines)

illustrative example

overview

154

training

readable.models

recall

receiver operating characteris-

tic curve.

ROC

194

scale of probability

scaling characteristics, of intelli-

gent algorithms

scikit-learn

138

153

SE (standard error)

165

second price auction

111

self-driving vehicle

semi-empirical approach

103

SGD (stochastic gradient

descent)

similarity

similarity of probability

SimilarityMatrix class

SimpleConsumer

SimpleProducer

203

simplicity

single hidden units

single hyperplane

144

single perceptron

single test variable

single weight value

single-shot optimization

ubiquitous computing

190

user-based collaborative

userdict object

users, information about

utility problem

validation stage

vehicles, self-driving

191

Vickrey auction

visible units

160

Vowpal Wabbit (VW), click pre-

diction with

128

model calibration

Wald interval

intelligent web

149

weighted_ratings

XOR function

147

148

Intelligent Web

DMITRY BABENKO

OREWORD

aluable insights are buried in the tracks web users leave

as they navigate pages and applications. You can uncover

them by using intelligent algorithms like the ones that

have earned Facebook, Google, and Twitter a place among the

giants of web data pattern extraction.

teaches you

how to create machine learning applications that crunch and

wrangle data collected from users, web applications, and

website logs. In this totally revised edition, youll look at

intelligent algorithms that extract real value from data. Key

machine learning concepts are explained with code examples

in Pythons

scikit-learn.

rithms to capture, store, and structure data streams coming

from the web. Youll explore recommendation engines and

dive into classi

cation via statistical algorithms, neural

Introduction to machine learning

Extracting structure from data

is a machine learning expert and data

eld of online advertising.

machine learning techniques for industrial solutions.

[INCLUDING eBOOK]

Algorithms

of the

Intelligent Web

WEB DEVELOPMENT/PYTHON

SEE INSERT

To download their free eBook in PDF, ePub, and Kindle formats,

owners of this book should visit

www.manning.com/books/algorithms-of-the-intelligent-web-second-edition

sample code in Python.

brings fresh life to an

Marius Butuc, Shopify

Covers the most essential

areas of machine learning

application in the real world.

A great hands-on approach.

Radha Ranjan Madhav, Amazon

Dike E. Kalu, Fara Frica