Continuing on my previous post on *Analytics in a Big Data World*!

**Chapter 3: Predictive analytics**

Here is an example of why I am less than enthusiastic about this book. Chapter 3 begins with: “Two types of predictive analytics can be distinguished: regression and classification. In regression, the target variable is continuous.” Well, no. That is true of linear regression but not logistic regression. But the author clearly knows this, since he talks about logistic regression later. This is just sloppy writing at its most obvious.

In response modeling, the author makes the distinction between gross response (customers who purchase after receiving the marketing message) and net response (customers who purchase because they received the message). He provides a simple definition of Customer Lifetime Value and goes back to his favorite application of credit risk by discussing loss given default and the need to “check the robustness and stability of the target definition” doing for instance roll rate analysis. He then quickly defines linear and logistic regression and drops names like Bernoulli distribution (“the errors/target are not normally distributed but follow a Bernoulli distribution.”)

Moving on to decision trees, he has some good graphs and examples for this topic, which lends itself well to visual aids. Splitting of the tree is determined using criteria such as Gini’s criterion or the Information Criterion for discrete variables or the mea squared error for continuous variables. Another way to measure the quality of spots is do calculate a **F-statistic**, where good splits have high F-value.

Then, neural networks, where I found the author’s description to be likely unintelligible for the everyday scientific reader that his book is supposedly targeting, but perhaps I am underestimating his readers. There are many superior introductory treatments on the topic, including for instance this excellent post on the blog towardsdatascience.com Those two treatments (of the book and of the blog post on towardsdatascience.com ) are so vastly different in quality, with the blog post being exceedingly better, that it is almost like they are from two different planets. The blog post has examples in Python too. And if you have a lot of time to read, you can read this book too, or that one, although they are a bit old.

Because neural network modeling is known as a black-box technique, meaning the manager doesn’t know what happens inside the neural network to motivate the final prediction, it is good to extract rules from the model to make it more understandable for the manager. Rules should be evaluated in terms of **accuracy, conciseness and fidelity** (ratio of correct predictions over incorrect predictions, i.e., ratio of sum of diagonal elements over sum of non-diagonal elements in the confusion matrix).

Neural networks have two shortcomings: the objective function is nonconvex and the computational effort needed to tune the number of hidden neurons can be substantial. The author goes on to describing **support vector machines **to address those shortcomings. SVM problems are about finding a hyperplane separating a class of points from the other class of points (for instance the people who pay back their loans vs the people who don’t). Of course, perfect separation between the payers and the defaulters may not be possible and we need to account for misclassification. The algorithm will use some transformation of the data that doesn’t need to be known exactly; instead, what needs to be known is a certain derivative mapping of the data known as the kernel function. For a good introduction to SVM, I recommend this tutorial.

The author then moves on to **ensemble methods**, that s, collections of models. Predictions are then made using, for instance, majority voting. The author goes very quickly over bagging, boosting and random forests. The paragraph about bagging starts with: “Bagging (bootstrap aggregating) starts by taking B bootstraps from the underlying sample.” Well, that explains everything then. In the next sentence, the author bothers to add “Note that a bootstrap is a sample with replacement (see section on evaluating predictive models)” [that section is later in the book.] The whole book is like that. More linear writing would have served the author well.

At least the boosting section is well explained. “Boosting works by estimating multiple models use a weighted sample of the data. Starts from uniform weights, boosting will iteratively reweight the data according to the classification error, whereby misclassified cases get higher weighs. The idea here s that difficult observations should get more attention.” One of the first boosting algorithms is known as AdaBoost (short for adaptive boosting).

Another ensemble method is **random forests**, i.e., a collection of decision trees. This was recently extended into rotation forests, which combine random forests wth principal component analysis (not something the author bothers to define, but a transformation that extracts uncorrelated components from a set of possibly correlated variables.)

Then the author goes over multi class classification techniques: multi class logistic regression, multi class decision trees, multi class neural networks and multi class support vector machines.

Mind you, by that point in the book, we are only at p.71, so that gives you an idea of how cursory the treatment of everything is. But if you want to learn names of data science techniques you can learn about later, it is not a bad book. Maybe a little expensive but not bad.

Then we move on to evaluating predictive models. This is where the author talks about splitting the data in **a training and a testing sets**. I think that the author’s statement “note that in case of decision trees or neural networks, the validation sample should be part of the training sample because it is actively being used during model development” is not completely true: you create a model using the training sample and if there are two models you are hesitating between (say, with and without an independent variable), you take a piece of the testing set, call it **validation set**, try your two models on that, pick the better one, and then implement it on the remaining part of the testing set (“true testing set”). You didn’t know ahead of time you would hesitate between two models so you had your testing set ready but you then had to split it between validation and testing set to be able on which model to keep. So in that sense, the part of the initial testing set that became the validation set is part of the training set because it helps design the model, but I am not sure the reader will pick up on that. Of course, if you know in advance you will have multiple models to evaluate, you could split into training/validating/testing set before you even start. The author also explains cross-validation, although the explanation in the *Analytics Edge* edX course is better.

Performance measures for classification models include classification **accuracy, sensitivity and specificity** (just like I tell my students in class). Classification error is one minus the accuracy. The author also discusses the Receiver Operating Characteristic (**ROC**) curve and the Area under the Curve (**AUC**), again better explained in the *Analytics Edge* edX course. A good model generally yields an AUC of 0.8 or higher.

Another important performance metric is the **lift curve**. The data set is sorted in deciles from low score to high score of the dependent variable (what we are trying to predict). The lift curve computed, for each decile, the ratio of the percentage of what we are predicting using our model over the percentage of the dependent variable in the entire population. If the model isn’t better than random guessing, the lift would be 1. The **cumulative accuracy profile** (CAP) computes the cumulative percentage of what we are trying to predict (the “1” in a binary dependent variable) for each decile. Then the CAP curve can be summarized by an accuracy ratio or Gini coefficient.

Other performance metrics are the Kolmogorov-Smirnov distance (maximum distance between the cumulative score distribution of the good vs the bad customers in credit scoring), the Mahalanobis distance between score distributions and the divergence metric.

Performance of regression models is evaluated using R^2, Mean Squared Error (MSE) and Mean Absolute Deviation (MAD).

Coming up: Chapter 4 on Descriptive Analytics in my next post.

## Comments