I bought this book to see how practitioners implement analytics in real life, where the data sets are usually very large, but the book does not cover Big Data at all. It is just something in the title to draw sales. The book, in fact, aims at being some sort of reference book (no exercises, no code, no data sets) but the concepts are presented so quickly and defined so perfunctorily as to make the book almost useless, except to name-drop key concepts that the interested reader can then go and look up online. Based on my reading of the beginning of the book, I expected the author to be a self-made consultant who had written a book to market his consulting services and was surprised to find out he is a faculty member. This is definitely not geared toward teaching analytics to his audience. When you teach something you make sure to define all the terms clearly when you first introduce them and increase the degree of complexity of the material as you go along. This is rather a reference textbook but is too think to do a good job of it.

However, there are many nuggets of wisdom in the text so it is still useful. In my opinion, the intended audience for this book consists of business managers of data scientists who want to know what names to throw around when they talk with their subordinates and have a one-line idea of what the topic is about. (And I don’t always agree with his one-line descriptions of concepts, for instance he says clustering is part of descriptive analytics and for me descriptive analytics is strictly visualizations and dashboards, that sort of things, but the equations he provides of the quantities he defines are correct.) Although the book is too vague to be useful and doesn’t involve any sort of computer codes, the author clearly knows his topic. There is value in knowing those topics so here is a list of ideas and concepts I found useful and some quick explanations of them (not necessarily from the book.)

**Chapter 2 Data collection, sampling and preprocessing**

I liked that he talked about checking the signs of the coefficients in a regression model as an example of expert-based validation (“do the signs make sense?”) because that’s something I always tell my students do. He also has a great example of sample bias in credit scoring, where we don’t know if the people who did not get the credit would have paid the loan back and it is important to remember that the people who did not get credit before the analytical approach was implemented were refused credit because the company followed some sort of policy although we don’t know what it was.

For outliers the author makes the distinction between valid and invalid observations and then goes on quantifying outliers with box plots, z-scores (computed by subtracting the mean of observations from a specific observation and then dividing by the standard deviation). He talks about truncation/capping/winsorizing valid outliers to bring them back into a more reasonable range and then standardizing data. (He has a tendency to drop names without defining them or defining them later.) Coarse categorization, i.e., placing instances into buckets, can for instance be done with equal interval binning or equal frequency binning, or through more sophisticated analysis techniques such as the **chi-squared method**. (Wikipedia has a far clearer example of what the method does for categorical data than what the author writes in his book.)

In linear regression with categorical variables, the computer must estimate coefficients for each of the factor’s levels minus one. This can introduce a lot of factors. It would be better (more manageable) to have only one new coefficient per feature and have some number associated with each of the levels, if possible (this will not always be possible but could be for instance when the categories are about age groups, when a clear ordering exists). **Weight-of-evidence coding** is a transformation that achieves that goal and is widely used in credit scoring (the application most familiar to the book’s author). For it to work well you have to bin the levels into a manageable number of categories. Then the WOE for one of the bins is computed as ln(Event%/Non-Event%) to use the definition on a more helpful Medium.com page.

To perform variable selection for an analytical model, we can consider correlations between each variable and the target, such as **Pearson correlation** (the correlation most engineers think about first), **Fisher score** (which generalizes into analysis of variance or ANOVA, which he assumes the reader knows about), **Information Value** or IV (which uses the WOE weights mentioned earlier and is considered medium predictive for IV>0.1 and strong predictive for IV>0.3) and **Cramer’s V**, which is the square root of “the chi-square divided by the number of observations” and should be higher than 0.1.

Chapter 3 will be discussed in my next post!

## Comments