Here comes Chapter 4: Descriptive Analytics of this book. In the analytical process, descriptive analytics usually comes before predictive analytics that we saw in the previous chapter, because we want to get a sense of the data we have before making predictions with it, but as one of the many peculiarities of this book, here the author covers descriptive analytics after predictive analytics, perhaps because predictive analytics is more advanced and more interesting, and many readers don’t read books in full, so it may make sense to put more of the interesting stuff upfront. Anyway, this chapter focuses on association rules, and the key metrics of support and confidence. Association rules are basically “if X happens then Y happens in this % of the cases”. The support of a rule is the percentage of total transactions in the data set that contains X or Y. The rule “X implies Y” has confidence (c) if 100c% of the transactions in D that contain X also contain Y (meaning Y indeed happens when X happens).
Then association rule mining is about:
- identifying all item sets that have support above a prespecified threshold called minsup (called “frequent item sets”)
- discovering all derived association rules having confidence above a prespecified threshold minconf.
The lift, measured as support of X or Y divided by the product of the support of X and the support of Y, quantifies whether those association rules better predict what happened than a baseline model. Wikipedia has a good example of lift for market basket analysis. It also explains that the lift is the ratio of the observed support to that expected if X and Y were independent. If the lift is less than 1, the items are substitutes to each other.
The author then moves on to discussing sequence rules. Association rules simply investigate which items appear together at the same time but sequence rules study which items appear at different times. We create sequences from transaction data and then calculate support, although in the “X implies Y” rule we have to distinguish whether Y should happen immediately after X or rather at any subsequent in the sequence.
We are then treated to a discussion of hierarchical clustering, various distances definition (Euclidean and Manhattan for distance between points and single, complete, average and centroid method for distance between clusters). Hierarchical clustering leads to a sequence of clustering schemes, starting with every instance being its own cluster and instances being progressively grouped together until there is only one big cluster. The next page is about k-means clustering and that one page includes two big graphs, only one about k-means clustering. I would have wanted to see some guidelines on how to select the number of clusters and the fine-tuning of parameters, since the outcome k-means clustering can depend on the initialization, which is random.
Finally, the author discusses self-organizing maps (SOMs), in three pages. This was quite interesting, because SOMs are about visualizing high-dimensional data on a low-dimensional grid of neurons using feedforward neural network with two layers. SOMs are then visualized using a U-matrix, which is the map of the grid of neurons color-coded by the average distance between the neuron and its neighbors, and a component plane visualizing the weights between each input variable and its output neurons. I have enjoyed reading this Wikipedia page about it. The blog towardsdatascience.com has some interesting posts about it, which are written by data science students and thus may be more accessible to other students, but there is always the possibility they misunderstood something or coded it wrong.
The following chapters are on survival analysis, social network analytics, “analytics: putting it all to work” and then example applications (although no code), but I will let you read those for yourself.
So again, the main interest of this book is to provide names of techniques and quick descriptions so that the interested reader can go and read some descriptions of it online. It's like an appetizer to data science. The author has enough real-life analytics experience that the reader can trust he knows which techniques matter more than others. I do really wish his descriptions of the techniques were more extensive but in data science it is difficult to hit the sweet spot between too short and too long because if you want to give more than cursory descriptions you have to introduce a lot of notations, mathematical equations, and give more detailed small examples throughout, which perhaps the author didn't have time for, and it also occurs to me the book may have had a different audience in mind. Also, I know authors don't necessarily pick their book title, but there shouldn't be "Big Data" in a title if you are not ever going to mention the challenges specific to Big Data. That's not his fault, though. All in all, maybe not a book I will consult frequently but not a book to send to the recycling bin either. A good introduction for MBAs who have to manage data scientists, who will definitely want this book on their shelves.
Comments