Data Science from Scratch: First Principles with Python, Joel Grus, O’Reilly, 2015

A clear explanation of some of the concepts central to data science

The book begins with the basics of the Python language in a chapter entitled “A Crash Course in Python.”  Grus recommends the Anaconda distribution of Python 2.7, as do I.  It is free, includes Python, NumPy, SciPy, matplotlib, and IPython that are used in the book, and includes pandas which we will use to handle financial data.  This is not the book I would recommend for a person new to Python to learn the language, but it establishes the style and notation used for the remainder of the book.

Chapters 4, 5, 6 are quick reviews of linear algebra and the Python data structures used, frequentist statistics, and probability, respectively.  Chapter 7 discusses hypothesis and inference, and has a nice discussion of the beta distribution and its use in describing the “prior” distribution for Bayesian analysis.  

Chapter 8 begins to get into the data science with a description of the gradient descent method of finding the set of parameter values that maximize (or minimize) the objective function.  The “from scratch” approach shows all the details.

Chapter 10, Working with Data, begins with methods for exploring the data.  Examining the distribution, plotting single dimensional data, comparing multiple data series, normalizing, rescaling, and dimensionality reduction.

Chapter 11 begins machine learning — models, overfitting, underfitting, bias-variance tradeoff, and feature extraction.  

Chapter 12 continues with k-nearest neighbors and the curse of dimensionality.

Chapter 13 illustrates naive Bayes to implement a spam filter.

Chapters 14 and 15 treat linear regression and multiple regression, fitting a model to data, and regularization to limit the tendency to overfit.

Chapter 16 explains the logistic function and logistic regression.  Examples look at measures of goodness of fit.  The concept of support vector machine is explained, although the mathematics are beyond from scratch.

Chapter 17 has a nice explanation of decision trees (the models that result from rule-based trading system development, such as AmiBroker).  Entropy, as it applies to information content, is well explained and used to partition data as the rules are created.  Random forests, one of the ensemble techniques for machine learning, is described in surprisingly concise code.

Neural networks are described in chapter 18, including code for a feed forward, back propagation network that identifies digits.  The interpretation of the weights of each of nodes gives insight into the workings of neural networks.

The book continues on with discussions of clustering, natural language processing, network analysis, recommender system, and databases.

While this is not the best book to learn Python, machine learning, or model development, it is valuable in explaining each of these topics with fully disclosed logic and computer code.

This book gets five stars based on meeting its objectives — to clearly illustrate some of the central concepts of data science.