Chapter 2. Inference

 

"I can see nothing," said I, handing it back to my friend.

"On the contrary, Watson, you can see everything. You fail, however, to reason from what you see. You are too timid in drawing your inferences."

 
  --Sir Arthur Conan Doyle, The Adventure of the Blue Carbuncle

In the previous chapter, we introduced a variety of numerical and visual approaches to understand the normal distribution. We discussed descriptive statistics, such as the mean and standard deviation, and how they can be used to summarize large amounts of data succinctly.

A dataset is usually a sample of some larger population. Sometimes, this population is too large to be measured in its entirety. Sometimes, it is intrinsically unmeasurable, either because it is infinite in size or it otherwise cannot be accessed directly. In either case, we are forced to generalize from the data that we have.

In this chapter, we consider statistical inference: how we can go beyond simply describing the samples of data and instead describe the population from which they were sampled. We'll look in detail at how confident we can be about the inferences we make from the samples of data. We'll cover hypothesis testing: a robust approach to data analysis that puts the science in data science. We'll also implement an interactive web page with ClojureScript to simulate the relationship between samples and the population from which they are taken.

To help illustrate the principles, we'll invent a fictional company, AcmeContent, that has recently hired us as a data scientist.