# Sampling Distributions

Populations follow **distributions**, which are likelihoods of values we expect to see. Not to be confused with an individual *sample*, a **sampling distribution** is the *distribution* of a particular statistic (mean, median, etc) across multiple repeated *samples* of the *same size from the same population.*

But what effect does sampling have on what you can infer that from your population?

Does the ratio of “4 out of 5 dentists” that recommend something extrapolate to *all dentists* or would it be more accurate to say “4 out of the 5 dentists that we asked”?

## Ex: Coin Flipping

There’s no real “population” of coin flips– you can perform the flip an infinite number of times, so it’d be impossible to gather all values and calculate the true population parameters. Nevertheless, we know that the distribution of a fair coin is ^{50}⁄_{50}.

But, for the sake of argument, say you only flipped one coin and recored that it landed on heads. That heads result obviously doesn’t extrapolate to every other coin. Or to put it differently, we can’t then infer that the “probability of getting heads is 100%,” based on the result of our one coin-flip.

### Simulating a Lot of Coin Flips

Here, let’s assign the value `1`

to any coin that lands on heads, and `0`

for tails.

```
import numpy as np
%pylab inline
```

```
Populating the interactive namespace from numpy and matplotlib
```

So we have a function that will give the sample proportion of a coin that was flipped `n`

times

```
def coinflip_prop(n):
return np.random.randint(2, size=n).mean()
```

`coinflip_prop(10)`

```
0.40000000000000002
```

And build a way to sample that flip multiple times

```
def samples_of_coinflips(samples, nFlips):
return np.array([coinflip_prop(nFlips) for x in range(samples)])
```

`samples_of_coinflips(5, 10)`

```
array([ 0.5, 0.6, 0.3, 0.5, 0.4])
```

### A Few Samples

We can take a look at what happens **to the distribution of results, when we monitor the sample proportion**.

For 25 samples of “fair coin flipped 10 times”, we get

```
tenFlips = samples_of_coinflips(25, 10)
_ = plt.hist(tenFlips, range=(0, 1))
tenFlips.mean(), tenFlips.std()
```

```
(0.436, 0.18736061485808589)
```

And again

```
tenFlips = samples_of_coinflips(25, 10)
_ = plt.hist(tenFlips, range=(0, 1))
tenFlips.mean(), tenFlips.std()
```

```
(0.47599999999999992, 0.110562199688682)
```

### Big Sample Size

Whereas if we did 1000 flips, 25 times, we see that the average value for all flips is perfectly centered around the value we expect, .5

```
thousandFlips = samples_of_coinflips(25, 1000)
_ = plt.hist(thousandFlips, range=(0, 1))
thousandFlips.mean(), thousandFlips.std()
```

```
(0.50087999999999999, 0.013865987162838432)
```

Thus, **the larger your sample size, the less variability in your sampling distribution**

### A Lot of Samples

Conversely, if we only ever look at 10 flips at a time, but repeat that sample thousands of times

```
tenFlips = samples_of_coinflips(250000, 10)
_ = plt.hist(tenFlips, range=(0, 1))
tenFlips.mean(), tenFlips.std()
```

```
(0.4997975999999999, 0.15809452563020646)
```

we find that the sampling distribution starts to look like a normal curve

### A Lot of Both

Indeed, when we take a ton of samples, each sampling a lot of coin flips, we can see that **the distribution of the sample mean follows a normal distribution**

```
import pandas as pd
hundredFlips = samples_of_coinflips(250000, 100)
pd.Series(hundredFlips).plot(kind='density')
hundredFlips.mean(), hundredFlips.std()
```

```
(0.50011791999999988, 0.049848230609256333)
```