Exploring Statistics and Probablity.

Statistics and Probablity are related fields in math:

- Statistics analyzes the frequency of
**past events**, while Probability predicts the likelihood of**future events**. - Statistics is
**applied math**, while Probability is**theoretical math**. - Statistics is deductive (given specific premises, here are conclusions), while Probability is inductive (given general premises, here are conclusions).

The **data set**, \(X\), is usually an ordered set of numbers with n discrete numbers. The source of the data set is sometimes a continuous distribution of numbers.

$$ X = \{x_1, x_2 \dots, x_n\} $$

The **mode** is the most frequently occurring value in \(X\). Mode has the same root as *modus operandi*, and is about the method of operating, hence the "preferred" number. EG:

If \(X = \{5, 6, 8, 9, 9\}\), then 9 is the mode.

The **median** is \(x_{n / 2}\) in \(X\). The median is the middle number in a set. Median has a root word *medius* that means "middle". If a set has an even number of items, then the median is midway between the 2 numbers closest to the median. Use of the median is key for **robust statistics** that is resistant to outliers in the data. EG:

If \(X = \{5, 6, 8, 9, 9\}\), then 8 is the median.

The **arithmetic mean** (aka **average**), **\(\bar{x}\)**, is the sum of the data set divided by the count of numbers in the data set. When most people talk about the "mean" they are usually talking about the arithmetic mean (as opposed to other means like the harmonic mean or geometric mean). The connotation of the word "mean" is that it typifies or denotes the set. EG:

$$ \bar{x} = \frac{x_1 + \dots + x_n}{n} $$

If you have data set \(X\):

$$ X = \{5, 6, 8, 9, 9\} $$

then 7.4 is the arithmetic mean.

$$ \bar{x} = (5 + 6 + 8 + 9 + 9) / 5 $$

$$ \bar{x} = 37 / 5 $$

$$ \bar{x} = 7.4 $$

The Expected Value (EV) is the value of a random variable if you could repeat the process an infinite number of times. This is akin to finding the "center of mass" of an physical object in that the weight as a function of length differs between a ruler (EV equivalent to mode, median, or mean, i.e. the middle) and a sword (EV closer to the hilt than the tip). Note that EV is sometimes denoted as μ or "mean" or "median" but sometimes they actually mean something else. On my site I when I say "mean" in a statistics or probability setting, then I mean arithmetic mean.

The **absolute deviaton**, \(D_i\), of the ith number in \(X_n\) is the absolute difference between \(x_i\) and a selected point (\(m(X)\)). The selected point is an EV (usually the median or mean).

$$ D_i = |x_i - m(X)| $$

The **standard deviation**, **σ**, is the "average" of the differences of each number in the data set from the EV. The standard deviation has the same units as the data set. EG: Suppose that American men have an EV of 70"; if the σ was 3", then a lot of American men would be 67-73"; if the σ was 0", then all American men would be 70".

The **uncorrected sample standard deviation**, \(σ_n\) is "uncorrected", but is more accurate with a larger sample population size.

$$ σ_n = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2} $$

If you have data set \(X\):

$$ X = \{5, 6, 8, 9, 9\} $$

then 1.63 is the uncorrected sample standard deviation.

Find the absolute deviation (using the arithmetic mean 7.4) for each number in the set:

$$ {5-7.4, 6-7.4, 8-7.4, 9-7.4, 9-7.4} $$

$$ {-2.4, -1.4, -0.6, 1.6, 1.6} $$

Square each absolute deviation in the set:

$$ {-2.4*-2.4, -1.4*-1.4, -0.6*-0.6, 1.6*1.6, 1.6*1.6} $$

$$ {5.76, 1.96, 0.36, 2.56, 2.56} $$

Sum the absolute deviations in the set:

$$ 5.76 + 1.96 + 0.36 + 2.56 + 2.56 = 13.2 $$

Divide that sum by

the count of the set:$$ 13.2 / 5 = 2.64 $$

Find the square root of that sum:

$$ \sqrt{2.64} = 1.63 $$

The **corrected sample standard deviation**, \(σ\), (aka sample standard deviation) removes some of the bias from the uncorrected sample standard deviation.

$$ σ = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2} $$

If you have this data set:

$$ X = \{5, 6, 8, 9, 9\} $$

Sum the absolute deviations in the set:

$$ 5.76 + 1.96 + 0.36 + 2.56 + 2.56 = 13.2 $$

Divide that sum by

one less than the count of the set:$$ 13.2 / 4 = 3.3 $$

Find the square root of that sum:

$$ \sqrt{3.3} = 1.82 $$

The **variance [W]**, \(σ^2\), is simply the square of the standard deviation.

Large naturally occurring sets of **random numbers **tend to have a **normal distribution** (aka **Gaussian distribution**; the famous **bell curve**).

For a normal distribution, the numbers in a set will be distributed as follows:

- ~68.26% will fall within 1 standard deviation.
- ~95.44% will fall within 2 standard deviations.
- ~99.72% will fall within 3 standard deviations.
- ~99.98% will fall within 4 standard deviations.

Probability distribution [W]. Has basic terms.

Unbiased estimation of standard deviation [W]

Multivariate normal distribution [W]

See also this nice chart from Statistics [W]. I want to do a variation that also shows a side ways box plot and confidence intervals from the EV.

A **standard error** uses the standard deviation in relation to the sample size. That is the greater the sample, the smaller the standard error.

standard error = σ / sqrt(samples)

If X = {5, 6, 8, 9, 9}, then the standard error can be found:

3.63318 / sqrt(5) =

3.63318 / 2.236067977 =

1.624807681

Page Modified: (Hand noted: ) (Auto noted: )