Exploring Statistics and Probablity.

Intro

Statistics and Probablity are related fields in math:

Key Components

The data set, \(X\), is usually an ordered set of numbers with n discrete numbers. The source of the data set is sometimes a continuous distribution of numbers.

$$ X = \{x_1, x_2 \dots, x_n\} $$

The mode is the most frequently occurring value in \(X\). Mode has the same root as modus operandi, and is about the method of operating, hence the "preferred" number. EG:

If \(X = \{5, 6, 8, 9, 9\}\), then 9 is the mode.

The median is \(x_{n / 2}\) in \(X\). The median is the middle number in a set. Median has a root word medius that means "middle". If a set has an even number of items, then the median is midway between the 2 numbers closest to the median. Use of the median is key for robust statistics that is resistant to outliers in the data. EG:

If \(X = \{5, 6, 8, 9, 9\}\), then 8 is the median.

The arithmetic mean (aka average), \(\bar{x}\), is the sum of the data set divided by the count of numbers in the data set. When most people talk about the "mean" they are usually talking about the arithmetic mean (as opposed to other means like the harmonic mean or geometric mean). The connotation of the word "mean" is that it typifies or denotes the set. EG:

$$ \bar{x} = \frac{x_1 + \dots + x_n}{n} $$

If you have data set \(X\):

$$ X = \{5, 6, 8, 9, 9\} $$

then 7.4 is the arithmetic mean.

$$ \bar{x} = (5 + 6 + 8 + 9 + 9) / 5 $$

$$ \bar{x} = 37 / 5 $$

$$ \bar{x} = 7.4 $$

The Expected Value (EV) is the value of a random variable if you could repeat the process an infinite number of times. This is akin to finding the "center of mass" of an physical object in that the weight as a function of length differs between a ruler (EV equivalent to mode, median, or mean, i.e. the middle) and a sword (EV closer to the hilt than the tip). Note that EV is sometimes denoted as μ or "mean" or "median" but sometimes they actually mean something else. On my site I when I say "mean" in a statistics or probability setting, then I mean arithmetic mean.

The absolute deviaton, \(D_i\), of the ith number in \(X_n\) is the absolute difference between \(x_i\) and a selected point (\(m(X)\)). The selected point is an EV (usually the median or mean).

$$ D_i = |x_i - m(X)| $$

The standard deviation, σ, is the "average" of the differences of each number in the data set from the EV. The standard deviation has the same units as the data set. EG: Suppose that American men have an EV of 70"; if the σ was 3", then a lot of American men would be 67-73"; if the σ was 0", then all American men would be 70".

The uncorrected sample standard deviation, \(σ_n\) is "uncorrected", but is more accurate with a larger sample population size.

$$ σ_n = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2} $$

If you have data set \(X\):

$$ X = \{5, 6, 8, 9, 9\} $$

then 1.63 is the uncorrected sample standard deviation.

Find the absolute deviation (using the arithmetic mean 7.4) for each number in the set:

$$ {5-7.4, 6-7.4, 8-7.4, 9-7.4, 9-7.4} $$

$$ {-2.4, -1.4, -0.6, 1.6, 1.6} $$

Square each absolute deviation in the set:

$$ {-2.4*-2.4, -1.4*-1.4, -0.6*-0.6, 1.6*1.6, 1.6*1.6} $$

$$ {5.76, 1.96, 0.36, 2.56, 2.56} $$

Sum the absolute deviations in the set:

$$ 5.76 + 1.96 + 0.36 + 2.56 + 2.56 = 13.2 $$

Divide that sum by the count of the set:

$$ 13.2 / 5 = 2.64 $$

Find the square root of that sum:

$$ \sqrt{2.64} = 1.63 $$

The corrected sample standard deviation, \(σ\), (aka sample standard deviation) removes some of the bias from the uncorrected sample standard deviation.

$$ σ = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2} $$

If you have this data set:

$$ X = \{5, 6, 8, 9, 9\} $$

Sum the absolute deviations in the set:

$$ 5.76 + 1.96 + 0.36 + 2.56 + 2.56 = 13.2 $$

Divide that sum by one less than the count of the set:

$$ 13.2 / 4 = 3.3 $$

Find the square root of that sum:

$$ \sqrt{3.3} = 1.82 $$

Stuff to Parse

The variance [W], \(σ^2\), is simply the square of the standard deviation.

Large naturally occurring sets of random numbers tend to have a normal distribution (aka Gaussian distribution; the famous bell curve).

For a normal distribution, the numbers in a set will be distributed as follows:

Statistics

Probability

Probability distribution [W]. Has basic terms.

Unbiased estimation of standard deviation [W]

Normal distribution [W]

Level of measurement [W]

Standard score [W]

Multivariate normal distribution [W]

See also this nice chart from Statistics [W]. I want to do a variation that also shows a side ways box plot and confidence intervals from the EV.

The normal distribution

A standard error uses the standard deviation in relation to the sample size. That is the greater the sample, the smaller the standard error.

standard error = σ / sqrt(samples)
If X = {5, 6, 8, 9, 9}, then the standard error can be found:
3.63318 / sqrt(5) =
3.63318 / 2.236067977 =
1.624807681

Page Modified: (Hand noted: ) (Auto noted: )