3.8. Chapter Summary#
In this Chapter, we started working with real data, introduced Pandas and dataframes, partitions, summary statistics, binary hypothesis testing via bootstrap resampling, and two-dimensional statistics. \(~~\!\!\)
3.8.1. Key Take-aways:#
Introduction to Pandas
Pandas stores data in dataframes, which store tabular data, like spreadsheets or database tables.
The community standard is to import Pandas into the
pd
namespace.Pandas can import data from standard file formats, including comma-separated value (CSV) files and Excel files.
Visualizing Multiple Data Sets – Part 1: Scatter Plots
Matplotlib is one of the most common modules used for creating plots in Python.
The community standard is to import the Matplotlib.pyplot module into the
plt
namespace.Outliers are values in a data set that are not reasonable.
Outliers may be due to measurement errors, unit-conversion errors, or data-entry errors.
Scatter plots are useful for initial data exploration and identifying outliers.
When plotting two variables that represent different types of data on the same axis, it can be helpful to create a second \(y\)-axis to deal with differences in the scales for the two variables.
Before plots are communicated to others, they should be properly labeled, including \(x\)- and \(y\)-axis labels, a title (or caption), and a legend (if needed).
Partitions
Sets are disjoint if they have no members in common; their intersection is the null set.
Partitions divide data into disjoint subsets.
If a group of sets \(A_0, A_1, \ldots\) partitions a set \(S\), then \(A_i \cap A_j\) for all \(i \ne j\) and \(A_0 \cup A_1 \cup \ldots = S\).
Summary Statistics
Summary statistics represent a group of data by a single number.
Common summary statistics minimize a measure of error from the summary statistic to the data:
The mode minimizes the error count.
The median minimizes the sum of absolute errors.
The average, or sample mean, minimizes the sum of squared errors.
A data set’s mode, median, and average can all be different values.
The median is often considered a more robust statistic than the average because it is less sensitive to outliers.
Visualizing Multiple Data Sets – Part 2:
When histograms of two variables are plotted on the same axes, it is important to use the same bins.
Transparency can be used to allow histograms for two variables to be shown on the same plot and still be able to see the values for each variable in every bin.
Null Hypothesis Testing with Real Data
A statistical hypothesis is a hypothesis that is testable using data.
A test statistic is a single numerical value that can be used in a statistical test.
A null hypothesis for multiple groups is that there is no underlying differences among the groups (in the statistic being measured).
An alternative hypothesis is that the observed differences or effects represent real differences among the groups or data.
The null hypothesis and alternative hypothesis are usually denoted by \(H_0\) and \(H_a\), respectively.
A binary hypothesis test is a statistical test that decides between two competing statistical hypotheses.
A null hypothesis significance test (NHST) estimates the probability that the observed effect could have occurred under a null hypothesis.
In a NHST:
A statistical test is used to determine how often we would see such a large value of the test statistic under the null hypothesis. The probability of seeing such a large value under \(H_0\) is called the \(p\)-value
Test statistics may be used in one-sided or two-sided tests. This will be discussed more in Section 5.3
The typical threshold for determining statistical significance is \(p < 0.05\)
If \(p<0.05\), we say that we reject the null hypothesis. We cannot say that the alternative hypothesis is true.
If \(p \ge 0.05\), we say that we fail to reject the null hypothesis. We cannot say that the null hypothesis is false. Often we may fail to reject the null hypothesis because we do not have enough data.
Statistical tests can be characterized as model-based or model-free.
In model-based techniques, the data is assumed to come from a known statistical distribution. Such techniques allow the use of analytical methods.
In model-free techniques, no assumption is made about the data fitting to some statistical model. Analytical methods are not possible.
Resampling is a model-free technique that draws new samples from the data for use in statistical testing.
When resampling is used with a NHST for data from two groups, the data is pooled (collected into one group) before resampling.
Bootstrap resampling is when data is drawn from the pooled data with replacement.
When performing bootstrap resampling to get samples to represent multiple groups, the sizes of the samples must match the sizes of the groups in the original data set.
To conduct a NHST with resampling:
Each simulation iteration draws new groups of sample data from the pooled data.
The sample test statistic is computed.
The sample test statistic is compared to the observed value of that test statistic, and a counter is incremented if the sample value is at least as large as the observed value.
After all the iterations, the relative frequency of seeing a test-statistic value at least as extreme as that observed in the data, under \(H_0\), is computed. That is the \(p\)-value, and it is compared to the predetermined \(p\)-value threshold.
A Quick Preview of Two-Dimensional Statistical Methods
Two-dimensional statistical methods can be used to work directly with pairs of data through techniques such as curve fitting.
How well a curve (equation) fits the data can be assessed by comparing the predicted values from the curve to the actual values and computing the total squared error.
Self-Assessment Questions
Use the following questions to review the material from this chapter. Because there are many questions for this chapter, a random sample of 10 questions are shown. Reload this page to get a new random sample of 10 questions.
Terminology Review
3.8.2. Spaced Repetition Review#
Answer these questions to check your retention on knowledge from Chapters 1 and 2: