11.4. Summary#

11.4.1. Terminology Review#

Use the flashcards below to help you review the terminology introduced in this chapter. \(~~~~ ~~~~ ~~~~ \mbox{ }\)

11.4.2. Key Take-Aways#

Introduction

  • Categorical data is not numerical but instead takes on one of a set of categories.

  • Categorical data is either ordinal or nominal.

  • In ordinal data, the categories have a natural ordering. Examples include Likert scale data and income-range data.

  • In nominal data, the categories do not have a natural ordering. Examples include handedness and country of citizenship.

Tabulating Categorical Data and Creating a Test Statistic

  • Two variables of categorical data can be summarized and compared using a (two-way) contingency table.

  • In a two-way contingency table, one variable is tabulated across the rows, while the other variable is tabulated across the columns.

  • Each cell contains the number of data points that match the corresponding row and column categories.

  • Contingency tables often include marginal counts, which are row and column sums.

  • Pandas has a crosstab() function to generate a contingency table given two Series or two columns of a dataframe. If passed the keyword argument margins = True, it will include the marginal counts.

  • The expected contingency table gives the expected value for each cell if there is no dependence between the two variables. The expected value of a cell can be computed as the product of the marginal row sum times the marginal column sum, divided by the total number of entries in the table.

  • Scipy.stats has a function stats.contingency.expected_freq() that returns the expected table when its argument is a contingency table (without marginal sums).

  • The normalized squared differences between the observed and expected values in the table cells are of the form \begin{equation*} \frac{\left(O_i - E_i\right)^2}{E_i}. \end{equation*}

  • In the above, \(O_i\) and \(E_i\) are the observed value and expected value, respectively, for cell \(i\).

  • The chi-squared statistic is the sum of the normalized squared differences.

Null Hypothesis Significance Testing for Contingency Tables

  • A NHST can be conducted via resampling or analysis.

  • For resampling, the values of one of the variables should be shuffled at random to break up any dependence between the two variables. The relative frequency of seeing such a large value of the chi-squared statistic is estimated over many such shufflings.

  • For analysis, the chi-squared statistic is modeled as a chi-squared random variable.

  • The number of degrees of freedom of the chi-squared random variable is the number of degrees of freedom for the table. If the table has \(r\) rows and \(c\) columns, the number of dofs is \begin{equation*} n_{dof} = (r-1)(c-1). \end{equation*}

  • Then the probability of seeing a value of the chi-squared statistic as large as that observed in the data is equal to the survival function of the chi-squared random variable evaluated at the observed value of the chi-squared statistic.

  • Using Scipy.stats, we can create a chi-squared distribution with a specified number of degrees of freedom dof as: chi_rv = stats.chi2(dof).

  • Be careful to use stats.chi2() because stats.chi() is a different distribution.

  • Given the chi_rv object and the observed value of the chi-squared statistic C, the analytical value for the \(p\)-value is given by chi_rv.sf(C).

  • An even better approach is to pass the contingency table (without marginal sums) to stats.chi2_contingency(), which will compute the \(p\)-value and apply an adjustment called Yates’ continuity correction to give a better estimate of the true \(p\)-value.

Chi-Square Goodness-of-Fit Test

  • In a chi-square goodness-of-fit test, discrete data values are tabulated in a one-way contingency table, where each cell represents a particular observed value, and the cell’s content is the number of times that value was observed.

  • The observed values in the one-way contingency table are compared to a reference distribution for the data, and a chi-squared statistic is computed.

  • In a NHST, the data is compared to some default distribution to see whether observed differences from the default distribution could be attributed to randomness. A \(p\)-value is calculated based on the chi-squared statistic, and the null hypothesis is either rejected or failed to be rejected.

  • For determining whether a particular distribution is a reasonable model for observed data, the parameters of the distribution are selected from the data. This reduces the number of degrees of freedom by the number of estimated parameters.

  • A chi-squared statistic is computed from the data, and a \(p\)-value is computed based on how often such a large value of the chi-squared statistic would be seen under the model distribution. If the \(p\)-value is much larger than the significance threshold, the model is consistent with the data.