12.5. Summary#
12.5.1. Terminology Review#
Use the flashcards below to help you review the terminology introduced in Chapter 12: Multidimensional Data: Vector Moments and Linear Regression of Foundations of Data Science with Python \(~~~~ ~~~~ ~~~~ \mbox{ }\)
12.5.2. Key take-aways#
Introduction to Vectors
Vectors are one-dimensional lists of components (or elements).
In this book, vectors are denoted by bold, lowercase letters. Another notation used in some domains is to indicate vectors with this type of arrow: \(\overset{\rightharpoonup}{a}\).
Components can be numbers or variables representing numbers.
Vectors usually refer to column vectors, in which the components form a single column.
Scalars are single numerical values.
Scalars and vectors both have lengths, but only vectors have direction.
The size of a vector is the number of components.
Standard unit vectors consist of all zeros, except for one component that takes on the value 1.
Two-dimension vectors are often visualized as arrows indicating a displacement, which is often referenced from the origin.
Vector Operations
Vector addition and vector subtraction of equal-length vectors are defined component-wise and adopt standard properties of scalar addition: these operations are commutative, associative, and have the zeros vector as the identity.
Scalar-vector multiplication is carried out via broadcasting: the scalar multiplies every element of the vector. Multiplication by a scalar changes the length of a vector, and a negative scalar can reverse the direction of the vector. Multiplication cannot otherwise change the direction of a vector.
The Hadamard product is the component-wise multiplication of two vectors and is denoted using the symbol \(\odot\). In NumPy, the Hadamard product is carried out using the standard multiplication operator
*
.The inner product or dot product of vectors is the sum of the elements in the Hadamard product. Dot product is denoted by \(\mathbf{a} \cdot \mathbf{b}\) or \(\mathbf{a}^T \mathbf{b}\).
The length or magnitude of a vector is denoted \(\left \Vert \mathbf{a} \right \Vert\) and defined by \begin{equation*} \left \Vert \mathbf{a} \right \Vert = \sqrt{ \sum_{i} a_{i}^{2} } \end{equation*}
The length of a vector can be generalized to the norm of any object with an inner product operator, in which case the norm is defined as \begin{equation*} \left \Vert x \right \Vert = \sqrt{ \langle x,~x \rangle} \end{equation*}
The distance between vectors is \begin{equation*} d(\mathbf{a}, \mathbf{b}) = | \mathbf{a} - \mathbf{b} | = |\mathbf{b} - \mathbf{a} |. \end{equation*}
Summary Statistics for Vector Data
Vectors of data can be stored in Pandas dataframes or NumPy arrays.
For a Pandas dataframe, the mean, median, and variance of all of the columns can be computed using
df.mean()
,df.median()
, anddf.var()
, respectively.The mean and variance of the columns of a NumPy array
my_array
can be computed usingmy_array.mean(axis=0)
andmy_array.var(axis=0)
, respectively. The medians can be computed usingnp.median(my_array, axis=0)
.The variance estimate computed by Pandas is the unbiased estimate \(s_{n-1}^{2}\). The variance estimate computed using NumPy is the biased estimate \(s_{n}^{2}\). To get the unbiased estimate in NumPy, use the keyword argument
ddof=1
.The covariance for random variables \(X\) and \(Y\) is \(E\left[\left(X- \mu_X\right)\left(Y-\mu_y \right)\right]\).
Positive covariance indicates that the mean-centered values of the random variables tend to have the same sign.
The covariance for data vectors \(\mathbf{x}\) and \(\mathbf{y}\) is \begin{equation*} \operatorname{Cov}( \mathbf{x}, \mathbf{y}) = \frac{1}{n-1} \sum_{i=0}^{n-1} \left(x_i - \overline{x}\right) \left(y_i - \overline{y}\right) . \end{equation*}
The correlation coefficient is a normalized covariance that takes values between -1 and 1.
The correlation coefficient for random variables is \begin{equation*} \rho = \frac{ \operatorname{Cov}(X, Y)}{\sigma_X \sigma_Y}. \end{equation*}
The correlation coefficient (also called Pearson’s correlation coefficient) for data vectors \(\mathbf{x}\) and \(\mathbf{y}\) is \begin{equation*} r = \frac{ \operatorname{Cov}(\mathbf{x}, \mathbf{y})}{\sigma_x \sigma_y}. \end{equation*}
Linear Regression
Variables are often classified as either explanatory or response:
Explanatory variables are used to predict another variable. Sometimes these are called independent variables, especially if they are under the experimenter’s control.
Response variables are to be predicted or explained using another variable.
Linear regression is the process of creating a predictor for a response variable that is a linear function of one or more explanatory variables.
Simple linear regression is linear regression with one explanatory variable and one response variable.
In ordinary least squares (OLS), the coefficients of linear regression are chosen to minimize the mean-square error.
For simple linear regression, the OLS solution can be found using calculus.
The SciPy.stats function
stats.linregress()
performs simple linear regression when passed two vectors.The coefficient of determination in simple linear regression is the value \(r^2\).
The explained variance is \(r^2 \sigma_{y}^{2}\), where \(\sigma_{y}^2\) is the total variance (the original variance in the response variable \(\mathbf{y}\).
If two variables are correlated, they can be used to predict each other. We say they are associated. We cannot say anything about why they are associated or about any causal relationship between the variables. A common phrase is “Correlation is not causation.”
Null Hypothesis Tests for Correlation
We can conduct a NHST for correlation among two vectors using either a resampling test or an analytical test.
The null hypothesis, \(H_0\), is that the two variables are not correlated.
In resampling, the values from the two vectors are drawn independently to break any potential correlation between the variables.
The analytical \(p\)-value can be found using
stats.linregress()
.
Nonlinear Regression Tests
In nonlinear regression, the response variable is predicted using a nonlinear function of the explanatory variable.
Nonlinear regression can often be performed by transforming either the explanatory variable or the response variable using a nonlinear function. Then the linear regression solution for the transformed variables is found. The result is a nonlinear regression solution.