Latest News

Friday, September 1, 2023

Correlation Between Variables using Python


 The level of interdependence between the variables in your dataset should be identified and quantified. By being aware of these interdependencies, you can better prepare your data to match the demands of machine learning algorithms like linear regression, whose performance will suffer as a result.

You will learn how to compute correlation for various types of variables and relationships in this course. Correlation is the statistical summary of the relationship between variables.

There are 4 sections in this tutorial; they are as follows:

  1. Correlation: What Is It?
  2. Covariance of the Test Dataset
  3. Correlation by Pearson
  4. Using Spearman's Correlation

Correlation: What Is It?

There are many different ways that variables in a dataset might be related or connected.
For instance:
  • The values of one variable may influence or be dependent upon the values of another.
  • There may be a very slight correlation between two variables.
  • A third unknowable variable may be dependent on two other variables.
The ability to better grasp the relationships between variables can be helpful in data analysis and modeling. The term "correlation" refers to the statistical association between two variables.

A correlation can be either positive or negative, meaning that when one variable's value changes, the other variable's value also changes in the same way. The variables may not be correlated if the correlation is 0 or neutral.
    1. Both variables fluctuate in the same direction when there is a positive correlation.
    2. No correlation exists between the changes in the variables in a neutral correlation.
    3. In a negative correlation, the variables shift against each other.

Impact of Correlations on ML Models:

  • Multicollinearity, or the close relationship between two or more variables, might cause some algorithms to perform worse. For instance, in linear regression, one of the problematic associated variables should be eliminated to raise the model's accuracy.
  • In order to gain insight into which variables may or may not be relevant as input for constructing a model, we may also be interested in the correlation between input variables and the output variable.

Generating Data for the Correlation Analysis:


# generate related variables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from matplotlib import pyplot
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data2), std(data2)))
# plot
pyplot.scatter(data1, data2)
pyplot.show()

data1: mean=100.776 stdv=19.620 data2: mean=151.050 stdv=22.358
Covariance pertains to the potential association between variables through a linear connection, where this connection remains consistently additive across both sets of data.

This connection can be succinctly described as the covariance between two variables. It is determined by averaging the product of values from each data set, after those values have been adjusted to be centered (by subtracting their mean).

The formula for computing the sample covariance is outlined as follows:

cov(X, Y) = (sum (x - mean(X)) * (y - mean(Y)) ) * 1/(n-1)


Utilizing the mean in the computation implies a requirement for Gaussian or Gaussian-like distribution in each data sample. Covariance's sign signifies whether variables change together (positive) or diverge (negative). Magnitude's interpretation is complex. A covariance of zero means full independence.

NumPy's cov() function computes a covariance matrix for multiple variables.
covariance = cov(data1, data2)

The matrix's diagonal holds the self-covariance of each variable. The other entries signify the covariance between the paired variables; given that only two variables are under consideration, these remaining entries are identical.

We can derive the covariance matrix for the given pair of variables in our test scenario. Here's the complete illustration:


from numpy.random import randn
from numpy.random import seed
from numpy import cov

# Set random seed
seed(1)

# Prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# Calculate covariance matrix
covariance = cov(data1, data2)
print(covariance)

[[385.33297729 389.7545618 ] [389.7545618 500.38006058]]

Covariance and covariance matrix play a crucial role in statistics and multivariate analysis for describing relationships among variables.

By executing the example, the covariance matrix is computed and displayed.

Since the dataset involves variables drawn from Gaussian distribution and exhibits linear correlation, covariance is a suitable approach for characterization.

The covariance between the two variables measures 389.75. This positive value indicates that the variables change in the expected direction together.

Using covariance as a standalone statistical tool is problematic due to its complex interpretation, prompting the introduction of Pearson's correlation coefficient.


Pearson’s Correlation:

The Pearson correlation coefficient, named after Karl Pearson, summarizes linear relationship strength between data samples.

It's calculated by dividing the covariance of variables by the product of their standard deviations, normalizing the covariance for an interpretable score.

 

Pearson's correlation coefficient = covariance(X, Y) / (stdv(X) * stdv(Y))

The requirement for mean and standard deviation use implies a Gaussian or Gaussian-like distribution for the data samples.

The outcome of the calculation, the correlation coefficient, provides insights into the relationship.

The coefficient ranges from -1 to 1, signifying the extent of correlation. 0 implies no correlation. Typically, values below -0.5 or above 0.5 indicate significant correlation, while values below these thresholds suggest weaker correlation.

To compute the Pearson’s correlation coefficient for data samples of equal length, one can utilize the pearsonr() function from SciPy.

# calculate the Pearson's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.888

The variables show a strong positive correlation with a coefficient of 0.8, indicating high association, similar to values near 1.0.

Pearson's correlation coefficient assesses relationships among multiple variables. This involves creating a correlation matrix by calculating interactions between each variable pair. The matrix is symmetrical, with 1.0 on the diagonal due to perfect self-correlation in each column.

Spearman’s Correlation

Variables can exhibit varying nonlinear relationships across their distributions. The distribution of these variables may also deviate from the Gaussian pattern.

For such cases, the Spearman's correlation coefficient, named after Charles Spearman, measures the strength of association between data samples. It's applicable for both nonlinear and linear relationships, with slightly reduced sensitivity (lower coefficients) in the latter case.

Similar to the Pearson correlation, scores range between -1 and 1 for perfect negative and positive correlations, respectively. However, Spearman's coefficient utilizes rank-based statistics rather than sample values, making it suitable for non-parametric analysis, where data distribution assumptions like Gaussian aren't made.

Spearman's correlation coefficient
= covariance(rank(X), rank(Y)) / (stdv(rank(X)) * stdv(rank(Y)))
# calculate the spearmans's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import spearmanr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate spearman's correlation
corr, _ = spearmanr(data1, data2)
print('Spearmans correlation: %.3f' % corr)


Spearmans correlation: 0.872

Despite assuming Gaussian data and a linear variable relationship, the nonparametric rank-based method reveals a robust 0.8 correlation between the variables.

  • Google+
  • Pinterest
« PREV
NEXT »

No comments

Post a Comment