🍃Association

Exploring relationships between variables is crucial when building machine learning models. It deepens our understanding of the data, helps validate model assumptions (e.g., the independence assumption in linear regression), and supports the removal of unnecessary variables to improve model efficiency.

Association is a common technique for examining data relationships. It applies to numerical, categorical, and mixed variable types, revealing how these variables interact or vary together.

About the Data

Our garden bank is renowned for its exceptional management of customers' funds. Many customers from the outside world chose to save their money here, attracted by the bank's reputation for security and appealing investment opportunities.

Each month, our bank develops engaging investment offers and sends advertisements to potential customers likely to be interested. This process of sending advertisements is referred to as a campaign.

The data presented here comes from one such campaign, which aimed to promote a term deposit product. This product requires customers to maintain their funds in the bank for several years, during which they earn interest.

In the campaign data, the label deposit indicates whether a customer has acquired this term deposit product.

🌻 The code to get campaign data >>

Association between Numerical Variables

Correlation between 2 Numerical Variables

Correlation is a type of association often applied between two numerical variables. It measures the strength and direction of the relationship between the two variables.

Strength: How closely the variables are related. Strong correlation means the variables move together closely.
Direction: Whether the relationship is positive (both variables increase together) or negative (one variable increases as the other decreases).

We have 3 common methods to check the correlation between each pair of variables:

Pearson is a measure of the strength and the direction of a linear relationship between two variables.
Spearman equals to, Pearson correlation using rank values of those two variables, it assesses a monotonic relationship.
Kendall is similar to Spearman which measures monotonic relationship using rank values of the 2 variables, but it's more robust (smaller gross error sensitivity) and more efficient (smaller asymptotic variance) than Spearman.

Both Spearman and Kendall uses rank values, therefore they can be applied to both continuous and ordinal variables. They are both non-parametric method and therefore the input data is not required to be in a bell curve as what Pearson assumes.

Using all the numerical variables in our campaign data, let's look at the correlation triangle first. In the code below, you can choose one of the correlation methods through corr_method, also decide whether you want to show absolute correlation values through abs_corr.

The correlation triangle is displayed as a heatmap, allowing you to quickly identify highly correlated pairs by color. In this example, we observe a strong Spearman correlation between previous (the number of contacts made with each client before this campaign) and pdays (the number of days since the client was last contacted).

The code below utilizes this correlation triangle to list all pairs with correlations exceeding the specified threshold.

The output aligns with the visualization above. By using the output drop list, we can directly remove unnecessary features from the data.

🌻 Look into code details here >>

Multicollineary

Sometimes, we can't find correlation between every 2 variable pairs, but it exists in a combo of more than 2 variables. This type of "correlation" is multicollineary.

VIF (Variance Inflation Factor) is often used to measure multicollineary. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.

Typically, a VIF score above 10 indicates high multicollinearity, suggesting that variables with VIFs exceeding this threshold can be removed. The code below creates a dataframe of VIF scores for each numerical variable in the dataset.

In the example below, the threshold is set to 5. Calculating the VIF for each variable can sometimes take a while, so saving the output vif_df as a file can save significant time if you plan to test different VIF thresholds.

🌻 Look into code details here >>

Association between Categorical Variables

To measure the association between categorical variables, there're 2 popular choices:

Pearson's Chi2 Test is used to determine if there is a statistically significant difference between observed and expected frequencies across one or more categories in a contingency table. A contingency table is a matrix that shows the frequency distribution of variables, for example.
Cramer's V measures the associations between 2 categorical variables, based on Pearson's Chi2 Test.

You can consider they are the same method 😊.

But in python implementation, there're some differences. The code below allows to choose either Chi2 or Cramer's V. For Chi2, the output mainly relies on the p-value to determine association. A significance level, often set at 0.05, is used as a threshold. If the p-value is lower than this threshold, we can conclude that an association exists.

In contrast, Cramer's V is more straightforward, as it produces a value representing the strength of the association. A higher value indicates a stronger association, making its output easier to interpret. Look at this output:

🌻 Look into code details here >>

Association between Categorical & Numerical Variables

ANOVA (Analysis of Variance) examines the differences among averages. Its null hypothesis assumes there is no difference among the averages.

When assessing the association between a numerical variable and a categorical variable, ANOVA calculates the average of the numerical variable for each category, then applies the f_oneway test to check if these averages are equal. If the resulting p-value is below the significance threshold (often set at 0.05), the null hypothesis is rejected, indicating there is a difference among the averages. This suggests an association between the categorical and numerical variables.

The code below demonstrates how to use ANOVA to assess the association between a categorical and a numerical variable.

Using our campaign data, the output looks like:

🌻 Look into code details here >>

PreviousResplendent Tree NextClustering

Last updated 3 months ago