Regression Refresh

graham chalfant
Dec 3, 2023
4 min read

Updated: Mar 24, 2024

In this article, I revisit regression analysis to refresh my statistical skills, providing a step-by-step guide that explores the impact of various factors on insurance costs. The article offers a walkthrough of the entire process, from data preprocessing to model evaluation, using a real-world dataset.

Dataset:

Medical Costs Personal Data Set: Kaggle Dataset

Columns Description:

age: The age of the primary beneficiary.
sex: The gender of the insurance contractor (female, male).
bmi: Body Mass Index, providing an understanding of body weight—relatively high or low relative to height. It is an objective index of body weight (kg/m^2) using the ratio of height to weight, ideally ranging from 18.5 to 24.9.
children: The number of children covered by health insurance/Number of dependents.
smoker: Indicates smoking status.
region: The beneficiary's residential area in the US (northeast, southeast, southwest, northwest).
charges: Individual medical costs billed by health insurance.

Exploratory Data Analysis (EDA):

Upon importing the dataset, I reviewed the summary statistics, data types, and checked for missing values. The data appeared to be in good order with no missing values. However, the data types required conversion, and string variables needed encoding to be used in regression analysis, as regression models require numerical input.

The variables sex, children, smoker, and region were converted from object types to categorical types to denote that these features contain a finite number of categories and to ensure compatibility with certain encoding techniques, which often require categorical data. After conversion, encoding was applied. For instance, the smoker variable, initially containing "yes" or "no", was transformed to contain 1 for "yes" and 0 for "no".

Further visualizations were conducted to understand the data structure and its relationship with charges. Outliers were assessed using boxplots, which showed values for charges and BMI that fell outside the whiskers. The regression was created with outliers, but later they were removed from the analysis to address the skewness and high kurtosis observed in the regression results.

Age and smoker status appeared to segment the data in the relationship plots, suggesting a potential association with charges. The impact of the number of children and region on costs was less clear.

Histograms of continuous variables showed a skew in charges, indicating a transformation was needed to satisfy the regression assumption of normality in the dependent variable. A log transformation was applied, successfully normalizing the data, as seen below.

I then tested for linear relationships using Pearson Correlation Coefficient and multicollinearity using the Variance Inflation Factor (VIF). As visualized earlier, age and smoker status exhibited the strongest correlations with charges, with the number of children also showing a moderate correlation.

Pearson Correlation Coefficients:

age 0.614647

sex -0.049231

bmi -0.023021

children 0.176349

smoker 0.489556

region -0.086837

charges_log 1.000000

All variables displayed acceptable VIF scores (below 5), suggesting an absence of significant multicollinearity, which, if present, could obscure the interpretation of predictive relationships within the model.

Variance Inflation Factor (VIF):

0 const 306.653619

1 age 2.431835

2 sex 1.010290

3 bmi 1.108682

4 children 1.088026

5 smoker 2.108660

6 region 1.045385

7 charges_log 3.383937

Modeling:

The regression model was constructed using the statsmodel library. The dataset was divided into a training set and a test set using a 70:30 split, facilitated by random sampling with sklearn.

Interpretation of Regression Results:

Dependent Variable (charges_log): Value attempting to be predicted
R-squared (0.707): 70.7% of the variability in charges_log is accounted for by the model.
Adj. R-squared (0.705): Same as R-squared but accounts for number of variables in the model. Suggests that the number of predictors is well-suited for the model's predictive capacity.
Df Residuals (832): Reflects the number of observations that were free to vary during model estimation.
Coefficients: The estimated change in charges_log for a one-unit change in each predictor, while holding other variables constant.
P>|t|: Indicates that the predictors are statistically significant, with p-values substantially below 0.05.
Omnibus (391.153): This high value suggests that the residuals may not be normally distributed.
Prob(Omnibus) (0.000): The low probability here supports the Omnibus test in indicating the residuals are not normally distributed.
Skew (2.160): Positive skew indicates that the residuals are skewed to the right, not symmetrically distributed.
Kurtosis (8.960): This is higher than the kurtosis of a normal distribution (which is 3), indicating that the residuals have heavier tails and are more peaked than the normal distribution.
Durbin-Watson (2.047): A value close to 2 indicates no major concern with autocorrelation in the residuals.
Jarque-Bera (JB) (1893.969): This high value indicates that the residuals do not follow a normal distribution.
Prob(JB) (0.00): The low probability confirms that the residuals significantly deviate from normality.
Cond. No. (312): This suggests a moderate level of multicollinearity, but not so high as to be a major concern (which would typically be a condition number in the thousands).

Residual Plots:

The high values for Omnibus and Jarque-Bera, along with the pronounced skewness and kurtosis, suggest that the residuals do not conform to a normal distribution. This could potentially impact the reliability of the coefficient estimates. These data points were validated using standard residual plots seen below.

Residual Analysis:

Q-Q Plot: Deviations from the line in the tails confirm that the residuals exhibit non-normality with heavy tails.
Residuals vs. Predicted Values Plot: The pattern observed suggests heteroscedasticity and potential non-linear relationships.

Conclusion and Next Steps:

Based on these diagnostics, further steps include investigating outliers (attempted to correct for already), considering transformations of the independent variables, and exploring more robust regression models that can accommodate the identified deviations from normality and homoscedasticity. If you like to see the Python code used for this project, you can find it on my GitHub using the below link.

GitHub: Medical Costs Regression