Exploratory Data Analysis

2 Exploratory Data Analysis

2.1 Univariate Analysis

Code
numeric_cols = df_clean.select_dtypes(include=['int64', 'float64']).columns.tolist()

for col in numeric_cols:
    fig = px.histogram(df_clean, x=col, marginal="box", title=f'Distribution of {col}')
    fig.show()

fig = px.box(df_clean[numeric_cols], title="Boxplots of Numerical Variables")
fig.show()

Distribution Insights The univariate analysis confirms that our primary continuous variables—most notably Hours_Studied and Exam_Score approximate a normal distribution. This is an ideal condition for parametric models like Linear Regression, as it mitigates the risk of residual skewness. Attendance is heavily left-skewed (clustering between 80-100%), which aligns with expected academic behavior but suggests we may need to monitor this feature for ceiling effects during modeling.

2.2 Outlier Analysis (IQR Method)

Code
def detect_outliers_iqr(data, columns):
    outliers_counts = {}
    for col in columns:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers_counts[col] = len(data[(data[col] < lower_bound) | (data[col] > upper_bound)])
    return outliers_counts

outliers_dict = detect_outliers_iqr(df_clean, numeric_cols)
print("Outliers detected per variable:", outliers_dict)
Outliers detected per variable: {'Hours_Studied': 43, 'Attendance': 0, 'Sleep_Hours': 0, 'Previous_Scores': 0, 'Tutoring_Sessions': 429, 'Physical_Activity': 0, 'Exam_Score': 103}

Outlier Treatment Strategy

Using the Interquartile Range (IQR) method, we identified a small subset of outliers, primarily in Tutoring_Sessions and Exam_Score. Because these represent genuine, plausible extreme values (e.g., exceptional students or those requiring intensive help) rather than measurement errors, we will retain them. Capping or removing these valid edge cases could strip the model of its ability to predict high variance scenarios.

2.3 Bivariate Analysis (Correlations)

Code
categorical_cols = ['Parental_Involvement', 'Access_to_Resources', 'Motivation_Level', 'Internet_Access', 'Family_Income',
    'Teacher_Quality', 'School_Type', 'Peer_Influence', 'Learning_Disabilities', 'Parental_Education_Level', 'Distance_from_Home',
    'Gender']

for col in categorical_cols:
    fig = px.box(df_clean, x=col, y='Exam_Score', title=f'{col} vs Exam_Score')
    fig.show()

# Correlation Matrix
corr = df_clean[numeric_cols].corr()
fig = px.imshow(corr, text_auto=".2f", color_continuous_scale='RdBu_r', title="Correlation Matrix")
fig.show()

Correlation and Multicollinearity Analysis

The correlation matrix provides our first strong indicator of predictive power. We observe a robust positive correlation between Attendance and Exam_Score, as well as Hours_Studied and Exam_Score. Fortunately, the predictor variables themselves do not exhibit dangerously high inter-correlations (multicollinearity), meaning models that are sensitive to correlated features (like Linear Regression) will yield stable and interpretable coefficients.