numeric_cols = df_clean.select_dtypes(include=['int64', 'float64']).columns.tolist()for col in numeric_cols: fig = px.histogram(df_clean, x=col, marginal="box", title=f'Distribution of {col}') fig.show()fig = px.box(df_clean[numeric_cols], title="Boxplots of Numerical Variables")fig.show()
Distribution Insights The univariate analysis confirms that our primary continuous variables—most notably Hours_Studied and Exam_Score approximate a normal distribution. This is an ideal condition for parametric models like Linear Regression, as it mitigates the risk of residual skewness. Attendance is heavily left-skewed (clustering between 80-100%), which aligns with expected academic behavior but suggests we may need to monitor this feature for ceiling effects during modeling.
Using the Interquartile Range (IQR) method, we identified a small subset of outliers, primarily in Tutoring_Sessions and Exam_Score. Because these represent genuine, plausible extreme values (e.g., exceptional students or those requiring intensive help) rather than measurement errors, we will retain them. Capping or removing these valid edge cases could strip the model of its ability to predict high variance scenarios.
The correlation matrix provides our first strong indicator of predictive power. We observe a robust positive correlation between Attendance and Exam_Score, as well as Hours_Studied and Exam_Score. Fortunately, the predictor variables themselves do not exhibit dangerously high inter-correlations (multicollinearity), meaning models that are sensitive to correlated features (like Linear Regression) will yield stable and interpretable coefficients.