To ensure a fair evaluation across diverse algorithms—especially distance-based models like Support Vector Machines (SVM) and K-Nearest Neighbors (KNN)—we must scale our data. We utilize a Pipeline combined with a ColumnTransformer to systematically apply a StandardScaler to numerical features (ensuring a mean of 0 and variance of 1) and a OneHotEncoder to categorical features.
We evaluated five distinct algorithms using R², Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). Linear Regression emerged as the superior model, achieving an R² of ~0.825 on the test set. The fact that an unregularized linear model outperformed complex ensemble methods (like Random Forest and Gradient Boosting) strongly suggests that the underlying relationship between our engineered features and exam scores is predominantly linear. The MAE indicates that our best model’s predictions deviate from the actual exam score by less than half a point on average, demonstrating exceptional practical reliability.
3.2.1.1 Analysis of the Best Model (Linear Regression)
Code
best_model_name ='Linear Regression'best_pipeline = reg_pipelines[best_model_name]lr_model = best_pipeline.named_steps['model']y_pred = best_pipeline.predict(X_test)fig = px.scatter(x=y_test, y=y_pred, labels={'x': 'Real Values', 'y': 'Predictions'}, title=f'Predictions vs Real Values - {best_model_name}')fig.add_shape(type="line", x0=y_test.min(), y0=y_test.min(), x1=y_test.max(), y1=y_test.max(), line=dict(color="Red", dash="dash"))fig.show()# Coefficientsfeature_names = preprocessor.get_feature_names_out()coef_df = pd.DataFrame({'Feature': feature_names, 'Coef': lr_model.coef_})coef_df['Feature'] = coef_df['Feature'].str.replace('cat__', '').str.replace('num__', '').str.replace('_', ' ')coef_df = coef_df.sort_values(by='Coef', ascending=False)fig = px.bar(coef_df, x='Coef', y='Feature', orientation='h', color='Coef', color_continuous_scale='RdBu_r', title='Impact of Variables on Final Grade')fig.show()### Exporting Model Weights# We export the coefficients and intercept for usage in index.qmd# Also exporting the full pipeline for easier use in Python code# Prepare weights and interceptfeature_names_raw = preprocessor.get_feature_names_out()weights_data = {'intercept': float(lr_model.intercept_),'coefficients': dict(zip(feature_names_raw, lr_model.coef_.tolist()))}# Save to JSON for potential Javascript/OJS usewithopen('model_weights.json', 'w') as f: json.dump(weights_data, f, indent=4)# Save the full pipeline as a joblib file for Python usagejoblib.dump(best_pipeline, 'model_pipeline.joblib')# Save to CSV for easy inspectioncoef_df_export = pd.DataFrame({'Feature': feature_names_raw, 'Weight': lr_model.coef_})coef_df_export = pd.concat([pd.DataFrame({'Feature': ['Intercept'], 'Weight': [lr_model.intercept_]}), coef_df_export])coef_df_export.to_csv('model_weights.csv', index=False)
Unsupervised learning allows us to move beyond simple score prediction and segment the student body into behavioral archetypes. We utilized the Elbow Method and the Silhouette Score to determine the optimal number of clusters (\(k\)). The PCA visualization reduces our high-dimensional feature space into two principal components, confirming that the K-Means algorithm has successfully identified distinct, non-overlapping student profiles based on their study habits and physical activity. These clusters can be directly translated into targeted academic intervention strategies.