Data Preparation

1 Data Preparation

1.1 Initial Setup

Loading Libraries

Code

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

1.2 Data Preparation

1.2.1 Data Loading and Quick View

Code

# =========================== Dataset Loading ===========================
# Note: Using relative path for portability
path = 'data/StudentPerformanceFactors.csv'
df = pd.read_csv(path)
print(f" Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")

# === Show first rows of the dataset ===
print('First 5 rows')
display(df.head())

 Rows: 6,607 | Columns: 20
First 5 rows

	Hours_Studied	Attendance	Parental_Involvement	Access_to_Resources	Extracurricular_Activities	Sleep_Hours	Previous_Scores	Motivation_Level	Internet_Access	Tutoring_Sessions	Family_Income	Teacher_Quality	School_Type	Peer_Influence	Physical_Activity	Learning_Disabilities	Parental_Education_Level	Distance_from_Home	Gender	Exam_Score
0	23	84	Low	High	No	7	73	Low	Yes	0	Low	Medium	Public	Positive	3	No	High School	Near	Male	67
1	19	64	Low	Medium	No	8	59	Low	Yes	2	Medium	Medium	Public	Negative	4	No	College	Moderate	Female	61
2	24	98	Medium	Medium	Yes	7	91	Medium	Yes	2	Medium	Medium	Public	Neutral	4	No	Postgraduate	Near	Male	74
3	29	89	Low	Medium	Yes	8	98	Medium	Yes	1	Medium	Medium	Public	Negative	4	No	High School	Moderate	Male	71
4	19	92	Medium	Medium	Yes	6	65	Medium	Yes	3	Medium	High	Public	Neutral	4	No	College	Near	Female	70

1.2.2 Variable Description

Gender: Student’s gender (Male/Female).
Hours_Studied: Average number of study hours per week.
Attendance: Percentage of attendance during the academic period.
Parental_Involvement: Level of parental involvement (Low, Medium, High).
Access_to_Resources: Availability of educational resources (Low, Medium, High).
Extracurricular_Activities: Participation in extracurricular activities (Yes/No).
Sleep_Hours: Average number of sleep hours per night.
Previous_Scores: Academic score obtained in previous evaluations.
Motivation_Level: Student’s motivation level (Low, Medium, High).
Internet_Access: Availability of internet access (Yes/No).
Tutoring_Sessions: Number of tutoring sessions per month.
Family_Income: Family income level (Low, Medium, High).
Teacher_Quality: Quality of the teacher (Low, Medium, High).
School_Type: Type of school (Public/Private).
Peer_Influence: Influence of the peer group (Positive, Neutral, Negative).
Physical_Activity: Average hours of physical activity per week.
Learning_Disabilities: Presence of learning difficulties (Yes/No).
Parental_Education_Level: Highest education level of parents (High School, College, Postgraduate).
Distance_from_Home: Distance between home and school (Near, Moderate, Far).
Exam_Score: Final score obtained (Target variable).

1.2.3 Data Summary

Code

df.info()
display(df.describe())

<class 'pandas.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   Hours_Studied               6607 non-null   int64
 1   Attendance                  6607 non-null   int64
 2   Parental_Involvement        6607 non-null   str  
 3   Access_to_Resources         6607 non-null   str  
 4   Extracurricular_Activities  6607 non-null   str  
 5   Sleep_Hours                 6607 non-null   int64
 6   Previous_Scores             6607 non-null   int64
 7   Motivation_Level            6607 non-null   str  
 8   Internet_Access             6607 non-null   str  
 9   Tutoring_Sessions           6607 non-null   int64
 10  Family_Income               6607 non-null   str  
 11  Teacher_Quality             6529 non-null   str  
 12  School_Type                 6607 non-null   str  
 13  Peer_Influence              6607 non-null   str  
 14  Physical_Activity           6607 non-null   int64
 15  Learning_Disabilities       6607 non-null   str  
 16  Parental_Education_Level    6517 non-null   str  
 17  Distance_from_Home          6540 non-null   str  
 18  Gender                      6607 non-null   str  
 19  Exam_Score                  6607 non-null   int64
dtypes: int64(7), str(13)
memory usage: 1.4 MB

	Hours_Studied	Attendance	Sleep_Hours	Previous_Scores	Tutoring_Sessions	Physical_Activity	Exam_Score
count	6607.000000	6607.000000	6607.00000	6607.000000	6607.000000	6607.000000	6607.000000
mean	19.975329	79.977448	7.02906	75.070531	1.493719	2.967610	67.235659
std	5.990594	11.547475	1.46812	14.399784	1.230570	1.031231	3.890456
min	1.000000	60.000000	4.00000	50.000000	0.000000	0.000000	55.000000
25%	16.000000	70.000000	6.00000	63.000000	1.000000	2.000000	65.000000
50%	20.000000	80.000000	7.00000	75.000000	1.000000	3.000000	67.000000
75%	24.000000	90.000000	8.00000	88.000000	2.000000	4.000000	69.000000
max	44.000000	100.000000	10.00000	100.000000	8.000000	6.000000	101.000000

1.2.4 Data Cleaning

1.2.4.1 Nulls and Duplicates

Code

null_counts = df.isnull().sum()
print("Null count per column:\n")
print(null_counts[null_counts > 0])
print("\nNumber of duplicate rows:", df.duplicated().sum())

Null count per column:

Teacher_Quality             78
Parental_Education_Level    90
Distance_from_Home          67
dtype: int64

Number of duplicate rows: 0

1.2.4.2 Null Diagnosis

Code

nan_cols = [col for col in df.columns if df[col].isna().any()]
df_nan = df[nan_cols + ['Exam_Score']].copy()

for col in nan_cols:
    df_nan['Status'] = df_nan[col].isna().map({True: 'Is Null', False: 'Has Data'})
    fig = px.box(df_nan, x="Status", y='Exam_Score', title=f'Impact of Nulls in {col}')
    fig.show()

1.2.4.3 Null Value Diagnosis and Strategy

Before treating missing data, it is crucial to understand its mechanism. By visualizing the distribution of the target variable (Exam_Score) against the presence or absence of data in features like Teacher_Quality, Parental_Education_Level, and Distance_from_Home, we can assess if the data is Missing Completely at Random (MCAR). The box plots reveal no significant divergence in exam scores between the null and non-null groups. Because these missing values represent less than 2% of the total dataset, dropping them or using complex imputation methods (like KNN imputation) is unnecessary. We will proceed with mode imputation, which preserves the central tendency of these categorical features without introducing statistical bias.

Action: We impute these values using the mode (most frequent value).

Code

df_clean = df.copy()
cols_to_fix = ['Teacher_Quality', 'Parental_Education_Level', 'Distance_from_Home']

for col in cols_to_fix:
    if col in df_clean.columns:
        moda = df_clean[col].mode()[0]
        df_clean[col] = df_clean[col].fillna(moda)

print(f"Remaining nulls in df_clean: {df_clean.isnull().sum().sum()}")

Remaining nulls in df_clean: 0

1.2.4.4 Removing Inconsistencies

Code

# Detecting outliers in Exam_Score (> 100)
outliers = df_clean[df_clean['Exam_Score'] > 100]
print(f"Rows with Exam_Score > 100: {len(outliers)}")

# Remove values outside theoretical range (0-100)
df_clean = df_clean[df_clean["Exam_Score"].between(0, 100)].copy()
df_clean.reset_index(drop=True, inplace=True)

Rows with Exam_Score > 100: 1

1.3 Feature Engineering

Feature Engineering Rationale To capture non-linear relationships and deeper behavioral patterns, we construct composite features:

Study Intensity: Raw study hours can be misleading if a student frequently misses class. By multiplying Hours_Studied by the Attendance percentage, we create a proxy for effective academic engagement.

Academic Effort: This combines independent study hours with Tutoring_Sessions to quantify total time invested in learning.

Study-Sleep Balance: Cognitive performance relies heavily on rest. This metric subtracts weekly sleep hours from study hours. A highly negative or excessively positive variance can help models detect burnout or lack of effort.

Code

df_fe = df_clean.copy()

# Study Intensity
df_fe["Study_Intensity"] = df_fe["Hours_Studied"] * (df_fe["Attendance"] / 100)

# Total Academic Effort
df_fe["Academic_Effort"] = df_fe["Hours_Studied"] + df_fe["Tutoring_Sessions"]

# Study-Sleep Balance
df_fe["Study_Sleep_Balance"] = (df_fe["Hours_Studied"] - (df_fe["Sleep_Hours"] * 7))

display(df_fe[["Study_Intensity", "Academic_Effort", "Study_Sleep_Balance"]].describe())

	Study_Intensity	Academic_Effort	Study_Sleep_Balance
count	6606.000000	6606.000000	6606.000000
mean	15.967318	21.467454	-29.230245
std	5.339210	6.097523	11.838062
min	0.690000	2.000000	-67.000000
25%	12.240000	17.000000	-37.000000
50%	15.750000	21.000000	-29.000000
75%	19.400000	26.000000	-21.000000
max	37.830000	47.000000	10.000000