Data Preparation

1 Data Preparation

1.1 Initial Setup

Loading Libraries

Code
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

1.2 Data Preparation

1.2.1 Data Loading and Quick View

Code
# =========================== Dataset Loading ===========================
# Note: Using relative path for portability
path = 'data/StudentPerformanceFactors.csv'
df = pd.read_csv(path)
print(f" Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")

# === Show first rows of the dataset ===
print('First 5 rows')
display(df.head())
 Rows: 6,607 | Columns: 20
First 5 rows
Hours_Studied Attendance Parental_Involvement Access_to_Resources Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type Peer_Influence Physical_Activity Learning_Disabilities Parental_Education_Level Distance_from_Home Gender Exam_Score
0 23 84 Low High No 7 73 Low Yes 0 Low Medium Public Positive 3 No High School Near Male 67
1 19 64 Low Medium No 8 59 Low Yes 2 Medium Medium Public Negative 4 No College Moderate Female 61
2 24 98 Medium Medium Yes 7 91 Medium Yes 2 Medium Medium Public Neutral 4 No Postgraduate Near Male 74
3 29 89 Low Medium Yes 8 98 Medium Yes 1 Medium Medium Public Negative 4 No High School Moderate Male 71
4 19 92 Medium Medium Yes 6 65 Medium Yes 3 Medium High Public Neutral 4 No College Near Female 70

1.2.2 Variable Description

  • Gender: Student’s gender (Male/Female).
  • Hours_Studied: Average number of study hours per week.
  • Attendance: Percentage of attendance during the academic period.
  • Parental_Involvement: Level of parental involvement (Low, Medium, High).
  • Access_to_Resources: Availability of educational resources (Low, Medium, High).
  • Extracurricular_Activities: Participation in extracurricular activities (Yes/No).
  • Sleep_Hours: Average number of sleep hours per night.
  • Previous_Scores: Academic score obtained in previous evaluations.
  • Motivation_Level: Student’s motivation level (Low, Medium, High).
  • Internet_Access: Availability of internet access (Yes/No).
  • Tutoring_Sessions: Number of tutoring sessions per month.
  • Family_Income: Family income level (Low, Medium, High).
  • Teacher_Quality: Quality of the teacher (Low, Medium, High).
  • School_Type: Type of school (Public/Private).
  • Peer_Influence: Influence of the peer group (Positive, Neutral, Negative).
  • Physical_Activity: Average hours of physical activity per week.
  • Learning_Disabilities: Presence of learning difficulties (Yes/No).
  • Parental_Education_Level: Highest education level of parents (High School, College, Postgraduate).
  • Distance_from_Home: Distance between home and school (Near, Moderate, Far).
  • Exam_Score: Final score obtained (Target variable).

1.2.3 Data Summary

Code
df.info()
display(df.describe())
<class 'pandas.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   Hours_Studied               6607 non-null   int64
 1   Attendance                  6607 non-null   int64
 2   Parental_Involvement        6607 non-null   str  
 3   Access_to_Resources         6607 non-null   str  
 4   Extracurricular_Activities  6607 non-null   str  
 5   Sleep_Hours                 6607 non-null   int64
 6   Previous_Scores             6607 non-null   int64
 7   Motivation_Level            6607 non-null   str  
 8   Internet_Access             6607 non-null   str  
 9   Tutoring_Sessions           6607 non-null   int64
 10  Family_Income               6607 non-null   str  
 11  Teacher_Quality             6529 non-null   str  
 12  School_Type                 6607 non-null   str  
 13  Peer_Influence              6607 non-null   str  
 14  Physical_Activity           6607 non-null   int64
 15  Learning_Disabilities       6607 non-null   str  
 16  Parental_Education_Level    6517 non-null   str  
 17  Distance_from_Home          6540 non-null   str  
 18  Gender                      6607 non-null   str  
 19  Exam_Score                  6607 non-null   int64
dtypes: int64(7), str(13)
memory usage: 1.4 MB
Hours_Studied Attendance Sleep_Hours Previous_Scores Tutoring_Sessions Physical_Activity Exam_Score
count 6607.000000 6607.000000 6607.00000 6607.000000 6607.000000 6607.000000 6607.000000
mean 19.975329 79.977448 7.02906 75.070531 1.493719 2.967610 67.235659
std 5.990594 11.547475 1.46812 14.399784 1.230570 1.031231 3.890456
min 1.000000 60.000000 4.00000 50.000000 0.000000 0.000000 55.000000
25% 16.000000 70.000000 6.00000 63.000000 1.000000 2.000000 65.000000
50% 20.000000 80.000000 7.00000 75.000000 1.000000 3.000000 67.000000
75% 24.000000 90.000000 8.00000 88.000000 2.000000 4.000000 69.000000
max 44.000000 100.000000 10.00000 100.000000 8.000000 6.000000 101.000000

1.2.4 Data Cleaning

1.2.4.1 Nulls and Duplicates
Code
null_counts = df.isnull().sum()
print("Null count per column:\n")
print(null_counts[null_counts > 0])
print("\nNumber of duplicate rows:", df.duplicated().sum())
Null count per column:

Teacher_Quality             78
Parental_Education_Level    90
Distance_from_Home          67
dtype: int64

Number of duplicate rows: 0
1.2.4.2 Null Diagnosis
Code
nan_cols = [col for col in df.columns if df[col].isna().any()]
df_nan = df[nan_cols + ['Exam_Score']].copy()

for col in nan_cols:
    df_nan['Status'] = df_nan[col].isna().map({True: 'Is Null', False: 'Has Data'})
    fig = px.box(df_nan, x="Status", y='Exam_Score', title=f'Impact of Nulls in {col}')
    fig.show()
1.2.4.3 Null Value Diagnosis and Strategy

Before treating missing data, it is crucial to understand its mechanism. By visualizing the distribution of the target variable (Exam_Score) against the presence or absence of data in features like Teacher_Quality, Parental_Education_Level, and Distance_from_Home, we can assess if the data is Missing Completely at Random (MCAR). The box plots reveal no significant divergence in exam scores between the null and non-null groups. Because these missing values represent less than 2% of the total dataset, dropping them or using complex imputation methods (like KNN imputation) is unnecessary. We will proceed with mode imputation, which preserves the central tendency of these categorical features without introducing statistical bias.

Action: We impute these values using the mode (most frequent value).

Code
df_clean = df.copy()
cols_to_fix = ['Teacher_Quality', 'Parental_Education_Level', 'Distance_from_Home']

for col in cols_to_fix:
    if col in df_clean.columns:
        moda = df_clean[col].mode()[0]
        df_clean[col] = df_clean[col].fillna(moda)

print(f"Remaining nulls in df_clean: {df_clean.isnull().sum().sum()}")
Remaining nulls in df_clean: 0
1.2.4.4 Removing Inconsistencies
Code
# Detecting outliers in Exam_Score (> 100)
outliers = df_clean[df_clean['Exam_Score'] > 100]
print(f"Rows with Exam_Score > 100: {len(outliers)}")

# Remove values outside theoretical range (0-100)
df_clean = df_clean[df_clean["Exam_Score"].between(0, 100)].copy()
df_clean.reset_index(drop=True, inplace=True)
Rows with Exam_Score > 100: 1

1.3 Feature Engineering

Feature Engineering Rationale To capture non-linear relationships and deeper behavioral patterns, we construct composite features:

Study Intensity: Raw study hours can be misleading if a student frequently misses class. By multiplying Hours_Studied by the Attendance percentage, we create a proxy for effective academic engagement.

Academic Effort: This combines independent study hours with Tutoring_Sessions to quantify total time invested in learning.

Study-Sleep Balance: Cognitive performance relies heavily on rest. This metric subtracts weekly sleep hours from study hours. A highly negative or excessively positive variance can help models detect burnout or lack of effort.

Code
df_fe = df_clean.copy()

# Study Intensity
df_fe["Study_Intensity"] = df_fe["Hours_Studied"] * (df_fe["Attendance"] / 100)

# Total Academic Effort
df_fe["Academic_Effort"] = df_fe["Hours_Studied"] + df_fe["Tutoring_Sessions"]

# Study-Sleep Balance
df_fe["Study_Sleep_Balance"] = (df_fe["Hours_Studied"] - (df_fe["Sleep_Hours"] * 7))

display(df_fe[["Study_Intensity", "Academic_Effort", "Study_Sleep_Balance"]].describe())
Study_Intensity Academic_Effort Study_Sleep_Balance
count 6606.000000 6606.000000 6606.000000
mean 15.967318 21.467454 -29.230245
std 5.339210 6.097523 11.838062
min 0.690000 2.000000 -67.000000
25% 12.240000 17.000000 -37.000000
50% 15.750000 21.000000 -29.000000
75% 19.400000 26.000000 -21.000000
max 37.830000 47.000000 10.000000