# =========================== Dataset Loading ===========================# Note: Using relative path for portabilitypath ='data/StudentPerformanceFactors.csv'df = pd.read_csv(path)print(f" Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")# === Show first rows of the dataset ===print('First 5 rows')display(df.head())
Rows: 6,607 | Columns: 20
First 5 rows
Hours_Studied
Attendance
Parental_Involvement
Access_to_Resources
Extracurricular_Activities
Sleep_Hours
Previous_Scores
Motivation_Level
Internet_Access
Tutoring_Sessions
Family_Income
Teacher_Quality
School_Type
Peer_Influence
Physical_Activity
Learning_Disabilities
Parental_Education_Level
Distance_from_Home
Gender
Exam_Score
0
23
84
Low
High
No
7
73
Low
Yes
0
Low
Medium
Public
Positive
3
No
High School
Near
Male
67
1
19
64
Low
Medium
No
8
59
Low
Yes
2
Medium
Medium
Public
Negative
4
No
College
Moderate
Female
61
2
24
98
Medium
Medium
Yes
7
91
Medium
Yes
2
Medium
Medium
Public
Neutral
4
No
Postgraduate
Near
Male
74
3
29
89
Low
Medium
Yes
8
98
Medium
Yes
1
Medium
Medium
Public
Negative
4
No
High School
Moderate
Male
71
4
19
92
Medium
Medium
Yes
6
65
Medium
Yes
3
Medium
High
Public
Neutral
4
No
College
Near
Female
70
1.2.2 Variable Description
Gender: Student’s gender (Male/Female).
Hours_Studied: Average number of study hours per week.
Attendance: Percentage of attendance during the academic period.
Parental_Involvement: Level of parental involvement (Low, Medium, High).
Access_to_Resources: Availability of educational resources (Low, Medium, High).
Extracurricular_Activities: Participation in extracurricular activities (Yes/No).
Sleep_Hours: Average number of sleep hours per night.
Previous_Scores: Academic score obtained in previous evaluations.
null_counts = df.isnull().sum()print("Null count per column:\n")print(null_counts[null_counts >0])print("\nNumber of duplicate rows:", df.duplicated().sum())
Null count per column:
Teacher_Quality 78
Parental_Education_Level 90
Distance_from_Home 67
dtype: int64
Number of duplicate rows: 0
1.2.4.2 Null Diagnosis
Code
nan_cols = [col for col in df.columns if df[col].isna().any()]df_nan = df[nan_cols + ['Exam_Score']].copy()for col in nan_cols: df_nan['Status'] = df_nan[col].isna().map({True: 'Is Null', False: 'Has Data'}) fig = px.box(df_nan, x="Status", y='Exam_Score', title=f'Impact of Nulls in {col}') fig.show()
1.2.4.3 Null Value Diagnosis and Strategy
Before treating missing data, it is crucial to understand its mechanism. By visualizing the distribution of the target variable (Exam_Score) against the presence or absence of data in features like Teacher_Quality, Parental_Education_Level, and Distance_from_Home, we can assess if the data is Missing Completely at Random (MCAR). The box plots reveal no significant divergence in exam scores between the null and non-null groups. Because these missing values represent less than 2% of the total dataset, dropping them or using complex imputation methods (like KNN imputation) is unnecessary. We will proceed with mode imputation, which preserves the central tendency of these categorical features without introducing statistical bias.
Action: We impute these values using the mode (most frequent value).
Code
df_clean = df.copy()cols_to_fix = ['Teacher_Quality', 'Parental_Education_Level', 'Distance_from_Home']for col in cols_to_fix:if col in df_clean.columns: moda = df_clean[col].mode()[0] df_clean[col] = df_clean[col].fillna(moda)print(f"Remaining nulls in df_clean: {df_clean.isnull().sum().sum()}")
Remaining nulls in df_clean: 0
1.2.4.4 Removing Inconsistencies
Code
# Detecting outliers in Exam_Score (> 100)outliers = df_clean[df_clean['Exam_Score'] >100]print(f"Rows with Exam_Score > 100: {len(outliers)}")# Remove values outside theoretical range (0-100)df_clean = df_clean[df_clean["Exam_Score"].between(0, 100)].copy()df_clean.reset_index(drop=True, inplace=True)
Rows with Exam_Score > 100: 1
1.3 Feature Engineering
Feature Engineering Rationale To capture non-linear relationships and deeper behavioral patterns, we construct composite features:
Study Intensity: Raw study hours can be misleading if a student frequently misses class. By multiplying Hours_Studied by the Attendance percentage, we create a proxy for effective academic engagement.
Academic Effort: This combines independent study hours with Tutoring_Sessions to quantify total time invested in learning.
Study-Sleep Balance: Cognitive performance relies heavily on rest. This metric subtracts weekly sleep hours from study hours. A highly negative or excessively positive variance can help models detect burnout or lack of effort.