Item-Item Collaborative Filtering (K-NN)

Author

Miguel R.

Published

March 25, 2026

Item-Item Collaborative Filtering (K-NN)

This notebook demonstrates the mathematical intuition and practical implementation behind Item-Item Collaborative Filtering using a K-Nearest Neighbors (K-NN) approach.

Instead of relying on user-to-user similarity, we will predict how much a user will like a specific movie by analyzing how similar that movie is to other movies the user has already rated.

1. The Dataset

Let’s start by creating a 2D matrix of movie reviews. * Rows: 6 Movies * Columns: 12 Users * Values: Ratings from 1 to 5.

Note: A NaN value indicates that the user has not interacted with or rated that specific movie.

Code
import pandas as pd
import numpy as np

nan = np.nan
item_item_df = pd.DataFrame({
    1: np.array([
        1, nan, 3, nan, nan, 5, nan, nan, 5, nan, 4, nan]),
    2: np.array([
        nan, nan, 5, 4, nan, nan, 4, nan, nan, 2, 1, 3]),

    3: np.array([
        2, 4, nan, 1, 2, nan, 3, nan, 4, 3, 5, nan]),

    4: np.array([
        nan, 2, 4, nan, 5, nan, nan, 4, nan, nan, 2, nan]),
    5: np.array([
        nan, nan, 4, 3, 4, 2, nan, nan, nan, nan, 2, 5
    ]),
    6: np.array([
        1, nan, 3, nan, 3, nan, nan, 2, nan, nan, 4, nan
    ])
}, index=[
    'user_1', 'user_2', 'user_3', 'user_4', 'user_5', 'user_6',
    'user_7', 'user_8', 'user_9', 'user_10', 'user_11', 'user_12'
]).transpose()
item_item_df
user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_8 user_9 user_10 user_11 user_12
1 1.0 NaN 3.0 NaN NaN 5.0 NaN NaN 5.0 NaN 4.0 NaN
2 NaN NaN 5.0 4.0 NaN NaN 4.0 NaN NaN 2.0 1.0 3.0
3 2.0 4.0 NaN 1.0 2.0 NaN 3.0 NaN 4.0 3.0 5.0 NaN
4 NaN 2.0 4.0 NaN 5.0 NaN NaN 4.0 NaN NaN 2.0 NaN
5 NaN NaN 4.0 3.0 4.0 2.0 NaN NaN NaN NaN 2.0 5.0
6 1.0 NaN 3.0 NaN 3.0 NaN NaN 2.0 NaN NaN 4.0 NaN
Code
import plotly.express as px

fig_data = px.imshow(
    item_item_df,
    text_auto=True,
    aspect="auto",
    color_continuous_scale='YlGnBu',
    title="Raw User-Item Ratings Matrix (Notice the missing values)",
    labels=dict(x="Users", y="Movies", color="Rating")
)

fig_data.update_layout(width=800, height=400)
fig_data.show()

2. Calculating the Similarity Matrix

Our Goal: Predict the rating that user_5 would give to movie 1.

To achieve this, we first need to build an Item-Item Similarity Matrix. To ensure our model doesn’t penalize users for being generally “harsh” or “generous” critics, we center the data by subtracting the mean rating of each movie from its respective row before calculating the cosine similarity.

Code
item_item_df['mean_rating'] = item_item_df.mean(axis=1)


def similarity_matrix(df):
    from sklearn.metrics.pairwise import cosine_similarity
    # returns the database with a new column with the similarity score
    local_df = df.copy()

    local_df['mean_rating'] = local_df.mean(axis=1)
    # optimized version
    # local_df = local_df.sub(local_df['mean_rating'], axis=0)
    for row in df.iterrows():
        for col in local_df.columns:
            if not np.isnan(row[1][col]):
                # should substract the mean rating of the user
                mean_rating = row[1][col] - row[1]['mean_rating']
                local_df.at[row[0], col] = mean_rating

    # now that is centered, we fill nan with 0
    local_df.fillna(0, inplace=True)
    return pd.DataFrame(cosine_similarity(local_df.drop(columns=['mean_rating'])), index=df.index, columns=df.index)


similarity_df = similarity_matrix(item_item_df)

fig_sim = px.imshow(
    similarity_df,
    text_auto='.2f',
    aspect="auto",
    color_continuous_scale='RdBu',
    zmin=-1, zmax=1,
    title="Item-Item Cosine Similarity Matrix",
    labels=dict(x="Movie", y="Movie", color="Similarity")
)

fig_sim.update_layout(width=700, height=600)
fig_sim.show()

Understanding the Cosine Similarity Score

The similarity function returns a value between \(-1\) and \(1\), representing the cosine of the angle between two item vectors:

  • \(1.0\) (\(0^\circ\) separation): Perfect positive correlation. The items are highly similar in the eyes of the users.
  • \(0.0\) (\(90^\circ\) separation): No apparent correlation. The items are completely independent.
  • \(-1.0\) (\(180^\circ\) separation): Perfect negative correlation. The items represent completely opposite user preferences.

3. Generating the Recommendation

Step 1: Isolate the User’s Profile

First, we identify the specific movies that our target user (user_5) has already watched and rated.

Code
target_movie = 1
target_user = 'user_5'

user_profile = item_item_df[target_user]
watched_movies = user_profile.dropna().index

Step 2: Extract Relevant Similarities

Next, we pull the similarity scores between our target movie (movie 1) and the movies the user has already watched.

Crucial Filtering: We strictly filter for positive similarities (\(>0\)). Including items with negative similarities (opposite preferences) would mathematically drag down our weighted average and distort the prediction.

Code
similarity = similarity_df.loc[target_movie, watched_movies]

# we are going to filter and consider only with positive similarity
positive_sim_mask = similarity > 0
similarity_positive = similarity[positive_sim_mask]
Code
fig_weights = px.bar(
    x=[f"Movie {idx}" for idx in similarity_positive.index],
    y=similarity_positive.values,
    text_auto='.2f',
    title="K-NN Applied Weights (Positive Similarities for Movie 1)",
    labels=dict(x="Neighbor Movies", y="Similarity Weight"),
    color=similarity_positive.values,
    color_continuous_scale='Blues'
)

fig_weights.update_layout(
    xaxis_title="Movies Rated by user_5",
    yaxis_title="Weight (Similarity Score)",
    showlegend=False,
    width=600, height=400
)
fig_weights.show()

Step 3: Compute the Weighted Average

Finally, we calculate the predicted rating. We use the similarities as “weights” to multiply the user’s past ratings, and then normalize by dividing by the sum of those absolute weights.

The formula is: \[\hat{r}_{u,i} = \frac{\sum_{j} \text{sim}(i, j) \cdot r_{u,j}}{\sum_{j} |\text{sim}(i, j)|}\]

Code
classifications_pos = user_profile[watched_movies][positive_sim_mask]
recommendation = (
        np.dot(similarity_positive, classifications_pos)
        / np.sum(np.abs(similarity_positive))
)
recommendation
np.float64(2.5864068669348175)

Conclusion

Based on the behavior of similar items, our K-NN model predicts that user_5 will give movie 1 a rating of \(2.59\). Since this is below a neutral score, this movie is likely not a strong candidate for a recommendation.