commit
47725aa4b4
184
README.md
184
README.md
@ -1,56 +1,146 @@
|
||||
# Supervised Learning - Movie/TV-Show recommender
|
||||
# Supervised Learning - TV-Show Recommender
|
||||
|
||||
## Specification
|
||||
Movie/TV-Show recommender
|
||||
## Table of Contents
|
||||
1. [How to Run the Program](#how-to-run-the-program)
|
||||
2. [Project Overview](#project-overview)
|
||||
3. [Dataset](#dataset)
|
||||
4. [Model and Algorithm](#model-and-algorithm)
|
||||
5. [Features](#features)
|
||||
6. [Requirements](#requirements)
|
||||
7. [Libraries](#libraries)
|
||||
8. [Classes](#classes)
|
||||
9. [References](#references)
|
||||
|
||||
This program will recommend you what movie or th-show to view based on what Movie/TV-Show you like.
|
||||
You should be able to search for recommendations from your Movie/TV-Show title, cast, director,
|
||||
release year and also Description, and get back a recommendations with a explanation on what just this
|
||||
title might suit you. It will get you about 25 recommendations of movies in order of rank from your search,
|
||||
and the same from TV-Shows that will match the same way as Movies.
|
||||
## How to Run the Program
|
||||
|
||||
### Data Source:
|
||||
I will use 4 datasets from kaggle, 3 datasets from streaming-sites Netflix,
|
||||
Amazon Prime and Disney Plus, also 1 from a IMDB dataset.
|
||||
### Prerequisites
|
||||
|
||||
### Model:
|
||||
I will use NearestNeighbors (NN) alhorithm that can help me find other titles based on features
|
||||
like Title, Release year, Description, Cast, Director and genres.
|
||||
1. **Download and Extract the Dataset:**
|
||||
- Download the dataset from [TMDB TV Dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows).
|
||||
- Extract `TMDB_tv_dataset_v3.zip` into the `dataset/` folder, so it contains the file `TMDB_tv_dataset_v3.csv`.
|
||||
|
||||
### Features:
|
||||
1. Load data from several data-files and preprocessing.
|
||||
2. Model training with k-NN algorithm.
|
||||
3. Search with explanation
|
||||
2. **Install Dependencies:**
|
||||
- Install the necessary libraries listed in `requirements.txt` (see below).
|
||||
|
||||
### Requirements:
|
||||
1. Title data:
|
||||
* Title
|
||||
* Genres
|
||||
* Release year
|
||||
* Cast
|
||||
* Director
|
||||
* Description
|
||||
3. User data:
|
||||
* What Movie / TV-Show
|
||||
* What genre
|
||||
* Director
|
||||
### Running the Program
|
||||
|
||||
### Libraries
|
||||
* pandas: Data manipulation and analysis
|
||||
* scikit-learn: machine learning algorithms and preprocessing
|
||||
* beatifulsoup4: web scraping (if necessary)
|
||||
There are two ways to run the program, depending on whether you prefer to use the web-based interface or the command-line interface (CLI).
|
||||
|
||||
### Classes
|
||||
1. LoadData
|
||||
* load_data
|
||||
* clean_text
|
||||
* clean_data
|
||||
* load_dataset
|
||||
* create_data
|
||||
* save_data
|
||||
* load_data
|
||||
2. UserData
|
||||
* input
|
||||
3. Recommendations
|
||||
* get_recommendations
|
||||
#### Web Interface (Flask)
|
||||
|
||||
To run the web-based interface (Flask application):
|
||||
|
||||
```bash
|
||||
python app.py
|
||||
```
|
||||
|
||||
- This will start a local web server, and you can access the app through your browser (usually at http://127.0.0.1:5000/).
|
||||
- The program will load the dataset, prompt you to enter a TV show title, and ask how many recommendations you want.
|
||||
|
||||
#### Command-Line Interface (Python-GUI)
|
||||
|
||||
To run the command-line version of the program:
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
- The program will work in the terminal, asking you to enter the title of a TV show you like and how many recommendations you want.
|
||||
|
||||
> [!NOTE]
|
||||
> The first time the program is run, it will generate **Sentence-BERT embeddings**. This can take up to 5 minutes due to the large size of the dataset.
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
The **TV-Show Recommender** is a machine learning-based program that suggests TV shows to users based on their preferences. The system uses **Nearest Neighbors (NN)** and **K-Nearest Neighbors (K-NN)** algorithms with **cosine distance** to recommend TV shows. Users provide a title of a TV show they like, and the system returns recommendations based on similarity to other TV shows in the dataset.
|
||||
|
||||
---
|
||||
|
||||
## Dataset
|
||||
|
||||
The dataset used in this project is sourced from **TMDB** (The Movie Database). It contains over 150,000 TV shows and includes information such as:
|
||||
|
||||
- Title of TV shows
|
||||
- Genres
|
||||
- First/Last air date
|
||||
- Vote count and average rating
|
||||
- Director/Creator information
|
||||
- Overview/Description
|
||||
- Networks
|
||||
- Spoken languages
|
||||
- Number of seasons/episodes
|
||||
|
||||
Download the dataset from [here](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows).
|
||||
|
||||
---
|
||||
|
||||
## Model and Algorithm
|
||||
|
||||
The recommender system is based on **Supervised Learning** using the **NearestNeighbors** and **K-NearestNeighbors** algorithms. Here's a breakdown of the process:
|
||||
|
||||
1. **Data Preprocessing:**
|
||||
- The TV show descriptions are vectorized using **Sentence-BERT embeddings** to create dense vector representations of each show's description.
|
||||
|
||||
2. **Model Training:**
|
||||
- The **NearestNeighbors (NN)** algorithm is used with **cosine distance** to compute similarity between TV shows. The algorithm finds the most similar shows to a user-provided title.
|
||||
|
||||
3. **Recommendation Generation:**
|
||||
- The model generates a list of recommended TV shows by finding the nearest neighbors of the input title using cosine similarity.
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
1. **Data Loading & Preprocessing:**
|
||||
- Loads the TV show data from a CSV file and preprocesses it for model training.
|
||||
|
||||
2. **Model Training with K-NN:**
|
||||
- Trains a K-NN model using the **NearestNeighbors** algorithm for generating recommendations.
|
||||
|
||||
3. **User Input for Recommendations:**
|
||||
- Accepts user input for the TV show title and the number of recommendations.
|
||||
|
||||
4. **TV Show Recommendations:**
|
||||
- Returns a list of recommended TV shows based on similarity to the input TV show.
|
||||
|
||||
---
|
||||
|
||||
## Requirements
|
||||
|
||||
### Data Requirements:
|
||||
The dataset should contain the following columns for each TV show:
|
||||
- **Title**
|
||||
- **Genres**
|
||||
- **First/Last air date**
|
||||
- **Vote count/average**
|
||||
- **Director**
|
||||
- **Overview**
|
||||
- **Networks**
|
||||
- **Spoken languages**
|
||||
- **Number of seasons/episodes**
|
||||
|
||||
### User Input Requirements:
|
||||
- **TV Show Title**: The name of the TV show you like.
|
||||
- **Number of Recommendations**: The number of recommendations you want to receive (default is 10).
|
||||
|
||||
---
|
||||
|
||||
## Libraries
|
||||
|
||||
The following libraries are required to run the program:
|
||||
|
||||
- **pandas**: For data manipulation and analysis.
|
||||
- **scikit-learn**: For machine learning algorithms and preprocessing.
|
||||
- **scipy**: For scientific computing (e.g., sparse matrices).
|
||||
- **time**: For working with time-related functions.
|
||||
- **os**: For interacting with the operating system.
|
||||
- **re**: For regular expression support.
|
||||
- **textwrap**: For text wrapping and formatting.
|
||||
- **flask**: For creating the web interface.
|
||||
|
||||
To install the dependencies, run:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
|
||||
84
app.py
Normal file
84
app.py
Normal file
@ -0,0 +1,84 @@
|
||||
from flask import Flask, render_template, request
|
||||
from readdata import LoadData
|
||||
from recommendations import RecommendationLoader
|
||||
from training import TrainModel
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
data_loader = LoadData()
|
||||
title_data = data_loader.load_data()
|
||||
|
||||
model = TrainModel(title_data)
|
||||
model.train()
|
||||
|
||||
recommender = RecommendationLoader(model, title_data)
|
||||
|
||||
|
||||
@app.route('/')
|
||||
def home():
|
||||
return render_template('index.html')
|
||||
|
||||
|
||||
@app.route('/recommend', methods=['POST'])
|
||||
def recommend():
|
||||
|
||||
# Get user input
|
||||
title = request.form.get('title').strip()
|
||||
n_recommendations = int(request.form.get('n_recommendations', 10))
|
||||
|
||||
# Validate user input
|
||||
if not title:
|
||||
return render_template('index.html', message="Please enter a valid TV show title.")
|
||||
|
||||
try:
|
||||
n_recommendations = int(n_recommendations)
|
||||
if n_recommendations < 1 or n_recommendations > 50:
|
||||
raise ValueError("Number of recommendations must be between 1 and 50.")
|
||||
except ValueError as e:
|
||||
return render_template('index.html', message=str(e))
|
||||
|
||||
# Get recommendations from the model
|
||||
target_row = title_data[title_data['name'].str.lower() == title.lower()]
|
||||
|
||||
# Check if a match was found
|
||||
if target_row.empty:
|
||||
return render_template('index.html', message=f"No match found for '{title}'. Try again.")
|
||||
|
||||
# Get recommendations
|
||||
target_row = target_row.iloc[0]
|
||||
user_data = {'title': title, 'n_rec': n_recommendations}
|
||||
recommendations = recommender.get_recommendations("flask", target_row, user_data)
|
||||
|
||||
# Check if recommendations were found
|
||||
if recommendations is None or recommendations.empty:
|
||||
return render_template('index.html', message=f"Sorry, no recommendations available for {title}.")
|
||||
|
||||
# Prepare data for display on the webpage
|
||||
recommendations_data = []
|
||||
|
||||
for _, row in recommendations.iterrows():
|
||||
|
||||
# Extract the first and last air dates
|
||||
first_air_date = recommender.extract_years(row['first_air_date'])
|
||||
last_air_date = recommender.extract_years(row['last_air_date'])
|
||||
if last_air_date != "Ongoing" and last_air_date:
|
||||
years = f"{first_air_date} - {last_air_date}"
|
||||
else:
|
||||
years = f"{first_air_date}"
|
||||
|
||||
recommendations_data.append({
|
||||
'title': row['name'],
|
||||
'genres': ', '.join(row['genres']) if isinstance(row['genres'], list) else row['genres'],
|
||||
'overview': row['overview'],
|
||||
'rating': row['vote_average'],
|
||||
'seasons': row['number_of_seasons'],
|
||||
'episodes': row['number_of_episodes'],
|
||||
'networks': ', '.join(row['networks']) if isinstance(row['networks'], list) and row['networks'] else 'N/A',
|
||||
'years': years,
|
||||
})
|
||||
|
||||
return render_template('index.html', recommendations=recommendations_data, original_title=title)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
app.run(debug=True)
|
||||
BIN
dataset/TMDB_tv_dataset_v3.zip
Normal file
BIN
dataset/TMDB_tv_dataset_v3.zip
Normal file
Binary file not shown.
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
216911
dataset/dataset_tmdb.csv
216911
dataset/dataset_tmdb.csv
File diff suppressed because it is too large
Load Diff
76
import_data.py
Normal file
76
import_data.py
Normal file
@ -0,0 +1,76 @@
|
||||
import re
|
||||
import os
|
||||
import pandas as pd
|
||||
|
||||
|
||||
###############################################################
|
||||
#### Class: ImportData
|
||||
###############################################################
|
||||
class ImportData:
|
||||
|
||||
def __init__(self):
|
||||
self.data = None
|
||||
self.loaded_datasets = []
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: load_dataset
|
||||
###########################################################
|
||||
def load_dataset(self, dataset_path):
|
||||
# Load data from dataset CSV file
|
||||
try:
|
||||
df = pd.read_csv(os.path.join(f'dataset', dataset_path))
|
||||
return df
|
||||
except FileNotFoundError:
|
||||
print(f'Warning: "{dataset_path}" not found. Skipping this dataset.')
|
||||
return None
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: create_data
|
||||
###########################################################
|
||||
def create_data(self, filename):
|
||||
try:
|
||||
self.data = self.load_dataset(filename)
|
||||
print(f'Imported data successfully.')
|
||||
except FileNotFoundError:
|
||||
print("No data imported, missing dataset")
|
||||
return None
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: clean_data
|
||||
###########################################################
|
||||
def clean_data(self):
|
||||
if self.data is not None:
|
||||
# Drop unnecessary columns
|
||||
df_cleaned = self.data.drop(columns=['adult', 'poster_path', 'production_companies',
|
||||
'in_production','backdrop_path','production_countries','status','episode_run_time',
|
||||
'original_name', 'popularity', 'tagline','homepage'], errors='ignore')
|
||||
|
||||
# Clean text from non-ASCII characters
|
||||
text_columns = ['name', 'overview','spoken_languages']
|
||||
masks = [df_cleaned[col].apply(lambda x: isinstance(x, str) and bool(re.match(r'^[\x00-\x7F]*$', x)))
|
||||
for col in text_columns]
|
||||
combined_mask = pd.concat(masks, axis=1).all(axis=1)
|
||||
|
||||
self.data = df_cleaned[combined_mask]
|
||||
|
||||
print(f'Data cleaned. {self.data.shape[0]} records remaining.')
|
||||
else:
|
||||
print("No data to clean. Please load the dataset first.")
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: save_data
|
||||
###########################################################
|
||||
def save_data(self):
|
||||
if self.data is not None:
|
||||
try:
|
||||
# Sava dataframe to CSV
|
||||
self.data.to_csv('data.csv', index=False)
|
||||
print(f'Data saved to data.csv.')
|
||||
except Exception as e:
|
||||
print(f'Error saving data: {e}')
|
||||
else:
|
||||
print("No data to save. Please clean the data first.")
|
||||
174
main.py
174
main.py
@ -1,177 +1,23 @@
|
||||
import pandas as pd
|
||||
import re
|
||||
import os
|
||||
from sklearn.neighbors import NearestNeighbors
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from textwrap import dedent
|
||||
|
||||
class LoadData:
|
||||
def __init__(self):
|
||||
self.data = None
|
||||
self.loaded_datasets = []
|
||||
|
||||
def load_data(self):
|
||||
self.create_data()
|
||||
self.clean_data()
|
||||
num_rows = self.data.shape[0]
|
||||
print(f'{num_rows} titles loaded successfully.')
|
||||
return self.data
|
||||
|
||||
def clean_text(self, text):
|
||||
if isinstance(text, str):
|
||||
cleaned = re.sub(r'[^\x00-\x7F]+', '', text)
|
||||
cleaned = cleaned.replace('#', '').replace('"', '')
|
||||
return cleaned.strip()
|
||||
return ''
|
||||
|
||||
def load_dataset(self, dataset_path, stream):
|
||||
try:
|
||||
df = pd.read_csv(f'dataset/{dataset_path}')
|
||||
df['stream'] = stream
|
||||
if stream != 'IMDB':
|
||||
df = df.drop(columns=['show_id', 'date_added', 'duration', 'rating'], errors='ignore')
|
||||
df = df.rename(columns={'listed_in': 'genres'})
|
||||
else:
|
||||
df = df.rename(columns={'releaseYear': 'release_year'})
|
||||
df = df.drop(columns=['numVotes', 'id','avaverageRating'], errors='ignore')
|
||||
self.loaded_datasets.append(stream)
|
||||
return df
|
||||
except FileNotFoundError:
|
||||
print(f'Warning: "{dataset_path}" not found. Skipping this dataset.')
|
||||
|
||||
def create_data(self):
|
||||
print(f'Starting to read data ...')
|
||||
|
||||
df_netflix = self.load_dataset('data_netflix.csv','Netflix')
|
||||
df_amazon = self.load_dataset('data_amazon.csv','Amazon')
|
||||
df_disney = self.load_dataset('data_disney.csv','Disney')
|
||||
df_imdb = self.load_dataset('data_imdb.csv','IMDB')
|
||||
|
||||
dataframes = [df for df in [df_imdb, df_netflix, df_amazon, df_disney] if df is not None]
|
||||
if not dataframes:
|
||||
print("Error: No datasets loaded. Cannot create combined data.")
|
||||
return
|
||||
|
||||
df_all = pd.concat(dataframes, ignore_index=True, sort=False)
|
||||
df_all = df_all.infer_objects(copy=False)
|
||||
self.data = df_all
|
||||
|
||||
print(f'Data from {", ".join(self.loaded_datasets)} imported.')
|
||||
|
||||
def clean_data(self):
|
||||
self.data.dropna(subset=['title', 'genres', 'description'], inplace=True)
|
||||
string_columns = self.data.select_dtypes(include=['object'])
|
||||
self.data[string_columns.columns] = string_columns.apply(lambda col: col.map(self.clean_text, na_action='ignore'))
|
||||
self.data = self.data[~self.data['title'].str.strip().isin(['', ':'])]
|
||||
self.data['genres'] = self.data['genres'].str.split(', ').apply(lambda x: [genre.strip() for genre in x])
|
||||
self.data = self.data[self.data['genres'].map(lambda x: len(x) > 0)]
|
||||
print(f'Data cleaned. {self.data.shape[0]} records remaining.')
|
||||
from readdata import LoadData
|
||||
from training import TrainModel
|
||||
from recommendations import RecommendationLoader
|
||||
|
||||
|
||||
class UserData:
|
||||
def __init__(self):
|
||||
self.user_data = None
|
||||
|
||||
def input(self):
|
||||
self.user_data = input("Which Movie or TV Series do you prefer: ")
|
||||
return self.user_data.strip().lower()
|
||||
|
||||
|
||||
class TrainModel:
|
||||
def __init__(self, title_data):
|
||||
self.recommendation_model = None
|
||||
self.title_data = title_data
|
||||
self.title_vectors = None
|
||||
self.vectorizer = TfidfVectorizer()
|
||||
self.preprocess_data()
|
||||
|
||||
def preprocess_data(self):
|
||||
self.title_data['genres'] = self.title_data['genres'].apply(lambda x: ', '.join(x) if isinstance(x, list) else '')
|
||||
self.title_data['combined_text'] = (
|
||||
self.title_data['title'].fillna('') + ' ' +
|
||||
self.title_data['director'].fillna('') + ' ' +
|
||||
self.title_data['cast'].fillna('') + ' ' +
|
||||
self.title_data['genres'] + ' ' +
|
||||
self.title_data['description'].fillna('')
|
||||
)
|
||||
self.title_data['combined_text'] = self.title_data['combined_text'].str.lower()
|
||||
self.title_data['combined_text'] = self.title_data['combined_text'].str.replace(r'[^a-z\s]', '', regex=True)
|
||||
self.title_vectors = self.vectorizer.fit_transform(self.title_data['combined_text'])
|
||||
|
||||
def preprocess_user_input(self, user_input):
|
||||
user_vector = self.vectorizer.transform([user_input])
|
||||
return user_vector
|
||||
|
||||
def train(self):
|
||||
self.recommendation_model = NearestNeighbors(n_neighbors=10, metric='cosine')
|
||||
self.recommendation_model.fit(self.title_vectors)
|
||||
|
||||
|
||||
class RecommendationLoader:
|
||||
def __init__(self, model, title_data):
|
||||
self.model = model
|
||||
self.title_data = title_data
|
||||
|
||||
def run(self):
|
||||
while True:
|
||||
user_data = UserData()
|
||||
user_input = user_data.input()
|
||||
|
||||
if user_input in ['exit', 'quit']:
|
||||
print("Program will exit now. Thanks for using!")
|
||||
break
|
||||
|
||||
self.get_recommendations(user_input)
|
||||
print("\nWrite 'exit' or 'quit' to end the program.")
|
||||
|
||||
def get_recommendations(self, user_data):
|
||||
user_vector = self.model.preprocess_user_input(user_data)
|
||||
distances, indices = self.model.recommendation_model.kneighbors(user_vector, n_neighbors=10)
|
||||
recommendations = self.title_data.iloc[indices[0]]
|
||||
|
||||
self.display_recommendations(user_data, recommendations)
|
||||
|
||||
def display_recommendations(self, user_data, recommendations):
|
||||
print(f'\nRecommendations based on "{user_data}":\n')
|
||||
|
||||
if not recommendations.empty:
|
||||
movie_recommendations = recommendations[recommendations['type'] == 'Movie']
|
||||
tv_show_recommendations = recommendations[recommendations['type'] == 'TV Show']
|
||||
|
||||
if not movie_recommendations.empty:
|
||||
print("\n#################### Recommended Movies: ####################")
|
||||
for i, (_, row) in enumerate(movie_recommendations.iterrows(), start=1):
|
||||
print(dedent(f"""
|
||||
{i}. {row['title']} ({row['release_year']}) ({row['genres']})
|
||||
Description: {row['description']}
|
||||
Director: {row['director']}
|
||||
Cast: {row['cast']}
|
||||
|
||||
===============================================================
|
||||
"""))
|
||||
|
||||
if not tv_show_recommendations.empty:
|
||||
print("\n#################### Recommended TV Shows: ####################")
|
||||
for i, (_, row) in enumerate(tv_show_recommendations.iterrows(), start=1):
|
||||
print(dedent(f"""
|
||||
{i}. {row['title']} ({row['release_year']}) ({row['genres']})
|
||||
Description: {row['description']}
|
||||
Director: {row['director']}
|
||||
Cast: {row['cast']}
|
||||
|
||||
===============================================================
|
||||
"""))
|
||||
else:
|
||||
print("No recommendations found.")
|
||||
|
||||
#########################################################################
|
||||
#### function: main
|
||||
#########################################################################
|
||||
|
||||
def main():
|
||||
|
||||
# Load data from CSV file
|
||||
data_loader = LoadData()
|
||||
title_data = data_loader.load_data()
|
||||
|
||||
# Train model
|
||||
model = TrainModel(title_data)
|
||||
model.train()
|
||||
|
||||
# Run recommendation loader
|
||||
recommendations = RecommendationLoader(model, title_data)
|
||||
recommendations.run()
|
||||
|
||||
|
||||
79
readdata.py
Normal file
79
readdata.py
Normal file
@ -0,0 +1,79 @@
|
||||
import pandas as pd
|
||||
from import_data import ImportData
|
||||
|
||||
|
||||
#########################################################################
|
||||
#### Class: LoadData
|
||||
#########################################################################
|
||||
class LoadData:
|
||||
def __init__(self):
|
||||
self.data = None
|
||||
self.filename = 'TMDB_tv_dataset_v3.csv'
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: load_data
|
||||
###########################################################
|
||||
def load_data(self):
|
||||
self.read_data()
|
||||
self.clean_data()
|
||||
print(f'{self.data.shape[0]} titles loaded successfully.')
|
||||
return self.data
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: read_data
|
||||
###########################################################
|
||||
def read_data(self):
|
||||
print("Starting to read data ...")
|
||||
try:
|
||||
# Try to Read CSV file
|
||||
self.data = pd.read_csv('data.csv')
|
||||
print(f'{self.data.shape[0]} rows read successfully.')
|
||||
except FileNotFoundError:
|
||||
print("No data.csv file found. Attempting to import data...")
|
||||
# If CSV file not found, try to import data from datasets instead
|
||||
try:
|
||||
data_importer = ImportData()
|
||||
data_importer.create_data(self.filename)
|
||||
data_importer.clean_data()
|
||||
data_importer.save_data()
|
||||
self.data = pd.read_csv('data.csv')
|
||||
print(f'{self.data.shape[0]} rows imported successfully.')
|
||||
except Exception as e:
|
||||
print(f"Error during data import process: {e}")
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: clean_data
|
||||
###########################################################
|
||||
def clean_data(self):
|
||||
# Function to split a string into a list, or use an empty list if no valid data
|
||||
def split_to_list(value):
|
||||
if isinstance(value, str):
|
||||
# Strip and split the string, and remove any empty items
|
||||
return [item.strip() for item in value.split(',') if item.strip()]
|
||||
return []
|
||||
|
||||
data_start = self.data.shape[0]
|
||||
|
||||
# Split genres, spoken_languages, networks, and created_by
|
||||
self.data['genres'] = self.data['genres'].apply(split_to_list)
|
||||
self.data['spoken_languages'] = self.data['spoken_languages'].apply(split_to_list)
|
||||
self.data['networks'] = self.data['networks'].apply(split_to_list)
|
||||
self.data['created_by'] = self.data['created_by'].apply(split_to_list)
|
||||
|
||||
# Drop rows that are not in English
|
||||
self.data = self.data[self.data['original_language'] == 'en']
|
||||
|
||||
# Drop rows with empty lists in genres or spoken_languages
|
||||
self.data = self.data[
|
||||
self.data['genres'].map(lambda x: len(x) > 0) &
|
||||
self.data['spoken_languages'].map(lambda x: len(x) > 0) &
|
||||
self.data['networks'].map(lambda x: len(x) > 0)
|
||||
]
|
||||
|
||||
# Count rows that were dropped
|
||||
rows_dropped = data_start - len(self.data)
|
||||
|
||||
print('Data cleaned successfully, dropped ' + str(rows_dropped) + ' rows.')
|
||||
155
recommendations.py
Normal file
155
recommendations.py
Normal file
@ -0,0 +1,155 @@
|
||||
from user import UserData
|
||||
import pandas as pd
|
||||
import textwrap
|
||||
|
||||
|
||||
###############################################################
|
||||
#### Class: RecommendationLoader
|
||||
###############################################################
|
||||
class RecommendationLoader:
|
||||
def __init__(self, model, title_data):
|
||||
self.model = model
|
||||
self.title_data = title_data
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: run
|
||||
###########################################################
|
||||
def run(self):
|
||||
while True:
|
||||
user_data = UserData()
|
||||
user_data.title()
|
||||
user_data.n_recommendations()
|
||||
|
||||
# Exit the program if writing exit or quit.
|
||||
if user_data.user_data['title'] in ['exit', 'quit']:
|
||||
print("Program will exit now. Thanks for using!")
|
||||
break
|
||||
|
||||
# Find a row in dataset to use as referens.
|
||||
target_row = self.title_data[self.title_data['name'].str.lower() == user_data.user_data['title']]
|
||||
|
||||
# If no match found, loop and try again.
|
||||
if target_row.empty:
|
||||
print(f"No match found for '{user_data.user_data['title']}'. Try again.")
|
||||
continue
|
||||
|
||||
# If match found, get recommendations.
|
||||
target_row = target_row.iloc[0]
|
||||
self.get_recommendations(target_row, user_data.user_data)
|
||||
print("#" * 100)
|
||||
print("\nWrite 'exit' or 'quit' to end the program.")
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: get_recommendations
|
||||
###########################################################
|
||||
def get_recommendations(self, type, target_row, user_data):
|
||||
recommendations = pd.DataFrame()
|
||||
n_recommendations = user_data['n_rec']
|
||||
|
||||
# Get more recommendations and filter untill n_recommendations is reached
|
||||
while len(recommendations) < n_recommendations:
|
||||
additional_recommendations = self.model.recommend(target_row, num_recommendations=20)
|
||||
additional_recommendations = additional_recommendations[~additional_recommendations.index.isin(recommendations.index)]
|
||||
additional_recommendations = self.filter_genres(additional_recommendations, target_row)
|
||||
recommendations = pd.concat([recommendations, additional_recommendations])
|
||||
|
||||
# Make sure we give n_recommendations recommendations
|
||||
recommendations = recommendations.head(n_recommendations)
|
||||
|
||||
if type == 'flask':
|
||||
return recommendations
|
||||
else:
|
||||
self.display_recommendations(user_data, recommendations, n_recommendations, target_row)
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: display_recommendations
|
||||
###########################################################
|
||||
def display_recommendations(self, user_data, recommendations, n_recommendations, target_row):
|
||||
print(f'\n{n_recommendations} recommendations based on "{user_data["title"]}":\n')
|
||||
|
||||
# Width on printed recommendations
|
||||
width = 100
|
||||
|
||||
# Print recommendations if there are any
|
||||
if not recommendations.empty:
|
||||
# print(f"{'Title':<40} {'Genres':<60} {'Networks':<30}")
|
||||
print("#" * width)
|
||||
|
||||
for index, row in recommendations.iterrows():
|
||||
title = row['name']
|
||||
genres = ', '.join(row['genres']) if isinstance(row['genres'], list) else row['genres']
|
||||
networks = ', '.join(row['networks']) if isinstance(row['networks'], list) and row['networks'] else 'N/A'
|
||||
created_by = ', '.join(row['created_by']) if isinstance(row['created_by'], list) and row['created_by'] else 'N/A'
|
||||
rating = row['vote_average']
|
||||
vote_count = row['vote_count']
|
||||
seasons = row['number_of_seasons'] if isinstance(row['number_of_seasons'], int) else 'N/A'
|
||||
episodes = row['number_of_episodes'] if isinstance(row['number_of_episodes'], int) else 'N/A'
|
||||
overview = textwrap.fill(row["overview"], width=width)
|
||||
|
||||
# Extract years fir first_air_date and last_air_date
|
||||
first_year = self.extract_years(row["first_air_date"])
|
||||
last_year = self.extract_years(row["last_air_date"])
|
||||
|
||||
# Construct title with the year range
|
||||
title_raw = f"{title} ({first_year}-{last_year})"
|
||||
title = textwrap.fill(title_raw, width=width)
|
||||
|
||||
# Print recommendation
|
||||
print(f"\nTitle: {title}")
|
||||
print(f"Genres: {genres}")
|
||||
if not created_by == 'N/A':
|
||||
print(f"Director: {created_by}")
|
||||
if not networks == 'N/A':
|
||||
print(f'Networks: {networks}')
|
||||
print(f"Rating: {rating:.1f} ({vote_count:.0f} votes)")
|
||||
if not seasons == 'N/A' and not episodes == 'N/A':
|
||||
print(f"Seasons: {seasons} ({episodes} episodes)")
|
||||
print(f'\n{overview}\n')
|
||||
|
||||
print("-" * width)
|
||||
|
||||
print("\nEnd of recommendations.")
|
||||
else:
|
||||
print("No recommendations found.")
|
||||
|
||||
###########################################################
|
||||
#### Function: extract_years
|
||||
###########################################################
|
||||
def extract_years(self, air_date):
|
||||
# Make sure air_date is not null
|
||||
if pd.isna(air_date):
|
||||
return "Unknown"
|
||||
# Convert float to int if needed
|
||||
if isinstance(air_date, float):
|
||||
return str(int(air_date))
|
||||
return air_date.split('-')[0]
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: filter_genres
|
||||
###########################################################
|
||||
def filter_genres(self, recommendations, target_row):
|
||||
|
||||
# Get genres from the target row
|
||||
reference_genres = [genre.lower() for genre in target_row['genres']]
|
||||
|
||||
# Check if the reference includes specific genres
|
||||
is_kids_reference = 'kids' in reference_genres
|
||||
is_animated_reference = 'animation' in reference_genres
|
||||
is_reality_reference = 'reality' in reference_genres
|
||||
is_documentary_reference = 'documentary' in reference_genres
|
||||
|
||||
# Filter recommendations based on genre preferences
|
||||
if not is_kids_reference:
|
||||
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'kids' in [g.lower() for g in x])]
|
||||
if not is_animated_reference:
|
||||
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'animation' in [g.lower() for g in x])]
|
||||
if not is_reality_reference:
|
||||
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'reality' in [g.lower() for g in x])]
|
||||
if not is_documentary_reference:
|
||||
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'documentary' in [g.lower() for g in x])]
|
||||
|
||||
return recommendations
|
||||
7
requirements.txt
Normal file
7
requirements.txt
Normal file
@ -0,0 +1,7 @@
|
||||
Flask==3.0.1
|
||||
numpy==1.26.4
|
||||
pandas==2.2.0
|
||||
scikit-learn==1.4.1.post1
|
||||
scipy==1.12.0
|
||||
sentence-transformers==3.2.1
|
||||
textwrap3==0.9.0
|
||||
92
templates/index.html
Normal file
92
templates/index.html
Normal file
@ -0,0 +1,92 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Recommendation System</title>
|
||||
<style>
|
||||
body {
|
||||
font-family: Arial, sans-serif;
|
||||
margin: 0;
|
||||
padding: 20px;
|
||||
background-color: #f4f4f9;
|
||||
}
|
||||
.container {
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
padding: 20px;
|
||||
}
|
||||
h1 {
|
||||
text-align: center;
|
||||
color: #333;
|
||||
}
|
||||
form {
|
||||
text-align: center;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
.recommendation {
|
||||
background: #fff;
|
||||
padding: 15px;
|
||||
border-radius: 8px;
|
||||
margin-bottom: 20px;
|
||||
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
|
||||
}
|
||||
.recommendation h3 {
|
||||
margin-top: 0;
|
||||
}
|
||||
.recommendation p {
|
||||
margin: 5px 0;
|
||||
}
|
||||
.recommendation .overview {
|
||||
font-size: 14px;
|
||||
color: #555;
|
||||
}
|
||||
.error-message {
|
||||
color: red;
|
||||
text-align: center;
|
||||
font-weight: bold;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<div class="container">
|
||||
<h1>TV-Show Recommendations</h1>
|
||||
|
||||
<!-- Recommendation Form -->
|
||||
<form method="POST" action="/recommend">
|
||||
<label for="title">Enter a Title (TV Show):</label><br><br>
|
||||
<input type="text" id="title" name="title" required><br><br>
|
||||
|
||||
<label for="n_recommendations">Number of Recommendations:</label><br><br>
|
||||
<input type="number" id="n_recommendations" name="n_recommendations" value="10" min="1" max="50"><br><br>
|
||||
|
||||
<input type="submit" value="Get Recommendations">
|
||||
</form>
|
||||
|
||||
<!-- Display Error Message if any -->
|
||||
{% if message %}
|
||||
<div class="error-message">{{ message }}</div>
|
||||
{% endif %}
|
||||
|
||||
<!-- Display Recommendations -->
|
||||
{% if recommendations %}
|
||||
<h2>Recommendations based on "{{ original_title }}":</h2>
|
||||
<div class="recommendations">
|
||||
{% for rec in recommendations %}
|
||||
<div class="recommendation">
|
||||
<h3>{{ rec.title }} ({{ rec.years }})</h3>
|
||||
<p><strong>Genres:</strong> {{ rec.genres }}</p>
|
||||
<p><strong>Networks:</strong> {{ rec.networks }}</p>
|
||||
<p><strong>Rating:</strong> {{ rec.rating }}</p>
|
||||
<p><strong>Seasons:</strong> {{ rec.seasons }}({{ rec.episodes }} episodes)</p>
|
||||
<p class="overview"><strong>Overview:</strong> {{ rec.overview }}</p>
|
||||
</div>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
</div>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
146
training.py
Normal file
146
training.py
Normal file
@ -0,0 +1,146 @@
|
||||
from sentence_transformers import SentenceTransformer
|
||||
from sklearn.neighbors import NearestNeighbors
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import TruncatedSVD
|
||||
from scipy.sparse import hstack, csr_matrix
|
||||
import pickle
|
||||
import time
|
||||
|
||||
import warnings
|
||||
warnings.filterwarnings("ignore", category=UserWarning, module='sklearn')
|
||||
|
||||
|
||||
#########################################################################
|
||||
#### Class: TrainModel
|
||||
#########################################################################
|
||||
class TrainModel:
|
||||
def __init__(self, title_data):
|
||||
self.title_data = title_data
|
||||
|
||||
# Initialize Sentence-BERT model for embeddings
|
||||
self.bert_model = SentenceTransformer('all-MiniLM-L12-v2')
|
||||
|
||||
# TF-IDF Vectorization settings
|
||||
self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=0.01, max_df=0.5)
|
||||
|
||||
# Nearest Neighbors settings
|
||||
self.nearest_neighbors = NearestNeighbors(metric='cosine')
|
||||
|
||||
# Scaler for numerical features
|
||||
self.scaler = StandardScaler()
|
||||
|
||||
# SVD for dimensionality reduction
|
||||
self.svd = TruncatedSVD(n_components=300)
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: Train
|
||||
###########################################################
|
||||
def train(self):
|
||||
print("Starting to train model...")
|
||||
|
||||
start = time.time()
|
||||
|
||||
# Preprocess title data with advanced embeddings included
|
||||
preprocessed_data = self.preprocess_title_data()
|
||||
|
||||
# Train Nearest Neighbors on the enhanced feature set
|
||||
self.nearest_neighbors.fit(preprocessed_data)
|
||||
|
||||
print(f'Trained model successfully in {time.time() - start:.2f} seconds.')
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: Recommend
|
||||
###########################################################
|
||||
def recommend(self, target_row, num_recommendations=40):
|
||||
# Preprocess target data
|
||||
target_vector = self.preprocess_target_data(target_row)
|
||||
|
||||
# Use Nearest Neighbors to get recommendations
|
||||
distances, indices = self.nearest_neighbors.kneighbors(target_vector, n_neighbors=num_recommendations)
|
||||
recommendations = self.title_data.iloc[indices[0]].copy()
|
||||
recommendations['distance'] = distances[0]
|
||||
|
||||
# Filter recommendations
|
||||
recommendations = recommendations[
|
||||
(recommendations['name'].str.lower() != target_row['name'].lower()) &
|
||||
(recommendations['distance'] < 0.5)
|
||||
]
|
||||
return recommendations.head(num_recommendations)
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: preprocess_title_data
|
||||
###########################################################
|
||||
def preprocess_title_data(self):
|
||||
# Combine text fields for TF-IDF and BERT
|
||||
self.title_data['combined_text'] = (
|
||||
self.title_data['overview'].fillna('').apply(str) + ' ' +
|
||||
self.title_data['genres'].fillna('').apply(str) + ' ' +
|
||||
self.title_data['created_by'].fillna('').apply(str)
|
||||
)
|
||||
|
||||
# TF-IDF + SVD
|
||||
text_features = self.vectorizer.fit_transform(self.title_data['combined_text'])
|
||||
text_features = self.svd.fit_transform(text_features)
|
||||
|
||||
# Sentence-BERT embeddings
|
||||
bert_embeddings = self.load_pickle('bert_embeddings.pkl', self.title_data['combined_text'])
|
||||
|
||||
# Numerical features
|
||||
self.numerical_data = self.title_data.select_dtypes(include=['number'])
|
||||
numerical_features = self.scaler.fit_transform(self.numerical_data)
|
||||
numerical_features_sparse = csr_matrix(numerical_features)
|
||||
|
||||
# Combine all features
|
||||
combined_features = hstack([csr_matrix(text_features), csr_matrix(bert_embeddings),
|
||||
numerical_features_sparse])
|
||||
return combined_features
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: preprocess_target_data
|
||||
###########################################################
|
||||
def preprocess_target_data(self, target_row):
|
||||
# TF-IDF + SVD
|
||||
target_text_vector = self.vectorizer.transform([target_row['combined_text']])
|
||||
target_text_vector = self.svd.transform(target_text_vector)
|
||||
|
||||
# Sentence-BERT embedding
|
||||
target_bert_embedding = self.embed_text(target_row['combined_text']).reshape(1, -1)
|
||||
|
||||
# Numerical features
|
||||
target_numerical = target_row[self.numerical_data.columns].values.reshape(1, -1)
|
||||
target_numerical_scaled = self.scaler.transform(target_numerical)
|
||||
|
||||
# Combine all features
|
||||
target_vector = hstack([csr_matrix(target_text_vector), csr_matrix(target_bert_embedding),
|
||||
csr_matrix(target_numerical_scaled)])
|
||||
return target_vector
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: embed_text
|
||||
###########################################################
|
||||
def embed_text(self, text):
|
||||
# Use Sentence-BERT to create embeddings
|
||||
return self.bert_model.encode(text, convert_to_numpy=True)
|
||||
|
||||
|
||||
###########################################################
|
||||
#### Function: load_pickle
|
||||
###########################################################
|
||||
def load_pickle(self, filename, title_data):
|
||||
try:
|
||||
with open(filename, 'rb') as f:
|
||||
bert_embeddings = pickle.load(f)
|
||||
except FileNotFoundError:
|
||||
print("Generating Sentence-BERT embeddings...")
|
||||
bert_embeddings = self.bert_model.encode(title_data.tolist(), batch_size=64, convert_to_numpy=True)
|
||||
with open(filename, 'wb') as f:
|
||||
pickle.dump(bert_embeddings, f)
|
||||
return bert_embeddings
|
||||
|
||||
|
||||
34
user.py
Normal file
34
user.py
Normal file
@ -0,0 +1,34 @@
|
||||
###############################################################
|
||||
#### Class: UserData
|
||||
###############################################################
|
||||
class UserData:
|
||||
def __init__(self):
|
||||
self.user_data = {}
|
||||
self.n_rec = 10
|
||||
|
||||
###########################################################
|
||||
#### Function: title
|
||||
###########################################################
|
||||
def title(self):
|
||||
# Ask for user input
|
||||
print("#" * 100)
|
||||
title = input("\nPlease enter the title of TV-Series you prefer: ")
|
||||
self.user_data['title'] = title.strip().lower()
|
||||
return self.user_data
|
||||
|
||||
###########################################################
|
||||
#### Function: n_recommendations
|
||||
###########################################################
|
||||
def n_recommendations(self):
|
||||
# Ask for number of recommendations
|
||||
while True:
|
||||
n_rec = input("How many recommendations do you want (minimum 5): ")
|
||||
try:
|
||||
n_rec = int(n_rec.strip())
|
||||
if n_rec < 5:
|
||||
print("Please enter a number greater than or equal to 5: ")
|
||||
else:
|
||||
self.user_data['n_rec'] = n_rec
|
||||
break
|
||||
except ValueError:
|
||||
print("Please enter a valid number.")
|
||||
Loading…
Reference in New Issue
Block a user