Merge pull request #2 from jwradhe/v1

V1
This commit is contained in:
Jimmy Wrådhe 2024-11-21 01:14:43 +01:00 committed by GitHub
commit 47725aa4b4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
16 changed files with 821 additions and 243103 deletions

184
README.md
View File

@ -1,56 +1,146 @@
# Supervised Learning - Movie/TV-Show recommender
# Supervised Learning - TV-Show Recommender
## Specification
Movie/TV-Show recommender
## Table of Contents
1. [How to Run the Program](#how-to-run-the-program)
2. [Project Overview](#project-overview)
3. [Dataset](#dataset)
4. [Model and Algorithm](#model-and-algorithm)
5. [Features](#features)
6. [Requirements](#requirements)
7. [Libraries](#libraries)
8. [Classes](#classes)
9. [References](#references)
This program will recommend you what movie or th-show to view based on what Movie/TV-Show you like.
You should be able to search for recommendations from your Movie/TV-Show title, cast, director,
release year and also Description, and get back a recommendations with a explanation on what just this
title might suit you. It will get you about 25 recommendations of movies in order of rank from your search,
and the same from TV-Shows that will match the same way as Movies.
## How to Run the Program
### Data Source:
I will use 4 datasets from kaggle, 3 datasets from streaming-sites Netflix,
Amazon Prime and Disney Plus, also 1 from a IMDB dataset.
### Prerequisites
### Model:
I will use NearestNeighbors (NN) alhorithm that can help me find other titles based on features
like Title, Release year, Description, Cast, Director and genres.
1. **Download and Extract the Dataset:**
- Download the dataset from [TMDB TV Dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows).
- Extract `TMDB_tv_dataset_v3.zip` into the `dataset/` folder, so it contains the file `TMDB_tv_dataset_v3.csv`.
### Features:
1. Load data from several data-files and preprocessing.
2. Model training with k-NN algorithm.
3. Search with explanation
2. **Install Dependencies:**
- Install the necessary libraries listed in `requirements.txt` (see below).
### Requirements:
1. Title data:
* Title
* Genres
* Release year
* Cast
* Director
* Description
3. User data:
* What Movie / TV-Show
* What genre
* Director
### Running the Program
### Libraries
* pandas: Data manipulation and analysis
* scikit-learn: machine learning algorithms and preprocessing
* beatifulsoup4: web scraping (if necessary)
There are two ways to run the program, depending on whether you prefer to use the web-based interface or the command-line interface (CLI).
### Classes
1. LoadData
* load_data
* clean_text
* clean_data
* load_dataset
* create_data
* save_data
* load_data
2. UserData
* input
3. Recommendations
* get_recommendations
#### Web Interface (Flask)
To run the web-based interface (Flask application):
```bash
python app.py
```
- This will start a local web server, and you can access the app through your browser (usually at http://127.0.0.1:5000/).
- The program will load the dataset, prompt you to enter a TV show title, and ask how many recommendations you want.
#### Command-Line Interface (Python-GUI)
To run the command-line version of the program:
```bash
python main.py
```
- The program will work in the terminal, asking you to enter the title of a TV show you like and how many recommendations you want.
> [!NOTE]
> The first time the program is run, it will generate **Sentence-BERT embeddings**. This can take up to 5 minutes due to the large size of the dataset.
---
## Project Overview
The **TV-Show Recommender** is a machine learning-based program that suggests TV shows to users based on their preferences. The system uses **Nearest Neighbors (NN)** and **K-Nearest Neighbors (K-NN)** algorithms with **cosine distance** to recommend TV shows. Users provide a title of a TV show they like, and the system returns recommendations based on similarity to other TV shows in the dataset.
---
## Dataset
The dataset used in this project is sourced from **TMDB** (The Movie Database). It contains over 150,000 TV shows and includes information such as:
- Title of TV shows
- Genres
- First/Last air date
- Vote count and average rating
- Director/Creator information
- Overview/Description
- Networks
- Spoken languages
- Number of seasons/episodes
Download the dataset from [here](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows).
---
## Model and Algorithm
The recommender system is based on **Supervised Learning** using the **NearestNeighbors** and **K-NearestNeighbors** algorithms. Here's a breakdown of the process:
1. **Data Preprocessing:**
- The TV show descriptions are vectorized using **Sentence-BERT embeddings** to create dense vector representations of each show's description.
2. **Model Training:**
- The **NearestNeighbors (NN)** algorithm is used with **cosine distance** to compute similarity between TV shows. The algorithm finds the most similar shows to a user-provided title.
3. **Recommendation Generation:**
- The model generates a list of recommended TV shows by finding the nearest neighbors of the input title using cosine similarity.
---
## Features
1. **Data Loading & Preprocessing:**
- Loads the TV show data from a CSV file and preprocesses it for model training.
2. **Model Training with K-NN:**
- Trains a K-NN model using the **NearestNeighbors** algorithm for generating recommendations.
3. **User Input for Recommendations:**
- Accepts user input for the TV show title and the number of recommendations.
4. **TV Show Recommendations:**
- Returns a list of recommended TV shows based on similarity to the input TV show.
---
## Requirements
### Data Requirements:
The dataset should contain the following columns for each TV show:
- **Title**
- **Genres**
- **First/Last air date**
- **Vote count/average**
- **Director**
- **Overview**
- **Networks**
- **Spoken languages**
- **Number of seasons/episodes**
### User Input Requirements:
- **TV Show Title**: The name of the TV show you like.
- **Number of Recommendations**: The number of recommendations you want to receive (default is 10).
---
## Libraries
The following libraries are required to run the program:
- **pandas**: For data manipulation and analysis.
- **scikit-learn**: For machine learning algorithms and preprocessing.
- **scipy**: For scientific computing (e.g., sparse matrices).
- **time**: For working with time-related functions.
- **os**: For interacting with the operating system.
- **re**: For regular expression support.
- **textwrap**: For text wrapping and formatting.
- **flask**: For creating the web interface.
To install the dependencies, run:
```bash
pip install -r requirements.txt

84
app.py Normal file
View File

@ -0,0 +1,84 @@
from flask import Flask, render_template, request
from readdata import LoadData
from recommendations import RecommendationLoader
from training import TrainModel
app = Flask(__name__)
data_loader = LoadData()
title_data = data_loader.load_data()
model = TrainModel(title_data)
model.train()
recommender = RecommendationLoader(model, title_data)
@app.route('/')
def home():
return render_template('index.html')
@app.route('/recommend', methods=['POST'])
def recommend():
# Get user input
title = request.form.get('title').strip()
n_recommendations = int(request.form.get('n_recommendations', 10))
# Validate user input
if not title:
return render_template('index.html', message="Please enter a valid TV show title.")
try:
n_recommendations = int(n_recommendations)
if n_recommendations < 1 or n_recommendations > 50:
raise ValueError("Number of recommendations must be between 1 and 50.")
except ValueError as e:
return render_template('index.html', message=str(e))
# Get recommendations from the model
target_row = title_data[title_data['name'].str.lower() == title.lower()]
# Check if a match was found
if target_row.empty:
return render_template('index.html', message=f"No match found for '{title}'. Try again.")
# Get recommendations
target_row = target_row.iloc[0]
user_data = {'title': title, 'n_rec': n_recommendations}
recommendations = recommender.get_recommendations("flask", target_row, user_data)
# Check if recommendations were found
if recommendations is None or recommendations.empty:
return render_template('index.html', message=f"Sorry, no recommendations available for {title}.")
# Prepare data for display on the webpage
recommendations_data = []
for _, row in recommendations.iterrows():
# Extract the first and last air dates
first_air_date = recommender.extract_years(row['first_air_date'])
last_air_date = recommender.extract_years(row['last_air_date'])
if last_air_date != "Ongoing" and last_air_date:
years = f"{first_air_date} - {last_air_date}"
else:
years = f"{first_air_date}"
recommendations_data.append({
'title': row['name'],
'genres': ', '.join(row['genres']) if isinstance(row['genres'], list) else row['genres'],
'overview': row['overview'],
'rating': row['vote_average'],
'seasons': row['number_of_seasons'],
'episodes': row['number_of_episodes'],
'networks': ', '.join(row['networks']) if isinstance(row['networks'], list) and row['networks'] else 'N/A',
'years': years,
})
return render_template('index.html', recommendations=recommendations_data, original_title=title)
if __name__ == '__main__':
app.run(debug=True)

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

76
import_data.py Normal file
View File

@ -0,0 +1,76 @@
import re
import os
import pandas as pd
###############################################################
#### Class: ImportData
###############################################################
class ImportData:
def __init__(self):
self.data = None
self.loaded_datasets = []
###########################################################
#### Function: load_dataset
###########################################################
def load_dataset(self, dataset_path):
# Load data from dataset CSV file
try:
df = pd.read_csv(os.path.join(f'dataset', dataset_path))
return df
except FileNotFoundError:
print(f'Warning: "{dataset_path}" not found. Skipping this dataset.')
return None
###########################################################
#### Function: create_data
###########################################################
def create_data(self, filename):
try:
self.data = self.load_dataset(filename)
print(f'Imported data successfully.')
except FileNotFoundError:
print("No data imported, missing dataset")
return None
###########################################################
#### Function: clean_data
###########################################################
def clean_data(self):
if self.data is not None:
# Drop unnecessary columns
df_cleaned = self.data.drop(columns=['adult', 'poster_path', 'production_companies',
'in_production','backdrop_path','production_countries','status','episode_run_time',
'original_name', 'popularity', 'tagline','homepage'], errors='ignore')
# Clean text from non-ASCII characters
text_columns = ['name', 'overview','spoken_languages']
masks = [df_cleaned[col].apply(lambda x: isinstance(x, str) and bool(re.match(r'^[\x00-\x7F]*$', x)))
for col in text_columns]
combined_mask = pd.concat(masks, axis=1).all(axis=1)
self.data = df_cleaned[combined_mask]
print(f'Data cleaned. {self.data.shape[0]} records remaining.')
else:
print("No data to clean. Please load the dataset first.")
###########################################################
#### Function: save_data
###########################################################
def save_data(self):
if self.data is not None:
try:
# Sava dataframe to CSV
self.data.to_csv('data.csv', index=False)
print(f'Data saved to data.csv.')
except Exception as e:
print(f'Error saving data: {e}')
else:
print("No data to save. Please clean the data first.")

174
main.py
View File

@ -1,177 +1,23 @@
import pandas as pd
import re
import os
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from textwrap import dedent
class LoadData:
def __init__(self):
self.data = None
self.loaded_datasets = []
def load_data(self):
self.create_data()
self.clean_data()
num_rows = self.data.shape[0]
print(f'{num_rows} titles loaded successfully.')
return self.data
def clean_text(self, text):
if isinstance(text, str):
cleaned = re.sub(r'[^\x00-\x7F]+', '', text)
cleaned = cleaned.replace('#', '').replace('"', '')
return cleaned.strip()
return ''
def load_dataset(self, dataset_path, stream):
try:
df = pd.read_csv(f'dataset/{dataset_path}')
df['stream'] = stream
if stream != 'IMDB':
df = df.drop(columns=['show_id', 'date_added', 'duration', 'rating'], errors='ignore')
df = df.rename(columns={'listed_in': 'genres'})
else:
df = df.rename(columns={'releaseYear': 'release_year'})
df = df.drop(columns=['numVotes', 'id','avaverageRating'], errors='ignore')
self.loaded_datasets.append(stream)
return df
except FileNotFoundError:
print(f'Warning: "{dataset_path}" not found. Skipping this dataset.')
def create_data(self):
print(f'Starting to read data ...')
df_netflix = self.load_dataset('data_netflix.csv','Netflix')
df_amazon = self.load_dataset('data_amazon.csv','Amazon')
df_disney = self.load_dataset('data_disney.csv','Disney')
df_imdb = self.load_dataset('data_imdb.csv','IMDB')
dataframes = [df for df in [df_imdb, df_netflix, df_amazon, df_disney] if df is not None]
if not dataframes:
print("Error: No datasets loaded. Cannot create combined data.")
return
df_all = pd.concat(dataframes, ignore_index=True, sort=False)
df_all = df_all.infer_objects(copy=False)
self.data = df_all
print(f'Data from {", ".join(self.loaded_datasets)} imported.')
def clean_data(self):
self.data.dropna(subset=['title', 'genres', 'description'], inplace=True)
string_columns = self.data.select_dtypes(include=['object'])
self.data[string_columns.columns] = string_columns.apply(lambda col: col.map(self.clean_text, na_action='ignore'))
self.data = self.data[~self.data['title'].str.strip().isin(['', ':'])]
self.data['genres'] = self.data['genres'].str.split(', ').apply(lambda x: [genre.strip() for genre in x])
self.data = self.data[self.data['genres'].map(lambda x: len(x) > 0)]
print(f'Data cleaned. {self.data.shape[0]} records remaining.')
from readdata import LoadData
from training import TrainModel
from recommendations import RecommendationLoader
class UserData:
def __init__(self):
self.user_data = None
def input(self):
self.user_data = input("Which Movie or TV Series do you prefer: ")
return self.user_data.strip().lower()
class TrainModel:
def __init__(self, title_data):
self.recommendation_model = None
self.title_data = title_data
self.title_vectors = None
self.vectorizer = TfidfVectorizer()
self.preprocess_data()
def preprocess_data(self):
self.title_data['genres'] = self.title_data['genres'].apply(lambda x: ', '.join(x) if isinstance(x, list) else '')
self.title_data['combined_text'] = (
self.title_data['title'].fillna('') + ' ' +
self.title_data['director'].fillna('') + ' ' +
self.title_data['cast'].fillna('') + ' ' +
self.title_data['genres'] + ' ' +
self.title_data['description'].fillna('')
)
self.title_data['combined_text'] = self.title_data['combined_text'].str.lower()
self.title_data['combined_text'] = self.title_data['combined_text'].str.replace(r'[^a-z\s]', '', regex=True)
self.title_vectors = self.vectorizer.fit_transform(self.title_data['combined_text'])
def preprocess_user_input(self, user_input):
user_vector = self.vectorizer.transform([user_input])
return user_vector
def train(self):
self.recommendation_model = NearestNeighbors(n_neighbors=10, metric='cosine')
self.recommendation_model.fit(self.title_vectors)
class RecommendationLoader:
def __init__(self, model, title_data):
self.model = model
self.title_data = title_data
def run(self):
while True:
user_data = UserData()
user_input = user_data.input()
if user_input in ['exit', 'quit']:
print("Program will exit now. Thanks for using!")
break
self.get_recommendations(user_input)
print("\nWrite 'exit' or 'quit' to end the program.")
def get_recommendations(self, user_data):
user_vector = self.model.preprocess_user_input(user_data)
distances, indices = self.model.recommendation_model.kneighbors(user_vector, n_neighbors=10)
recommendations = self.title_data.iloc[indices[0]]
self.display_recommendations(user_data, recommendations)
def display_recommendations(self, user_data, recommendations):
print(f'\nRecommendations based on "{user_data}":\n')
if not recommendations.empty:
movie_recommendations = recommendations[recommendations['type'] == 'Movie']
tv_show_recommendations = recommendations[recommendations['type'] == 'TV Show']
if not movie_recommendations.empty:
print("\n#################### Recommended Movies: ####################")
for i, (_, row) in enumerate(movie_recommendations.iterrows(), start=1):
print(dedent(f"""
{i}. {row['title']} ({row['release_year']}) ({row['genres']})
Description: {row['description']}
Director: {row['director']}
Cast: {row['cast']}
===============================================================
"""))
if not tv_show_recommendations.empty:
print("\n#################### Recommended TV Shows: ####################")
for i, (_, row) in enumerate(tv_show_recommendations.iterrows(), start=1):
print(dedent(f"""
{i}. {row['title']} ({row['release_year']}) ({row['genres']})
Description: {row['description']}
Director: {row['director']}
Cast: {row['cast']}
===============================================================
"""))
else:
print("No recommendations found.")
#########################################################################
#### function: main
#########################################################################
def main():
# Load data from CSV file
data_loader = LoadData()
title_data = data_loader.load_data()
# Train model
model = TrainModel(title_data)
model.train()
# Run recommendation loader
recommendations = RecommendationLoader(model, title_data)
recommendations.run()

79
readdata.py Normal file
View File

@ -0,0 +1,79 @@
import pandas as pd
from import_data import ImportData
#########################################################################
#### Class: LoadData
#########################################################################
class LoadData:
def __init__(self):
self.data = None
self.filename = 'TMDB_tv_dataset_v3.csv'
###########################################################
#### Function: load_data
###########################################################
def load_data(self):
self.read_data()
self.clean_data()
print(f'{self.data.shape[0]} titles loaded successfully.')
return self.data
###########################################################
#### Function: read_data
###########################################################
def read_data(self):
print("Starting to read data ...")
try:
# Try to Read CSV file
self.data = pd.read_csv('data.csv')
print(f'{self.data.shape[0]} rows read successfully.')
except FileNotFoundError:
print("No data.csv file found. Attempting to import data...")
# If CSV file not found, try to import data from datasets instead
try:
data_importer = ImportData()
data_importer.create_data(self.filename)
data_importer.clean_data()
data_importer.save_data()
self.data = pd.read_csv('data.csv')
print(f'{self.data.shape[0]} rows imported successfully.')
except Exception as e:
print(f"Error during data import process: {e}")
###########################################################
#### Function: clean_data
###########################################################
def clean_data(self):
# Function to split a string into a list, or use an empty list if no valid data
def split_to_list(value):
if isinstance(value, str):
# Strip and split the string, and remove any empty items
return [item.strip() for item in value.split(',') if item.strip()]
return []
data_start = self.data.shape[0]
# Split genres, spoken_languages, networks, and created_by
self.data['genres'] = self.data['genres'].apply(split_to_list)
self.data['spoken_languages'] = self.data['spoken_languages'].apply(split_to_list)
self.data['networks'] = self.data['networks'].apply(split_to_list)
self.data['created_by'] = self.data['created_by'].apply(split_to_list)
# Drop rows that are not in English
self.data = self.data[self.data['original_language'] == 'en']
# Drop rows with empty lists in genres or spoken_languages
self.data = self.data[
self.data['genres'].map(lambda x: len(x) > 0) &
self.data['spoken_languages'].map(lambda x: len(x) > 0) &
self.data['networks'].map(lambda x: len(x) > 0)
]
# Count rows that were dropped
rows_dropped = data_start - len(self.data)
print('Data cleaned successfully, dropped ' + str(rows_dropped) + ' rows.')

155
recommendations.py Normal file
View File

@ -0,0 +1,155 @@
from user import UserData
import pandas as pd
import textwrap
###############################################################
#### Class: RecommendationLoader
###############################################################
class RecommendationLoader:
def __init__(self, model, title_data):
self.model = model
self.title_data = title_data
###########################################################
#### Function: run
###########################################################
def run(self):
while True:
user_data = UserData()
user_data.title()
user_data.n_recommendations()
# Exit the program if writing exit or quit.
if user_data.user_data['title'] in ['exit', 'quit']:
print("Program will exit now. Thanks for using!")
break
# Find a row in dataset to use as referens.
target_row = self.title_data[self.title_data['name'].str.lower() == user_data.user_data['title']]
# If no match found, loop and try again.
if target_row.empty:
print(f"No match found for '{user_data.user_data['title']}'. Try again.")
continue
# If match found, get recommendations.
target_row = target_row.iloc[0]
self.get_recommendations(target_row, user_data.user_data)
print("#" * 100)
print("\nWrite 'exit' or 'quit' to end the program.")
###########################################################
#### Function: get_recommendations
###########################################################
def get_recommendations(self, type, target_row, user_data):
recommendations = pd.DataFrame()
n_recommendations = user_data['n_rec']
# Get more recommendations and filter untill n_recommendations is reached
while len(recommendations) < n_recommendations:
additional_recommendations = self.model.recommend(target_row, num_recommendations=20)
additional_recommendations = additional_recommendations[~additional_recommendations.index.isin(recommendations.index)]
additional_recommendations = self.filter_genres(additional_recommendations, target_row)
recommendations = pd.concat([recommendations, additional_recommendations])
# Make sure we give n_recommendations recommendations
recommendations = recommendations.head(n_recommendations)
if type == 'flask':
return recommendations
else:
self.display_recommendations(user_data, recommendations, n_recommendations, target_row)
###########################################################
#### Function: display_recommendations
###########################################################
def display_recommendations(self, user_data, recommendations, n_recommendations, target_row):
print(f'\n{n_recommendations} recommendations based on "{user_data["title"]}":\n')
# Width on printed recommendations
width = 100
# Print recommendations if there are any
if not recommendations.empty:
# print(f"{'Title':<40} {'Genres':<60} {'Networks':<30}")
print("#" * width)
for index, row in recommendations.iterrows():
title = row['name']
genres = ', '.join(row['genres']) if isinstance(row['genres'], list) else row['genres']
networks = ', '.join(row['networks']) if isinstance(row['networks'], list) and row['networks'] else 'N/A'
created_by = ', '.join(row['created_by']) if isinstance(row['created_by'], list) and row['created_by'] else 'N/A'
rating = row['vote_average']
vote_count = row['vote_count']
seasons = row['number_of_seasons'] if isinstance(row['number_of_seasons'], int) else 'N/A'
episodes = row['number_of_episodes'] if isinstance(row['number_of_episodes'], int) else 'N/A'
overview = textwrap.fill(row["overview"], width=width)
# Extract years fir first_air_date and last_air_date
first_year = self.extract_years(row["first_air_date"])
last_year = self.extract_years(row["last_air_date"])
# Construct title with the year range
title_raw = f"{title} ({first_year}-{last_year})"
title = textwrap.fill(title_raw, width=width)
# Print recommendation
print(f"\nTitle: {title}")
print(f"Genres: {genres}")
if not created_by == 'N/A':
print(f"Director: {created_by}")
if not networks == 'N/A':
print(f'Networks: {networks}')
print(f"Rating: {rating:.1f} ({vote_count:.0f} votes)")
if not seasons == 'N/A' and not episodes == 'N/A':
print(f"Seasons: {seasons} ({episodes} episodes)")
print(f'\n{overview}\n')
print("-" * width)
print("\nEnd of recommendations.")
else:
print("No recommendations found.")
###########################################################
#### Function: extract_years
###########################################################
def extract_years(self, air_date):
# Make sure air_date is not null
if pd.isna(air_date):
return "Unknown"
# Convert float to int if needed
if isinstance(air_date, float):
return str(int(air_date))
return air_date.split('-')[0]
###########################################################
#### Function: filter_genres
###########################################################
def filter_genres(self, recommendations, target_row):
# Get genres from the target row
reference_genres = [genre.lower() for genre in target_row['genres']]
# Check if the reference includes specific genres
is_kids_reference = 'kids' in reference_genres
is_animated_reference = 'animation' in reference_genres
is_reality_reference = 'reality' in reference_genres
is_documentary_reference = 'documentary' in reference_genres
# Filter recommendations based on genre preferences
if not is_kids_reference:
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'kids' in [g.lower() for g in x])]
if not is_animated_reference:
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'animation' in [g.lower() for g in x])]
if not is_reality_reference:
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'reality' in [g.lower() for g in x])]
if not is_documentary_reference:
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'documentary' in [g.lower() for g in x])]
return recommendations

7
requirements.txt Normal file
View File

@ -0,0 +1,7 @@
Flask==3.0.1
numpy==1.26.4
pandas==2.2.0
scikit-learn==1.4.1.post1
scipy==1.12.0
sentence-transformers==3.2.1
textwrap3==0.9.0

92
templates/index.html Normal file
View File

@ -0,0 +1,92 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Recommendation System</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 0;
padding: 20px;
background-color: #f4f4f9;
}
.container {
max-width: 1200px;
margin: 0 auto;
padding: 20px;
}
h1 {
text-align: center;
color: #333;
}
form {
text-align: center;
margin-bottom: 20px;
}
.recommendation {
background: #fff;
padding: 15px;
border-radius: 8px;
margin-bottom: 20px;
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
}
.recommendation h3 {
margin-top: 0;
}
.recommendation p {
margin: 5px 0;
}
.recommendation .overview {
font-size: 14px;
color: #555;
}
.error-message {
color: red;
text-align: center;
font-weight: bold;
}
</style>
</head>
<body>
<div class="container">
<h1>TV-Show Recommendations</h1>
<!-- Recommendation Form -->
<form method="POST" action="/recommend">
<label for="title">Enter a Title (TV Show):</label><br><br>
<input type="text" id="title" name="title" required><br><br>
<label for="n_recommendations">Number of Recommendations:</label><br><br>
<input type="number" id="n_recommendations" name="n_recommendations" value="10" min="1" max="50"><br><br>
<input type="submit" value="Get Recommendations">
</form>
<!-- Display Error Message if any -->
{% if message %}
<div class="error-message">{{ message }}</div>
{% endif %}
<!-- Display Recommendations -->
{% if recommendations %}
<h2>Recommendations based on "{{ original_title }}":</h2>
<div class="recommendations">
{% for rec in recommendations %}
<div class="recommendation">
<h3>{{ rec.title }} ({{ rec.years }})</h3>
<p><strong>Genres:</strong> {{ rec.genres }}</p>
<p><strong>Networks:</strong> {{ rec.networks }}</p>
<p><strong>Rating:</strong> {{ rec.rating }}</p>
<p><strong>Seasons:</strong> {{ rec.seasons }}({{ rec.episodes }} episodes)</p>
<p class="overview"><strong>Overview:</strong> {{ rec.overview }}</p>
</div>
{% endfor %}
</div>
{% endif %}
</div>
</body>
</html>

146
training.py Normal file
View File

@ -0,0 +1,146 @@
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import hstack, csr_matrix
import pickle
import time
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='sklearn')
#########################################################################
#### Class: TrainModel
#########################################################################
class TrainModel:
def __init__(self, title_data):
self.title_data = title_data
# Initialize Sentence-BERT model for embeddings
self.bert_model = SentenceTransformer('all-MiniLM-L12-v2')
# TF-IDF Vectorization settings
self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=0.01, max_df=0.5)
# Nearest Neighbors settings
self.nearest_neighbors = NearestNeighbors(metric='cosine')
# Scaler for numerical features
self.scaler = StandardScaler()
# SVD for dimensionality reduction
self.svd = TruncatedSVD(n_components=300)
###########################################################
#### Function: Train
###########################################################
def train(self):
print("Starting to train model...")
start = time.time()
# Preprocess title data with advanced embeddings included
preprocessed_data = self.preprocess_title_data()
# Train Nearest Neighbors on the enhanced feature set
self.nearest_neighbors.fit(preprocessed_data)
print(f'Trained model successfully in {time.time() - start:.2f} seconds.')
###########################################################
#### Function: Recommend
###########################################################
def recommend(self, target_row, num_recommendations=40):
# Preprocess target data
target_vector = self.preprocess_target_data(target_row)
# Use Nearest Neighbors to get recommendations
distances, indices = self.nearest_neighbors.kneighbors(target_vector, n_neighbors=num_recommendations)
recommendations = self.title_data.iloc[indices[0]].copy()
recommendations['distance'] = distances[0]
# Filter recommendations
recommendations = recommendations[
(recommendations['name'].str.lower() != target_row['name'].lower()) &
(recommendations['distance'] < 0.5)
]
return recommendations.head(num_recommendations)
###########################################################
#### Function: preprocess_title_data
###########################################################
def preprocess_title_data(self):
# Combine text fields for TF-IDF and BERT
self.title_data['combined_text'] = (
self.title_data['overview'].fillna('').apply(str) + ' ' +
self.title_data['genres'].fillna('').apply(str) + ' ' +
self.title_data['created_by'].fillna('').apply(str)
)
# TF-IDF + SVD
text_features = self.vectorizer.fit_transform(self.title_data['combined_text'])
text_features = self.svd.fit_transform(text_features)
# Sentence-BERT embeddings
bert_embeddings = self.load_pickle('bert_embeddings.pkl', self.title_data['combined_text'])
# Numerical features
self.numerical_data = self.title_data.select_dtypes(include=['number'])
numerical_features = self.scaler.fit_transform(self.numerical_data)
numerical_features_sparse = csr_matrix(numerical_features)
# Combine all features
combined_features = hstack([csr_matrix(text_features), csr_matrix(bert_embeddings),
numerical_features_sparse])
return combined_features
###########################################################
#### Function: preprocess_target_data
###########################################################
def preprocess_target_data(self, target_row):
# TF-IDF + SVD
target_text_vector = self.vectorizer.transform([target_row['combined_text']])
target_text_vector = self.svd.transform(target_text_vector)
# Sentence-BERT embedding
target_bert_embedding = self.embed_text(target_row['combined_text']).reshape(1, -1)
# Numerical features
target_numerical = target_row[self.numerical_data.columns].values.reshape(1, -1)
target_numerical_scaled = self.scaler.transform(target_numerical)
# Combine all features
target_vector = hstack([csr_matrix(target_text_vector), csr_matrix(target_bert_embedding),
csr_matrix(target_numerical_scaled)])
return target_vector
###########################################################
#### Function: embed_text
###########################################################
def embed_text(self, text):
# Use Sentence-BERT to create embeddings
return self.bert_model.encode(text, convert_to_numpy=True)
###########################################################
#### Function: load_pickle
###########################################################
def load_pickle(self, filename, title_data):
try:
with open(filename, 'rb') as f:
bert_embeddings = pickle.load(f)
except FileNotFoundError:
print("Generating Sentence-BERT embeddings...")
bert_embeddings = self.bert_model.encode(title_data.tolist(), batch_size=64, convert_to_numpy=True)
with open(filename, 'wb') as f:
pickle.dump(bert_embeddings, f)
return bert_embeddings

34
user.py Normal file
View File

@ -0,0 +1,34 @@
###############################################################
#### Class: UserData
###############################################################
class UserData:
def __init__(self):
self.user_data = {}
self.n_rec = 10
###########################################################
#### Function: title
###########################################################
def title(self):
# Ask for user input
print("#" * 100)
title = input("\nPlease enter the title of TV-Series you prefer: ")
self.user_data['title'] = title.strip().lower()
return self.user_data
###########################################################
#### Function: n_recommendations
###########################################################
def n_recommendations(self):
# Ask for number of recommendations
while True:
n_rec = input("How many recommendations do you want (minimum 5): ")
try:
n_rec = int(n_rec.strip())
if n_rec < 5:
print("Please enter a number greater than or equal to 5: ")
else:
self.user_data['n_rec'] = n_rec
break
except ValueError:
print("Please enter a valid number.")