Uppdatera filer till V1 som är fungerande
This commit is contained in:
parent
0e8a994162
commit
aa0f23c41b
88
README.md
88
README.md
@ -1,56 +1,92 @@
|
|||||||
# Supervised Learning - Movie/TV-Show recommender
|
# Supervised Learning - TV-Show recommender
|
||||||
|
|
||||||
## Specification
|
## Specification
|
||||||
Movie/TV-Show recommender
|
TV-Show recommender
|
||||||
|
|
||||||
This program will recommend you what movie or th-show to view based on what Movie/TV-Show you like.
|
This program will recommend you what tv-show to view based on what you like.
|
||||||
You should be able to search for recommendations from your Movie/TV-Show title, cast, director,
|
You will tell what tv-show you like and how many recommendations wanted, then you will get that
|
||||||
release year and also Description, and get back a recommendations with a explanation on what just this
|
amount of recommendations of tv-shows in order of rank from your search.
|
||||||
title might suit you. It will get you about 25 recommendations of movies in order of rank from your search,
|
|
||||||
and the same from TV-Shows that will match the same way as Movies.
|
|
||||||
|
|
||||||
### Data Source:
|
### Data Source:
|
||||||
I will use 4 datasets from kaggle, 3 datasets from streaming-sites Netflix,
|
I will use a dataset from TMBD
|
||||||
Amazon Prime and Disney Plus, also 1 from a IMDB dataset.
|
https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows
|
||||||
|
|
||||||
### Model:
|
### Model:
|
||||||
I will use NearestNeighbors (NN) alhorithm that can help me find other titles based on features
|
I will use NearestNeighbors (NN) alhorithm together with K-NearestNeighbors alhorithm.
|
||||||
like Title, Release year, Description, Cast, Director and genres.
|
|
||||||
|
|
||||||
### Features:
|
### Features:
|
||||||
1. Load data from several data-files and preprocessing.
|
1. Load data from dataset and preprocessing.
|
||||||
2. Model training with k-NN algorithm.
|
2. Model training with NN & k-NN algorithm.
|
||||||
3. Search with explanation
|
3. User input
|
||||||
|
4. Recommendations
|
||||||
|
|
||||||
### Requirements:
|
### Requirements:
|
||||||
1. Title data:
|
1. Title data:
|
||||||
* Title
|
* Title
|
||||||
* Genres
|
* Genres
|
||||||
* Release year
|
* First/last air date
|
||||||
* Cast
|
* Vote count/average
|
||||||
* Director
|
* Director
|
||||||
* Description
|
* Description
|
||||||
3. User data:
|
* Networks
|
||||||
* What Movie / TV-Show
|
* Spoken languages
|
||||||
* What genre
|
* Number of seasons/episodes
|
||||||
* Director
|
2. User data:
|
||||||
|
* What Movie / TV-Show prefers
|
||||||
|
* Number of recommendations wanted
|
||||||
|
|
||||||
### Libraries
|
### Libraries
|
||||||
* pandas: Data manipulation and analysis
|
* pandas: Data manipulation and analysis
|
||||||
* scikit-learn: machine learning algorithms and preprocessing
|
* scikit-learn: machine learning algorithms and preprocessing
|
||||||
* beatifulsoup4: web scraping (if necessary)
|
* scipy: A scientific computing package for Python
|
||||||
|
* time: provides various functions for working with time
|
||||||
|
* os: functions for interacting with the operating system
|
||||||
|
* re: provides regular expression support
|
||||||
|
* textwrap: Text wrapping and filling
|
||||||
|
|
||||||
### Classes
|
### Classes
|
||||||
1. LoadData
|
1. LoadData
|
||||||
* load_data
|
* load_data
|
||||||
* clean_text
|
* read_data
|
||||||
* clean_data
|
* clean_data
|
||||||
|
2. ImportData
|
||||||
* load_dataset
|
* load_dataset
|
||||||
* create_data
|
* create_data
|
||||||
|
* clean_data
|
||||||
* save_data
|
* save_data
|
||||||
* load_data
|
3. TrainModel
|
||||||
2. UserData
|
* train
|
||||||
|
* recommend
|
||||||
|
* preprocess_title_data
|
||||||
|
* preprocess_target_data
|
||||||
|
4. UserData
|
||||||
* input
|
* input
|
||||||
3. Recommendations
|
* n_recommendations
|
||||||
* get_recommendations
|
5. RecommendationLoader
|
||||||
|
* run
|
||||||
|
* get_recommendations
|
||||||
|
* display_recommendations
|
||||||
|
* get_explanation
|
||||||
|
* check_genre_overlap
|
||||||
|
* check_created_by_overlap
|
||||||
|
* extract_years
|
||||||
|
* filter_genres
|
||||||
|
|
||||||
|
### References
|
||||||
|
* https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.NearestNeighbors.html
|
||||||
|
* https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
|
||||||
|
* https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html
|
||||||
|
* https://scikit-learn.org/0.16/modules/generated/sklearn.decomposition.TruncatedSVD.html
|
||||||
|
* https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html
|
||||||
|
* https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
|
||||||
|
* https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html
|
||||||
|
|
||||||
|
## How to run program
|
||||||
|
|
||||||
|
### Before running program
|
||||||
|
First thing to do is to extract TMDB_tv_dataset_v3.zip in dataset folder so that it contains TMDB_tv_dataset_v3.csv.
|
||||||
|
|
||||||
|
### Running program
|
||||||
|
Start main.py and it will load dataset and ask for a title to get recommendations from, also how many recommendations wanted. Then enter and you will have those recommendations presented on screen.
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
BIN
dataset/TMDB_tv_dataset_v3.zip
Normal file
BIN
dataset/TMDB_tv_dataset_v3.zip
Normal file
Binary file not shown.
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
216911
dataset/dataset_tmdb.csv
216911
dataset/dataset_tmdb.csv
File diff suppressed because it is too large
Load Diff
66
import_data.py
Normal file
66
import_data.py
Normal file
@ -0,0 +1,66 @@
|
|||||||
|
import re
|
||||||
|
import os
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
|
||||||
|
############################## Import data ##############################
|
||||||
|
class ImportData:
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.data = None
|
||||||
|
self.loaded_datasets = []
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: load_dataset ----------------------
|
||||||
|
def load_dataset(self, dataset_path):
|
||||||
|
# Load data from dataset CSV file
|
||||||
|
try:
|
||||||
|
df = pd.read_csv(os.path.join(f'dataset', dataset_path))
|
||||||
|
return df
|
||||||
|
except FileNotFoundError:
|
||||||
|
print(f'Warning: "{dataset_path}" not found. Skipping this dataset.')
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: create_data ----------------------
|
||||||
|
def create_data(self, filename):
|
||||||
|
try:
|
||||||
|
self.data = self.load_dataset(filename)
|
||||||
|
print(f'Imported data successfully.')
|
||||||
|
except FileNotFoundError:
|
||||||
|
print("No data imported, missing dataset")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: clean_data ----------------------
|
||||||
|
def clean_data(self):
|
||||||
|
if self.data is not None:
|
||||||
|
# Drop unnecessary columns
|
||||||
|
df_cleaned = self.data.drop(columns=['adult', 'poster_path', 'production_companies',
|
||||||
|
'in_production','backdrop_path','production_countries','status','episode_run_time',
|
||||||
|
'original_name', 'popularity', 'tagline','homepage'], errors='ignore')
|
||||||
|
|
||||||
|
# Clean text from non-ASCII characters
|
||||||
|
text_columns = ['name', 'overview','spoken_languages']
|
||||||
|
masks = [df_cleaned[col].apply(lambda x: isinstance(x, str) and bool(re.match(r'^[\x00-\x7F]*$', x)))
|
||||||
|
for col in text_columns]
|
||||||
|
combined_mask = pd.concat(masks, axis=1).all(axis=1)
|
||||||
|
|
||||||
|
self.data = df_cleaned[combined_mask]
|
||||||
|
|
||||||
|
print(f'Data cleaned. {self.data.shape[0]} records remaining.')
|
||||||
|
else:
|
||||||
|
print("No data to clean. Please load the dataset first.")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: save_data ----------------------
|
||||||
|
def save_data(self):
|
||||||
|
if self.data is not None:
|
||||||
|
try:
|
||||||
|
# Sava dataframe to CSV
|
||||||
|
self.data.to_csv('data.csv', index=False)
|
||||||
|
print(f'Data saved to data.csv.')
|
||||||
|
except Exception as e:
|
||||||
|
print(f'Error saving data: {e}')
|
||||||
|
else:
|
||||||
|
print("No data to save. Please clean the data first.")
|
||||||
172
main.py
172
main.py
@ -1,177 +1,21 @@
|
|||||||
import pandas as pd
|
from read_data import LoadData
|
||||||
import re
|
from trainmodel import TrainModel
|
||||||
import os
|
from recommendations import RecommendationLoader
|
||||||
from sklearn.neighbors import NearestNeighbors
|
|
||||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
|
||||||
from textwrap import dedent
|
|
||||||
|
|
||||||
class LoadData:
|
|
||||||
def __init__(self):
|
|
||||||
self.data = None
|
|
||||||
self.loaded_datasets = []
|
|
||||||
|
|
||||||
def load_data(self):
|
|
||||||
self.create_data()
|
|
||||||
self.clean_data()
|
|
||||||
num_rows = self.data.shape[0]
|
|
||||||
print(f'{num_rows} titles loaded successfully.')
|
|
||||||
return self.data
|
|
||||||
|
|
||||||
def clean_text(self, text):
|
|
||||||
if isinstance(text, str):
|
|
||||||
cleaned = re.sub(r'[^\x00-\x7F]+', '', text)
|
|
||||||
cleaned = cleaned.replace('#', '').replace('"', '')
|
|
||||||
return cleaned.strip()
|
|
||||||
return ''
|
|
||||||
|
|
||||||
def load_dataset(self, dataset_path, stream):
|
|
||||||
try:
|
|
||||||
df = pd.read_csv(f'dataset/{dataset_path}')
|
|
||||||
df['stream'] = stream
|
|
||||||
if stream != 'IMDB':
|
|
||||||
df = df.drop(columns=['show_id', 'date_added', 'duration', 'rating'], errors='ignore')
|
|
||||||
df = df.rename(columns={'listed_in': 'genres'})
|
|
||||||
else:
|
|
||||||
df = df.rename(columns={'releaseYear': 'release_year'})
|
|
||||||
df = df.drop(columns=['numVotes', 'id','avaverageRating'], errors='ignore')
|
|
||||||
self.loaded_datasets.append(stream)
|
|
||||||
return df
|
|
||||||
except FileNotFoundError:
|
|
||||||
print(f'Warning: "{dataset_path}" not found. Skipping this dataset.')
|
|
||||||
|
|
||||||
def create_data(self):
|
|
||||||
print(f'Starting to read data ...')
|
|
||||||
|
|
||||||
df_netflix = self.load_dataset('data_netflix.csv','Netflix')
|
|
||||||
df_amazon = self.load_dataset('data_amazon.csv','Amazon')
|
|
||||||
df_disney = self.load_dataset('data_disney.csv','Disney')
|
|
||||||
df_imdb = self.load_dataset('data_imdb.csv','IMDB')
|
|
||||||
|
|
||||||
dataframes = [df for df in [df_imdb, df_netflix, df_amazon, df_disney] if df is not None]
|
|
||||||
if not dataframes:
|
|
||||||
print("Error: No datasets loaded. Cannot create combined data.")
|
|
||||||
return
|
|
||||||
|
|
||||||
df_all = pd.concat(dataframes, ignore_index=True, sort=False)
|
|
||||||
df_all = df_all.infer_objects(copy=False)
|
|
||||||
self.data = df_all
|
|
||||||
|
|
||||||
print(f'Data from {", ".join(self.loaded_datasets)} imported.')
|
|
||||||
|
|
||||||
def clean_data(self):
|
|
||||||
self.data.dropna(subset=['title', 'genres', 'description'], inplace=True)
|
|
||||||
string_columns = self.data.select_dtypes(include=['object'])
|
|
||||||
self.data[string_columns.columns] = string_columns.apply(lambda col: col.map(self.clean_text, na_action='ignore'))
|
|
||||||
self.data = self.data[~self.data['title'].str.strip().isin(['', ':'])]
|
|
||||||
self.data['genres'] = self.data['genres'].str.split(', ').apply(lambda x: [genre.strip() for genre in x])
|
|
||||||
self.data = self.data[self.data['genres'].map(lambda x: len(x) > 0)]
|
|
||||||
print(f'Data cleaned. {self.data.shape[0]} records remaining.')
|
|
||||||
|
|
||||||
|
|
||||||
class UserData:
|
############################## Main ############################################
|
||||||
def __init__(self):
|
|
||||||
self.user_data = None
|
|
||||||
|
|
||||||
def input(self):
|
|
||||||
self.user_data = input("Which Movie or TV Series do you prefer: ")
|
|
||||||
return self.user_data.strip().lower()
|
|
||||||
|
|
||||||
|
|
||||||
class TrainModel:
|
|
||||||
def __init__(self, title_data):
|
|
||||||
self.recommendation_model = None
|
|
||||||
self.title_data = title_data
|
|
||||||
self.title_vectors = None
|
|
||||||
self.vectorizer = TfidfVectorizer()
|
|
||||||
self.preprocess_data()
|
|
||||||
|
|
||||||
def preprocess_data(self):
|
|
||||||
self.title_data['genres'] = self.title_data['genres'].apply(lambda x: ', '.join(x) if isinstance(x, list) else '')
|
|
||||||
self.title_data['combined_text'] = (
|
|
||||||
self.title_data['title'].fillna('') + ' ' +
|
|
||||||
self.title_data['director'].fillna('') + ' ' +
|
|
||||||
self.title_data['cast'].fillna('') + ' ' +
|
|
||||||
self.title_data['genres'] + ' ' +
|
|
||||||
self.title_data['description'].fillna('')
|
|
||||||
)
|
|
||||||
self.title_data['combined_text'] = self.title_data['combined_text'].str.lower()
|
|
||||||
self.title_data['combined_text'] = self.title_data['combined_text'].str.replace(r'[^a-z\s]', '', regex=True)
|
|
||||||
self.title_vectors = self.vectorizer.fit_transform(self.title_data['combined_text'])
|
|
||||||
|
|
||||||
def preprocess_user_input(self, user_input):
|
|
||||||
user_vector = self.vectorizer.transform([user_input])
|
|
||||||
return user_vector
|
|
||||||
|
|
||||||
def train(self):
|
|
||||||
self.recommendation_model = NearestNeighbors(n_neighbors=10, metric='cosine')
|
|
||||||
self.recommendation_model.fit(self.title_vectors)
|
|
||||||
|
|
||||||
|
|
||||||
class RecommendationLoader:
|
|
||||||
def __init__(self, model, title_data):
|
|
||||||
self.model = model
|
|
||||||
self.title_data = title_data
|
|
||||||
|
|
||||||
def run(self):
|
|
||||||
while True:
|
|
||||||
user_data = UserData()
|
|
||||||
user_input = user_data.input()
|
|
||||||
|
|
||||||
if user_input in ['exit', 'quit']:
|
|
||||||
print("Program will exit now. Thanks for using!")
|
|
||||||
break
|
|
||||||
|
|
||||||
self.get_recommendations(user_input)
|
|
||||||
print("\nWrite 'exit' or 'quit' to end the program.")
|
|
||||||
|
|
||||||
def get_recommendations(self, user_data):
|
|
||||||
user_vector = self.model.preprocess_user_input(user_data)
|
|
||||||
distances, indices = self.model.recommendation_model.kneighbors(user_vector, n_neighbors=10)
|
|
||||||
recommendations = self.title_data.iloc[indices[0]]
|
|
||||||
|
|
||||||
self.display_recommendations(user_data, recommendations)
|
|
||||||
|
|
||||||
def display_recommendations(self, user_data, recommendations):
|
|
||||||
print(f'\nRecommendations based on "{user_data}":\n')
|
|
||||||
|
|
||||||
if not recommendations.empty:
|
|
||||||
movie_recommendations = recommendations[recommendations['type'] == 'Movie']
|
|
||||||
tv_show_recommendations = recommendations[recommendations['type'] == 'TV Show']
|
|
||||||
|
|
||||||
if not movie_recommendations.empty:
|
|
||||||
print("\n#################### Recommended Movies: ####################")
|
|
||||||
for i, (_, row) in enumerate(movie_recommendations.iterrows(), start=1):
|
|
||||||
print(dedent(f"""
|
|
||||||
{i}. {row['title']} ({row['release_year']}) ({row['genres']})
|
|
||||||
Description: {row['description']}
|
|
||||||
Director: {row['director']}
|
|
||||||
Cast: {row['cast']}
|
|
||||||
|
|
||||||
===============================================================
|
|
||||||
"""))
|
|
||||||
|
|
||||||
if not tv_show_recommendations.empty:
|
|
||||||
print("\n#################### Recommended TV Shows: ####################")
|
|
||||||
for i, (_, row) in enumerate(tv_show_recommendations.iterrows(), start=1):
|
|
||||||
print(dedent(f"""
|
|
||||||
{i}. {row['title']} ({row['release_year']}) ({row['genres']})
|
|
||||||
Description: {row['description']}
|
|
||||||
Director: {row['director']}
|
|
||||||
Cast: {row['cast']}
|
|
||||||
|
|
||||||
===============================================================
|
|
||||||
"""))
|
|
||||||
else:
|
|
||||||
print("No recommendations found.")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
|
|
||||||
|
# Load data from CSV file
|
||||||
data_loader = LoadData()
|
data_loader = LoadData()
|
||||||
title_data = data_loader.load_data()
|
title_data = data_loader.load_data()
|
||||||
|
|
||||||
|
# Train model
|
||||||
model = TrainModel(title_data)
|
model = TrainModel(title_data)
|
||||||
model.train()
|
model.train()
|
||||||
|
|
||||||
|
# Run recommendation loader
|
||||||
recommendations = RecommendationLoader(model, title_data)
|
recommendations = RecommendationLoader(model, title_data)
|
||||||
recommendations.run()
|
recommendations.run()
|
||||||
|
|
||||||
|
|||||||
71
read_data.py
Normal file
71
read_data.py
Normal file
@ -0,0 +1,71 @@
|
|||||||
|
import pandas as pd
|
||||||
|
from import_data import ImportData
|
||||||
|
|
||||||
|
|
||||||
|
############################## Load data ##############################
|
||||||
|
class LoadData:
|
||||||
|
def __init__(self):
|
||||||
|
self.data = None
|
||||||
|
self.filename = 'TMDB_tv_dataset_v3.csv'
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: load_data ----------------------
|
||||||
|
def load_data(self):
|
||||||
|
self.read_data()
|
||||||
|
self.clean_data()
|
||||||
|
print(f'{self.data.shape[0]} titles loaded successfully.')
|
||||||
|
return self.data
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: read_data ----------------------
|
||||||
|
def read_data(self):
|
||||||
|
print("Starting to read data ...")
|
||||||
|
try:
|
||||||
|
# Try to Read CSV file
|
||||||
|
self.data = pd.read_csv('data.csv')
|
||||||
|
print(f'{self.data.shape[0]} rows read successfully.')
|
||||||
|
except FileNotFoundError:
|
||||||
|
print("No data.csv file found. Attempting to import data...")
|
||||||
|
# If CSV file not found, try to import data from datasets instead
|
||||||
|
try:
|
||||||
|
data_importer = ImportData()
|
||||||
|
data_importer.create_data(self.filename)
|
||||||
|
data_importer.clean_data()
|
||||||
|
data_importer.save_data()
|
||||||
|
self.data = pd.read_csv('data.csv')
|
||||||
|
print(f'{self.data.shape[0]} rows imported successfully.')
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error during data import process: {e}")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: clean_data ----------------------
|
||||||
|
def clean_data(self):
|
||||||
|
# Function to split a string into a list, or use an empty list if no valid data
|
||||||
|
def split_to_list(value):
|
||||||
|
if isinstance(value, str):
|
||||||
|
# Strip and split the string, and remove any empty items
|
||||||
|
return [item.strip() for item in value.split(',') if item.strip()]
|
||||||
|
return []
|
||||||
|
|
||||||
|
data_start = self.data.shape[0]
|
||||||
|
|
||||||
|
# Split genres, spoken_languages, networks, and created_by
|
||||||
|
self.data['genres'] = self.data['genres'].apply(split_to_list)
|
||||||
|
self.data['spoken_languages'] = self.data['spoken_languages'].apply(split_to_list)
|
||||||
|
self.data['networks'] = self.data['networks'].apply(split_to_list)
|
||||||
|
self.data['created_by'] = self.data['created_by'].apply(split_to_list)
|
||||||
|
|
||||||
|
# Drop rows that are not in English
|
||||||
|
self.data = self.data[self.data['original_language'] == 'en']
|
||||||
|
|
||||||
|
# Drop rows with empty lists in genres or spoken_languages
|
||||||
|
self.data = self.data[
|
||||||
|
self.data['genres'].map(lambda x: len(x) > 0) &
|
||||||
|
self.data['spoken_languages'].map(lambda x: len(x) > 0) &
|
||||||
|
self.data['networks'].map(lambda x: len(x) > 0)
|
||||||
|
]
|
||||||
|
|
||||||
|
# Count rows that were dropped
|
||||||
|
rows_dropped = data_start - len(self.data)
|
||||||
|
|
||||||
|
print('Data cleaned successfully, dropped ' + str(rows_dropped) + ' rows.')
|
||||||
197
recommendations.py
Normal file
197
recommendations.py
Normal file
@ -0,0 +1,197 @@
|
|||||||
|
from user_data import UserData
|
||||||
|
import pandas as pd
|
||||||
|
import textwrap
|
||||||
|
|
||||||
|
|
||||||
|
############################## Recommendation loader ##############################
|
||||||
|
class RecommendationLoader:
|
||||||
|
def __init__(self, model, title_data):
|
||||||
|
self.model = model
|
||||||
|
self.title_data = title_data
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: run ------------------------
|
||||||
|
def run(self):
|
||||||
|
while True:
|
||||||
|
user_data = UserData()
|
||||||
|
user_data.title()
|
||||||
|
user_data.n_recommendations()
|
||||||
|
|
||||||
|
# Exit the program if writing exit or quit.
|
||||||
|
if user_data.user_data['title'] in ['exit', 'quit']:
|
||||||
|
print("Program will exit now. Thanks for using!")
|
||||||
|
break
|
||||||
|
|
||||||
|
# Find a row in dataset to use as referens.
|
||||||
|
target_row = self.title_data[self.title_data['name'].str.lower() == user_data.user_data['title']]
|
||||||
|
|
||||||
|
# If no match found, loop and try again.
|
||||||
|
if target_row.empty:
|
||||||
|
print(f"No match found for '{user_data.user_data['title']}'. Try again.")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# If match found, get recommendations.
|
||||||
|
target_row = target_row.iloc[0]
|
||||||
|
self.get_recommendations(target_row, user_data.user_data)
|
||||||
|
print("#" * 100)
|
||||||
|
print("\nWrite 'exit' or 'quit' to end the program.")
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: get_recommendations ------------------------
|
||||||
|
def get_recommendations(self, target_row, user_data):
|
||||||
|
n_recommendations = user_data['n_rec']
|
||||||
|
recommendations = self.model.recommend(target_row, user_data['n_rec'])
|
||||||
|
|
||||||
|
# I dont want to recommend a title with Reality in it if the reference doesnt have that genre and so on
|
||||||
|
recommendations = self.filter_genres(recommendations, target_row)
|
||||||
|
|
||||||
|
# Get more recommendations and filter untill n_recommendations is reached
|
||||||
|
while len(recommendations) < n_recommendations:
|
||||||
|
additional_recommendations = self.model.recommend(target_row, num_recommendations=20)
|
||||||
|
additional_recommendations = additional_recommendations[~additional_recommendations.index.isin(recommendations.index)]
|
||||||
|
additional_recommendations = self.filter_genres(additional_recommendations, target_row)
|
||||||
|
recommendations = pd.concat([recommendations, additional_recommendations])
|
||||||
|
|
||||||
|
# Make sure we give n_recommendations recommendations
|
||||||
|
recommendations = recommendations.head(n_recommendations)
|
||||||
|
|
||||||
|
self.display_recommendations(user_data, recommendations, n_recommendations, target_row)
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: display_recommendations ------------------------
|
||||||
|
def display_recommendations(self, user_data, recommendations, n_recommendations, target_row):
|
||||||
|
print(f'\n{n_recommendations} recommendations based on "{user_data["title"]}":\n')
|
||||||
|
|
||||||
|
# Width on printed recommendations
|
||||||
|
width = 100
|
||||||
|
|
||||||
|
# Print recommendations if there are any
|
||||||
|
if not recommendations.empty:
|
||||||
|
# print(f"{'Title':<40} {'Genres':<60} {'Networks':<30}")
|
||||||
|
print("#" * width)
|
||||||
|
|
||||||
|
for index, row in recommendations.iterrows():
|
||||||
|
title = row['name']
|
||||||
|
genres = ', '.join(row['genres']) if isinstance(row['genres'], list) else row['genres']
|
||||||
|
networks = ', '.join(row['networks']) if isinstance(row['networks'], list) and row['networks'] else 'N/A'
|
||||||
|
created_by = ', '.join(row['created_by']) if isinstance(row['created_by'], list) and row['created_by'] else 'N/A'
|
||||||
|
rating = row['vote_average']
|
||||||
|
vote_count = row['vote_count']
|
||||||
|
seasons = row['number_of_seasons'] if isinstance(row['number_of_seasons'], int) else 'N/A'
|
||||||
|
episodes = row['number_of_episodes'] if isinstance(row['number_of_episodes'], int) else 'N/A'
|
||||||
|
overview = textwrap.fill(row["overview"], width=width)
|
||||||
|
|
||||||
|
# Extract years fir first_air_date and last_air_date
|
||||||
|
first_year = self.extract_years(row["first_air_date"])
|
||||||
|
last_year = self.extract_years(row["last_air_date"])
|
||||||
|
|
||||||
|
# Construct title with the year range
|
||||||
|
title_raw = f"{title} ({first_year}-{last_year})"
|
||||||
|
title = textwrap.fill(title_raw, width=width)
|
||||||
|
|
||||||
|
# Print recommendation
|
||||||
|
print(f"\nTitle: {title}")
|
||||||
|
print(f"Genres: {genres}")
|
||||||
|
if not created_by == 'N/A':
|
||||||
|
print(f"Director: {created_by}")
|
||||||
|
if not networks == 'N/A':
|
||||||
|
print(f'Networks: {networks}')
|
||||||
|
print(f"Rating: {rating:.1f} ({vote_count:.0f} votes)")
|
||||||
|
if not seasons == 'N/A' and not episodes == 'N/A':
|
||||||
|
print(f"Seasons: {seasons} ({episodes} episodes)")
|
||||||
|
print(f'\n{overview}\n')
|
||||||
|
|
||||||
|
# Get explanation for recommendation
|
||||||
|
explanation = self.get_explanation(row, target_row)
|
||||||
|
print(f"{explanation}\n")
|
||||||
|
|
||||||
|
print("-" * width)
|
||||||
|
|
||||||
|
print("\nEnd of recommendations.")
|
||||||
|
else:
|
||||||
|
print("No recommendations found.")
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: get_explanation ------------------------
|
||||||
|
def get_explanation(self, row, target_row):
|
||||||
|
explanation = []
|
||||||
|
title = row['name']
|
||||||
|
|
||||||
|
explanation.append(f"The title '{title}' was recommended because: \n")
|
||||||
|
|
||||||
|
# Explain genre overlap
|
||||||
|
genre_overlap = self.check_genre_overlap(target_row, row)
|
||||||
|
if genre_overlap:
|
||||||
|
overlapping_genres = ', '.join(genre_overlap)
|
||||||
|
explanation.append(f"It shares the following genres with your preferences: {overlapping_genres}.\n")
|
||||||
|
|
||||||
|
# Explain created_by overlap
|
||||||
|
created_by_overlap = self.check_created_by_overlap(target_row, row)
|
||||||
|
if created_by_overlap:
|
||||||
|
overlapping_created_by = ', '.join(created_by_overlap)
|
||||||
|
explanation.append(f"It shares the following director with your preferences: {overlapping_created_by}.\n")
|
||||||
|
|
||||||
|
# Explain the distance metric
|
||||||
|
explanation.append(f"The distance metric of {round(row['distance'], 2)} indicates that it is quite similar to your preferences.")
|
||||||
|
return ' '.join(explanation)
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: check_genre_overlap ------------------------
|
||||||
|
def check_genre_overlap(self, target_row, row):
|
||||||
|
# Get genres from the target row
|
||||||
|
target_genres = set(genre.lower() for genre in target_row['genres'])
|
||||||
|
# Get genres from the recommended row
|
||||||
|
recommended_genres = set(genre.lower() for genre in row['genres'])
|
||||||
|
|
||||||
|
# Find the intersection of the target genres and recommended genres
|
||||||
|
overlap = target_genres.intersection(recommended_genres)
|
||||||
|
|
||||||
|
return overlap
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: check_created_by_overlap ------------------------
|
||||||
|
def check_created_by_overlap(self, target_row, row):
|
||||||
|
# Get created_by from the target row
|
||||||
|
target_creators = set(creator.lower() for creator in target_row['created_by'])
|
||||||
|
# Get created_by from the recommended row
|
||||||
|
recommended_creators = set(creator.lower() for creator in row['created_by'])
|
||||||
|
|
||||||
|
# Find the intersection of the target creators and recommended creators
|
||||||
|
overlap = target_creators.intersection(recommended_creators)
|
||||||
|
|
||||||
|
return overlap
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: extract_years ------------------------
|
||||||
|
def extract_years(self, air_date):
|
||||||
|
# Make sure air_date is not null
|
||||||
|
if pd.isna(air_date):
|
||||||
|
return "Unknown"
|
||||||
|
# Convert float to int if needed
|
||||||
|
if isinstance(air_date, float):
|
||||||
|
return str(int(air_date))
|
||||||
|
return air_date.split('-')[0]
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: get_recommendations ------------------------
|
||||||
|
def filter_genres(self, recommendations, target_row):
|
||||||
|
# Get genres from the target row
|
||||||
|
reference_genres = [genre.lower() for genre in target_row['genres']]
|
||||||
|
|
||||||
|
# Check if the reference includes specific genres
|
||||||
|
is_kids_reference = 'kids' in reference_genres
|
||||||
|
is_animated_reference = 'animation' in reference_genres
|
||||||
|
is_reality_reference = 'reality' in reference_genres
|
||||||
|
is_documentary_reference = 'documentary' in reference_genres
|
||||||
|
|
||||||
|
# Filter recommendations based on genre preferences
|
||||||
|
if not is_kids_reference:
|
||||||
|
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'kids' in [g.lower() for g in x])]
|
||||||
|
if not is_animated_reference:
|
||||||
|
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'animation' in [g.lower() for g in x])]
|
||||||
|
if not is_reality_reference:
|
||||||
|
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'reality' in [g.lower() for g in x])]
|
||||||
|
if not is_documentary_reference:
|
||||||
|
recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'documentary' in [g.lower() for g in x])]
|
||||||
|
|
||||||
|
return recommendations
|
||||||
107
trainmodel.py
Normal file
107
trainmodel.py
Normal file
@ -0,0 +1,107 @@
|
|||||||
|
from sklearn.neighbors import NearestNeighbors
|
||||||
|
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||||
|
from sklearn.preprocessing import StandardScaler
|
||||||
|
from sklearn.decomposition import TruncatedSVD
|
||||||
|
from scipy.sparse import hstack, csr_matrix
|
||||||
|
import time
|
||||||
|
|
||||||
|
import warnings
|
||||||
|
warnings.filterwarnings("ignore", category=UserWarning, module='sklearn')
|
||||||
|
|
||||||
|
|
||||||
|
############################## Train model ##############################
|
||||||
|
class TrainModel:
|
||||||
|
def __init__(self, title_data):
|
||||||
|
self.title_data = title_data
|
||||||
|
|
||||||
|
# Settings for vectorization
|
||||||
|
self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=0.01, max_df=0.5)
|
||||||
|
|
||||||
|
# Settings for nearest neighbors
|
||||||
|
self.model = NearestNeighbors(metric='cosine')
|
||||||
|
self.scaler = StandardScaler()
|
||||||
|
|
||||||
|
# Settings for SVD
|
||||||
|
self.svd = TruncatedSVD(n_components=300)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: train ----------------------
|
||||||
|
def train(self):
|
||||||
|
print("Starting to train model ...")
|
||||||
|
|
||||||
|
start = time.time()
|
||||||
|
|
||||||
|
# Preprocess title data
|
||||||
|
preproccessed_data = self.preprocess_title_data()
|
||||||
|
|
||||||
|
# Train the NN model
|
||||||
|
self.model.fit(preproccessed_data)
|
||||||
|
|
||||||
|
stop = time.time()
|
||||||
|
|
||||||
|
# Count time for training
|
||||||
|
elapsed_time = stop - start
|
||||||
|
|
||||||
|
print(f'Trained model successfully in {elapsed_time:.2f} seconds.')
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------------ Function: recommend ------------------------
|
||||||
|
def recommend(self, target_row, num_recommendations=40):
|
||||||
|
|
||||||
|
# Preprocess target data
|
||||||
|
target_vector = self.preprocess_target_data(target_row)
|
||||||
|
|
||||||
|
# Get nearest neighbors
|
||||||
|
distances, indices = self.model.kneighbors(target_vector, n_neighbors=num_recommendations)
|
||||||
|
recommendations = self.title_data.iloc[indices[0]].copy()
|
||||||
|
recommendations['distance'] = distances[0]
|
||||||
|
|
||||||
|
# Filter recommendations
|
||||||
|
recommendations = recommendations[
|
||||||
|
(recommendations['name'].str.lower() != target_row['name'].lower()) &
|
||||||
|
(recommendations['distance'] < 0.5)
|
||||||
|
]
|
||||||
|
return recommendations.head(num_recommendations)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: preprocess_data ----------------------
|
||||||
|
def preprocess_title_data(self):
|
||||||
|
# Combine text fields in a new column for vectorization
|
||||||
|
self.title_data['combined_text'] = (
|
||||||
|
self.title_data['overview'].fillna('').apply(str) + ' ' +
|
||||||
|
self.title_data['genres'].fillna('').apply(str) + ' ' +
|
||||||
|
self.title_data['created_by'].fillna('').apply(str)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Process combined_text column with vectorizer
|
||||||
|
text_features = self.vectorizer.fit_transform(self.title_data['combined_text'])
|
||||||
|
text_features = self.svd.fit_transform(text_features)
|
||||||
|
|
||||||
|
# Scale numerical features in the DataFrame using a scaler
|
||||||
|
self.numerical_data = self.title_data.select_dtypes(include=['number'])
|
||||||
|
|
||||||
|
# Include ratings in numerical features
|
||||||
|
if 'vote_average' in self.numerical_data.columns:
|
||||||
|
self.numerical_data = self.numerical_data[['vote_average']]
|
||||||
|
|
||||||
|
# Scale numerical features
|
||||||
|
numerical_features = self.scaler.fit_transform(self.numerical_data)
|
||||||
|
numerical_features_sparse = csr_matrix(numerical_features)
|
||||||
|
|
||||||
|
# Combine text and numerical features
|
||||||
|
combined_features = hstack([csr_matrix(text_features), numerical_features_sparse])
|
||||||
|
|
||||||
|
return combined_features
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------- Function: preprocess_target_data ----------------------
|
||||||
|
def preprocess_target_data(self, target_row):
|
||||||
|
# Create feature vector for target row
|
||||||
|
target_text_vector = self.vectorizer.transform([target_row['combined_text']])
|
||||||
|
target_text_vector = self.svd.transform(target_text_vector)
|
||||||
|
|
||||||
|
# Process numerical features of the referens target
|
||||||
|
target_numerical = target_row[self.numerical_data.columns].values.reshape(1, -1)
|
||||||
|
target_vector = hstack([csr_matrix(target_text_vector), csr_matrix(self.scaler.transform(target_numerical))])
|
||||||
|
|
||||||
|
return target_vector
|
||||||
28
user_data.py
Normal file
28
user_data.py
Normal file
@ -0,0 +1,28 @@
|
|||||||
|
############################## User input ##############################
|
||||||
|
class UserData:
|
||||||
|
def __init__(self):
|
||||||
|
self.user_data = {}
|
||||||
|
self.n_rec = 10
|
||||||
|
|
||||||
|
# ---------------------- Function: title ----------------------
|
||||||
|
def title(self):
|
||||||
|
# Ask for user input
|
||||||
|
print("#" * 100)
|
||||||
|
title = input("\nPlease enter the title of TV-Series you prefer: ")
|
||||||
|
self.user_data['title'] = title.strip().lower()
|
||||||
|
return self.user_data
|
||||||
|
|
||||||
|
# ---------------------- Function: n_recommendations ----------------------
|
||||||
|
def n_recommendations(self):
|
||||||
|
# Ask for number of recommendations
|
||||||
|
while True:
|
||||||
|
n_rec = input("How many recommendations do you want (minimum 5): ")
|
||||||
|
try:
|
||||||
|
n_rec = int(n_rec.strip())
|
||||||
|
if n_rec < 5:
|
||||||
|
print("Please enter a number greater than or equal to 5: ")
|
||||||
|
else:
|
||||||
|
self.user_data['n_rec'] = n_rec
|
||||||
|
break
|
||||||
|
except ValueError:
|
||||||
|
print("Please enter a valid number.")
|
||||||
Loading…
Reference in New Issue
Block a user