Uppdatera filer till V1 som är fungerande

2024-11-06 22:37:54 +01:00 · 2024-11-06 22:37:54 +01:00 · aa0f23c41b
commit aa0f23c41b
parent 0e8a994162
13 changed files with 539 additions and 243081 deletions
--- a/README.md
+++ b/README.md
@ -1,56 +1,92 @@
-# Supervised Learning - Movie/TV-Show recommender
+# Supervised Learning - TV-Show recommender

 ## Specification
-Movie/TV-Show recommender
+TV-Show recommender

-This program will recommend you what movie or th-show to view based on what Movie/TV-Show you like.
-You should be able to search for recommendations from your Movie/TV-Show title, cast, director, 
-release year and also Description, and get back a recommendations with a explanation on what just this 
-title might suit you. It will get you about 25 recommendations of movies in order of rank from your search, 
-and the same from TV-Shows that will match the same way as Movies.
+This program will recommend you what tv-show to view based on what you like.
+You will tell what tv-show you like and how many recommendations wanted, then you will get that 
+amount of recommendations of tv-shows in order of rank from your search.

 ### Data Source:
-I will use 4 datasets from kaggle, 3 datasets from streaming-sites Netflix, 
-Amazon Prime and Disney Plus, also 1 from a IMDB dataset.
+I will use a dataset from TMBD
+https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows

 ### Model:
-I will use NearestNeighbors (NN) alhorithm that can help me find other titles based on features 
-like Title, Release year, Description, Cast, Director and genres.
+I will use NearestNeighbors (NN) alhorithm together with K-NearestNeighbors alhorithm.

 ### Features:
-1.  Load data from several data-files and preprocessing.
-2.  Model training with k-NN algorithm.
-3.  Search with explanation
+1.  Load data from dataset and preprocessing.
+2.  Model training with NN & k-NN algorithm.
+3.  User input
+4.  Recommendations

 ### Requirements:
 1. Title data:
    * Title
    * Genres
-    * Release year
-    * Cast
+    * First/last air date
+    * Vote count/average
    * Director
    * Description
-3. User data:
-    * What Movie / TV-Show
-    * What genre
-    * Director
+    * Networks
+    * Spoken languages
+    * Number of seasons/episodes
+2. User data:
+    * What Movie / TV-Show prefers
+    * Number of recommendations wanted

 ### Libraries
  * pandas: Data manipulation and analysis
  * scikit-learn: machine learning algorithms and preprocessing
-  * beatifulsoup4: web scraping (if necessary)
+  * scipy: A scientific computing package for Python
+  * time: provides various functions for working with time
+  * os: functions for interacting with the operating system
+  * re: provides regular expression support
+  * textwrap: Text wrapping and filling
    
 ### Classes
  1. LoadData
     * load_data
-     * clean_text
+     * read_data
     * clean_data
+  2. ImportData
     * load_dataset
     * create_data
+     * clean_data
     * save_data
-     * load_data
-  2. UserData
+  3. TrainModel
+     * train
+     * recommend
+     * preprocess_title_data
+     * preprocess_target_data
+  4. UserData
     * input
-  3. Recommendations
-     * get_recommendations 
+     * n_recommendations
+  5. RecommendationLoader
+     * run 
+     * get_recommendations
+     * display_recommendations
+     * get_explanation
+     * check_genre_overlap
+     * check_created_by_overlap
+     * extract_years
+     * filter_genres
+
+### References   
+   * https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.NearestNeighbors.html
+   * https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
+   * https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html
+   * https://scikit-learn.org/0.16/modules/generated/sklearn.decomposition.TruncatedSVD.html
+   * https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.hstack.html
+   * https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
+   * https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html
+
+## How to run program
+
+### Before running program
+First thing to do is to extract TMDB_tv_dataset_v3.zip in dataset folder so that it contains TMDB_tv_dataset_v3.csv.
+
+### Running program
+Start main.py and it will load dataset and ask for a title to get recommendations from, also how many recommendations wanted. Then enter and you will have those recommendations presented on screen.
+

--- a/dataset/TMDB_tv_dataset_v3.zip
+++ b/dataset/TMDB_tv_dataset_v3.zip
--- a/dataset/data_amazon.csv
+++ b/dataset/data_amazon.csv
--- a/dataset/data_disney.csv
+++ b/dataset/data_disney.csv
--- a/dataset/data_imdb.csv
+++ b/dataset/data_imdb.csv
--- a/dataset/data_netflix.csv
+++ b/dataset/data_netflix.csv
--- a/dataset/dataset_tmdb.csv
+++ b/dataset/dataset_tmdb.csv
--- a/import_data.py
+++ b/import_data.py
@ -0,0 +1,66 @@
+import re
+import os
+import pandas as pd
+
+
+############################## Import data ##############################
+class ImportData:
+
+    def __init__(self):
+        self.data = None
+        self.loaded_datasets = []
+
+
+# ---------------------- Function: load_dataset ----------------------
+    def load_dataset(self, dataset_path):
+        # Load data from dataset CSV file
+        try:
+            df = pd.read_csv(os.path.join(f'dataset', dataset_path))
+            return df
+        except FileNotFoundError:
+            print(f'Warning: "{dataset_path}" not found. Skipping this dataset.')
+            return None
+
+
+# ---------------------- Function: create_data ----------------------
+    def create_data(self, filename):
+        try:
+            self.data = self.load_dataset(filename)
+            print(f'Imported data successfully.')
+        except FileNotFoundError:
+            print("No data imported, missing dataset")
+            return None
+
+
+# ---------------------- Function: clean_data ----------------------  
+    def clean_data(self):
+        if self.data is not None:
+            # Drop unnecessary columns
+            df_cleaned = self.data.drop(columns=['adult', 'poster_path', 'production_companies', 
+            'in_production','backdrop_path','production_countries','status','episode_run_time',
+            'original_name', 'popularity', 'tagline','homepage'], errors='ignore')
+            
+            # Clean text from non-ASCII characters
+            text_columns = ['name', 'overview','spoken_languages']
+            masks = [df_cleaned[col].apply(lambda x: isinstance(x, str) and bool(re.match(r'^[\x00-\x7F]*$', x)))
+                     for col in text_columns]
+            combined_mask = pd.concat(masks, axis=1).all(axis=1)
+
+            self.data = df_cleaned[combined_mask]
+
+            print(f'Data cleaned. {self.data.shape[0]} records remaining.')
+        else:
+            print("No data to clean. Please load the dataset first.")
+
+
+# ---------------------- Function: save_data ----------------------
+    def save_data(self):
+        if self.data is not None:
+            try:
+                # Sava dataframe to CSV
+                self.data.to_csv('data.csv', index=False)
+                print(f'Data saved to data.csv.')
+            except Exception as e:
+                print(f'Error saving data: {e}')
+        else:
+            print("No data to save. Please clean the data first.")
--- a/main.py
+++ b/main.py
@ -1,177 +1,21 @@
-import pandas as pd
-import re
-import os
-from sklearn.neighbors import NearestNeighbors
-from sklearn.feature_extraction.text import TfidfVectorizer
-from textwrap import dedent
-
-class LoadData:
-    def __init__(self):
-        self.data = None
-        self.loaded_datasets = []
-
-    def load_data(self):
-        self.create_data()
-        self.clean_data()
-        num_rows = self.data.shape[0]
-        print(f'{num_rows} titles loaded successfully.')
-        return self.data
-
-    def clean_text(self, text):
-        if isinstance(text, str):
-            cleaned = re.sub(r'[^\x00-\x7F]+', '', text)
-            cleaned = cleaned.replace('#', '').replace('"', '')
-            return cleaned.strip()
-        return '' 
-
-    def load_dataset(self, dataset_path, stream):
-        try:
-            df = pd.read_csv(f'dataset/{dataset_path}')
-            df['stream'] = stream
-            if stream != 'IMDB':
-                df = df.drop(columns=['show_id', 'date_added', 'duration', 'rating'], errors='ignore')
-                df = df.rename(columns={'listed_in': 'genres'})
-            else:
-                df = df.rename(columns={'releaseYear': 'release_year'})
-                df = df.drop(columns=['numVotes', 'id','avaverageRating'], errors='ignore')
-            self.loaded_datasets.append(stream)
-            return df
-        except FileNotFoundError:
-            print(f'Warning: "{dataset_path}" not found. Skipping this dataset.')
-
-    def create_data(self):
-        print(f'Starting to read data ...')
-        
-        df_netflix = self.load_dataset('data_netflix.csv','Netflix')
-        df_amazon = self.load_dataset('data_amazon.csv','Amazon')
-        df_disney = self.load_dataset('data_disney.csv','Disney')
-        df_imdb = self.load_dataset('data_imdb.csv','IMDB')
-
-        dataframes = [df for df in [df_imdb, df_netflix, df_amazon, df_disney] if df is not None]
-        if not dataframes:
-            print("Error: No datasets loaded. Cannot create combined data.")
-            return
-
-        df_all = pd.concat(dataframes, ignore_index=True, sort=False)
-        df_all = df_all.infer_objects(copy=False)
-        self.data = df_all
-
-        print(f'Data from {", ".join(self.loaded_datasets)} imported.')
-
-    def clean_data(self):
-        self.data.dropna(subset=['title', 'genres', 'description'], inplace=True)
-        string_columns = self.data.select_dtypes(include=['object'])
-        self.data[string_columns.columns] = string_columns.apply(lambda col: col.map(self.clean_text, na_action='ignore'))
-        self.data = self.data[~self.data['title'].str.strip().isin(['', ':'])]
-        self.data['genres'] = self.data['genres'].str.split(', ').apply(lambda x: [genre.strip() for genre in x])
-        self.data = self.data[self.data['genres'].map(lambda x: len(x) > 0)]
-        print(f'Data cleaned. {self.data.shape[0]} records remaining.')
+from read_data import LoadData
+from trainmodel import TrainModel
+from recommendations import RecommendationLoader


-class UserData:
-    def __init__(self):
-        self.user_data = None
-
-    def input(self):
-        self.user_data = input("Which Movie or TV Series do you prefer: ")
-        return self.user_data.strip().lower()
-
-
-class TrainModel:
-    def __init__(self, title_data):
-        self.recommendation_model = None
-        self.title_data = title_data
-        self.title_vectors = None 
-        self.vectorizer = TfidfVectorizer() 
-        self.preprocess_data()
-
-    def preprocess_data(self):
-        self.title_data['genres'] = self.title_data['genres'].apply(lambda x: ', '.join(x) if isinstance(x, list) else '')
-        self.title_data['combined_text'] = (
-            self.title_data['title'].fillna('') + ' ' +
-            self.title_data['director'].fillna('') + ' ' +
-            self.title_data['cast'].fillna('') + ' ' +
-            self.title_data['genres'] + ' ' + 
-            self.title_data['description'].fillna('')
-        )
-        self.title_data['combined_text'] = self.title_data['combined_text'].str.lower()
-        self.title_data['combined_text'] = self.title_data['combined_text'].str.replace(r'[^a-z\s]', '', regex=True)
-        self.title_vectors = self.vectorizer.fit_transform(self.title_data['combined_text'])
-
-    def preprocess_user_input(self, user_input):
-        user_vector = self.vectorizer.transform([user_input])
-        return user_vector
-
-    def train(self):
-        self.recommendation_model = NearestNeighbors(n_neighbors=10, metric='cosine')
-        self.recommendation_model.fit(self.title_vectors)
-
-
-class RecommendationLoader:
-    def __init__(self, model, title_data):
-        self.model = model
-        self.title_data = title_data
-
-    def run(self):
-        while True: 
-            user_data = UserData()
-            user_input = user_data.input()
-
-            if user_input in ['exit', 'quit']:
-                print("Program will exit now. Thanks for using!")
-                break
-
-            self.get_recommendations(user_input) 
-            print("\nWrite 'exit' or 'quit' to end the program.")
-            
-    def get_recommendations(self, user_data):
-        user_vector = self.model.preprocess_user_input(user_data)   
-        distances, indices = self.model.recommendation_model.kneighbors(user_vector, n_neighbors=10) 
-        recommendations = self.title_data.iloc[indices[0]]
-
-        self.display_recommendations(user_data, recommendations)
-
-    def display_recommendations(self, user_data, recommendations):
-        print(f'\nRecommendations based on "{user_data}":\n')
-
-        if not recommendations.empty:
-            movie_recommendations = recommendations[recommendations['type'] == 'Movie']
-            tv_show_recommendations = recommendations[recommendations['type'] == 'TV Show']
-
-            if not movie_recommendations.empty:
-                print("\n#################### Recommended Movies: ####################")
-                for i, (_, row) in enumerate(movie_recommendations.iterrows(), start=1):
-                    print(dedent(f"""
-                        {i}. {row['title']} ({row['release_year']}) ({row['genres']})
-                        Description: {row['description']}
-                        Director: {row['director']}
-                        Cast: {row['cast']}
-                        
-                        ===============================================================
-                    """))
-
-            if not tv_show_recommendations.empty:
-                print("\n#################### Recommended TV Shows: ####################")
-                for i, (_, row) in enumerate(tv_show_recommendations.iterrows(), start=1):
-                    print(dedent(f"""
-                        {i}. {row['title']} ({row['release_year']}) ({row['genres']})
-                        Description: {row['description']}
-                        Director: {row['director']}
-                        Cast: {row['cast']}
-                        
-                        ===============================================================
-                    """))
-        else:
-            print("No recommendations found.")
-
+############################## Main ############################################

 def main():
+    
+    # Load data from CSV file
    data_loader = LoadData()
    title_data = data_loader.load_data()

+    # Train model
    model = TrainModel(title_data)
    model.train()

+    # Run recommendation loader
    recommendations = RecommendationLoader(model, title_data)
    recommendations.run()

--- a/read_data.py
+++ b/read_data.py
@ -0,0 +1,71 @@
+import pandas as pd
+from import_data import ImportData
+
+
+############################## Load data ##############################
+class LoadData:
+    def __init__(self):
+        self.data = None
+        self.filename = 'TMDB_tv_dataset_v3.csv'
+
+
+# ---------------------- Function: load_data ----------------------
+    def load_data(self):
+        self.read_data()
+        self.clean_data()
+        print(f'{self.data.shape[0]} titles loaded successfully.')
+        return self.data
+
+
+# ---------------------- Function: read_data ----------------------
+    def read_data(self):
+        print("Starting to read data ...")
+        try:
+            # Try to Read CSV file
+            self.data = pd.read_csv('data.csv')
+            print(f'{self.data.shape[0]} rows read successfully.')
+        except FileNotFoundError:
+            print("No data.csv file found. Attempting to import data...")
+            # If CSV file not found, try to import data from datasets instead
+            try:
+                data_importer = ImportData()
+                data_importer.create_data(self.filename)
+                data_importer.clean_data()
+                data_importer.save_data()
+                self.data = pd.read_csv('data.csv')
+                print(f'{self.data.shape[0]} rows imported successfully.')
+            except Exception as e:
+                print(f"Error during data import process: {e}")
+
+
+# ---------------------- Function: clean_data ----------------------
+    def clean_data(self):
+        # Function to split a string into a list, or use an empty list if no valid data
+        def split_to_list(value):
+            if isinstance(value, str):
+                # Strip and split the string, and remove any empty items
+                return [item.strip() for item in value.split(',') if item.strip()] 
+            return []
+        
+        data_start = self.data.shape[0]
+        
+        # Split genres, spoken_languages, networks, and created_by
+        self.data['genres'] = self.data['genres'].apply(split_to_list)
+        self.data['spoken_languages'] = self.data['spoken_languages'].apply(split_to_list)
+        self.data['networks'] = self.data['networks'].apply(split_to_list)
+        self.data['created_by'] = self.data['created_by'].apply(split_to_list)
+
+        # Drop rows that are not in English
+        self.data = self.data[self.data['original_language'] == 'en']
+
+        # Drop rows with empty lists in genres or spoken_languages
+        self.data = self.data[
+            self.data['genres'].map(lambda x: len(x) > 0) &
+            self.data['spoken_languages'].map(lambda x: len(x) > 0) &
+            self.data['networks'].map(lambda x: len(x) > 0) 
+        ]
+
+        # Count rows that were dropped
+        rows_dropped = data_start - len(self.data)
+
+        print('Data cleaned successfully, dropped ' + str(rows_dropped) + ' rows.')
--- a/recommendations.py
+++ b/recommendations.py
@ -0,0 +1,197 @@
+from user_data import UserData
+import pandas as pd
+import textwrap
+
+
+############################## Recommendation loader ##############################
+class RecommendationLoader:
+    def __init__(self, model, title_data):
+        self.model = model
+        self.title_data = title_data
+
+
+# ------------------------ Function: run ------------------------
+    def run(self):
+        while True: 
+            user_data = UserData()
+            user_data.title() 
+            user_data.n_recommendations() 
+
+            # Exit the program if writing exit or quit.
+            if user_data.user_data['title'] in ['exit', 'quit']:
+                print("Program will exit now. Thanks for using!")
+                break
+            
+            # Find a row in dataset to use as referens.
+            target_row = self.title_data[self.title_data['name'].str.lower() == user_data.user_data['title']]
+
+            # If no match found, loop and try again.
+            if target_row.empty:
+                print(f"No match found for '{user_data.user_data['title']}'. Try again.")
+                continue
+            
+            # If match found, get recommendations.
+            target_row = target_row.iloc[0]
+            self.get_recommendations(target_row, user_data.user_data)
+            print("#" * 100)
+            print("\nWrite 'exit' or 'quit' to end the program.")
+
+
+# ------------------------ Function: get_recommendations ------------------------
+    def get_recommendations(self, target_row, user_data):
+        n_recommendations = user_data['n_rec']
+        recommendations = self.model.recommend(target_row, user_data['n_rec'])
+
+        # I dont want to recommend a title with Reality in it if the reference doesnt have that genre and so on
+        recommendations = self.filter_genres(recommendations, target_row)
+
+        # Get more recommendations and filter untill n_recommendations is reached
+        while len(recommendations) < n_recommendations:
+            additional_recommendations = self.model.recommend(target_row, num_recommendations=20)     
+            additional_recommendations = additional_recommendations[~additional_recommendations.index.isin(recommendations.index)] 
+            additional_recommendations = self.filter_genres(additional_recommendations, target_row)
+            recommendations = pd.concat([recommendations, additional_recommendations])
+
+        # Make sure we give n_recommendations recommendations
+        recommendations = recommendations.head(n_recommendations)    
+
+        self.display_recommendations(user_data, recommendations, n_recommendations, target_row)
+
+
+# ------------------------ Function: display_recommendations ------------------------
+    def display_recommendations(self, user_data, recommendations, n_recommendations, target_row):
+        print(f'\n{n_recommendations} recommendations based on "{user_data["title"]}":\n')
+
+        # Width on printed recommendations
+        width = 100
+
+        # Print recommendations if there are any
+        if not recommendations.empty:
+            # print(f"{'Title':<40} {'Genres':<60} {'Networks':<30}")
+            print("#" * width)
+
+            for index, row in recommendations.iterrows():
+                title = row['name']
+                genres = ', '.join(row['genres']) if isinstance(row['genres'], list) else row['genres'] 
+                networks = ', '.join(row['networks']) if isinstance(row['networks'], list) and row['networks'] else 'N/A'
+                created_by = ', '.join(row['created_by']) if isinstance(row['created_by'], list) and row['created_by'] else 'N/A'
+                rating = row['vote_average']
+                vote_count = row['vote_count']
+                seasons = row['number_of_seasons'] if isinstance(row['number_of_seasons'], int) else 'N/A'
+                episodes = row['number_of_episodes'] if isinstance(row['number_of_episodes'], int) else 'N/A'
+                overview = textwrap.fill(row["overview"], width=width)
+
+                # Extract years fir first_air_date and last_air_date      
+                first_year = self.extract_years(row["first_air_date"])
+                last_year = self.extract_years(row["last_air_date"])
+
+                # Construct title with the year range
+                title_raw = f"{title} ({first_year}-{last_year})"
+                title = textwrap.fill(title_raw, width=width)
+
+                # Print recommendation
+                print(f"\nTitle:    {title}")
+                print(f"Genres:   {genres}")
+                if not created_by == 'N/A':
+                    print(f"Director: {created_by}")
+                if not networks == 'N/A':
+                    print(f'Networks: {networks}')
+                print(f"Rating:   {rating:.1f} ({vote_count:.0f} votes)")
+                if not seasons == 'N/A' and not episodes == 'N/A':
+                    print(f"Seasons:  {seasons} ({episodes} episodes)")
+                print(f'\n{overview}\n')
+                
+                # Get explanation for recommendation
+                explanation = self.get_explanation(row, target_row)
+                print(f"{explanation}\n")
+
+                print("-" * width)
+
+            print("\nEnd of recommendations.")
+        else:
+            print("No recommendations found.")
+
+
+# ------------------------ Function: get_explanation ------------------------
+    def get_explanation(self, row, target_row):
+        explanation = []
+        title = row['name']
+
+        explanation.append(f"The title '{title}' was recommended because: \n")
+
+        # Explain genre overlap
+        genre_overlap = self.check_genre_overlap(target_row, row)
+        if genre_overlap:
+            overlapping_genres = ', '.join(genre_overlap)
+            explanation.append(f"It shares the following genres with your preferences: {overlapping_genres}.\n")   
+        
+        # Explain created_by overlap
+        created_by_overlap = self.check_created_by_overlap(target_row, row)
+        if created_by_overlap:
+            overlapping_created_by = ', '.join(created_by_overlap)
+            explanation.append(f"It shares the following director with your preferences: {overlapping_created_by}.\n")
+
+        # Explain the distance metric
+        explanation.append(f"The distance metric of {round(row['distance'], 2)} indicates that it is quite similar to your preferences.")
+        return ' '.join(explanation)
+
+
+# ------------------------ Function: check_genre_overlap ------------------------
+    def check_genre_overlap(self, target_row, row):
+        # Get genres from the target row
+        target_genres = set(genre.lower() for genre in target_row['genres'])
+        # Get genres from the recommended row
+        recommended_genres = set(genre.lower() for genre in row['genres'])
+
+        # Find the intersection of the target genres and recommended genres
+        overlap = target_genres.intersection(recommended_genres) 
+
+        return overlap
+
+
+# ------------------------ Function: check_created_by_overlap ------------------------
+    def check_created_by_overlap(self, target_row, row):
+        # Get created_by from the target row
+        target_creators = set(creator.lower() for creator in target_row['created_by'])
+        # Get created_by from the recommended row
+        recommended_creators = set(creator.lower() for creator in row['created_by'])
+
+        # Find the intersection of the target creators and recommended creators
+        overlap = target_creators.intersection(recommended_creators)
+
+        return overlap
+
+
+# ------------------------ Function: extract_years ------------------------
+    def extract_years(self, air_date):
+        # Make sure air_date is not null
+        if pd.isna(air_date):
+            return "Unknown"
+        # Convert float to int if needed
+        if isinstance(air_date, float):
+            return str(int(air_date))
+        return air_date.split('-')[0]  
+
+
+# ------------------------ Function: get_recommendations ------------------------
+    def filter_genres(self, recommendations, target_row):
+        # Get genres from the target row
+        reference_genres = [genre.lower() for genre in target_row['genres']]
+
+        # Check if the reference includes specific genres
+        is_kids_reference = 'kids' in reference_genres
+        is_animated_reference = 'animation' in reference_genres
+        is_reality_reference = 'reality' in reference_genres
+        is_documentary_reference = 'documentary' in reference_genres
+
+        # Filter recommendations based on genre preferences
+        if not is_kids_reference:
+            recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'kids' in [g.lower() for g in x])]
+        if not is_animated_reference:
+            recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'animation' in [g.lower() for g in x])]
+        if not is_reality_reference:
+            recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'reality' in [g.lower() for g in x])]
+        if not is_documentary_reference:
+            recommendations = recommendations[~recommendations['genres'].apply(lambda x: 'documentary' in [g.lower() for g in x])]
+
+        return recommendations
--- a/trainmodel.py
+++ b/trainmodel.py
@ -0,0 +1,107 @@
+from sklearn.neighbors import NearestNeighbors
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import TruncatedSVD
+from scipy.sparse import hstack, csr_matrix
+import time
+
+import warnings
+warnings.filterwarnings("ignore", category=UserWarning, module='sklearn')
+
+
+############################## Train model ##############################
+class TrainModel:
+    def __init__(self, title_data):
+        self.title_data = title_data
+
+        # Settings for vectorization
+        self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=0.01, max_df=0.5)
+
+        # Settings for nearest neighbors
+        self.model = NearestNeighbors(metric='cosine')  
+        self.scaler = StandardScaler()
+
+        # Settings for SVD
+        self.svd = TruncatedSVD(n_components=300) 
+
+
+# ---------------------- Function: train ----------------------
+    def train(self):
+        print("Starting to train model ...")
+
+        start = time.time()
+
+        # Preprocess title data
+        preproccessed_data = self.preprocess_title_data()
+
+        # Train the NN model
+        self.model.fit(preproccessed_data)
+
+        stop = time.time()
+
+        # Count time for training
+        elapsed_time = stop - start
+
+        print(f'Trained model successfully in {elapsed_time:.2f} seconds.')
+
+
+# ------------------------ Function: recommend ------------------------
+    def recommend(self, target_row, num_recommendations=40):    
+
+        # Preprocess target data
+        target_vector = self.preprocess_target_data(target_row)
+
+        # Get nearest neighbors
+        distances, indices = self.model.kneighbors(target_vector, n_neighbors=num_recommendations)
+        recommendations = self.title_data.iloc[indices[0]].copy()
+        recommendations['distance'] = distances[0]
+
+        # Filter recommendations
+        recommendations = recommendations[
+            (recommendations['name'].str.lower() != target_row['name'].lower()) & 
+            (recommendations['distance'] < 0.5) 
+        ]
+        return recommendations.head(num_recommendations)
+
+
+# ---------------------- Function: preprocess_data ----------------------
+    def preprocess_title_data(self):
+        # Combine text fields in a new column for vectorization
+        self.title_data['combined_text'] = (
+            self.title_data['overview'].fillna('').apply(str) + ' ' +
+            self.title_data['genres'].fillna('').apply(str) + ' ' +
+            self.title_data['created_by'].fillna('').apply(str)
+        )
+
+        # Process combined_text column with vectorizer
+        text_features = self.vectorizer.fit_transform(self.title_data['combined_text'])
+        text_features = self.svd.fit_transform(text_features)
+        
+        # Scale numerical features in the DataFrame using a scaler
+        self.numerical_data = self.title_data.select_dtypes(include=['number'])
+
+        # Include ratings in numerical features
+        if 'vote_average' in self.numerical_data.columns:
+            self.numerical_data = self.numerical_data[['vote_average']]
+
+        # Scale numerical features
+        numerical_features = self.scaler.fit_transform(self.numerical_data)
+        numerical_features_sparse = csr_matrix(numerical_features) 
+    
+        # Combine text and numerical features
+        combined_features = hstack([csr_matrix(text_features), numerical_features_sparse])
+
+        return combined_features
+    
+
+# ---------------------- Function: preprocess_target_data ----------------------   
+    def preprocess_target_data(self, target_row):
+        # Create feature vector for target row
+        target_text_vector = self.vectorizer.transform([target_row['combined_text']])
+        target_text_vector = self.svd.transform(target_text_vector)
+        
+        # Process numerical features of the referens target
+        target_numerical = target_row[self.numerical_data.columns].values.reshape(1, -1)
+        target_vector = hstack([csr_matrix(target_text_vector), csr_matrix(self.scaler.transform(target_numerical))])
+
+        return target_vector
--- a/user_data.py
+++ b/user_data.py
@ -0,0 +1,28 @@
+############################## User input ##############################
+class UserData:
+    def __init__(self):
+        self.user_data = {} 
+        self.n_rec = 10
+
+# ---------------------- Function: title ---------------------- 
+    def title(self):
+        # Ask for user input
+        print("#" * 100)
+        title = input("\nPlease enter the title of TV-Series you prefer: ")
+        self.user_data['title'] = title.strip().lower()
+        return self.user_data
+    
+# ---------------------- Function: n_recommendations ----------------------
+    def n_recommendations(self):
+        # Ask for number of recommendations
+        while True:
+            n_rec = input("How many recommendations do you want (minimum 5): ")
+            try:
+                n_rec = int(n_rec.strip())
+                if n_rec < 5:
+                    print("Please enter a number greater than or equal to 5: ")
+                else:
+                    self.user_data['n_rec'] = n_rec
+                    break
+            except ValueError:
+                print("Please enter a valid number.")