TV-Show-recommender/README.md

130 lines
4.4 KiB
Markdown

# Supervised Learning - TV-Show Recommender
## Table of Contents
1. [How to Run the Program](#how-to-run-the-program)
2. [Project Overview](#project-overview)
3. [Dataset](#dataset)
4. [Model and Algorithm](#model-and-algorithm)
5. [Features](#features)
6. [Requirements](#requirements)
7. [Libraries](#libraries)
8. [Classes](#classes)
9. [References](#references)
## How to Run the Program
### Prerequisites
1. **Download and Extract the Dataset:**
- Download the dataset from [TMDB TV Dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows).
- Extract `TMDB_tv_dataset_v3.zip` into the `dataset/` folder, so it contains the file `TMDB_tv_dataset_v3.csv`.
2. **Install Dependencies:**
- Install the necessary libraries listed in `requirements.txt` (see below).
3. **Run the Program:**
- Start the program by running the following command:
```bash
python main.py
```
- The program will load the dataset, ask for a TV show title to base recommendations on, and prompt for the number of recommendations.
- **Note:** The first time the program is run, it will generate **Sentence-BERT embeddings**. This can take up to 5 minutes due to the large size of the dataset.
---
## Project Overview
The **TV-Show Recommender** is a machine learning-based program that suggests TV shows to users based on their preferences. The system uses **Nearest Neighbors (NN)** and **K-Nearest Neighbors (K-NN)** algorithms with **cosine distance** to recommend TV shows. Users provide a title of a TV show they like, and the system returns personalized recommendations based on similarity to other TV shows in the dataset.
---
## Dataset
The dataset used in this project is sourced from **TMDB** (The Movie Database). It contains over 150,000 TV shows and includes information such as:
- Title of TV shows
- Genres
- First/Last air date
- Vote count and average rating
- Director/Creator information
- Overview/Description
- Networks
- Spoken languages
- Number of seasons/episodes
Download the dataset from [here](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows).
---
## Model and Algorithm
The recommender system is based on **Supervised Learning** using the **NearestNeighbors** and **K-NearestNeighbors** algorithms. Here's a breakdown of the process:
1. **Data Preprocessing:**
- The TV show descriptions are vectorized using **Sentence-BERT embeddings** to create dense vector representations of each show's description.
2. **Model Training:**
- The **NearestNeighbors (NN)** algorithm is used with **cosine distance** to compute similarity between TV shows. The algorithm finds the most similar shows to a user-provided title.
3. **Recommendation Generation:**
- The model generates a list of recommended TV shows by finding the nearest neighbors of the input title using cosine similarity.
---
## Features
1. **Data Loading & Preprocessing:**
- Loads the TV show data from a CSV file and preprocesses it for model training.
2. **Model Training with K-NN:**
- Trains a K-NN model using the **NearestNeighbors** algorithm for generating recommendations.
3. **User Input for Recommendations:**
- Accepts user input for the TV show title and the number of recommendations.
4. **TV Show Recommendations:**
- Returns a list of recommended TV shows based on similarity to the input TV show.
---
## Requirements
### Data Requirements:
The dataset should contain the following columns for each TV show:
- **Title**
- **Genres**
- **First/Last air date**
- **Vote count/average**
- **Director**
- **Overview**
- **Networks**
- **Spoken languages**
- **Number of seasons/episodes**
### User Input Requirements:
- **TV Show Title**: The name of the TV show you like.
- **Number of Recommendations**: The number of recommendations you want to receive (default is 10).
---
## Libraries
The following libraries are required to run the program:
- **pandas**: For data manipulation and analysis.
- **scikit-learn**: For machine learning algorithms and preprocessing.
- **scipy**: For scientific computing (e.g., sparse matrices).
- **time**: For working with time-related functions.
- **os**: For interacting with the operating system.
- **re**: For regular expression support.
- **textwrap**: For text wrapping and formatting.
- **flask**: For creating the web interface.
To install the dependencies, run:
```bash
pip install -r requirements.txt