Big Data: Movie Rec Sys

Big Data: Movie Recommendation System

Bess Yang (qy561@nyu.edu), Iris Lu (hl5679@nyu.edu), Chloe Kwon (ekk294@nyu.edu)

Project Overview

This capstone project focused on building a movie recommendation system using various data processing techniques. We explored methods like collaborative filtering, baseline models, and latent factor models with hyperparameter tuning. The goal was to compare the performance of different recommendation models in terms of Mean Average Precision (MAP) and Root Mean Squared Error (RMSE) on datasets processed using Apache Spark.

Report

GitHub

Languages, Platforms, and Tools

Languages: Python, SQL
Tools: Apache Spark, Parquet, Dataproc, Jupyter Notebooks
Platforms: NYU HPC (High-Performance Computing), Google Cloud, Greene (for Spark Standalone Cluster)

My Contributions

I contributed to the entire project, focusing on the implementation of collaborative filtering models (ALS), popularity baseline models, and hyperparameter tuning. Specifically:

I handled Q3, partitioning the dataset for training, validation, and testing (7:1.5:1.5 ratio) for cross-validation to enhance model performance and reduce overfitting.
For Q4, I implemented the weighted composite and genre-based popularity models, achieving the highest MAP scores, improving recommendation accuracy by 15% compared to the popularity-based baseline model.
I worked on Q5, where I fine-tuned the ALS model using both MAP and RMSE as evaluators. I also attempted to incorporate genome relevance into the model but faced memory issues on large datasets.
Additionally, I reduced data processing time by 25% by optimizing the use of Apache Spark, Dataproc, Parquet, NYU HPC and Google Cloud platforms, significantly improving the efficiency of large-scale data handling.