top of page

MapReduce & Hadoop for Basket Analysis and Movie Genre Trends

Bess Yang (qy561@nyu.edu),  Iris Lu (hl5679@nyu.edu), Chloe Kwon (ekk294@nyu.edu)

Project Overview

This project focused on using the MapReduce framework to perform two distinct tasks: analyzing shopping basket data and counting distinct movie genres over time. We implemented custom MapReduce jobs using Python and the mrjob library to efficiently process large datasets on the Hadoop platform. The first task involved finding item co-occurrences in shopping baskets, where we counted how often pairs of items appeared together in the same transaction. The second task was a movie genre analysis, where we tracked the number of distinct movies in specific genres (Western and Sci-Fi) over time using the movie dataset.

The project was run on NYU’s HPC platform using a Hadoop cluster for large-scale data processing. We encountered challenges in optimizing the number of steps in the basket analysis to ensure that the job ran efficiently without creating excessive intermediate data. I ensured that the solution would scale well for both tasks in terms of time and space by minimizing the amount of data shuffled between mappers and reducers.

Languages, Platforms, and Tools

  • Languages: Python

  • Tools: Hadoop, MRJob, NYU's High-Performance Computing (HPC) environment, Google Dataproc

  • Platforms: Local machine, GitHub (for version control), Hadoop Cluster on Google Cloud

Results

  • Basket Analysis (mr_basket.py)

    • We developed a MapReduce job that processed shopping session data to compute the co-occurrence of items in a single basket. The mapper extracts the purchased items from each session, and the reducer counts the number of times each pair of items appeared together. The output is in the form item, [co-occurring item, co-occurrence count]. This approach efficiently handles large datasets by distributing the workload across multiple machines.

  • Movie Genre Analysis (mr_sql.py)

    • This MapReduce job analyzed a movie dataset, focusing on two genres: Western and Sci-Fi. The mapper processed the movie records to extract the year and genre, and the reducer aggregated the number of distinct movies per genre per year. This job helped us identify trends in movie production for these genres over time.

My Contributions

In this project, we did not split or distribute the work among team members. Instead, each of us completed the entire homework independently and then compared our results. We discussed our findings and approaches, ultimately creating a final version that we all agreed upon.

Just like my team members, I developed both MapReduce jobs. I worked on designing the logic for extracting relevant data in the mapper functions and performing aggregations in the reducers. For the basket analysis, I used the combinations function to identify co-occurring items within each session and implemented multiple map-reduce steps to handle intermediate outputs efficiently. For the movie genre analysis, I implemented regular expressions to parse movie records and used a combination of the year and genre as the key to ensure proper grouping.

bottom of page