Project information

Title: From Raw Data to Informed Decisions: Analyzing Amazon Book Reviews
Category: Big Data
Author: Andrea Alberti - Davide Ligari - Cristian Andreoli
Project date: 26 September, 2023
Project URL: Github-Helpfulness_Prediction

Amazon Book Reviews Analysis

This repository contains code and resources for analyzing Amazon book reviews. The project aims to develop scalable solutions for various analyses, including sentiment analysis, review helpfulness prediction, topic modeling, and more.

Tools • Data • Hypotheses • Workflow • Repository Structure • Trained models

Tools

Big Data: Hadoop, Spark, MongoDB
Data Analysis: Pandas, Scikit-learn, Seaborn, Matplotlib, Jupyter Notebook

Data

We're using the Amazon Books Reviews dataset, containing 142.8 million reviews. The dataset comprises two tables: Books Ratings and Books Info.

Hypotheses

We've formulated several hypotheses, including:

Reviews with longer text have higher helpfulness ratings.
Reviews with more positive sentiment words receive higher helpfulness ratings.
Reviews with higher average book ratings have higher helpfulness ratings.
The rating score is influenced by individual users.
The review/score is influenced by the category of the book.
The number of books published for a category affects the review score.

Workflow

Data Preparation:
- Load data into HDFS.
- Clean data with Spark.
- Perform data aggregation and transformations.
- Load transformed data into MongoDB.
Modeling:
- Choose appropriate models (classification, regression, clustering, dimensionality reduction).
Evaluation:
- Evaluate models using relevant metrics.
Reporting:
- Generate reports and insights.

Repository Structure

This repository is organized into the following folders, each serving a specific purpose:

Documents: Contains general documents used for synchronizing activities among team members. You can find meeting notes, project timelines, and any other relevant files here.
MapReduce Join: Contains the scripts for MapReduce operations, including mapper and reducer scripts. These scripts are used for data processing tasks within the project.
Notebooks: This folder is further organized into subfolders, each dedicated to a specific aspect of the project:
- - Hypotheses Testing: Contains Jupyter notebooks used for testing and analyzing hypotheses related to the project. You'll find code and documentation for hypothesis testing here.
- - Model: Contains Jupyter notebooks used for feature extraction, model training, and evaluating the predictive capabilities of our models. This is where the core data analysis and machine learning work happens.
- - MongoDB: Holds Jupyter notebooks related to exporting a subset of data to MongoDB. This may include data migration and integration tasks.
- - Spark: It is dedicated to Jupyter notebooks for preliminary data analysis, data cleaning, and testing various hypotheses on the complete dataset using Apache Spark.
Report: Contains LaTeX files for creating the project report. This is where you can find the documentation and presentation materials summarizing our project's goals, methodology, findings, and conclusions.
Presentation: Contains template and images used in the PowerPoint presentation

Trained models

The models trained during the project execution are available at the following Google Drive folder

Helpfulness Prediction