Document Classification

02 February 2018

Document Classification of Reuter’s dataset using Apache Spark and GCP

Area: NLP, Big Data; Technologies: Spark, GCP.

Links: Github Repo

Reuter’s huge data corpus with more than a million records is needed to be classified into 4 labels according to their type. This labeled dataset is available for machine learning enthusiasts to build their models on, and try to test their models on the test set.

To solve such a huge Natural Langauge Processing(NLP) problem, a popular distributed computing framework, Apache Spark, has been used to train machine learning models in this project. Classical machine learning techniques including Naive-Bayes and data dictionary model are utilized in search of the best solution. Google Cloud Platform(GCP)’s effective DataProc cluster has been used to train the distributed. The project was accomplished in a team of 3.