Large-scale Recommender System Architecture

Figures - uploaded by Saravanan Selvam

Author content

All figure content in this area was uploaded by Saravanan Selvam

Content may be subject to copyright.

ResearchGate Logo

Discover the world's research

  • 20+ million members
  • 135+ million publications
  • 700k+ research projects

Join for free

Design of Large-scale Content-based

Recommender System using Hadoop MapReduce

Framework

S.Saravanan

CSE, Amrita School of Engineering

Amrita Vishwa Vidyapeetham University

Bangalore, India

saranpons3@gmail.com

Abstract Nowadays, providing relevant product

recommendations to customers plays an important role in

retaining customers and improving their shopping experience.

Recommender systems can be applied to industries such as an e-

commerce, music, online radio, television, hospitality, finance and

many more. It is proved over the years that a simple algorithm

with a lot of data can always provide better results than a

complex algorithm with an inadequate amount of data. To

provide better product recommendations, retail businesses have

to analyze huge amount of data. As the recommendation system

has to analyze huge amount of data to provide better

recommendations, it is considered as a data intensive application.

Hadoop distributed cluster platform is developed by Apache

Software Foundation to address the issues which are involved in

designing data intensive applications. In this paper, the improved

MapReduce based data preprocessing and Content based

recommendation algorithms are proposed and implemented

using hadoop framework. Also, graphical user interfaces are

developed to interact with the recommender system.

Experimental results on Amazon product co-purchasing network

metadata show that Hadoop distributed cluster environment is

an efficient and scalable platform for implementing large scale

recommender system.

Keywords—Hadoop, MapReduce, Recommender System

I. INTRODUCTION

A recommender system is software that analyzes available

data to make suggestions to customers about products that

might be of interest to him. In a world where the number of

choices can be huge, recommender systems enable users to find

the best products for their tastes. Content-based

recommendations, collaborative filtering and hybrid

recommendation are three approaches for implementing

recommender system. The idea behind content based

recommendation is that a user is likely to have similar level of

interest for similar products. That is, first, similar products are

identified using features about products such as categories to

which the product belongs to. After identifying the similar

products, content based recommendation aims to recommend

products that are similar to those that a user has bought already

in the past. On the other hand, Collaborative filtering based

recommendation tries to recommend products by collecting and

analyzing a large amount of information on users' behaviors,

activities and predicting what user will like based on their

similarity to other users. There are two types of collaborative

filtering based recommendation: User-based collaborative

filtering and Item-based collaborative filtering. In User-based

collaborative filtering, recommendation decision is made based

on similarity measures between users. Item-based collaborative

filtering uses ratings of the products to measure the similarity

between the products and then it makes the recommendation

decision to the customer. In the era of Big Data, Recommender

system has to analyze large data sets in order to provide better

results. The data set which has to be analyzed by recommender

system is considered as big data because the volume of the data

set is extremely huge. Analyzing big data using conventional

databases, statistical software and visualization tools is

difficult. So, Hadoop like parallel distributed cluster

environment is required to store and analyze the big data. As

the current recommender systems have to analyze huge data

sets in order to provide better recommendations, they are

considered as data intensive applications. MapReduce parallel

programming model suits well to analyze the large data sets in

parallel over a cluster of computers [1]. The remainder of the

paper is organized as follows: Section 2 briefly reviews the

content based recommender systems which are implemented in

Hadoop framework and Section 3 presents the system

architecture. Section 4 explains the implementation of large-

scale Content-based recommender system. Section 5 presents

the experimental setup and results. Section 6 shows the

performance evaluation. Section 7 draws the conclusions with

pointers to future work.

II. RELATED W ORK

Toon De Pessemier et al [2] propose Content-based

recommendation algorithm for analyzing wikipedia articles and

making the recommendation decisions using hadoop

framework. They proposed MapReduce algorithms for

keyword extraction and for generating content-based

suggestions for the end-user. Simon Dooms et al [3] propose

In-memory, distributed content-based recommendation system

which uses MapReduce paradigm. Generally, mid-computation

values are stored in hard disk by the MapReduce parallel

programming model which is one of the drawbacks of it. In this

paper, the developed content based recommendation algorithm

which can keep mid-computation values completely in RAM to

reduce the hard disk accesses and to improve the efficiency of

MapReduce parallel programming model. Jure Leskovec et al

[4] explain the basic principles of Content-based

recommendation system which has to analyze massive data sets

for making relevant recommendations. Srinath Perera et al [5]

propose MapReduce based data preprocessing and Content-

based recommendation algorithms for preprocessing and

Content-based recommendation. The preprocessing algorithm

designed in [5] does not remove title, group, salesrank,

categories fields and the products whose similar field is empty

from the input data set. These fields can be removed from the

input data set as they are not required for making

recommendation decisions. So, in this paper, the data

preprocessing algorithm is redesigned to remove those fields

mentioned in the previous line. The Content-based

recommendation algorithm designed in [5] does not avoid same

products being added to the recommendation list. Also, this

algorithm does not take the same number of products from

each similar products list and add to the recommendation list.

So, in this paper, two variations of Content-based

recommendation algorithms are designed. The first variation

removes the repeated products from the recommendation list

and takes the same number of products from each similar

products list while adding products to the recommendation list.

Also, this variation will accept the number of recommendations

required in the recommendation list from the user of the

system. The second variation generates the best

recommendation for each customer. This paper provides

graphical user interfaces to interact with the recommender

system and to show the output.

III. ARCHITECURE DESIGN OF SYSTEM

This paper implements 2 nodes hadoop distributed cluster

architecture. Hadoop is a distributed platform which can store

and enable us processing large data sets through clusters of

computers using MapReduce parallel programming model.

There are two nodes in the cluster: Master node Slave node.

Master node runs Namenode, Datanode, Jobtracker and Task

tracker processes. Slave node runs the Datanode and Task

tracker processes. Namenode keeps track of how the input

dataset is broken down into blocks, which nodes store those

blocks. Secondary name node periodically reads the HDFS file

system changes log and apply them into the fsimage file. Data

node stores the replication of input dataset. JobTracker

determines the execution plan by deciding which blocks to be

processed, assigns nodes to different tasks, and keeps track of

all tasks as they are running. Task tracker is responsible for the

execution of individual tasks on each slave node. As the Fig.1.

shows, there are two layers in the hadoop distributed cluster

architecture: HDFS Layer and MapReduce layer. HDFS

(Hadoop Distributed File System) is a scalable and reliable file

system that provides data storage and can span large clusters

of commodity servers. MapReduce layer reads data from,

writes data to HDFS storage and processes the data in parallel

Fig. 1. Large-scale Recommender System Architecture

[6]. The recommender system which is developed in this paper

is made up of load data set, data set preprocess and content

based recommendation components. Load data set component

loads the data set into the HDFS. After loading the data set,

preprocessing can be done. User can choose content based

recommendation algorithm for generating recommendations.

IV. IMPLEMENTATION

This section presents the implementation of MapReduce

based Large scale Content-based Recommender System in

Hadoop distributed cluster environment.

A. Loading data set

This component loads the input data set into HDFS of

hadoop distributed cluster environment. The command to load

the data set into HDFS is

bin/hadoop dfs -put <input file path> <path in HDFS>

B. Data set preprocessing

This component is written using MapReduce parallel

programming model as the input dataset which has to be

processed is huge in size. The data set used in this paper is

Amazon co-purchasing network metadata which is taken from

Stanford University website [7]. The type of the input file is

text file. The data format of the input data set is shown in Fig.

2. There are 8 fields for each product in the input data set. In

preprocessing, the various products purchased by each

customer with their similar products are extracted from the

input dataset by running MapReduce algorithm on the input

data set. The procedure of data preprocessing algorithm is

given in Fig. 3. In preprocessing step, reviews except customer

Load Data Set

Data Preprocessing

Content based

Recommendation

Task

Tracker

Job

Tracker

Task

Tracker

Name

Node

Data

Node

Data

Node

MapReduce Layer

Master Node Slave Node

HDFS Layer

Recommender

System User

Interface

2 Nodes Hadoop cluster

id subfield, sales rank, title, category and group fields are

removed as they are not required for making recommendation

decisions. The data format of the preprocessed data set is

shown in Fig. 4. The output of preprocessing step is shown in

the results section.

Fig. 2. Amazon Data set Format

Fig. 3. Data preprocessing algorithm

Fig. 4. Preprocessed data set format

C. Content based Recommendation

In content based recommendation algorithm,

recommendations are made to a customer by looking at the

products which are similar to the products that customer has

already bought. This algorithm is applied on the preprocessed

data. Each data entry in the preprocessed data contains the

similar products list for every product the customer has

purchased. This algorithm is implemented using MapReduce

parallel programming model. Two variations of content based

recommendation algorithm are implemented in this paper. The

first variation accepts the number of recommendations to be

generated for each customer from the user of the system and

generates the recommendation list. The procedure of this

variation of the content based recommendation algorithm is

shown in Fig. 5. The second variation generates the best

recommendation for each customer. This variation assumes

that the product which is similar to more products the customer

has bought as the best recommendation. The second variation

of the content based recommendation algorithm is shown in

Fig. 6.

Fig. 5. Content based recommendation algorithm (Variation I)

Fig. 6. Content based recommendation algorithm (Variation II)

Id: Product id (number 0, …., 548551)

ASIN: Amazon Standard Identification Number

Title: Name/title of the product

Group: Product group (Book, DVD, Video or

Music)

Salesrank: Amazon Salesrank

Similar: ASINs of co-purchased products

Categories: Location in product category

hierarchy to which the product belongs

Reviews: Product review information: time,

customer id, rating, total number of votes on the

review, total number of helpfulness votes (how

many people found the review to be helpful)

Input: Preprocessed data set

Output: Recommendation list for each customer

1. Map

1.1. Scan the preprocessed data for each

customer

1.2. Remove number of products purchased by

the customer field

1.3. Create a list of recommendations by adding

only two products from similar Products field

for each product the customer has purchased

1.4. Remove a product from the recommendation

list if the customer has already purchased it and

also remove the products which are repeated

in the recommendation list

2. Reduce

2.1. Receive the output from Map step and print

the results in output file in HDFS

Input: Amazon data set

Output: Preprocessed data set

1. Map

1.1. Scan the input data set for each product

1.2. Call Remove() procedure

1.3. Generate the customer ID and the product

Information (ASIN field and similarProducts

field) as the key-value pairs for each customer

who has purchased that particular product

2. Group and Sort function

2.1. Group and sort all the key-value pairs from

map step based on the customer ID

3. Reduce

3.1. Process the key-value pairs received from

previous step and generate the list of

products with similar products bought by

each customer.

Remove() procedure:

1. Extract customer id subfield from reviews field

2. Remove title, sales rank, group, category and

reviews fields

3. Remove the products whose similar field is

empty

<Number of products purchased by the customer><space>

<Customer ID: A1001URKW36W4Z><space><Product:

ASIN=0316779059, similarProducts=List of similar

products where each product is separated by | ><space

><Product: ASIN=A316B79059, similarProducts=List of

similar products where each product is separated by |

>………………

Input: Preprocessed data set

Output: Best Recommended product

1. Map

1.1. Scan the preprocessed data for each

customer

1.2. Remove number of products purchased by

the customer field

1.3. Create a list of recommendations by adding

all products from similar Products field for

each product the customer has purchased

1.4. Find the product which is repeated more

number of times in the recommendation list

and add it to the new recommendation list

2. Reduce

2.1. Receive the output from Map step and print

the best recommendation for each customer in

the output file in HDFS

V. E XPERIMENTAL S ETUP AND R ESULTS

A. System Configuration

The experimental setup consists of Hadoop cluster with 2

nodes. A cat 5e LAN cable is used to connect both the nodes.

The IP address of the master node is 192.168.2.1 and the slave

node is 192.168.2.2. RAM capacity of master node is 3 GB

and slave node is 2 GB. The processor of master node is Intel

Pentium Dual core and its speed is 2.30 GHz. The processor of

slave node is Intel Core i3 and its speed is 3.07 GHz. The

available hard disk space in master node is around 33GB and

in slave node is around 128 GB when the experiment is done.

Ubuntu 14.04 LTS is installed in both the nodes. The version

of Hadoop used is Hadoop 1.2.1. The data set used to carry out

the experiment is Amazon co-purchasing network metadata

which is taken from Stanford University website [7]. The size

of the data set is 977.5 MB. This data set contains product and

review information for about 548,522 products.

B. Results

First, the graphical user interface of the recommender system

is shown in Fig. 7a, Fig. 7b. and Fig. 7c. The interface has the

options for loading the data set into HDFS, preprocessing the

data set, Content based recommendation. Fig. 8a. shows the

HDFS setup. Fig. 8b. shows the created directories in HDFS.

Fig. 8c. shows the input data set in HDFS. The results of data

preprocessing, content based recommendation are shown in

the Fig. 9, Fig. 10 and Fig. 11.

Fig.7a. Initial Graphical User Interface

Fig. 7b. Graphical User Interface (Data set Loading)

Fig. 7c. Graphical User Interface (Content-based Recommendation)

Fig. 8a. HDFS Setup

Fig.8b. Directories in HDFS

Fig. 8c. Input Data set in HDFS

Fig. 9. Data set after preprocessing

Fig. 10. Result of Content-based recommendation with 5

recommendations in the recommendation list (Variation I)

Fig. 11. Result of Content-based recommendation with best

recommendation (Variation II)

VI. PERFORM EVALUATION

The table I shows the time taken by each MapReduce

algorithm implemented in this paper. The algorithms are

executed on 2 nodes hadoop distributed cluster.

Table I. Execution Time

MapReduce

Algorithm

Input

Data

Size

Output

Data

Size

Time

Taken

Data

Preprocessing

977.5

MB

266 MB 89 Mins

51 Secs

Content based

recommendation

with N

recommendations

266

MB

5.55 MB 15 Mins

14 Secs

Content based

recommendation

with best

recommendation

266

MB

2.08 MB 22 Mins

57 Secs

VII. C ONCLUSION AND F UTURE E NHANCEMENTS

This paper focuses on designing large scale content-

based recommender system using hadoop mapreduce

framework and providing graphical user interfaces to

interact with the system. The amount of data which has to be

analyzed by the modern recommender system is massive in

order to get relevant product recommedation. As the input

data set is a big data, the recommender system is

implemented in Hadoop distributed cluster environment.

The improved MapReduce algorithms are designed for data

set preprocessing, recommendation and executed in Hadoop

distributed cluster environment. This recommender system

analyzes batch of stored data set. That is, this recommender

system analyzes user's past history and makes the

recommendations. However, in many situations, in modern

era, enterprises need to generate recommendations based on

user activity in real time. So, In future, capabilities to

analyze the data in real time will be added to the

recommender system.

REFERENCES

[1] Sol Ji Kang, Sang Yeon Lee, Keon Myung Lee, "Performance

Comparison of OpenMP, MPI, and MapReduce in Practical Problems",

Article ID 575687, Advances in Multimedia Journal, Hindawi Publishing

Corporation, 2014.

[2] Toon De Pessemier, Kris Vanhecke, Simon Dooms and Luc Martens,

"Content-based recommendation algorithms on the hadoop mapreduce

framework", 7th International conference on web information systems and

technologies, P.237-240, 2011.

[3] Simon Dooms, Pieter Audenaert, Jan Fostier, Toon De Pessemier, Luc

Marten, "In-memory, distributed content-based recommender system",

Journal of Intelligent Systems, Vol.42, Issue 3, P 645-669,2014.

[4] Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, "Mining of

Massive Datasets",P.322-331, 2014.

[5] Srinath Perera, Thilina Gunarathne, "Hadoop MapReduce

Cookbook",P.192-197,2013.

[6] Apache-Hadoop, http://Hadoop.apache.org.

[7] https://snap.stanford.edu/data/web-Amazon.html

... This approach requires to rephrase existing algorithms to be suitable on a parallel processing platform. There are several studies of large-scale recommendation [6]- [8] based on the cluster, such as Apache Spark, Hadoop, and cloud. Most of these methods pay their attention to train a bigger model, and solve the problems derived from the large-scale cluster, such as data compression, data transformation, data synchronization, and model parallelism. ...

... The first one is building a bigger model based on cluster and compress the scale of data. Saravanan [6] proposed the improved MapReduce based data preprocessing and content-based recommendation algorithm using Hadoop framework. In the large-scale recommendation, the standard approach of sequentially examining each item and looking at all interacting users does not scale. ...

With the rapid increase of users and resources on the Internet, the scale of the recommender system becomes larger and larger. There are three major challenges facing in recommender system: sparsity, scalability, and cold start. In this paper, we mainly focus on the scalability issue and propose a recommender system based on the memory-efficient recurrent neural network. First, we allocate an item table for items and use a pair of embedding vectors to represent each item. Thus, we can use a few vectors to represent numerous items and decrease the memory used for the storage of embedding vectors. Second, we present a similarity-based initialization method for the item table to get a better representation of items. Third, we further design the loss function and the adjustment method to adjust the placement of items in the item table to speed up the training procedure of the model and get a better performance. The experimental results demonstrate the effectiveness of our approach. It can clearly improve the performance of recommendation, such as hit rate and normalized discounted cumulative gain, when compared to the state-of-the-art recommender algorithm. In addition, our approach can also handle the cold start problem and supply new users with the same quality of service as the old users. INDEX TERMS Memory-efficient, recurrent neural network, recommender system, scalability, item embedding, similarity-based, initialization, cold start.

... This approach requires to rephrase existing algorithms to be suitable on a parallel processing platform. There are several studies of large-scale recommendation [6]- [8] based on the cluster, such as Apache Spark, Hadoop, and cloud. Most of these methods pay their attention to train a bigger model, and solve the problems derived from the large-scale cluster, such as data compression, data transformation, data synchronization, and model parallelism. ...

... The first one is building a bigger model based on cluster and compress the scale of data. Saravanan et al. [6] proposed the improved MapReduce based data preprocessing and content-based recommendation algorithm using Hadoop framework. In the large-scale recommendation, the standard approach of sequentially examining each item and looking at all interacting users does not scale. ...

With the rapid increase of users and resources on the Internet, the scale of the recommender system becomes larger and larger. There are three major challenges facing in recommender system: sparsity, scalability and cold start. In this paper, we mainly focus on the scalability issue and propose a recommender system based on memory-efficient recurrent neural network. Firstly, we allocate an item table for items and use a pair of embedding vectors to represent each item. Thus, we can use a few vectors to represent numerous items and decrease the memory used for the storage of embedding vectors. Secondly, we present a similarity-based initialization method for the item table to get a better representation of items. Thirdly, we further design the loss function and the adjustment method to adjust the placement of items in the item table to speedup the training procedure of the model and get a better performance. The experimental results demonstrate the effectiveness of our approach. It can clearly improve the performance of recommendation, such as hit rate and normalized discounted cumulative gain, when compared to the state-of-the-art recommender algorithm. Additionally, our approach can also handle the cold start problem, and supply new users with the same quality of service as the old users.

... CB technique has several issues and limitations [13][14][15]. For example, (i) having no mechanism to assess the quality of an item supported by CB methods. ...

With the exponential increase in information, it has become imperative to design mechanisms that allow users to access what matters to them as quickly as possible. The recommendation system (RS) with information technology development is the solution, it is an intelligent system. Various types of data can be collected on items of interest to users and presented as recommendations. RS also play a very important role in e-commerce. The purpose of recommending a product is to designate the most appropriate designation for a specific product. The major challenge when recommending products is insufficient information about the products and the categories to which they belong. In this paper, we transform the product data using two methods of document representation: bag-of-words (BOW) and the neural network-based document combination known as vector-based (Doc2Vec). We propose three-criteria recommendation systems (product, package and health) for each document representation method to foster online grocery shopping, which depends on product characteristics such as composition, packaging, nutrition table, allergen, and so forth. For our evaluation, we conducted a user and expert survey. Finally, we compared the performance of these three criteria for each document representation method, discovering that the neural network-based (Doc2Vec) performs better and completely alters the results.

... [6] [7][8]but query tools like Hive, Impala, SparkSQL and MySQL can also be used in recommender systems as they are much simpler to use. Performance of these tools depends upon some factors like data size, file formats, aggregate search etc. [9]. ...

... However, it does extract features of items from the user's selection in the past to establish the user's preference model. Then, according to the candidate items matching the user preference model, the highest matching degree of N objects will be recommended to the user [11][12]. ...

  • Xiang Li
  • L. Wei

Information overload is a key issue of the current network information retrieval, and a personalized recommendation with special information filtering methods is an important way and means to solve this problem. Based on the analysis of the common methods used of personalized recommendation, the architectural design of the personalized recommendation is proposed on the cloud computing platform. Then, combined with the specific issues of employment recommendation, this article proposes an optimized algorithm of Mahout distributed personalized recommendation based on content and items. Compared with the current single target recommendation algorithm, this algorithm is more efficient with a good practical significance and reference value.

  • Binu P K Binu P K
  • Akhil Harikrishnan
  • Sreejith

The "Internet of Things" (IoT) is among the most highly subsidized and promising topics in both academia and industry these days. Contemporary developments in digital technology have raised the interest of many researchers towards implementation in this area. The influence of IoT within the insurance field is vital. This chapter asserts an innovative concept of IoT pooled with an insurance application, which is beneficial for insurance companies to monitor and analyze the health of their clients continuously. Numerous insurance companies are clustered together to provide a standardized health status monitoring of clients. Since there is a large amount data generated by the system, we adopt Hadoop in the background to map the data effectively and to reduce it into a simpler format. We assimilate Sqoop tool to enable data transfer between Hadoop and RDBMS, in consort with Apache Hive for providing a database query interface to the Hadoop. By consuming the output from Hadoop MapReduce, a non-probabilistic binary linear classifier predicts the policyholder's chances of developing some health problems. Ultimately, the resultant outcomes are presented on the user's smartphones. The Apache Ranger framework interweaved with the Hadoop ecosystem aims to ensure data confidentiality. The endowments are granted to the policy holders based on the health report generated by our system. To evaluate the efficiency of the system, experiments are conducted using various policyholder's health datasets and from the results, it is observed that SVM predicts sepsis with an accuracy of approximately 86%. While testing with the medical dataset, SVM proved to be more accurate than the C4.5 algorithm.

  • Roshan Bharti
  • Deepak Gupta

Nowadays, the recommender system plays an important role in the real world by which we can recommend the most useful and perfect movies to the users from a large set of movies list and their ratings based on different users. Since the number of users and the movies are increasing day by day, computing the recommended movies list in a single node machine takes a very large time. Hence to reduce the computation time, we are using Hadoop framework to work in a distributed manner. Further, we have proposed a hybrid approach to recommend movies to the users by combining both the filtering techniques, i.e., user-based collaborative filtering and content-based filtering to overcome the problems of these techniques. In content-based filtering, we recommend items that are similar to the previous items which are highly rated by that user. Whereas in case of user-based collaborative filtering technique, we find out the most similar users with respect to the current user based on their cosine similarity and centered cosine similarity, and based on best similarity values, top N movies are recommended to the user by predicting the ratings of the movies. Further, to reduce the computation complexity, Hive database for Hadoop framework is used for developing SQL type scripts to perform MapReduce operations.

  • Sol Ji Kang
  • Sang Yeon Lee
  • Keon Myung Lee Keon Myung Lee

With problem size and complexity increasing, several parallel and distributed programming models and frameworks have been developed to efficiently handle such problems. This paper briefly reviews the parallel computing models and describes three widely recognized parallel programming frameworks: OpenMP, MPI, and MapReduce. OpenMP is the de facto standard for parallel programming on shared memory systems. MPI is the de facto industry standard for distributed memory systems. MapReduce framework has become the de facto standard for large scale data-intensive applications. Qualitative pros and cons of each framework are known, but quantitative performance indexes help get a good picture of which framework to use for the applications. As benchmark problems to compare those frameworks, two problems are chosen: all-pairs-shortest-path problem and data join problem. This paper presents the parallel programs for the problems implemented on the three frameworks, respectively. It shows the experiment results on a cluster of computers. It also discusses which is the right tool for the jobs by analyzing the characteristics and performance of the paradigms.

Burdened by their popularity, recommender systems increasingly take on larger datasets while they are expected to deliver high quality results within reasonable time. To meet these ever growing requirements, industrial recommender systems often turn to parallel hardware and distributed computing. While the MapReduce paradigm is generally accepted for massive parallel data processing, it often entails complex algorithm reorganization and suboptimal efficiency because mid-computation values are typically read from and written to hard disk. This work implements an in-memory, content-based recommendation algorithm and shows how it can be parallelized and efficiently distributed across many homogeneous machines in a distributed-memory environment. By focusing on data parallelism and carefully constructing the definition of work in the context of recommender systems, we are able to partition the complete calculation process into any number of independent and equally sized jobs. An empirically validated performance model is developed to predict parallel speedup and promises high efficiencies for realistic hardware configurations. For the MovieLens 10 M dataset we note efficiency values up to 71 % for a configuration of 200 computing nodes (eight cores per node).

  • J. Leskovec
  • A. Rajaraman
  • J.D. Ullman

Written by leading authorities in database and Web technologies, this book is essential reading for students and practitioners alike. The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and can be applied successfully to even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. Other chapters cover the PageRank idea and related tricks for organizing the Web, the problems of finding frequent itemsets and clustering. This second edition includes new and extended coverage on social networks, machine learning and dimensionality reduction.

The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.

  • Srinath Perera Srinath Perera

Hadoop MapReduce Cookbook" is a one-stop guide to processing large and complex data sets using the Hadoop ecosystem. The book introduces you to simple examples and then dives deep to solve in-depth big data use cases. The book deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations.

Content-based recommender systems are widely used to generate personal suggestions for content items based on their metadata description. However, due to the required (text) processing of these metadata, the computational complexity of the recommendation algorithms is high, which hampers their application in large-scale. This computational load reinforces the necessity of a reliable, scalable and distributed processing platform for calculating recommendations. Hadoop is such a platform that supports data-intensive distributed applications based on map and reduce tasks. Therefore, we investigated how Hadoop can be utilized as a cloud computing platform to solve the scalability problem of content-based recommendation algorithms. The various MapReduce operations, necessary for keyword extraction and generating content-based suggestions for the end-user, are elucidated in this paper. Experimental results on Wikipedia articles prove the appropriateness of Hadoop as an efficient and scalable platform for computing content-based recommendations.