Figures - uploaded by Saravanan Selvam
Author content
All figure content in this area was uploaded by Saravanan Selvam
Content may be subject to copyright.
Discover the world's research
- 20+ million members
- 135+ million publications
- 700k+ research projects
Join for free
Design of Large-scale Content-based
Recommender System using Hadoop MapReduce
Framework
S.Saravanan
CSE, Amrita School of Engineering
Amrita Vishwa Vidyapeetham University
Bangalore, India
saranpons3@gmail.com
Abstract— Nowadays, providing relevant product
recommendations to customers plays an important role in
retaining customers and improving their shopping experience.
Recommender systems can be applied to industries such as an e-
commerce, music, online radio, television, hospitality, finance and
many more. It is proved over the years that a simple algorithm
with a lot of data can always provide better results than a
complex algorithm with an inadequate amount of data. To
provide better product recommendations, retail businesses have
to analyze huge amount of data. As the recommendation system
has to analyze huge amount of data to provide better
recommendations, it is considered as a data intensive application.
Hadoop distributed cluster platform is developed by Apache
Software Foundation to address the issues which are involved in
designing data intensive applications. In this paper, the improved
MapReduce based data preprocessing and Content based
recommendation algorithms are proposed and implemented
using hadoop framework. Also, graphical user interfaces are
developed to interact with the recommender system.
Experimental results on Amazon product co-purchasing network
metadata show that Hadoop distributed cluster environment is
an efficient and scalable platform for implementing large scale
recommender system.
Keywords—Hadoop, MapReduce, Recommender System
I. INTRODUCTION
A recommender system is software that analyzes available
data to make suggestions to customers about products that
might be of interest to him. In a world where the number of
choices can be huge, recommender systems enable users to find
the best products for their tastes. Content-based
recommendations, collaborative filtering and hybrid
recommendation are three approaches for implementing
recommender system. The idea behind content based
recommendation is that a user is likely to have similar level of
interest for similar products. That is, first, similar products are
identified using features about products such as categories to
which the product belongs to. After identifying the similar
products, content based recommendation aims to recommend
products that are similar to those that a user has bought already
in the past. On the other hand, Collaborative filtering based
recommendation tries to recommend products by collecting and
analyzing a large amount of information on users' behaviors,
activities and predicting what user will like based on their
similarity to other users. There are two types of collaborative
filtering based recommendation: User-based collaborative
filtering and Item-based collaborative filtering. In User-based
collaborative filtering, recommendation decision is made based
on similarity measures between users. Item-based collaborative
filtering uses ratings of the products to measure the similarity
between the products and then it makes the recommendation
decision to the customer. In the era of Big Data, Recommender
system has to analyze large data sets in order to provide better
results. The data set which has to be analyzed by recommender
system is considered as big data because the volume of the data
set is extremely huge. Analyzing big data using conventional
databases, statistical software and visualization tools is
difficult. So, Hadoop like parallel distributed cluster
environment is required to store and analyze the big data. As
the current recommender systems have to analyze huge data
sets in order to provide better recommendations, they are
considered as data intensive applications. MapReduce parallel
programming model suits well to analyze the large data sets in
parallel over a cluster of computers [1]. The remainder of the
paper is organized as follows: Section 2 briefly reviews the
content based recommender systems which are implemented in
Hadoop framework and Section 3 presents the system
architecture. Section 4 explains the implementation of large-
scale Content-based recommender system. Section 5 presents
the experimental setup and results. Section 6 shows the
performance evaluation. Section 7 draws the conclusions with
pointers to future work.
II. RELATED W ORK
Toon De Pessemier et al [2] propose Content-based
recommendation algorithm for analyzing wikipedia articles and
making the recommendation decisions using hadoop
framework. They proposed MapReduce algorithms for
keyword extraction and for generating content-based
suggestions for the end-user. Simon Dooms et al [3] propose
In-memory, distributed content-based recommendation system
which uses MapReduce paradigm. Generally, mid-computation
values are stored in hard disk by the MapReduce parallel
programming model which is one of the drawbacks of it. In this
paper, the developed content based recommendation algorithm
which can keep mid-computation values completely in RAM to
reduce the hard disk accesses and to improve the efficiency of
MapReduce parallel programming model. Jure Leskovec et al
[4] explain the basic principles of Content-based
recommendation system which has to analyze massive data sets
for making relevant recommendations. Srinath Perera et al [5]
propose MapReduce based data preprocessing and Content-
based recommendation algorithms for preprocessing and
Content-based recommendation. The preprocessing algorithm
designed in [5] does not remove title, group, salesrank,
categories fields and the products whose similar field is empty
from the input data set. These fields can be removed from the
input data set as they are not required for making
recommendation decisions. So, in this paper, the data
preprocessing algorithm is redesigned to remove those fields
mentioned in the previous line. The Content-based
recommendation algorithm designed in [5] does not avoid same
products being added to the recommendation list. Also, this
algorithm does not take the same number of products from
each similar products list and add to the recommendation list.
So, in this paper, two variations of Content-based
recommendation algorithms are designed. The first variation
removes the repeated products from the recommendation list
and takes the same number of products from each similar
products list while adding products to the recommendation list.
Also, this variation will accept the number of recommendations
required in the recommendation list from the user of the
system. The second variation generates the best
recommendation for each customer. This paper provides
graphical user interfaces to interact with the recommender
system and to show the output.
III. ARCHITECURE DESIGN OF SYSTEM
This paper implements 2 nodes hadoop distributed cluster
architecture. Hadoop is a distributed platform which can store
and enable us processing large data sets through clusters of
computers using MapReduce parallel programming model.
There are two nodes in the cluster: Master node Slave node.
Master node runs Namenode, Datanode, Jobtracker and Task
tracker processes. Slave node runs the Datanode and Task
tracker processes. Namenode keeps track of how the input
dataset is broken down into blocks, which nodes store those
blocks. Secondary name node periodically reads the HDFS file
system changes log and apply them into the fsimage file. Data
node stores the replication of input dataset. JobTracker
determines the execution plan by deciding which blocks to be
processed, assigns nodes to different tasks, and keeps track of
all tasks as they are running. Task tracker is responsible for the
execution of individual tasks on each slave node. As the Fig.1.
shows, there are two layers in the hadoop distributed cluster
architecture: HDFS Layer and MapReduce layer. HDFS
(Hadoop Distributed File System) is a scalable and reliable file
system that provides data storage and can span large clusters
of commodity servers. MapReduce layer reads data from,
writes data to HDFS storage and processes the data in parallel
Fig. 1. Large-scale Recommender System Architecture
[6]. The recommender system which is developed in this paper
is made up of load data set, data set preprocess and content
based recommendation components. Load data set component
loads the data set into the HDFS. After loading the data set,
preprocessing can be done. User can choose content based
recommendation algorithm for generating recommendations.
IV. IMPLEMENTATION
This section presents the implementation of MapReduce
based Large scale Content-based Recommender System in
Hadoop distributed cluster environment.
A. Loading data set
This component loads the input data set into HDFS of
hadoop distributed cluster environment. The command to load
the data set into HDFS is
bin/hadoop dfs -put <input file path> <path in HDFS>
B. Data set preprocessing
This component is written using MapReduce parallel
programming model as the input dataset which has to be
processed is huge in size. The data set used in this paper is
Amazon co-purchasing network metadata which is taken from
Stanford University website [7]. The type of the input file is
text file. The data format of the input data set is shown in Fig.
2. There are 8 fields for each product in the input data set. In
preprocessing, the various products purchased by each
customer with their similar products are extracted from the
input dataset by running MapReduce algorithm on the input
data set. The procedure of data preprocessing algorithm is
given in Fig. 3. In preprocessing step, reviews except customer
Load Data Set
Data Preprocessing
Content based
Recommendation
Task
Tracker
Job
Tracker
Task
Tracker
Name
Node
Data
Node
Data
Node
MapReduce Layer
Master Node Slave Node
HDFS Layer
Recommender
System User
Interface
2 Nodes Hadoop cluster
id subfield, sales rank, title, category and group fields are
removed as they are not required for making recommendation
decisions. The data format of the preprocessed data set is
shown in Fig. 4. The output of preprocessing step is shown in
the results section.
Fig. 2. Amazon Data set Format
Fig. 3. Data preprocessing algorithm
Fig. 4. Preprocessed data set format
C. Content based Recommendation
In content based recommendation algorithm,
recommendations are made to a customer by looking at the
products which are similar to the products that customer has
already bought. This algorithm is applied on the preprocessed
data. Each data entry in the preprocessed data contains the
similar products list for every product the customer has
purchased. This algorithm is implemented using MapReduce
parallel programming model. Two variations of content based
recommendation algorithm are implemented in this paper. The
first variation accepts the number of recommendations to be
generated for each customer from the user of the system and
generates the recommendation list. The procedure of this
variation of the content based recommendation algorithm is
shown in Fig. 5. The second variation generates the best
recommendation for each customer. This variation assumes
that the product which is similar to more products the customer
has bought as the best recommendation. The second variation
of the content based recommendation algorithm is shown in
Fig. 6.
Fig. 5. Content based recommendation algorithm (Variation I)
Fig. 6. Content based recommendation algorithm (Variation II)
Id: Product id (number 0, …., 548551)
ASIN: Amazon Standard Identification Number
Title: Name/title of the product
Group: Product group (Book, DVD, Video or
Music)
Salesrank: Amazon Salesrank
Similar: ASINs of co-purchased products
Categories: Location in product category
hierarchy to which the product belongs
Reviews: Product review information: time,
customer id, rating, total number of votes on the
review, total number of helpfulness votes (how
many people found the review to be helpful)
Input: Preprocessed data set
Output: Recommendation list for each customer
1. Map
1.1. Scan the preprocessed data for each
customer
1.2. Remove number of products purchased by
the customer field
1.3. Create a list of recommendations by adding
only two products from similar Products field
for each product the customer has purchased
1.4. Remove a product from the recommendation
list if the customer has already purchased it and
also remove the products which are repeated
in the recommendation list
2. Reduce
2.1. Receive the output from Map step and print
the results in output file in HDFS
Input: Amazon data set
Output: Preprocessed data set
1. Map
1.1. Scan the input data set for each product
1.2. Call Remove() procedure
1.3. Generate the customer ID and the product
Information (ASIN field and similarProducts
field) as the key-value pairs for each customer
who has purchased that particular product
2. Group and Sort function
2.1. Group and sort all the key-value pairs from
map step based on the customer ID
3. Reduce
3.1. Process the key-value pairs received from
previous step and generate the list of
products with similar products bought by
each customer.
Remove() procedure:
1. Extract customer id subfield from reviews field
2. Remove title, sales rank, group, category and
reviews fields
3. Remove the products whose similar field is
empty
<Number of products purchased by the customer><space>
<Customer ID: A1001URKW36W4Z><space><Product:
ASIN=0316779059, similarProducts=List of similar
products where each product is separated by | ><space
><Product: ASIN=A316B79059, similarProducts=List of
similar products where each product is separated by |
>………………
Input: Preprocessed data set
Output: Best Recommended product
1. Map
1.1. Scan the preprocessed data for each
customer
1.2. Remove number of products purchased by
the customer field
1.3. Create a list of recommendations by adding
all products from similar Products field for
each product the customer has purchased
1.4. Find the product which is repeated more
number of times in the recommendation list
and add it to the new recommendation list
2. Reduce
2.1. Receive the output from Map step and print
the best recommendation for each customer in
the output file in HDFS
V. E XPERIMENTAL S ETUP AND R ESULTS
A. System Configuration
The experimental setup consists of Hadoop cluster with 2
nodes. A cat 5e LAN cable is used to connect both the nodes.
The IP address of the master node is 192.168.2.1 and the slave
node is 192.168.2.2. RAM capacity of master node is 3 GB
and slave node is 2 GB. The processor of master node is Intel
Pentium Dual core and its speed is 2.30 GHz. The processor of
slave node is Intel Core i3 and its speed is 3.07 GHz. The
available hard disk space in master node is around 33GB and
in slave node is around 128 GB when the experiment is done.
Ubuntu 14.04 LTS is installed in both the nodes. The version
of Hadoop used is Hadoop 1.2.1. The data set used to carry out
the experiment is Amazon co-purchasing network metadata
which is taken from Stanford University website [7]. The size
of the data set is 977.5 MB. This data set contains product and
review information for about 548,522 products.
B. Results
First, the graphical user interface of the recommender system
is shown in Fig. 7a, Fig. 7b. and Fig. 7c. The interface has the
options for loading the data set into HDFS, preprocessing the
data set, Content based recommendation. Fig. 8a. shows the
HDFS setup. Fig. 8b. shows the created directories in HDFS.
Fig. 8c. shows the input data set in HDFS. The results of data
preprocessing, content based recommendation are shown in
the Fig. 9, Fig. 10 and Fig. 11.
Fig.7a. Initial Graphical User Interface
Fig. 7b. Graphical User Interface (Data set Loading)
Fig. 7c. Graphical User Interface (Content-based Recommendation)
Fig. 8a. HDFS Setup
Fig.8b. Directories in HDFS
Fig. 8c. Input Data set in HDFS
Fig. 9. Data set after preprocessing
Fig. 10. Result of Content-based recommendation with 5
recommendations in the recommendation list (Variation I)
Fig. 11. Result of Content-based recommendation with best
recommendation (Variation II)
VI. PERFORM EVALUATION
The table I shows the time taken by each MapReduce
algorithm implemented in this paper. The algorithms are
executed on 2 nodes hadoop distributed cluster.
Table I. Execution Time
MapReduce
Algorithm
Input
Data
Size
Output
Data
Size
Time
Taken
Data
Preprocessing
977.5
MB
266 MB 89 Mins
51 Secs
Content based
recommendation
with N
recommendations
266
MB
5.55 MB 15 Mins
14 Secs
Content based
recommendation
with best
recommendation
266
MB
2.08 MB 22 Mins
57 Secs
VII. C ONCLUSION AND F UTURE E NHANCEMENTS
This paper focuses on designing large scale content-
based recommender system using hadoop mapreduce
framework and providing graphical user interfaces to
interact with the system. The amount of data which has to be
analyzed by the modern recommender system is massive in
order to get relevant product recommedation. As the input
data set is a big data, the recommender system is
implemented in Hadoop distributed cluster environment.
The improved MapReduce algorithms are designed for data
set preprocessing, recommendation and executed in Hadoop
distributed cluster environment. This recommender system
analyzes batch of stored data set. That is, this recommender
system analyzes user's past history and makes the
recommendations. However, in many situations, in modern
era, enterprises need to generate recommendations based on
user activity in real time. So, In future, capabilities to
analyze the data in real time will be added to the
recommender system.
REFERENCES
[1] Sol Ji Kang, Sang Yeon Lee, Keon Myung Lee, "Performance
Comparison of OpenMP, MPI, and MapReduce in Practical Problems",
Article ID 575687, Advances in Multimedia Journal, Hindawi Publishing
Corporation, 2014.
[2] Toon De Pessemier, Kris Vanhecke, Simon Dooms and Luc Martens,
"Content-based recommendation algorithms on the hadoop mapreduce
framework", 7th International conference on web information systems and
technologies, P.237-240, 2011.
[3] Simon Dooms, Pieter Audenaert, Jan Fostier, Toon De Pessemier, Luc
Marten, "In-memory, distributed content-based recommender system",
Journal of Intelligent Systems, Vol.42, Issue 3, P 645-669,2014.
[4] Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, "Mining of
Massive Datasets",P.322-331, 2014.
[5] Srinath Perera, Thilina Gunarathne, "Hadoop MapReduce
Cookbook",P.192-197,2013.
[6] Apache-Hadoop, http://Hadoop.apache.org.
[7] https://snap.stanford.edu/data/web-Amazon.html
... This approach requires to rephrase existing algorithms to be suitable on a parallel processing platform. There are several studies of large-scale recommendation [6]- [8] based on the cluster, such as Apache Spark, Hadoop, and cloud. Most of these methods pay their attention to train a bigger model, and solve the problems derived from the large-scale cluster, such as data compression, data transformation, data synchronization, and model parallelism. ...
... The first one is building a bigger model based on cluster and compress the scale of data. Saravanan [6] proposed the improved MapReduce based data preprocessing and content-based recommendation algorithm using Hadoop framework. In the large-scale recommendation, the standard approach of sequentially examining each item and looking at all interacting users does not scale. ...
With the rapid increase of users and resources on the Internet, the scale of the recommender system becomes larger and larger. There are three major challenges facing in recommender system: sparsity, scalability, and cold start. In this paper, we mainly focus on the scalability issue and propose a recommender system based on the memory-efficient recurrent neural network. First, we allocate an item table for items and use a pair of embedding vectors to represent each item. Thus, we can use a few vectors to represent numerous items and decrease the memory used for the storage of embedding vectors. Second, we present a similarity-based initialization method for the item table to get a better representation of items. Third, we further design the loss function and the adjustment method to adjust the placement of items in the item table to speed up the training procedure of the model and get a better performance. The experimental results demonstrate the effectiveness of our approach. It can clearly improve the performance of recommendation, such as hit rate and normalized discounted cumulative gain, when compared to the state-of-the-art recommender algorithm. In addition, our approach can also handle the cold start problem and supply new users with the same quality of service as the old users. INDEX TERMS Memory-efficient, recurrent neural network, recommender system, scalability, item embedding, similarity-based, initialization, cold start.
... This approach requires to rephrase existing algorithms to be suitable on a parallel processing platform. There are several studies of large-scale recommendation [6]- [8] based on the cluster, such as Apache Spark, Hadoop, and cloud. Most of these methods pay their attention to train a bigger model, and solve the problems derived from the large-scale cluster, such as data compression, data transformation, data synchronization, and model parallelism. ...
... The first one is building a bigger model based on cluster and compress the scale of data. Saravanan et al. [6] proposed the improved MapReduce based data preprocessing and content-based recommendation algorithm using Hadoop framework. In the large-scale recommendation, the standard approach of sequentially examining each item and looking at all interacting users does not scale. ...
With the rapid increase of users and resources on the Internet, the scale of the recommender system becomes larger and larger. There are three major challenges facing in recommender system: sparsity, scalability and cold start. In this paper, we mainly focus on the scalability issue and propose a recommender system based on memory-efficient recurrent neural network. Firstly, we allocate an item table for items and use a pair of embedding vectors to represent each item. Thus, we can use a few vectors to represent numerous items and decrease the memory used for the storage of embedding vectors. Secondly, we present a similarity-based initialization method for the item table to get a better representation of items. Thirdly, we further design the loss function and the adjustment method to adjust the placement of items in the item table to speedup the training procedure of the model and get a better performance. The experimental results demonstrate the effectiveness of our approach. It can clearly improve the performance of recommendation, such as hit rate and normalized discounted cumulative gain, when compared to the state-of-the-art recommender algorithm. Additionally, our approach can also handle the cold start problem, and supply new users with the same quality of service as the old users.
... CB technique has several issues and limitations [13][14][15]. For example, (i) having no mechanism to assess the quality of an item supported by CB methods. ...
With the exponential increase in information, it has become imperative to design mechanisms that allow users to access what matters to them as quickly as possible. The recommendation system (RS) with information technology development is the solution, it is an intelligent system. Various types of data can be collected on items of interest to users and presented as recommendations. RS also play a very important role in e-commerce. The purpose of recommending a product is to designate the most appropriate designation for a specific product. The major challenge when recommending products is insufficient information about the products and the categories to which they belong. In this paper, we transform the product data using two methods of document representation: bag-of-words (BOW) and the neural network-based document combination known as vector-based (Doc2Vec). We propose three-criteria recommendation systems (product, package and health) for each document representation method to foster online grocery shopping, which depends on product characteristics such as composition, packaging, nutrition table, allergen, and so forth. For our evaluation, we conducted a user and expert survey. Finally, we compared the performance of these three criteria for each document representation method, discovering that the neural network-based (Doc2Vec) performs better and completely alters the results.
... [6] [7][8]but query tools like Hive, Impala, SparkSQL and MySQL can also be used in recommender systems as they are much simpler to use. Performance of these tools depends upon some factors like data size, file formats, aggregate search etc. [9]. ...
... However, it does extract features of items from the user's selection in the past to establish the user's preference model. Then, according to the candidate items matching the user preference model, the highest matching degree of N objects will be recommended to the user [11][12]. ...
- Xiang Li
- L. Wei
Information overload is a key issue of the current network information retrieval, and a personalized recommendation with special information filtering methods is an important way and means to solve this problem. Based on the analysis of the common methods used of personalized recommendation, the architectural design of the personalized recommendation is proposed on the cloud computing platform. Then, combined with the specific issues of employment recommendation, this article proposes an optimized algorithm of Mahout distributed personalized recommendation based on content and items. Compared with the current single target recommendation algorithm, this algorithm is more efficient with a good practical significance and reference value.
- Binu P K
- Akhil Harikrishnan
- Sreejith
The "Internet of Things" (IoT) is among the most highly subsidized and promising topics in both academia and industry these days. Contemporary developments in digital technology have raised the interest of many researchers towards implementation in this area. The influence of IoT within the insurance field is vital. This chapter asserts an innovative concept of IoT pooled with an insurance application, which is beneficial for insurance companies to monitor and analyze the health of their clients continuously. Numerous insurance companies are clustered together to provide a standardized health status monitoring of clients. Since there is a large amount data generated by the system, we adopt Hadoop in the background to map the data effectively and to reduce it into a simpler format. We assimilate Sqoop tool to enable data transfer between Hadoop and RDBMS, in consort with Apache Hive for providing a database query interface to the Hadoop. By consuming the output from Hadoop MapReduce, a non-probabilistic binary linear classifier predicts the policyholder's chances of developing some health problems. Ultimately, the resultant outcomes are presented on the user's smartphones. The Apache Ranger framework interweaved with the Hadoop ecosystem aims to ensure data confidentiality. The endowments are granted to the policy holders based on the health report generated by our system. To evaluate the efficiency of the system, experiments are conducted using various policyholder's health datasets and from the results, it is observed that SVM predicts sepsis with an accuracy of approximately 86%. While testing with the medical dataset, SVM proved to be more accurate than the C4.5 algorithm.
- Roshan Bharti
- Deepak Gupta
Nowadays, the recommender system plays an important role in the real world by which we can recommend the most useful and perfect movies to the users from a large set of movies list and their ratings based on different users. Since the number of users and the movies are increasing day by day, computing the recommended movies list in a single node machine takes a very large time. Hence to reduce the computation time, we are using Hadoop framework to work in a distributed manner. Further, we have proposed a hybrid approach to recommend movies to the users by combining both the filtering techniques, i.e., user-based collaborative filtering and content-based filtering to overcome the problems of these techniques. In content-based filtering, we recommend items that are similar to the previous items which are highly rated by that user. Whereas in case of user-based collaborative filtering technique, we find out the most similar users with respect to the current user based on their cosine similarity and centered cosine similarity, and based on best similarity values, top N movies are recommended to the user by predicting the ratings of the movies. Further, to reduce the computation complexity, Hive database for Hadoop framework is used for developing SQL type scripts to perform MapReduce operations.
- Sol Ji Kang
- Sang Yeon Lee
- Keon Myung Lee
With problem size and complexity increasing, several parallel and distributed programming models and frameworks have been developed to efficiently handle such problems. This paper briefly reviews the parallel computing models and describes three widely recognized parallel programming frameworks: OpenMP, MPI, and MapReduce. OpenMP is the de facto standard for parallel programming on shared memory systems. MPI is the de facto industry standard for distributed memory systems. MapReduce framework has become the de facto standard for large scale data-intensive applications. Qualitative pros and cons of each framework are known, but quantitative performance indexes help get a good picture of which framework to use for the applications. As benchmark problems to compare those frameworks, two problems are chosen: all-pairs-shortest-path problem and data join problem. This paper presents the parallel programs for the problems implemented on the three frameworks, respectively. It shows the experiment results on a cluster of computers. It also discusses which is the right tool for the jobs by analyzing the characteristics and performance of the paradigms.
Burdened by their popularity, recommender systems increasingly take on larger datasets while they are expected to deliver high quality results within reasonable time. To meet these ever growing requirements, industrial recommender systems often turn to parallel hardware and distributed computing. While the MapReduce paradigm is generally accepted for massive parallel data processing, it often entails complex algorithm reorganization and suboptimal efficiency because mid-computation values are typically read from and written to hard disk. This work implements an in-memory, content-based recommendation algorithm and shows how it can be parallelized and efficiently distributed across many homogeneous machines in a distributed-memory environment. By focusing on data parallelism and carefully constructing the definition of work in the context of recommender systems, we are able to partition the complete calculation process into any number of independent and equally sized jobs. An empirically validated performance model is developed to predict parallel speedup and promises high efficiencies for realistic hardware configurations. For the MovieLens 10 M dataset we note efficiency values up to 71 % for a configuration of 200 computing nodes (eight cores per node).
- J. Leskovec
- A. Rajaraman
- J.D. Ullman
Written by leading authorities in database and Web technologies, this book is essential reading for students and practitioners alike. The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and can be applied successfully to even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. Other chapters cover the PageRank idea and related tricks for organizing the Web, the problems of finding frequent itemsets and clustering. This second edition includes new and extended coverage on social networks, machine learning and dimensionality reduction.
The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.
- Srinath Perera
Hadoop MapReduce Cookbook" is a one-stop guide to processing large and complex data sets using the Hadoop ecosystem. The book introduces you to simple examples and then dives deep to solve in-depth big data use cases. The book deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations.
Content-based recommender systems are widely used to generate personal suggestions for content items based on their metadata description. However, due to the required (text) processing of these metadata, the computational complexity of the recommendation algorithms is high, which hampers their application in large-scale. This computational load reinforces the necessity of a reliable, scalable and distributed processing platform for calculating recommendations. Hadoop is such a platform that supports data-intensive distributed applications based on map and reduce tasks. Therefore, we investigated how Hadoop can be utilized as a cloud computing platform to solve the scalability problem of content-based recommendation algorithms. The various MapReduce operations, necessary for keyword extraction and generating content-based suggestions for the end-user, are elucidated in this paper. Experimental results on Wikipedia articles prove the appropriateness of Hadoop as an efficient and scalable platform for computing content-based recommendations.
Posted by: milanmilanerrettowe0266741.blogspot.com
Source: https://www.researchgate.net/publication/308814714_Design_of_large-scale_Content-based_recommender_system_using_hadoop_MapReduce_framework