Its old design pattern which is more relevant now. You can
have different parts of calculation run on different machines which can talk to
each other and hence run calculation faster.
Google have their own implementation which is similar to Cloudera implementation. Then Amazon has Hadoop Lite , houch DB or mongo DB have some implementation running inside. Each one is solving specific probs.
MapReduce is basically made up of MAP and REDUCE. Where MAP
is Stateless and independent and REDUCE for it we need to think of how it fits
into overall flow... so there is seperation. Let's think of sorting prob each
node needs to sort its data and fit it into Reducer.
Let's suppose a cluster of N nodes having data and running a
piece of code F(x) (like sum , avg etc) . Now there is no separation of data ex it's not the case one
node has Sept data other Oct node. not data is separated by Product type etc ,
data is evenly distributed. AND DATA is independent.
Each of the nodes will output pairs. can be date
and dollar amount(sales).
Seems Can have more than one reducer. All the key value
pairs come to this Reducer and gives you any key value pair ie sales on any
given day.
Can do similar stuff
in SQL , so whats the advtg here ? First this is not running on enterprise architecture think in terms of clustered architecture. And secondly Advtg here is scalability ... independent
portions don't have to wait for anyone to finish.
Example :- suppose 6 nodes and we have 2 reducers one for
1st quarter Jan - Mar and another for second march so each reducers gets data
from all of the nodes.
Map is like a Data Warehouse code which extracts , Transform
and loads.
It's not necessary to always write/need reduce , 6 out of 10
times you can pass by the reduce code.
Can also chain map reduce.
Icremental Map Reduce :- Couch DB does this once you ask it
a ques ( avg sale this quarter ) it keeps computing this for you. this makes it
different from google. Couch DB is interested in 1 to 100 DB server and Google
is interested in PeraByte plus info like how to scale the web. COuch DB is a doc based database where docs
are represented as JSON when intreacting with it , else stored as binary.
Suppose i want to find biggest sales. so sorting on each
node. sort is done according to key so you need to put the right thing in the
key. so basically ask yourself how do i want to query this thing later on ...
that defines the key.
you can do lot of diff things with Emit(amount,null) if date
is of june then emit if date is in last 30 days then emit else not.
Things like count , sum , min , max , avg , std dev etc are
written to the DB.
Ex :- we want to know the total no of sales of
certain amount. so basically in mapper we return k,v where k should be amount
.. in the mapper code we can have logic if greater than x then return k,v =>
then in reducer we can count the values HENCE we are retuning .
Time series aggregation
Key can be complex not necc single string or int or date.
emit ( [year,month,day,location] , dollarAmount )
then in reduce we can use _SUM ( again is applied to Value that is emitted )
u can have keys like [location,yr,month,day] from most significant to least.
this way we can STRUCTURE.
you can try and NOT to write your own reducer , most of the times you can use std reducer.
We can group by loc only or year only ... there is group_level its a number between 0 and length of the array.
[AZ,2010,6,29],s1
[AZ,2010,6,30],s2
if i say group level 3 THEN i get info for [AZ,2010,6]
Note output of Map & reduce is simple text hence keeping comm fast.
Also note all this happening in Real time you don't have a system THAT gets Data from one DB to another overnight.
Parts of a Map Reduce Job
Job Conf
This represents a Job that will be completed via a Map Reduce paradigm
JobConf conf = new
JobConf(MapReduceIntro.class);
Now that you have a JobConfig object, conf, you need to set
the required parameters for the job. These include the input and output
directory locations, the format of the input and output, and the mapper and
reducer classes.
It is good practice to pass in a class that is contained in
the JAR file that has your map and reduce functions.This ensures that
the framework will make the JAR available to the map and reduce tasks run for
your job.
Mapper
Maps
are the individual tasks which transform input records into a intermediate
records. The transformed intermediate records need not be of the same type as
the input records. A given input pair may map to zero or many output pairs.
Configuring the Reduce Phase
To configure the reduce phase, the user must supply the
framework with five pieces of information:
• The number of reduce tasks; if zero, no reduce phase is
run
• The class supplying the reduce method
• The input key and value types for the reduce task; by
default, the same as the reduce output
• The output key and value types for the reduce task
• The output file type for the reduce task output
No comments:
Post a Comment