Input file contains 20 lines. I am trying to count total number of records using reduce function. Can anyone please explain me why there is difference in the results? Because here value of y is nothing but only 1.
Default number of partitions : 4
scala> rdd = sc.textFile("D:\LearningPythonTomaszDenny\Codebase\\wholeTextFiles\\names1.txt")
scala> rdd.map(x=>1).reduce((acc,y) => acc+1)
res17: Int = 8
scala> rdd.map(x=>1).reduce((acc,y) => acc+y)
res18: Int = 20
Because here value of y is nothing but only 1.
That is simply not true. reduce consist of three stages (not in a strict Spark meaning of the word):
Distributed reduce on each partition.
Collection of the partial results to the driver (synchronous or asynchronous depending on the backend).
Local driver reduction.
In your case the results of the first and second stage will be the same, but the first approach will simply ignore the partial results. In other words, no matter what was the result for the partition, it will always add only 1.
Such approach would work only with non-parallel, non-sequential reduce implementations.
Related
I have an RDD as follows:
rdd
.filter { case (_, record) => predicates.forall(_.accept(record)) }
.toDS()
.cache()
It basically filters down an RDD after applying a predicate.
The issue I have is this... Some of my data set RDDs are massive and predicates may be empty meaning that we attempt to cache an entire data set.
Instead what I'd like to do is always limit the size of the data set before I cache it.
I've tried placing a limit as follows:
dataSet
.filter { case (_, record) => predicates.forall(_.accept(record)) }
.limit(10000)
.toDS()
.cache()
but I get OOM errors. It looks to me like the partitions are being overloaded before the limit is applied.
Therefore I'm wondering if there is some way for the limit to be applied to the partitions. So effectively filtering would be paused once we reach the limit.
Scaling out further isn't an option as these data sets are too big
You should likely look into sampling the rdd. If you provide a consistent seed you will get a consistent result. You likely don't want "withReplace". This will run faster than using limit. Sample does work on the entire data but filters as it goes reducing the data set.
RDD.sample(withReplacement, fraction, seed=None)
Parameters:
withReplacement - bool can elements be sampled multiple times
(replaced when sampled out)
fraction - float expected size of the sample as a fraction of this RDD’s
size without replacement: probability that each element is chosen;
fraction must be [0, 1] with replacement: expected number of times
each element is chosen; fraction must be >= 0
seed - int, optional seed for the random number generation
Relevant code links (rdd.sample), (subclass that does actual work work.)
Can anyone explain the difference between reducebykey, groupbykey, aggregatebykey and combinebykey? I have read the documents regarding this, but couldn't understand the exact differences.
An explanation with examples would be great.
groupByKey:
Syntax:
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" ") )
.map(word => (word,1))
.groupByKey()
.map((x,y) => (x,sum(y)))
groupByKey can cause out of disk problems as data is sent over the network and collected on the reduced workers.
reduceByKey:
Syntax:
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y)=> (x+y))
Data are combined at each partition, with only one output for one key at each partition to send over the network. reduceByKey required combining all your values into another value with the exact same type.
aggregateByKey:
same as reduceByKey, which takes an initial value.
3 parameters as input
initial value
Combiner logic
sequence op logic
Example:
val keysWithValuesList = Array("foo=A", "foo=A", "foo=A", "foo=A", "foo=B", "bar=C", "bar=D", "bar=D")
val data = sc.parallelize(keysWithValuesList)
//Create key value pairs
val kv = data.map(_.split("=")).map(v => (v(0), v(1))).cache()
val initialCount = 0;
val addToCounts = (n: Int, v: String) => n + 1
val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
val countByKey = kv.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)
ouput:
Aggregate By Key sum Results
bar -> 3
foo -> 5
combineByKey:
3 parameters as input
Initial value: unlike aggregateByKey, need not pass constant always, we can pass a function that will return a new value.
merging function
combine function
Example:
val result = rdd.combineByKey(
(v) => (v,1),
( (acc:(Int,Int),v) => acc._1 +v , acc._2 +1 ) ,
( acc1:(Int,Int),acc2:(Int,Int) => (acc1._1+acc2._1) , (acc1._2+acc2._2))
).map( { case (k,v) => (k,v._1/v._2.toDouble) })
result.collect.foreach(println)
reduceByKey,aggregateByKey,combineByKey preferred over groupByKey
Reference:
Avoid groupByKey
groupByKey() is just to group your dataset based on a key. It will result in data shuffling when RDD is not already partitioned.
reduceByKey() is something like grouping + aggregation. We can say reduceByKey() equivalent to dataset.group(...).reduce(...). It will shuffle less data unlike groupByKey().
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have an input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output. It also takes zero-value that will be applied at the beginning of each key.
Note: One similarity is they all are wide operations.
While both reducebykey and groupbykey will produce the same answer, the
reduceByKey example works much better on a large dataset. That's
because Spark knows it can combine output with a common key on each
partition before shuffling the data.
On the other hand, when calling groupByKey - all the key-value pairs
are shuffled around. This is a lot of unnessary data to being
transferred over the network.
for more detailed check this below link
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
Although both of them will fetch the same results, there is a significant difference in the performance of both the functions. reduceByKey() works better with larger datasets when compared to groupByKey().
In reduceByKey(), pairs on the same machine with the same key are combined (by using the function passed into reduceByKey()) before the data is shuffled. Then the function is called again to reduce all the values from each partition to produce one final result.
In groupByKey(), all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.
ReduceByKey reduceByKey(func, [numTasks])-
Data is combined so that at each partition there should be at least one value for each key.
And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.
GroupByKey - groupByKey([numTasks])
It doesn't merge the values for the key but directly the shuffle process happens
and here lot of data gets sent to each partition, almost same as the initial data.
And the merging of values for each key is done after the shuffle.
Here lot of data stored on final worker node so resulting in out of memory issue.
AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
It is similar to reduceByKey but you can provide initial values when performing aggregation.
Use of reduceByKey
reduceByKey can be used when we run on large data set.
reduceByKey when the input and output value types are of same type
over aggregateByKey
Moreover it recommended not to use groupByKey and prefer reduceByKey. For details you can refer here.
You can also refer this question to understand in more detail how reduceByKey and aggregateByKey.
Then apart from these 4, we have
foldByKey which is same as reduceByKey but with a user defined Zero Value.
AggregateByKey takes 3 parameters as input and uses 2 functions for merging(one for merging on same partitions and another to merge values across partition. The first parameter is ZeroValue)
whereas
ReduceBykey takes 1 parameter only which is a function for merging.
CombineByKey takes 3 parameter and all 3 are functions. Similar to aggregateBykey except it can have a function for ZeroValue.
GroupByKey takes no parameter and groups everything. Also, it is an overhead for data transfer across partitions.
I have a use case where I want to count types of elements in an RDD matching some filter.
e.g. RDD.filter(F1) and RDD.filter(!F1)
I have 2 options
Use accumulators: e.g.
LongAccumulator l1 = sparkContext.longAccumulator("Count1")
LongAccumulator l2 = sparkContext.longAccumulator("Count2")
RDD.forEachPartition(f -> {
if(F1) l1.add(1)
else l2.add(1)
});
Use Count
RDD.filter(F1).count(); RDD.filter(!F1).count()
One benefit of the first approach is that we only need to iterate data once (useful since my data set is 10s of TB)
What is the use of count if same affect can be achieved by using Accumulators ?
Major difference is that if your code will fail in transformation, then Accumulators will be updated and count() result not.
Other option is to use pure map-reduce:
val counts = rdd.map(x => (F1(x), 1)).reduceByKey(_ + _).collectAsMap()
Network cost should be also low as only few numbers will be sent. It creates pairs of (is F1(x) true/false, 1) and then sum all ones - it will give you number of items both F1(x) and !F1(x) in counts map
Given 1 Billion records containing following information:
ID x1 x2 x3 ... x100
1 0.1 0.12 1.3 ... -2.00
2 -1 1.2 2 ... 3
...
For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).
What's the best way to compute this?
As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/
The gist of it is:
Use sklearn’s k-NN fit() method centrally
But then use sklearn’s k-NN kneighbors() method distributedly
Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.
Steps in this case would be:
1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features
2- fit your scikit-learn nn to your data:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
3- run the trained algorithm on your vectorized data (training and query data are the same in your case)
distances, indices = nbrs.kneighbors(qpa)
Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.
Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn
You haven't provided a lot of detail, but the general approach I would take to this problem would be to:
Convert the records to a data structure like like a LabeledPoint with (ID, x1..x100) as label and features
Map over each record and compare that record to all the other records (lots of room for optimization here)
Create some cutoff logic so that once you start comparing ID = 5 with ID = 1 you interrupt the computation because you have already compared ID = 1 with ID = 5
Some reduce step to get a data structure like {id_pair: [1,5], distance: 123}
Another map step to find the 10 closest neighbors of each record
You've identified pyspark and I generally do this type of work using scala, but some pseudo code for each step might look like:
# 1. vectorize the features
def vectorize_raw_data(record)
arr_of_features = record[1..99]
LabeledPoint( record[0] , arr_of_features)
# 2,3 + 4 map over each record for comparison
broadcast_var = []
def calc_distance(record, comparison)
# here you want to keep a broadcast variable with a list or dictionary of
# already compared IDs and break if the key pair already exists
# then, calc the euclidean distance by mapping over the features of
# the record and subtracting the values then squaring the result, keeping
# a running sum of those squares and square rooting that sum
return {"id_pair" : [1,5], "distance" : 123}
for record in allRecords:
for comparison in allRecords:
broadcast_var.append( calc_distance(record, comparison) )
# 5. map for 10 closest neighbors
def closest_neighbors(record, n=10)
broadcast_var.filter(x => x.id_pair.include?(record.id) ).takeOrdered(n, distance)
The psuedocode is terrible, but I think it communicates the intent. There will be a lot of shuffling and sorting here as you are comparing all records with all other records. IMHO, you want to store the keypair/distance in a central place (like a broadcast variable that gets updated though this is dangerous) to reduce the total euclidean distance calculations you perform.
Lets assume I have a stream of Double values and I want to compute the average every ten seconds. How can I have a sliding window that doesn't need to recompute the average but instead update it by, lets say, removing the part of the oldest ten seconds and adding only the new 10 seconds values?
TL;DR : use reduceByWindow with both of its function arguments (jump to the last paragraph for the code snippet)
There's two interpretations of your question, the specific one (how do I get a running mean for one hour, updated every 2 seconds), and the general one (how do I get a computation that updates state in a sparse way). Here's the answer for the general one.
First, notice there is a way to represent your data such that your average-with-updates is easy to compute, based on a windowed DStream: this represents your data as an incremental construction of the stream, with maximal sharing. But it is less efficient, computationally, to recompute the mean on each batch – as you noted.
If you do want to do an update of a complex stateful computation which is invertible, but don't want to touch the stream's construction, there is updateStateByKey – but there Spark doesn't help you in reflecting the incremental aspect of your computation in the stream, you have to manage it yourself.
Here, you do have something simple and invertible, and you don't have a notion of keys. You can use reduceByWindow with its inverse reduction argument, using the usual functions that would let you compute an incremental mean.
val myInitialDStream: DStream[Float]
val myDStreamWithCount: DStream[(Float, Long)] =
myInitialDStream.map((x) => (x, 1L))
def addOneBatchToMean(previousMean: (Float, Long), newBatch: (Float, Long)): (Float, Long) =
(previousMean._1 + newBatch._1, previousMean._2 + newBatch._2)
def removeOneBatchToMean(previousMean: (Float, Long), oldBatch: (Float, Long)): (Float, Long) =
(previousMean._1 - oldBatch._1, previousMean._2 - oldBatch._2)
val runningMeans = myDStreamWithCount.reduceByWindow(addOneBatchToMean, removeOneBatchToMean, Durations.seconds(3600), Duractions.seconds(2))
You get a stream of one-element RDDs, each of which contains a pair (m, n) where m is your running sum over the 1h-window and n the number of elements in the 1h-window. Just return (or map to) m/n to get the mean.