Grouping related keys of an RDD together - apache-spark

I have a generated RDD with a set of key value pairs. Assume that the keys are [10, 20, 25,30, 40, 50]. The real keys are close by Geographic bins of size X.X meters that need to be aggregated to size 2*X.2*X size.
So in this RDD set I need to aggregate keys that are having a relation between them. Example a key that is twice that of the current key - say 10 and 20. Then these will be added together to give 30. The values will also be added together Similarly the result set would be [30,25,70,50].
I am assuming that since map and reduce work on the current key of an element in an RDD , there is no way to do it using map or groupbyKey or aggregatebyKey; as the grouping I want needs the state of the previous key
I was thinking the only way to do this is to iterate through the elements in the RDD using foreach and for each element pass in also the entire RDD to it.
def group_rdds_together(rdd,rdd_list):
key,val = rdd
xbin,ybin = key
rdd_list.foreach(group_similar_keys,xbin,ybin)
bin_rdd.map(lambda x : group_rdds_together(rdd,bin_rdd))
For that I have to pass in rdd to the map lambda as well as custom parameters to the foreach function
What I am doing is horribly wrong; just wanted to illustrate where i am going with this. There should be a simpler and better way than this

Related

Spark HashPartitioner Collision Mechanism?

does anyone know if Spark HashPartitioner has an automatic collision mechanism to assign key to a new partition? I.e. If I have very skewed data where a single key holds many records, and by
partition = hash(key) % num_partitions
I will land many records in the same partition which memory won’t hold. In this case, does the HashPartitioner have something like probing to assign records to a new partition, or does it not? If it does not, do I need to implement a customized partitioner to deal with the skewed key? Thanks very much.
I don't think the HashPartitioner is going to put records with the same key to two different partitions in any situation. The javadoc for partitioner clearly says the following:
An object that defines how the elements in a key-value pair RDD are
partitioned by key. Maps each key to a partition ID, from 0 to
numPartitions - 1.
Note that, partitioner must be deterministic, i.e. it must return the
same partition id given the same partition key.
If putting the records with same keys into the same partition is not a requirement for you, maybe you can try the following without implementing a custom partitioner.
Let's say you want to write the dataframe into 1000 files.
Add a new column to your dataframe with random integers between 0 to 999.
_num_output_files = 1000
df = df.withColumn('rand', round(rand() * (_num_output_files-1), 0).astype(IntegerType()))
WLG, let's Assume the rand column is your i-th column in the dataframe. We need to use that column as key for the rdd, and then partition by that key. This will ensure almost uniform distribution of data across all partitions. Following code snippet will achieve that.
tmp_rdd = df.rdd.keyBy(lambda x: x[i-1])
tmp_rdd = tmp_rdd.partitionBy(_num_output_files, lambda x: x)
df_rdd = spark.createDataFrame(tmp_rdd.map(lambda x: x[1]))
Note: This is a handy code snippet to check the current distribution of records across partitions in Pyspark: print('partition distrib: ' + str(df_rdd.rdd.glom().map(len).collect())). After calling the previous set of methods you should see roughly the same numbers in each of the partition.

How to create RDD inside map function

I have RDD of key/value pair and for each key i need to call some function which accept RDD. So I tried RDD.Map and inside map created RDD using sc.parallelize(value) method and send this rdd to my function but as Spark does not support to create RDD within RDD this is not working.
Can you please suggest me any solution for this situation ?
I am looking for solution as suggest in below thread but problem i am having is my keys are not fixed and i can have any number of keys.
How to create RDD from within Task?
Thanks
It doesn't sound quite right. If the function needs to process the key value pair, it should receive the pair as the parameter, not RDD.
But if you really want to send the RDD as a parameter, instead of inside the chain operation, you may create a reference after preprocessing and send that reference to the method.
No, you shouldn't create RDD inside RDD.
Depends on the size of your data, there could be two solutions:
1) If there are many keys and each key has not too much values. Turn the function which accepts RDD to a function which accepts Iterable. Then you can do some thing like
// rdd: RDD[(keyType, valueType)]
rdd.groupByKey()
.map { case (key, values) =>
func(values)
}
2) If there are few keys and each key has many values. Then you should not do a group as it would collect all values for a key to an executor, which may cause OutOfMemory. Instead, run a job for each key like
rdd.keys.distinct().collect()
.foreach { key =>
func(rdd.filter(_._1 == key))
}

SparkSQL DataFrame order by across partitions

I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned.
I would like to coalesce the resulting DataFrame and order the rows by a column. I tried
DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")
I also tried
DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")
the output file is ordered in chunks (i.e. the partitions are ordered, but the data frame is not ordered as a whole). For example, instead of
1, value
2, value
4, value
4, value
5, value
5, value
...
I get
2, value
4, value
5, value
-----------> partition boundary
1, value
4, value
5, value
What is the correct way to get an absolute ordering of my query result?
Why isn't the data frame being coalesced into a single partition?
I want to mention couple of things here .
1- the source code shows that the orderBy statement internally calls the sorting api with global ordering set to true .So the lack of ordering at the level of the output suggests that the ordering was lost while writing into the target. My point is that a call to orderBy always requires global order.
2- Using a drastic coalesce , as in forcing a single partition in your case , can be really dangerous. I would recommend you do not do that. The source code suggests that calling coalesce(1) can potentially cause upstream transformations to use a single partition . This would be brutal performance wise.
3- You seem to expect the orderBy statement to be executed with a single partition. I do not think that i agree with that statement. That would make Spark a really silly distributed framework.
Community please let me know if you agree or disagree with statements.
how are you collecting data from the output anyway?
maybe the output actually contains sorted data , but the transformations /actions that you performed in order to read from the output is responsible for the order lost.
The orderBy will produce new partitions after your coalesce. To have a single output partition, reorder the operations...
DataFrame result = spark.sql("my sql").orderBy("col1").coalesce(1)
result.write.json("results.json")
As #JavaPlanet mentioned, for really big data you don't want to coalesce into a single partition. It will drastically reduce your level of parallelism.

create unique values for each key in a spark RDD

I would like to create an RDD of key, value pairs where each key would have a unique value. The purpose is to "remember" key indices for later use since keys might be shuffled around the partitions, and basically create a lookup table of sorts. I am vectorizing some text and need to create feature vectors so I have to have a unique value for each key.
I tried this with zipping a second RDD to my RDD of keys, but the problem is that if the two RDDs are not partitioned in exactly the same way, you end up losing elements.
My second attempt is to use a hash generator like the one used in scikit-learn but I'm wondering if there is some other "spark-native" way of doing this? I'm using PySpark, not Scala...
zipWithIndex and zipWithUniqueId were just added to PySpark (https://github.com/apache/spark/pull/2092) and will be available in the forthcoming Spark 1.1.0 release (they're currently available in the Spark master branch).
If you're using an older version of Spark, you should be able cherry-pick that commit in order to backport these functions, since I think it only adds lines to rdd.py.
As mentioned by #aaronman this is a simple operation that for some reason hasn't made it into the pyspark api yet. Going off the Java implementation, here's what seems to work (but gives indices with consecutive ordering on each partition):
def count_partitions(id, iterator):
c = sum(1 for _ in iterator)
yield (id,c)
def zipindex(l, indices, k) :
start_index = indices[k]
for i,item in enumerate(l) :
yield (item,start_ind+i)
> parts = rdd.mapPartitionsWithSplit(count_partitions).collectAsMap()
> indices = parts.values()
> indices.append(0,0)
> rdd_index = rdd.mapPartitionsWithIndex(lambda k,l: zipindex(l,indices,k))

Caching in Spark

A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks
There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
Get the distinct "keys" by which you must access the external lookup, and perform the lookup once for each key
Use this mapping to perform the lookup for each record in the RDD
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
map your data into the distinct keys by which you'd want to cache these fetched values, and collect it, e.g. : val keys = inputRdd.map(/* get key */).distinct().collect()
perform the fetching on driver-side (not using Spark)
use the resulting Map[Key, FetchedValues] in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)

Resources