Spark parallel processing of grouped data - apache-spark

Initially, I had a lot of data. But using spark-SQL and especially groupBy it could be trimmed down to a manageable size. (fits in RAM of a single node)
How can I perform functions (in parallel) on all the groups (distributed among my nodes)?
How can I make sure that the data for a single group is collected to a single node? E.g. I will probably want to use local matrix for computation but do not want to run into errors regarding data locality.

Let's say you have x no. of executors(in your case probably 1 executor per node).And you want to partition the data on your keys in such a way that each key falls into a unique bucket which will be something like a perfect partitioner.There would be no generic way of doing that but it may be possible to achieve that if there is some inherent distribution/logic specific to your data.
I had dealt with a specific case where I found that Spark's inbuilt hash partitioner was not doing a good job of distributing the keys uniformly.So I wrote a custom partitioner using Guava like this:
class FooPartitioner(partitions: Int) extends org.apache.spark.HashPartitioner(partitions: Int) {
override def getPartition(key: Any): Int = {
val hasherer = Hashing.murmur3_32().newHasher()
Hashing.consistentHash(
key match {
case i: Int => hasherer.putInt(i).hash.asInt()
case _ => key.hashCode
},PARTITION_SIZE)
}
}
Then I added this partitioner instance as an argument to the combineBy that I was using so that resulting rdd is partitioned in this fashion.
This does a good job of distributing data to x no of buckets but I guess there are no guarantees that each bucket will have only 1 key.
In case you are on Spark 1.6 and using dataframes you can define a udf like this
val hasher = udf((i:Int)=>Hashing.consistentHash(Hashing.murmur3_32().newHasher().putInt(i) .hash.asInt(),PARTITION_SIZE))
and do dataframe.repartition(hasher(keyThatYouAreUsing))
Hopefully this provides some hint to get started.

I found a solution from Efficient UD(A)Fs with PySpark
this blog
mapPartitions to split data;
udaf convert spark dataframe to pandas dataframe;
do your data etl logic in udaf and return a pandas dataframe;
udaf will convert pandas dataframe to spark dataframe;
toDF() merge the result spark dataframe and do some persist like SaveAsTable;
df = df.repartition('guestid').rdd.mapPartitions(udf_calc).toDF()

Related

Cost of transforming a dataframe to rdd in spark

I'm trying to fetch the number of partitions of a dataframe using this:
df.rdd.getNumPartitions.toString
But when I monitor the spark log, I see it spins up many stages and is a costly operation to have.
As per my understanding, dataframe adds a structural layer to rdd via metadata. So, how come stripping that while converting to rdd takes this much time?
A DataFrame is an optimized distributed tabular collection. Since it keeps a tabular format (similar to a SQL table) it can mantain metadata to allow Spark some optimizations performed under the hood.
This optimizations are performed by side project such as Catalyst and Tungsten
RDD does not mantain any schema, it is required for you to provide one if needed. So RDD is not as highly oiptimized as Dataframe, (Catalyst is not involved at all)
Converting a DataFrame to an RDD force Spark to loop over all the elements converting them from the highly optimized Catalyst space to the scala one.
Check the code from .rdd
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
rddQueryExecution.toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
#transient private lazy val rddQueryExecution: QueryExecution = {
val deserialized = CatalystSerde.deserialize[T](logicalPlan)
sparkSession.sessionState.executePlan(deserialized)
}
So first, it's executing the plan and retrieve the output as an RDD[InternalRow] which, as the name implies, are only for internal use and need to be converted to RDD[Row]
Then it loops over all the rows converting them. As you can see, it's not just removing the schema
Hope that answer your question.

What is the fastest way to get a large number of time ranges using Apache Spark?

I have about 100 GB of time series data in Hadoop. I'd like to use Spark to grab all data from 1000 different time ranges.
I have tried this using Apache Hive by creating an extremely long SQL statement that has about 1000 'OR BETWEEN X AND Y OR BETWEEN Q AND R' statements.
I have also tried using Spark. In this technique I've created a dataframe that has the time ranges in question and loaded that into spark with:
spark_session.CreateDataFrame()
and
df.registerTempTable()
With this, I'm doing a join with the newly created timestamp dataframe and the larger set of timestamped data.
This query is taking an extremely long time and I'm wondering if there's a more efficient way to do this.
Especially if the data is not partitioned or ordered in any special way, you or Spark need to scan it all no matter what.
I would define a predicate given the set of time ranges:
import scala.collection.immutable.Range
val ranges: List[Range] = ??? // load your ranges here
def matches(timestamp: Int): Boolean = {
// This is not efficient, a better data structure than a List
// should be used, but this is just an example
ranges.contains(_.contains(timestamp))
}
val data: RDD[(Int, T)] = ??? // load the data in an RDD
val filtered = data.filter(x => matches(x.first))
You can do the same with DataFrame/DataSet and UDFs.
This works well if the set of ranges is provided in the driver. If instead it comes from a table, like the 100G data, first collect it back in the driver, if not too big.
Your Spark job goes through 100GB dataset to select relevant data.
I don’t think there is big difference between using SQL or data frame api, as under the hood the full scan happening anyway.
I would consider re-structuring your data, so it is optimised for specific queries.
In your cases partitioning by time can give quite significant improvement (for ex. HIVE table with partitioning).
If you perform search using the same field, that has been used for partitioning - Spark job will only look into relevant partitions.

use spark to scan multiple cassandra tables using spark-cassandra-connector

I have a problem of how to use spark to manipulate/iterate/scan multiple tables of cassandra. Our project uses spark&spark-cassandra-connector connecting to cassandra to scan multiple tables , try to match related value in different tables and if matched, take the extra action such as table inserting. The use case is like below:
sc.cassandraTable(KEYSPACE, "table1").foreach(
row => {
val company_url = row.getString("company_url")
sc.cassandraTable(keyspace, "table2").foreach(
val url = row.getString("url")
val value = row.getString("value")
if (company_url == url) {
sc.saveToCassandra(KEYSPACE, "target", SomeColumns(url, value))
}
)
})
The problems are
As spark RDD is not serializable, the nested search will fail cause sc.cassandraTable returns a RDD. The only way I know to work around is to use sc.broadcast(sometable.collect()). But if the sometable is huge, the collect will consume all the memory. And also, if in the use case, several tables use the broadcast, it will drain the memory.
Instead of broadcast, can RDD.persist handle the case? In my case, I use sc.cassandraTable to read all tables in RDD and persist back to disk, then retrieve the data back for processing. If it works, how can I guarantee the rdd read is done by chunks?
Other than spark, is there any other tool (like hadoop etc.??) which can handle the case gracefully?
It looks like you are actually trying to do a series of Inner Joins. See the
joinWithCassandraTable Method
This allows you to use the elements of One RDD to do a direct query on a Cassandra Table. Depending on the fraction of data you are reading from Cassandra this may be your best bet. If the fraction is too large though you are better off reading the two table separately and then using the RDD.join method to line up rows.
If all else fails you can always manually use the CassandraConnector Object to directly access the Java Driver and do raw requests with that from a distributed context.

Can I put back a partitioner to a PairRDD after transformations?

It seems that the "partitioner" of a pairRDD is reset to None after most transformations (e.g. values() , or toDF() ). However my understanding is that the partitioning may not always be changed for these transformations.
Since cogroup and maybe other examples perform more efficiently when the partitioning is known to be co-partitioned, I'm wondering if there's a way to tell spark that the rdd's are still co-partitioned.
See the simple example below where I create two co-partitioned rdd's, then cast them to DFs and perform cogroup on the resulting rdds. A similar example could be done with values, and then adding the right pairs back on.
Although this example is simple, my real case is maybe I load two parquet dataframes with the same partitioning.
Is this possible and would it result in a performance benefit in this case?
data1 = [Row(a=1,b=2),Row(a=2,b=3)]
data2 = [Row(a=1,c=4),Row(a=2,c=5)]
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
rdd1 = rdd1.map(lambda x: (x.a,x)).partitionBy(2)
rdd2 = rdd2.map(lambda x: (x.a,x)).partitionBy(2)
print(rdd1.cogroup(rdd2).getNumPartitions()) #2 partitions
rdd3 = rdd1.toDF(["a","b"]).rdd
rdd4 = rdd2.toDF(["a","c"]).rdd
print(rdd3.cogroup(rdd4).getNumPartitions()) #4 partitions (2 empty)
In the scala api most transformations include the
preservesPartitioning=true
option. Some of the python RDD api's retain that capability: but for example the
groupBy
is a significant exception. As far as Dataframe API's the partitioning scheme seems to be mostly outside of end user control - even on the scala end.
It is likely then that you would have to:
restrict yourself to using rdds - i.e. refrain from the DataFrame/Dataset approach
be choosy on which RDD transformations you choose: take a look at the ones that do allow either
retaining the parent's partitioning schem
using preservesPartitioning=true

Spark: Mapping an RDD of HBase row keys to an RDD of values

I have an RDD that contains HBase row keys. The RDD is relatively large to fit in memory. I need to get an RDD of values for each of the provided key. Is there a way to do something like this:
keys.map(key => table.get(new Get(key)))
So the question is how can I obtain an instance of HTable inside map task? Should I instantiate an HConnection for every partition, and then obtain HTable instance from it, or is there a better way?
There are a few options you can can do but first consider the fact that spark does not allow you to create RDDs of RDDs. So really that leaves you with two options
a list of RDDs
A Key/value RDD
I would highly recommend the second one as a list of RDDs could end with you needing to perform a lot of reduces which could massively increase the number of shuffles you need to perform. With that in mind I would recommend you use a flatMap.
So here is some basic skeleton code that could get you that result
val input:RDD[String]
val completedRequests:RDD[(String, List[String]) = input.map(a => (a, table.get(new Get(a)))
val flattenedRequests:RDD[(String, String) = completedRequests.flatMap{ case(k,v) => v.map(b =>(k,b))
You can now handle the RDD as one object, reduceByKey if you have a particular piece of information you need from it, and now spark will be able to access the data with optimal parallelism.
Hope that helps!

Resources