reducebykey and aggregatebykey in spark Dataframe - apache-spark

I am using spark 2.0 to read the data from parquet file .
val Df = sqlContext.read.parquet("c:/data/parquet1")
val dfSelect= Df.
select(
"id",
"Currency",
"balance"
)
val dfSumForeachId=dfSelect.groupBy("id").sum("balance")
val total=dfSumForeachId.agg(sum("sum(balance)")).first().getDouble(0)
In order to get a total balance value is this the best way of getting it using an action first() on a dataframe ?
In spark 2.0 is it fine to use groupby key ,does it have the same performance issue like groupbykey on rdd like does it need to shuffle the whole data over the network and then perform aggregation or the aggregation is performed locally like reducebykey in earlier version of the spark
Thanks

Getting the data by using first is a perfectly valid way of getting the data. That said, doing:
val total = dfSelect.agg(sum("balance")).first().getDouble(0)
would probably give you better performance for getting the total.
group by key and reduce by key work exactly the same as previous versions for the same reasons. group by key makes no assumption on the action you want to do and therefore cannot know how to do partial aggregations as reduce by key does.
When you do dataframe groupby and sum you are actually doing reduce by key with the + option and the second aggregation you did is a reduce with the +. That said dataframe does it more efficiently because, knowing exactly what is done it can perform many optimizations such as whole stage code generation.

Related

GroupByKey vs Join performance in Spark

I have an RDD like (id, (val1, val2)). I want to normalize the val2 values for each id by dividing by sum of all val2 for that particular id. So my output should look like (id, (val1, val2normalized))
There are 2 ways of doing this
Do a groupByKey on id followed by normalizing the value using mapValues.
Do a reduceByKey to get RDD like (id, val2sum) and join this RDD with original RDD to get (id, ((val1, val2), val2sum)) followed by mapValuesto normalize.
Which one should be chosen?
If you limit yourself to:
RDD API.
groupByKey + mapValues vs. reduceByKey + join
the former one will be preferred. Since RDD.join is implemented using cogroup the cost of the latter strategy can be only higher than groupByKey (cogroup on the unreduced RDD will be equivalent to groupByKey, but you additionally need a full shuffle for reduceByKey). Please keep in mind, that if groups are to large, neither solution will be feasible.
This however might not be the optimal choice. Depending on the size of each group and the total number of groups, you might be able to achieve much better performance using broadcast join.
At the same time DataFrame API comes with significantly improved shuffle internals and can automatically apply some optimizations, including broadcast joins.

Efficient pyspark join

I've read a lot about how to do efficient joins in pyspark. The ways to achieve efficient joins I've found are basically:
Use a broadcast join if you can. (I usually can't because the dataframes are too large)
Consider using a very large cluster. (I'd rather not because of $$$).
Use the same partitioner.
The last one is the one i'd rather try, but I can't find a way to do it in pyspark. I've tried:
df.repartition(numberOfPartitions,['parition_col1','partition_col2'])
but it doesn't help, it still takes way too long until I stop it, because spark get's stucked in the last few jobs.
So, how can I use the same partitioner in pyspark and speed up my joins, or even get rid of the shuffles that takes forever ? Which code do I need to use ?
PD: I've checked other articles, even on stackoverflow, but I still can't see code.
you can also use a two-pass approach, in case it suits your requirement.First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
It was nicely explained by Sim. see link below
two pass approach to join big dataframes in pyspark
based on case explained above I was able to join sub-partitions serially in a loop and then persisting joined data to hive table.
Here is the code.
from pyspark.sql.functions import *
emp_df_1.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_1")
emp_df_2.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_2")
So, if you are joining on an integer emp_id, you can partition by the ID modulo some number and this way you can re distribute the load across the spark partitions and records having similar keys will be grouped together and reside on same partition.
you can then read and loop through each sub partition data and join both the dataframes and persist them together.
counter =0;
paritioncount = 4;
while counter<=paritioncount:
query1 ="SELECT * FROM UDB.temptable_1 where par_id={}".format(counter)
query2 ="SELECT * FROM UDB.temptable_2 where par_id={}".format(counter)
EMP_DF1 =spark.sql(query1)
EMP_DF2 =spark.sql(query2)
df1 = EMP_DF1.alias('df1')
df2 = EMP_DF2.alias('df2')
innerjoin_EMP = df1.join(df2, df1.emp_id == df2.emp_id,'inner').select('df1.*')
innerjoin_EMP.show()
innerjoin_EMP.write.format('orc').insertInto("UDB.temptable")
counter = counter +1
I have tried this and this is working fine. This is just an example to demo the two-pass approach. your join conditions may vary and the number of partitions also depending on your data size.
Thank you #vikrantrana for your answer, I will try it if I ever need it. I say these because I found out the problem wasn't with the 'big' joins, the problem was the amount of calculations prior to the join. Imagine this scenario:
I read a table and I store in a dataframe, called df1. I read another table, and I store it in df2. Then, I perfome a huge amount of calculations and joins to both, and I end up with a join between df1 and df2. The problem here wasn't the size, the problem was spark's execution plan was huge and it couldn't maintain all the intermediate tables in memory, so it started to write to disk and it took so much time.
The solution that worked to me was to persist df1 and df2 in disk before the join (I also persisted other intermediate dataframes that were the result of big and complex calculations).

Can I put back a partitioner to a PairRDD after transformations?

It seems that the "partitioner" of a pairRDD is reset to None after most transformations (e.g. values() , or toDF() ). However my understanding is that the partitioning may not always be changed for these transformations.
Since cogroup and maybe other examples perform more efficiently when the partitioning is known to be co-partitioned, I'm wondering if there's a way to tell spark that the rdd's are still co-partitioned.
See the simple example below where I create two co-partitioned rdd's, then cast them to DFs and perform cogroup on the resulting rdds. A similar example could be done with values, and then adding the right pairs back on.
Although this example is simple, my real case is maybe I load two parquet dataframes with the same partitioning.
Is this possible and would it result in a performance benefit in this case?
data1 = [Row(a=1,b=2),Row(a=2,b=3)]
data2 = [Row(a=1,c=4),Row(a=2,c=5)]
rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)
rdd1 = rdd1.map(lambda x: (x.a,x)).partitionBy(2)
rdd2 = rdd2.map(lambda x: (x.a,x)).partitionBy(2)
print(rdd1.cogroup(rdd2).getNumPartitions()) #2 partitions
rdd3 = rdd1.toDF(["a","b"]).rdd
rdd4 = rdd2.toDF(["a","c"]).rdd
print(rdd3.cogroup(rdd4).getNumPartitions()) #4 partitions (2 empty)
In the scala api most transformations include the
preservesPartitioning=true
option. Some of the python RDD api's retain that capability: but for example the
groupBy
is a significant exception. As far as Dataframe API's the partitioning scheme seems to be mostly outside of end user control - even on the scala end.
It is likely then that you would have to:
restrict yourself to using rdds - i.e. refrain from the DataFrame/Dataset approach
be choosy on which RDD transformations you choose: take a look at the ones that do allow either
retaining the parent's partitioning schem
using preservesPartitioning=true

How to make a Cassandra partition feel like a wide-row in Spark?

Cassandra exposes its partitions as multiples rows, however internally that are stored as wide-rows and that is the way I would like to work on my data with Spark.
To be more specific I will, one way or another get a RDD of Cassandra partitions, or a dataframe of these.
Then I would like to do a map operation, and in the closure, I would like to express something like this:
row['parameter1']['value'] / len(row['parameter2']['vector_value'])
pseudo code just to give an idea, a simple division and taking the lenght of a vector.
My table would be
create table(
dataset_name text,
parameter text,
value real,
vector_value list<real>,
primary key(dataset_name, parameter));
How can I do that efficiencly? Using with PySpark.
I think I need something like Pandas set_index.
Logically, RDD groupBy seems to me to be what you want to do.
RDD groupBy is said to be bad for large grouping, but here we are grouping on a cassandra partition, so it is supposed to be kept in a spark partition, and it is supposed to be locally as all rows of one partition would be on the same node.
I'm more using Scala with Spark than Python, so let's try. But I did not tested it.
I would suggest
rdd =sc.cassandraTable('keyspace','table').map(lambda x: ( (x.dataset_name, (x.parameter , value, vector_value) )) // create the key to group on
rdd2 = sorted ( rdd.groupByKey() ) // GroupByKey returns (key, Iterator), hence sorted to get a list
Look groupBy / groupByKey functions
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD
You will get one row per partition, and inside each partition, a list of clustering rows. so you should be able to access with [0] for the first occurence so 'parameter1', then [1] for 'parameter2'
EDIT: A colleague told me spark-cassandra-connector provides RDD methods to make what you want, ie preserving clustering column grouping and ordering. They are called spanBy / spanByKey : https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

How to paralellize RDD work when using cassandra spark connector for data aggration?

Here is the sample senario, we have real time data record in cassandra, and we want to aggregate the data in different time ranges. What I write code like below:
val timeRanges = getTimeRanges(report)
timeRanges.foreach { timeRange =>
val (timestampStart, timestampEnd) = timeRange
val query = _sc.get.cassandraTable(report.keyspace, utilities.Helper.makeStringValid(report.scope)).
where(s"TIMESTAMP > ?", timestampStart).
where(s"VALID_TIMESTAMP <= ?", timestampEnd)
......do the aggregation work....
what the issue of the code is that for every time range, the aggregation work is running not in parallized. My question is how can I parallized the aggregation work? Since RDD can't run in another RDD or Future? Is there any way to parallize the work, or we can't using spark connector here?
Use the joinWithCassandraTable function. This allows you to use the data from one RDD to access C* and pull records just like in your example.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java driver to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be preformed without doing a full table
scan. When preformed between two Cassandra Tables which share the
same partition key this will not require movement of data between
machines. In all cases this method will use the source RDD's
partitioning and placement for data locality.
Finally , we using union to join each RDD and makes them parallized.

Resources