Dask.groupby turns multiple partitions into one - python-3.x

I have a dask.dataframe
df2 = dd.read_csv(path, dtype=dtypes, sep=',', error_bad_lines=False)
which is split into 220 partitions by dask itself
print(df2.npartitions)
>>220
I'd like to use groupby twice and save two dataframes into files
coccurrence_df = df2.groupby(['h1_h2', 'hashtag1','hashtag2','user_id']).count().reset_index()\
.groupby(['h1_h2', 'hashtag1','hashtag2']).message_id.count().reset_index()\
.rename(columns={"message_id":"coccurrence"})
strong_edges_df = coccurrence_df[coccurrence_df['coccurrence']>1].to_csv(path1, compute=False)
weak_edges_df = coccurrence_df[coccurrence_df['coccurrence']==1].to_csv(path2, compute=False)
dask.compute(strong_edges_df,weak_edges_df)
Why coccurrence_df is split into 1 partition when the dataframe it is created from is split into 220 partitions?
print(coccurrence_df.npartitions)
>>1
I believe because of this I'm losing parallelism, am I right?
Thank you in advance

Groupby aggregations do parallel computation but result in a single partition output. If you have many groups and want to have a multi-partition output then consider using the split_out= parameter to the groupby aggregation.
I don't recommend doing this though if things work ok. I recommend just using the defaults until something is obviously performing poorly.

Related

Why does Spark crossJoin take so long for a tiny dataframe?

I'm trying to do the following crossJoin on two dataframes with 5 rows each, but Spark spawns 40000 tasks on my machine and it took 30 seconds to achieve the task. Any idea why that is happening?
df = spark.createDataFrame([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']]).toDF('a','b')
df = df.repartition(1)
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
You call a .distinct before join, it requires a shuffle, so it repartitions data based on spark.sql.shuffle.partitions property value. Thus, df.select('a').distinct() and df.select('b').distinct() result in new DataFrames each with 200 partitions, 200 x 200 = 40000
Two things - it looks like you cannot directly control the number of partitions a DF is created with, so we can first create a RDD instead (where you can specify the number of partitions) and convert it to DF. Also you can set the shuffle partitions to '1' as well. These both ensure you will have just 1 partition during the whole execution and should speed things up.
Just note that this shouldn't be an issue at all for larger datasets, for which Spark is designed (it would be faster to achieve the same result on a dataset of this size not using spark at all). So in the general case you won't really need to do stuff like this, but tune the number of partitions to your resources/data.
spark.conf.set("spark.default.parallelism", "1")
spark.conf.set("spark.sql.shuffle.partitions", "1")
df = sc.parallelize([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']], 1).toDF(['a','b'])
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
spark.conf.set sets the configuration for a single execution only, if you want more permanent changes do them in the actual spark conf file

Spark non-deterministic results after repartition

Is there some way to get deterministic results from dataframe repartition without sorting? In the below code, I get different results while doing the same operation.
from pyspark.sql.functions import rand, randn
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.range(0, 100000)
# repartition dataframe to 5 partitions
df2 = df.repartition(5).persist()
df2.head(5)
Out[1]: [Row(id=5324), Row(id=5389), Row(id=6209), Row(id=7640), Row(id=8090)]
df2.unpersist()
df3 = df.repartition(5).persist()
df3.head(5)
Out[2]: [Row(id=1019), Row(id=652), Row(id=2287), Row(id=470), Row(id=1348)]
Spark Version - 2.4.5
This non deterministic behaviour is expected. Here's how...
.repartition(num) does a round-robin repartitioning when no columns are passed inside the function. This does not guarantee that a specific row will always be in a particular partition.
.head(n) returns first n rows of first partition of dataframe.
If you want an order, you need to use orderBy !
According to this JIRA, repartitioning (by default) involves a local sort, and is fully deterministic. From the PR notes:
In this PR, we propose ... performing a local sort before
partitioning, after we make the input row ordering deterministic, the
function from rows to partitions is fully deterministic too.
The downside of the approach is that with extra local sort inserted,
the performance of repartition() will go down, so we add a new config
named spark.sql.execution.sortBeforeRepartition to control whether
this patch is applied. The patch is default enabled to be
safe-by-default, but user may choose to manually turn it off to avoid
performance regression.
head(n) on the other hand is not (unless you apply orderBy which again repartitions dataset to one partition), but that isn't your concern right?

Efficient pyspark join

I've read a lot about how to do efficient joins in pyspark. The ways to achieve efficient joins I've found are basically:
Use a broadcast join if you can. (I usually can't because the dataframes are too large)
Consider using a very large cluster. (I'd rather not because of $$$).
Use the same partitioner.
The last one is the one i'd rather try, but I can't find a way to do it in pyspark. I've tried:
df.repartition(numberOfPartitions,['parition_col1','partition_col2'])
but it doesn't help, it still takes way too long until I stop it, because spark get's stucked in the last few jobs.
So, how can I use the same partitioner in pyspark and speed up my joins, or even get rid of the shuffles that takes forever ? Which code do I need to use ?
PD: I've checked other articles, even on stackoverflow, but I still can't see code.
you can also use a two-pass approach, in case it suits your requirement.First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
It was nicely explained by Sim. see link below
two pass approach to join big dataframes in pyspark
based on case explained above I was able to join sub-partitions serially in a loop and then persisting joined data to hive table.
Here is the code.
from pyspark.sql.functions import *
emp_df_1.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_1")
emp_df_2.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_2")
So, if you are joining on an integer emp_id, you can partition by the ID modulo some number and this way you can re distribute the load across the spark partitions and records having similar keys will be grouped together and reside on same partition.
you can then read and loop through each sub partition data and join both the dataframes and persist them together.
counter =0;
paritioncount = 4;
while counter<=paritioncount:
query1 ="SELECT * FROM UDB.temptable_1 where par_id={}".format(counter)
query2 ="SELECT * FROM UDB.temptable_2 where par_id={}".format(counter)
EMP_DF1 =spark.sql(query1)
EMP_DF2 =spark.sql(query2)
df1 = EMP_DF1.alias('df1')
df2 = EMP_DF2.alias('df2')
innerjoin_EMP = df1.join(df2, df1.emp_id == df2.emp_id,'inner').select('df1.*')
innerjoin_EMP.show()
innerjoin_EMP.write.format('orc').insertInto("UDB.temptable")
counter = counter +1
I have tried this and this is working fine. This is just an example to demo the two-pass approach. your join conditions may vary and the number of partitions also depending on your data size.
Thank you #vikrantrana for your answer, I will try it if I ever need it. I say these because I found out the problem wasn't with the 'big' joins, the problem was the amount of calculations prior to the join. Imagine this scenario:
I read a table and I store in a dataframe, called df1. I read another table, and I store it in df2. Then, I perfome a huge amount of calculations and joins to both, and I end up with a join between df1 and df2. The problem here wasn't the size, the problem was spark's execution plan was huge and it couldn't maintain all the intermediate tables in memory, so it started to write to disk and it took so much time.
The solution that worked to me was to persist df1 and df2 in disk before the join (I also persisted other intermediate dataframes that were the result of big and complex calculations).

Spark: Apply multiple transformations without recalculating or caching

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?
Long version
Consider the case.
I have a very large dataset that doesn't fit in memory. Now I do some transformations on it which prepare the data to be worked on efficiently (grouping, filtering, sorting....):
DATASET --(TF1: transformation with group by, etc)--> DF1
DF1 --(TF2: more_transformations_some_columns)--> output
DF1 --(TF3: more_transformations_other_columns)--> output2
I was wondering if there is any way (or planned in dev) to tell Spark that, after TF1, it must reuse the same results (at partition level, without caching everything!) to serve both TF2 and TF3.
This can be conceptually imagined as a cache() at each partition, with automatic unpersist() when the partition was consumed by the further transformations.
I searched for a long time but couldn't find any way of doing it.
My attempt:
DF1 = spark.read()... .groupBy().agg()...
DF2 = DF1.select("col1").cache() # col1 fits in mem
DF3 = DF1.select("col1", transformation(other_cols)).write()... # Force evaluation of col1
Unfortunately, DF3 cannot guess it could to the caching of col1. So apparently it isn't possible to ask spark to only cache a few columns. That would already alleviate the problem.
Any ideas?
I don't think it is possible to cache just some of the columns,
but will this solve your problem?
DF1 = spark.read()... .groupBy().agg()...
DF3 = DF1.select("col1", transformation(other_cols)).cache()
DF3.write()
DF2 = DF3.select("col1")

How to reduce Spark task counts & avoid group by

All, I am using PySpark & need to join two RDD's but to join them both I need to group all elements of each RDD by the joining key and later perform a join function. This causes additional overheads and I am not sure what a work around can be. Also this is creating a high number of tasks that is in turn increasing the number of files to write to HDFS and slowing overall process by a lot here is a example:
RDD1 = [join_col,{All_Elements of RDD1}] #derived by using groupby join_col)
RDD2 = [join_col,{All_Elements of RDD2}] #derived by using groupby join_col)
RDD3 = RDD1.join(RDD2)
If desired output is grouped and both RDDs are to large to be broadcasted there is not much you can do at the code level. It could be cleaner to simply apply cogroup:
rdd1.cogroup(rdd2)
but there should be no significant difference performance-wise. If you suspect there can be a large data / hash skew you can try different partitioning, for example by using sortByKey but it is unlikely to help you in a general case.

Resources