Pyspark dataframe.limit is slow - apache-spark

I am trying to work with a large dataset, but just play around with a small part of it. Each operation takes a long time, and I want to look at the head or limit of the dataframe.
So, for example, I call a UDF (user defined function) to add a column, but I only care to do so on the first, say, 10 rows.
sum_cols = F.udf(lambda x:x[0] + x[1], IntegerType())
df_with_sum = df.limit(10).withColumn('C',sum_cols(F.array('A','B')))
However, this still to take the same long time it would take if I did not use limit.

If you work with 10 rows first, I think it is better that to create a new df and cache it
df2 = df.limit(10).cache()
df_with_sum = df2.withColumn('C',sum_cols(F.array('A','B')))

limit will first try to get the required data from single partition. If the it does not get the whole data in one partition then it will get remaining data from next partition.
So please check how many partition you have by using df.rdd.getNumPartition
To prove this I would suggest first coalsce your df to one partition and do a limit. You can see this time limit is faster as it’s filtering data from one partition

Related

How to check Spark DataFrame difference?

I need to check my solution for idempotency and check how much it's different with past solution.
I tried next:
spark.sql('''
select * from t1
except
select * from t2
''').count()
It's gives me information how much this tables different (t1 - my solution, t2 - primal data). If here is many different data, I want to check, where it different.
So, I tried that:
diff = {}
columns = t1.columns
for col in columns:
cntr = spark.sql('''
select {col} from t1
except
select {col} from t2
''').count()
diff[col] = cntr
print(diff)
It's not good for me, because it's works about 1-2 hours (both tables have 30 columns and 30 million lines of data).
Do you guys have an idea how to calculate this quickly?
Except is a kind of a join on all columns at the same time. Does your data have a primary key? It could even be complex, comprising of multiple columns, but it's still much better then taking all 30 columns into account.
Once you figure out the primary key you can do the FULL OUTER JOIN and:
check NULLs on the left
check NULLs on the right
check other columns of matching rows (it's much cheaper to compare the values after the join)
Given that your resource remains unchanged, I think there are three ways that you can optimize:
Join two dataframe once but not looping the except: I assume your dataset should have a key / index, otherwise there is no ordering in your both dataframe and you can't perform except to check the difference. Unless you have limited resource, just do join once to concat two dataframe first instead of multiple except.
Check your data partitioning: Even you use point 1 / the method that you're using, make sure that data partition is in even distribution with optimal number of partition. Most of the time, data skew is one of the critical parts to lower your performance. If your key is a string, use repartition. If you're using a sequence number, use repartitionByRange.
Use the when-otherwise pair to check the difference: once you join two dataframe, you can use a when-otherwise condition to compare the difference, for example: df.select(func.sum(func.when(func.col('df1.col_a')!=func.col('df2.col_a'), func.lit(1))).otherwise(func.lit(0)).alias('diff_in_col_a_count')). Therefore, you can calculate all the difference within one action but not multiple action.

Spark dataframe distinct write is increasing the output size by almost 10 fold

I have a case where i am trying to write some results using dataframe write into S3 using the below query with input_table_1 size is 13 Gb and input_table_2 as 1 Mb
input_table_1 has columns account, membership and
input_table_2 has columns role, id , membership_id, quantity, start_date
SELECT
/*+ BROADCASTJOIN(input_table_2) */
account,
role,
id,
quantity,
cast(start_date AS string) AS start_date
FROM
input_table_1
INNER JOIN
input_table_2
ON array_contains(input_table_1.membership, input_table_2.membership_id)
where membership array contains list of member_ids
This dataset write using Spark dataframe is generating around 1.1TiB of data in S3 with around 700 billion records.
We identified that there are duplicates and used dataframe.distinct.write.parquet("s3path") to remove the duplicates . The record count is reduced to almost 1/3rd of the previous total count with around 200 billion rows but we observed that the output size in S3 is now 17.2 TiB .
I am very confused how this can happen.
I have used the following spark conf settings
spark.sql.shuffle.partitions=20000
I have tried to do a coalesce and write to s3 but it did not work.
Please suggest if this is expected and when can be done ?
There's two sides to this:
1) Physical translation of distinct in Spark
The Spark catalyst optimiser turns a distinct operation into an aggregation by means of the ReplaceDeduplicateWithAggregate rule (Note: in the execution plan distinct is named Deduplicate).
This basically means df.distinct() on all columns is translated into a groupBy on all columns with an empty aggregation:
df.groupBy(df.columns:_*).agg(Map.empty).
Spark uses a HashPartitioner when shuffling data for a groupBy on respective columns. Since the groupBy clause in your case contains all columns (well, implicitly, but it does), you're more or less randomly shuffling data to different nodes in the cluster.
Increasing spark.sql.shuffle.partitions in this case is not going to help.
Now on to the 2nd side, why does this affect the size of your parquet files so much?
2) Compression in parquet files
Parquet is a columnar format, will say your data is organised in columns rather than row by row. This allows for powerful compression if data is adequately laid-out & ordered. E.g. if a column contains the same value for a number of consecutive rows, it is enough to write that value just once and make a note of the number of repetitions (a strategy called run length encoding). But Parquet also uses various other compression strategies.
Unfortunately, data ends up pretty randomly in your case after shuffling to remove duplicates. The original partitioning of input_table_1 was much better fitted.
Solutions
There's no single answer how to solve this, but here's a few pointers I'd suggest doing next:
What's causing the duplicates? Could these be removed upstream? Or is there a problem with the join condition causing duplicates?
A simple solution is to just repartition the dataset after distinct to match the partitioning of your input data. Adding a secondary sorting (sortWithinPartition) is likely going to give you even better compression. However, this comes at the cost of an additional shuffle!
As #matt-andruff pointed out below, you can also achieve this in SQL using cluster by. Obviously, that also requires you to move the distinct keyword into your SQL statement.
Write your own deduplication algorithm as Spark Aggregator and group / shuffle the data just once in a meaningful way.

pyspark: Efficiently have partitionBy write to same number of total partitions as original table

I had a question that is related to pyspark's repartitionBy() function which I originally posted in a comment on this question. I was asked to post it as a separate question, so here it is:
I understand that df.partitionBy(COL) will write all the rows with each value of COL to their own folder, and that each folder will (assuming the rows were previously distributed across all the partitions by some other key) have roughly the same number of files as were previously in the entire table. I find this behavior annoying. If I have a large table with 500 partitions, and I use partitionBy(COL) on some attribute columns, I now have for example 100 folders which each contain 500 (now very small) files.
What I would like is the partitionBy(COL) behavior, but with roughly the same file size and number of files as I had originally.
As demonstration, the previous question shares a toy example where you have a table with 10 partitions and do partitionBy(dayOfWeek) and now you have 70 files because there are 10 in each folder. I would want ~10 files, one for each day, and maybe 2 or 3 for days that have more data.
Can this be easily accomplished? Something like df.write().repartition(COL).partitionBy(COL) seems like it might work, but I worry that (in the case of a very large table which is about to be partitioned into many folders) having to first combine it to some small number of partitions before doing the partitionBy(COL) seems like a bad idea.
Any suggestions are greatly appreciated!
You've got several options. In my code below I'll assume you want to write in parquet, but of course you can change that.
(1) df.repartition(numPartitions, *cols).write.partitionBy(*cols).parquet(writePath)
This will first use hash-based partitioning to ensure that a limited number of values from COL make their way into each partition. Depending on the value you choose for numPartitions, some partitions may be empty while others may be crowded with values -- for anyone not sure why, read this. Then, when you call partitionBy on the DataFrameWriter, each unique value in each partition will be placed in its own individual file.
Warning: this approach can lead to lopsided partition sizes and lopsided task execution times. This happens when values in your column are associated with many rows (e.g., a city column -- the file for New York City might have lots of rows), whereas other values are less numerous (e.g., values for small towns).
(2) df.sort(sortCols).write.parquet(writePath)
This options works great when you want (1) the files you write to be of nearly equal sizes (2) exact control over the number of files written. This approach first globally sorts your data and then finds splits that break up the data into k evenly-sized partitions, where k is specified in the spark config spark.sql.shuffle.partitions. This means that all values with the same values of your sort key are adjacent to each other, but sometimes they'll span a split, and be in different files. This, if your use-case requires all rows with the same key to be in the same partition, then don't use this approach.
There are two extra bonuses: (1) by sorting your data its size on disk can often be reduced (e.g., sorting all events by user_id and then by time will lead to lots of repetition in column values, which aids compression) and (2) if you write to a file format the supports it (like Parquet) then subsequent readers can read data in optimally by using predicate push-down, because the parquet writer will write the MAX and MIN values of each column in the metadata, allowing the reader to skip rows if the query specifies values outside of the partition's (min, max) range.
Note that sorting in Spark is more expensive than just repartitioning and requires an extra stage. Behind the scenes Spark will first determine the splits in one stage, and then shuffle the data into those splits in another stage.
(3) df.rdd.partitionBy(customPartitioner).toDF().write.parquet(writePath)
If you're using spark on Scala, then you can write a customer partitioner, which can get over the annoying gotchas of the hash-based partitioner. Not an option in pySpark, unfortunately. If you really want to write a custom partitioner in pySpark, I've found this is possible, albeit a bit awkward, by using rdd.repartitionAndSortWithinPartitions:
df.rdd \
.keyBy(sort_key_function) \ # Convert to key-value pairs
.repartitionAndSortWithinPartitions(numPartitions=N_WRITE_PARTITIONS,
partitionFunc=part_func) \
.values() # get rid of keys \
.toDF().write.parquet(writePath)
Maybe someone else knows an easier way to use a custom partitioner on a dataframe in pyspark?
df.repartition(COL).write().partitionBy(COL)
will write out one file per partition. This will not work well if one of your partition contains a lot of data. e.g. if one partition contains 100GB of data, Spark will try to write out a 100GB file and your job will probably blow up.
df.repartition(2, COL).write().partitionBy(COL)
will write out a maximum of two files per partition, as described in this answer. This approach works well for datasets that are not very skewed (because the optimal number of files per partition is roughly the same for all partitions).
This answer explains how to write out more files for the partitions that have a lot of data and fewer files for the small partitions.

More efficient way to Iterate & compute over columns [duplicate]

This question already has answers here:
Spark columnar performance
(2 answers)
Closed 5 years ago.
I have a very wide dataframe > 10,000 columns and I need to compute the percent of nulls in each. Right now I am doing:
threshold=0.9
for c in df_a.columns[:]:
if df_a[df_a[c].isNull()].count() >= (df_a.count()*threshold):
# print(c)
df_a=df_a.drop(c)
Of course this is a slow process and crashes on occasion. Is there a more efficient method I am missing?
Thanks!
There are few strategies you can take depending upon the size of the dataframe. The code looks good to me. You need to go through each column and count the number of null values.
One strategy is to cache the input dataframe. That will enable faster filtering. This however works if the dataframe is not huge
Also
df_a=df_a.drop(c)
I am little skeptical with this as this is changing the dataframe in the loop. Better to keep the null column names and drop from the dataframe later in a separate loop.
If the dataframe is huge and you can't cache it completely you can partition the dataframe in to some finite manageable columns. Like take 100 column each and cache that smaller dataframe and run the analysis 100 times in a loop.
Now you might want to keep track of the analyzed column list separate from the yet to be analyzed columns in this case. That way even if the job fails you can start the analysis from the rest of the columns.
You should avoid iterating when using pyspark, since it does not distribute the computations anymore.
Using count on a column will compute the count of non-null elements.
threshold = 0.9
import pyspark.sql.functions as psf
count_df = df_a\
.agg(*[psf.count("*").alias("count")]+ [psf.count(c).alias(c) for c in df_a.columns])\
.toPandas().transpose()
The first element is the number of lines in the dataframe:
total_count = count_df.iloc[0, 0]
kept_cols = count_df[count_df[0] > (1 - threshold)*total_count].iloc[1:,:]
df_a.select(list(kept_cols.index))

Spark Delete Rows

I have a DataFrame containing roughly 20k rows.
I want to delete 186 rows randomly in the dataset.
To understand the context - I am testing a classification model on missing data, and each row has a unix timestamp. 186 rows corresponds to 3 seconds (there are 62 rows of data per second.)
My aim for this is, when data is streaming, it is likely that data will
be missing for a number of seconds. I am extracting features from a time window, so I want to see how missing data effects model performance.
I think the best approach to this would be to convert to an rdd and use the filter function, something like this, and put the logic inside the filter function.
dataFrame.rdd.zipWithIndex().filter(lambda x: )
But I am stuck with the logic - how do I implement this? (using PySpark)
Try to do like this:
import random
startVal = random.randint(0,dataFrame.count() - 62)
dataFrame.rdd.zipWithIndex()\
.filter(lambda x: not x[<<index>>] in range(startVal, startVal+62))
This should work!

Resources