How to efficiently sum multiple columns in PySpark? - apache-spark

Recently I've started to use PySpark and it's DataFrames. I've got situation where I have around 18 million records and around 50 columns. I'd like to get a sum of every column so I use:
df_final = df.select([f.sum(c) for c in df.columns])
df_final.collect()
But my problem is that when I do it my whole code repartitions to only 1 partition and I've got problems with efficiency and not enough memory when I'm collecting.
I've read that it behaves this way because it needs to put every key of groupBy in single executor, since I'm summing whole column i actually don't need groupBy but i don't know how to achieve it otherwise.
Is there any more efficient/faster way to do it?

It is advisable to apply collect() on small volumn of data as it is using too much memory. you can read this article .
Instead of collect(), write the output into file.
df.write.csv('mycsv.csv')
EDIT :
parquet will provides better performace with spark
To explore all the supported file formats in spark. Read official documentation

Related

HashPartioning dataframes to achieve co-partitioning during join in PySpark

I am trying to figure out the best way to achieve co-partitioning on my two datasets to eliminate join related shuffles. I'm working with 2 dataframes A and B where A contains minimal user date including a field for event IDs they interacted with, and B contains detailed information about the events. I am trying to join on 3 fields: day, event_type, and event_id. A and B need to be read from disk as they will be written to and read from by external clients on an ongoing basis.
The main goal of the project I'm working on is to enable the ability to quickly:
Filter by event_type
Join raw event details to user IDs
I understand that in order to achieve #1 I probably need to partition my parquet files on event_type so that the directory structure achieves easier filtering. In order to achieve #2 I should try to minimize shuffles as much as possible by means of co-partitioning keys from the two dataframes.
The data I'm working with consists of 3 days of event data (~12M rows per event type) and the goal is to get this working efficiently for 1-3 years of data.
In order to improve my join I first begin by filtering on the event_type I am interested in to narrow down the data on both dataframes. I then do the actual join on day and event_id. This naturally will result in shuffles since there is no co-partitioning so I've tried to address that using hash partitioning.
I read that repartition implements hash partitioning on the specified columns. I save my dataframes to disk and also include a partitionBy('day', 'event_type') in order to achieve better performance on filtering/grouping operations.
A\
.repartition('day', 'event_id')\
.write
.partitionBy('day', 'event_type')\
.mode('overwrite')\
.parquet('/path/to/A')
B\
.repartition('day', 'event_id')\
.write\
.partitionBy('day', 'event_type')\
.mode('overwrite')\
.parquet('/path/to/B')
...
...
A = spark.read.parquet('/path/to/A')
B = spark.read.parquet('/path/to/B')
A.filter(col('event_type') == 'X')\
.join(B.filter(col('event_type) == 'X'), on=['day', event_id'], how='inner')\
.show()
When I execute this I still see a shuffle exchange in the plan as well as shuffle writes which take up around 5-10GB each. I also see longer executor compute times of around 21-41s which might not seem much on 3 days of data but might blow up for yearly data.
I am wondering what's a better way I can go about doing this - or if it is even possible to eliminate shuffles when working with dataframes? Answers to this question seem to suggest that it might be possible but not a great idea?
I am not even sure that doing both a repartition and a partitionBy is the correct approach. Is the initial partitioning using repartition() preserved at all when I re-read the parquet files from disk? I have read that this might not be the case - overall the information available seems either conflicting or without explicit sources attached.
Thank you for taking the time to help.

Overused the capacity memory when trying to process the CSV file when using Pyspark and Python

I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.

Write spark dataframe to single parquet file

I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation.
I have a very large (~3 TB, ~300MM rows, 25k partitions) table, saved as parquet in s3, and I would like to give someone a tiny sample of it as a single parquet file. Unfortunately, this is taking forever to finish and I don't understand why. I have tried the following:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.coalesce(1).write.saveAsTable("db.tiny_table")
and then when that didn't work I tried this, which I thought should be the same, but I wasn't sure. (I added the print's in an effort to debug.)
tiny = spark.table("db.big_table").limit(500).coalesce(1)
print(tiny.count())
print(tiny.show(10))
tiny.write.saveAsTable("db.tiny_table")
When I watch the Yarn UI, both print statements and the write are using 25k mappers. The count took 3 mins, the show took 25 mins, and the write took ~40 mins, although it finally did write the single file table I was looking for.
It seems to me like the first line should take the top 500 rows and coalesce them to a single partition, and then the other lines should happen extremely fast (on a single mapper/reducer). Can anyone see what I'm doing wrong here? I've been told maybe I should use sample instead of limit but as I understand it limit should be much faster. Is that right?
Thanks in advance for any thoughts!
I’ll approach the print functions issue first, as it’s something fundamental to understanding spark. Then limit vs sample. Then repartition vs coalesce.
The reasons the print functions take so long in this manner is because coalesce is a lazy transformation. Most transformations in spark are lazy and do not get evaluated until an action gets called.
Actions are things that do stuff and (mostly) dont return a new dataframe as a result. Like count, show. They return a number, and some data, whereas coalesce returns a dataframe with 1 partition (sort of, see below).
What is happening is that you are rerunning the sql query and the coalesce call each time you call an action on the tiny dataframe. That’s why they are using the 25k mappers for each call.
To save time, add the .cache() method to the first line (for your print code anyway).
Then the data frame transformations are actually executed on your first line and the result persisted in memory on your spark nodes.
This won’t have any impact on the initial query time for the first line, but at least you’re not running that query 2 more times because the result has been cached, and the actions can then use that cached result.
To remove it from memory, use the .unpersist() method.
Now for the actual query youre trying to do...
It really depends on how your data is partitioned. As in, is it partitioned on specific fields etc...
You mentioned it in your question, but sample might the right way to go.
Why is this?
limit has to search for 500 of the first rows. Unless your data is partitioned by row number (or some sort of incrementing id) then the first 500 rows could be stored in any of the the 25k partitions.
So spark has to go search through all of them until it finds all the correct values. Not only that, it has to perform an additional step of sorting the data to have the correct order.
sample just grabs 500 random values. Much easier to do as there’s no order/sorting of the data involved and it doesn’t have to search through specific partitions for specific rows.
While limit can be faster, it also has its, erm, limits. I usually only use it for very small subsets like 10/20 rows.
Now for partitioning....
The problem I think with coalesce is it virtually changes the partitioning. Now I’m not sure about this, so pinch of salt.
According to the pyspark docs:
this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
So your 500 rows will actually still sit across your 25k physical partitions that are considered by spark to be 1 virtual partition.
Causing a shuffle (usually bad) and persisting in spark memory with .repartition(1).cache() is possibly a good idea here. Because instead of having the 25k mappers looking at the physical partitions when you write, it should only result in 1 mapper looking at what is in spark memory. Then write becomes easy. You’re also dealing with a small subset, so any shuffling should (hopefully) be manageable.
Obviously this is usually bad practice, and doesn’t change the fact spark will probably want to run 25k mappers when it performs the original sql query. Hopefully sample takes care of that.
edit to clarify shuffling, repartition and coalesce
You have 2 datasets in 16 partitions on a 4 node cluster. You want to join them and write as a new dataset in 16 partitions.
Row 1 for data 1 might be on node 1, and row 1 for data 2 on node 4.
In order to join these rows together, spark has to physically move one, or both of them, then write to a new partition.
That’s a shuffle, physically moving data around a cluster.
It doesn’t matter that everything is partitioned by 16, what matters is where the data is sitting on he cluster.
data.repartition(4) will physically move data from each 4 sets of partitions per node into 1 partition per node.
Spark might move all 4 partitions from node 1 over to the 3 other nodes, in a new single partition on those nodes, and vice versa.
I wouldn’t think it’d do this, but it’s an extreme case that demonstrates the point.
A coalesce(4) call though, doesn’t move the data, it’s much more clever. Instead, it recognises “I already have 4 partitions per node & 4 nodes in total... I’m just going to call all 4 of those partitions per node a single partition and then I’ll have 4 total partitions!”
So it doesn’t need to move any data because it just combines existing partitions into a joined partition.
Try this, in my empirical experience repartition works better for this kind of problems:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.saveAsTable("db.tiny_table")
Even better if you are interested in the parquet you don't need to save it as a table:
tiny = spark.sql("SELECT * FROM db.big_table LIMIT 500")
tiny.repartition(1).write.parquet(your_hdfs_path+"db.tiny_table")

Spark: How collect large amount of data without out of memory

I have the following issue:
I do a sql query over a set of parquet files on HDFS and then I do a collect in order to get the result.
The problem is that when there are many rows I get an out of memory error.
This query requires shuffling so I can not do the query on each file.
One solution could be to iterate over the values of a column and save the result on disk:
df = sql('original query goes here')
// data = collect(df) <- out of memory
createOrReplaceTempView(df, 't')
for each c in cities
x = collect(sql("select * from t where city = c")
append x to file
As far as I know it will result in the program taking too much time because the query will be executed for each city.
What is the best way of doing this?
In the case if its running out of memory, which means that the output data is really very huge, so,
you can write down the results into some file itself just like parquet file.
If you want to further perform some operation, on this collected data, you can read data from this file.
For large datasets we should not use collect(), instead you may use take(100) or take(some_integer) in order to check that some values are correct.
As #cricket_007 said, I would not collect() your data from Spark to append it to a file in R.
Additionally, it doesn't make sense to iterate over a list of SparkR::distinct() cities and then select everything from those tables just to append them to some output dataset. The only time you would want to do that is if you are trying to do another operation within each group based upon some sort of conditional logic or apply an operation to each group using a function that is NOT available in SparkR.
I think you are trying to get a data frame (either Spark or R) with observations grouped in a way so that when you look at them, everything is pretty. To do that, add a GROUP BY city clause to your first SQL query. From there, just write the data back out to HDFS or some other output directory. From what I understand about your question, maybe doing something like this will help:
sdf <- SparkR::sql('SELECT SOME GREAT QUERY FROM TABLE GROUP BY city')
SparkR::write.parquet(sdf, path="path/to/desired/output/location", mode="append")
This will give you all your data in one file, and it should be grouped by city, which is what I think you are trying to get with your second query in your question.
You can confirm the output is what you want via:
newsdf<- SparkR::read.parquet(x="path/to/first/output/location/")
View(head(sdf, num=200))
Good luck, hopefully this helps.
Since your data is huge it is no longer possible to collect() anymore. So you can use a strategy to sample data and learn from the sampled data.
import numpy as np
arr = np.array(sdf.select("col_name").sample(False, 0.5, seed=42).collect())
Here you are sampling 50% of the data and just a single column.

Splitting a huge dataframe into smaller dataframes and writing to files using SPARK(python)

I am loading a (5gb compressed file) into memory (aws), creating a dataframe(in spark) and trying to split it into smaller dataframes based on 2 column values. Eventually i want to write all these sub-sets into their respective files.
I just started experimenting in spark and just getting used to the data structures. The approach I was trying to follow was something like this.
read the file
sort it by the 2 columns (still not familiar with repartitioning and do not know if it will help)
identify unique list of all values of those 2 columns
iterate through this list
-- create smaller dataframes by filtering using the values in list
-- writing to files
df.sort("DEVICE_TYPE", "PARTNER_POS")
df.registerTempTable("temp")
grp_col = sqlContext.sql("SELECT DEVICE_TYPE, PARTNER_POS FROM temp GROUP BY DEVICE_TYPE, PARTNER_POS")
print(grp_col)
I do not believe this are cleaner and more efficient ways of doing this. I need to write this to files as there are etls which get kicked off in parallel based on the output. Any recommendations?
If it's okay that the subsets are nested in a directory hierarchy, then you should consider using spark's builtin partitioning:
df.write.partitionBy("device_type","partner_pos")
.json("/path/to/root/output/dir")

Resources