Writing a dataframe to disk taking an unrealistically long time in Pyspark (Spark 2.1.1) - apache-spark

I'm running Pyspark on a single server with multiple CPUs. All other operations (reading, joining, filtering, custom UDFs) are executed quickly except for writing to disk. The dataframe I'm trying to save is of size around ~400 gb with 200 partitions.
sc.getConf().getAll()
The driver memory is 16g, and working directory has enough space (> 10TB)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
Wondering if anyone has run into the same issue. Also will changing any of the config parameters before pyspark is invoked help solve the issue?
Edits (a few clarifications):
When I mean other operations were executed quickly, there was always an action after transformation, in my case they were row counts. So all the operations were executed super fast. Still haven't gotten around why writing takes such a ridiculous amount of time.
One of my colleagues brought up the fact that the disks in our server might have a limit on concurrent writing which might be slowing things down, still investigating on this. Interested in knowing if others are seeing slow write times on a Spark cluster too. I have confirmation from one user regarding this on AWS cluster.

All other operations (reading, joining, filtering, custom UDFs)
There are because there are transformations - they don't do anything until data has to be saved.
The dataframe I'm trying to save is of size around ~400 gb
(...)
I'm trying to save using the following command:
df.repartition(1).write.csv("out.csv")
That just cannot work well. Even ignoring part where you use a single machine, saving 400GB with a single thread (!) is just hopeless. Even if it succeeds, it is not better than using plain bash script.
Skipping over Spark - sequential writes for 400GB will take a substantial amount of time, even on average size disk. And given multiple shuffle (join, repartition) data will be written to disk multiple times.

After a lot of trial and error, I realized that the issue was due to the method I used to read the file from disk. I was using the in-built read.csv function, and when I switched over to the read function in databricks-csv package the problem went away. I'm now able to write files to disk at a reasonable time. It's really strange, maybe it's a bug in 2.1.1 or databricks csv package is really optimized.
1.read.csv method
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("model") \
.config("spark.worker.dir", "xxxx") \
.getOrCreate()
df = spark.read.load("file.csv", format="csv", header = True)
write.csv("file_after_processing.csv")
2.Using the databricks-csv package
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')

Related

Transfering a large table with small amount of memory using pyspark

I am trying to transfer multiple tables' data using pyspark (one table at a time). The problem is that two of my tables are a lot larger than my memory (Table 1 - 30GB, Table 2 - 12GB).
Unfortunately, I only have 6GB of memory (for driver + executor). All of my attempts to optimize the transfer process have failed. Here's my SparkSession Configuration:
spark = SparkSession.builder\
.config('spark.sql.shuffle.partitions', '300')\
.config('spark.driver.maxResultSize', '0')\
.config('spark.driver.memoryOverhead', '0')\
.config('spark.memory.offHeap.enabled', 'false')\
.config('spark.memory.fraction', '300')\
.master('local[*]')\
.appName('stackoverflow')\
.getOrCreate()
For reading and writing I'm using fetchsize and batchsize parameters and a simple connection to Postgresql DB. Using parameters like numPartitions are not available in this case - the script should be generic for about 70 tables.
I ran tons of tests and tuned all the parameters but none of them worked. Beside that, I noticed that there are memory spills but I can't understand why and how to disable it. Spark should be holding some rows at a time, write them to my destenation table then delete them from memory.
I'd be happy to get any tips from anyone who faced a similar challenge.

Spark-redis: dataframe writing times too slow

I am an Apache Spark/Redis user and recently I tried spark-redis for a project. The program is generating PySpark dataframes with approximately 3 million lines, that I am writing in a Redis database using the command
df.write \
.format("org.apache.spark.sql.redis") \
.option("table", "person") \
.option("key.column", "name") \
.save()
as suggested at the GitHub project dataframe page.
However, I am getting inconsistent writing times for the same Spark cluster configuration (same number of EC2 instances and instance types). Sometimes it happens very fast, sometimes too slow. Is there any way to speed up this process and get consistent writing times? I wonder if it happens slowly when there are a lot of keys inside already, but it should not be an issue for a hash table, should it?
This could be a problem with your partition strategy.
Check Number of Partitions of "df" before writing and see if there is a relation between number of partitions and execution time.
If so, partitioning your "df" with suitable partiton stratigy (Re-partitioning to a fixed number of partitions or re-partitioning based on a column value) should resolve the problem.
Hope this helps.

spark repartition / executor inconsistencies commandline vs jupyter

I wasn't really sure what to title this question -- happy for a suggested better summary
I'm beating my head trying to figure out why a dead simple spark job works fine from Jupyter, but from the command line is left with insufficient executors to progress.
What I'm trying to do: I have a large amount of data (<1TB) from which I need to extract a small amount of data (~1GB) and save as parquet.
Problem I have: when my dead-simple code is run from the command line, I only get as many executors as I have final partitions, which is ideally one given it is small. The same exact code works just fine in Jupyter, same cluster, where it tasks out >10k tasks across my entire cluster. The commandline version never progresses. Since it doesn't produce any logs beyond reporting lack of progress, i'm not sure where more to dig.
I have tried both python3 mycode.py and spark-submit mycode.py with lots of variations to no avail. My cluster has dynamicAllocation configured.
import findspark
findspark.init('/usr/lib/spark/')
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
data = spark.read.parquet(<datapath>).select(<fields>)
subset = [<list of items>]
spark.sparkContext.broadcast(subset)
data.filter(field.isin.(subset)).coalesce(1).write.parquet("output")
** edit: original version mistakenly had repartition(1) instead of coalesce.
In this case, run from the command line, my process will get one executor.
In my logs, the only real hint I get is
WARN TaskSetManager: Stage 1 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB.
which makes sense given the lack of resources being allocated.
I have tried to manually force the number of executors using spark-submit runtime settings. In that case, it will start with my initial settings and then immediately start bringing them down until there is only one and nothing progresses.
Any ideas? thanks.
I ended up phoning a friend on this one...
the code that was running fine in JupyterHub, but not via the commandline was essentially a:
read parquet,
filter on some small field,
coalesce(1)
write parquet
I had assumed that coalesce(1) and repartition(1) should have the same results -- even though coalesce(N) and repartition(N) do not -- given that they all go to one partition.
According to my friend, Spark sometimes optimizes coalesce(1) to a single task, which was the behavior I saw. By changing it to repartition(1), everything works fine.
I still have no idea why it works fine in JupyterHub --- having done >20 experiments -- and never on the commandline -- also >20 experiements.
But, if you want to take your data lake to a data puddle this way, use repartition(1) or repartition(n), where n is small, instead of coalesce.

How can you quickly save a dataframe/RDD from PySpark to disk as a CSV/Parquet file?

I have a Google Dataproc Cluster running and am submitting a PySpark job to it that reads in a file from Google Cloud Storage (945MB CSV file with 4 million rows --> takes 48 seconds in total to read in) to a PySpark Dataframe and applies a function to that dataframe (parsed_dataframe = raw_dataframe.rdd.map(parse_user_agents).toDF() --> takes about 4 or 5 seconds).
I then have to save those modified results back to Google Cloud Storage as a GZIP'd CSV or Parquet file. I can also save those modified results locally, and then copy them to a GCS bucket.
I repartition the dataframe via parsed_dataframe = parsed_dataframe.repartition(15) and then try saving that new dataframe via
parsed_dataframe.write.parquet("gs://somefolder/proto.parquet")
parsed_dataframe.write.format("com.databricks.spark.csv").save("gs://somefolder/", header="true")
parsed_dataframe.write.format("com.databricks.spark.csv").options(codec="org.apache.hadoop.io.compress.GzipCodec").save("gs://nyt_regi_usage/2017/max_0722/regi_usage/", header="true")
Each of those methods (and their different variants with lower/higher partitions and saving locally vs. on GCS) takes over 60 minutes for the 4 million rows (945 MB) which is a pretty long time.
How can I optimize this/make saving the data faster?
It should be worth noting that both the Dataproc Cluster and GCS bucket are in the same region/zone, and that the Cluster has a n1-highmem-8 (8CPU, 52GB memory) Master node with 15+ worker nodes (just variables I'm still testing out)
Some red flags here.
1) reading as a DF then converting to an RDD to process and back to a DF alone is very inefficient. You lose catalyst and tungsten optimizations by reverting to RDDs. Try changing your function to work within DF.
2) repartitioning forces a shuffle but more importantly means that the computation will now be limited to those executors controlling the 15 partitions. If your executors are large (7 cores, 40ish GB RAM), this likely isn't a problem.
What happens if you write the output before being repartitioned?
Please provide more code and ideally spark UI output to show how long each step in the job takes.
Try this, it should take few minutes:
your_dataframe.write.csv("export_location", mode="overwrite", header=True, sep="|")
Make sure you add mode="overwrite" if you want overwrite an old version.
Are you calling an action on parsed_dataframe?
As you wrote it above, Spark will not compute your function until you call write. If you're not calling an action, see how long parsed_dataframe.cache().count() takes. I suspect it will take an hour and that then running parsed_dataframe.write(...) will be much faster.

Spark on localhost

For testing purposes, while I donĀ“t have production cluster, I am using spark locally:
print('Setting SparkContext...')
sconf = SparkConf()
sconf.setAppName('myLocalApp')
sconf.setMaster('local[*]')
sc = SparkContext(conf=sconf)
print('Setting SparkContext...OK!')
Also, I am using a very very small dataset, consisting of only 20 rows in a postgresql database ( ~2kb)
Also(!), my code is quite simple as well, only grouping 20 rows by a key and applying a trivial map operation
params = [object1, object2]
rdd = df.rdd.keyBy(lambda x: (x.a, x.b, x.c)) \
.groupByKey() \
.mapValues(lambda value: self.__data_interpolation(value, params))
def __data_interpolation(self, data, params):
# TODO: only for testing
return data
What bothers me is that the whole execution takes about 5 minutes!!
Inspecting the Spark UI, I see that most of the time was spent in Stage 6: byKey method. (Stage 7, collect() method was also slow...)
Some info:
These numbers make no sense to me... Why do I need 22 tasks, executing for 54 sec, to process less than 1 kb of data
Can it be a network issue, trying to figure out the ip address of localhost?
I don't know... Any clues?
It appears the main reason for the slower performance in your code snippet is due to the use of groupByKey(). The issue with groupByKey is that it ends up shuffling all of the key-value pairs resulting in a lot of data unnecessarily being transferred. A good reference to explain this issue is Avoid GroupByKey.
To work around this issue, you can:
Try using reduceByKey which should be faster (more info is also included in the above Avoid GroupByKey link).
Use DataFrames (instead of RDDs) as DFs include performance optimizations (and the DF GroupBy statement is faster than the RDD version). As well, as you're using Python, you can avoid the Python-to-JVM issues with PySpark RDDs. More information on this can be seen in PySpark Internals
By the way, reviewing the Spark UI diagram above, the #22 refers to the task # within the DAG (not the number of tasks executed).
HTH!
I suppose the "postgresql" is the key to solve that puzzle.
keyBy is probably the first operation that really uses the data so it's execution time is bigger as it needs to get the data from external database. You can verify it by adding at the beginning:
df.cache()
df.count() # to fill the cache
df.rdd.keyBy....
If I am right, you need to optimize the database. It may be:
Network issue (slow network to DB server)
Complicated (and slow) SQL on this database (try it using postgre shell)
Some authorization difficulties on DB server
Problem with JDBC driver you use
From what I have seen happening in my system while running spark:
When we run a spark job it internally creates map and reduce tasks and runs them. In your case, to run the data you have, it created 22 such tasks. I bigger the size of data the number may be big.
Hope this helps.

Resources