I am developing a pyspark code in vscode and it has a directory structure as below. Currently, Even if need to display small print statement pySpark would load everything and then give me the output which takes 10-15 seconds. Is there a way to keep spark session always on so that it returns results faster (not through jyputer)?
Related
I have a PySpark script that I am running locally with spark-submit in Docker. In my script I have a call toPandas() on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv() to write results to a local CSV file.
When I run this script, the code after the call to toPandas() does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true but this seems to show only internal events, not my actual application log statements.
Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp). Further, the history server says No completed applications found! when I start it, configuring it to read the event log (spark.history.fs.logDirectory for the history server, spark.eventLog.dir for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas() call.
How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.
When you use the toPandas() to convert your spark dataframe to your pandas dataframe, it's actually a heavy action because it will pull all the records to the driver.
Remember that Spark is a distributed computing engine and it's doing the parallel computing. Therefore your data will be distributed to different node and it's completely different to pandas dataframe, since pandas works on single machine but spark work in cluster. You can check this post: why does python dataFrames' are localted only in the same machine?
Back to your post, actually it covers 2 questions:
Why there is no logs after toPandas(): As mentioned above, Spark is a distributed computing engine. The event log will only save the job details which appear in Spark computation DAG. Other non Spark log will not be saved in spark log, if you really those log, you need to use external library like logging to collect the logs in driver.
Why there is no CSV saved in /tmp dir: As you mentioned that when you check the event log, there is a an incomplete application but not a failed application, I believe you dataframe is so huge that you collection has not finished and your transformation in pandas dataframe has even not yet started. You can try to collect few record, let's say df.limit(20).toPandas() to see if it works or not. If it works, that means your dataframe that converts to pandas is so large and it takes time. If it's not work, maybe you can share more about the error traceback.
I'm using Pyspark on Spark 3.0.1 on Windows 10 locally for testing and developing, and regardless of what I try the number of processes spawned is always 200 which is way too many for my small test cases.
I'm creating my Spark-SQL context like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark_test").master("local")\
.config('spark.shuffle.partitions', '16')\
.config('spark.adaptive.enabled', 'True')\
.config("spark.adaptive.coalescePartitions.enabled", "True").getOrCreate()
Doing print(spark.sparkContext._conf.getAll()) later shows that the parameters have been correctly set (host censored by me):
[('spark.master', 'local'),
('spark.driver.host', '**************'),
('spark.app.name', 'pyspark_test'),
('spark.adaptive.enabled', 'True'),
('spark.rdd.compress', 'True'),
('spark.adaptive.coalescePartitions.enabled', 'True'),
('spark.driver.port', '58352'),
('spark.serializer.objectStreamReset', '100'),
('spark.submit.pyFiles', ''),
('spark.shuffle.partitions', '16'),
('spark.executor.id', 'driver'),
('spark.submit.deployMode', 'client'),
('spark.app.id', 'local-1602571079244')]
I'm executing the task using spark-submit in the console, so each SparkSession should be created new with the given config.
My code contains a groupBy, an inner join, and a write.csv at the end. The csv output is the main issue here.
When I do a coalesce(1) before writing csv it takes 3 minutes to collect 200 pieces of data into one, the output csv has 338KB. In the Stages Overview I can see that it only runs 2 tasks in parallel while going through the 200 pieces. Without that it just writes 200 separate csv files with 2KB each which also takes around 3 minutes.
My input data is two csv files with the sizes 3.8MB and 826KB.
I tried this with and without enabling adaptive optimization, but it feels like my settings are being ignored anyway.
I am aware of this related question but that was three and a half years ago on V1.6.
Also I did experiment with first creating a SparkContext, setting and getting a conf, stopping the SparkContext and using the conf for my SparkSession, but that didn't help either.
So my simple question is: Why is my setting of spark.shuffle.partitions being ignored and how do I fix this?
I do feel a bit stupid now.
I need to set spark.sql.shuffle.partitions and not spark.shuffle.partitions.
I was expecting Spark to throw an error on getting a setting that doesn't exist and when that didn't happen I thought it was okay.
I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks
Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.
I wasn't really sure what to title this question -- happy for a suggested better summary
I'm beating my head trying to figure out why a dead simple spark job works fine from Jupyter, but from the command line is left with insufficient executors to progress.
What I'm trying to do: I have a large amount of data (<1TB) from which I need to extract a small amount of data (~1GB) and save as parquet.
Problem I have: when my dead-simple code is run from the command line, I only get as many executors as I have final partitions, which is ideally one given it is small. The same exact code works just fine in Jupyter, same cluster, where it tasks out >10k tasks across my entire cluster. The commandline version never progresses. Since it doesn't produce any logs beyond reporting lack of progress, i'm not sure where more to dig.
I have tried both python3 mycode.py and spark-submit mycode.py with lots of variations to no avail. My cluster has dynamicAllocation configured.
import findspark
findspark.init('/usr/lib/spark/')
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
data = spark.read.parquet(<datapath>).select(<fields>)
subset = [<list of items>]
spark.sparkContext.broadcast(subset)
data.filter(field.isin.(subset)).coalesce(1).write.parquet("output")
** edit: original version mistakenly had repartition(1) instead of coalesce.
In this case, run from the command line, my process will get one executor.
In my logs, the only real hint I get is
WARN TaskSetManager: Stage 1 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB.
which makes sense given the lack of resources being allocated.
I have tried to manually force the number of executors using spark-submit runtime settings. In that case, it will start with my initial settings and then immediately start bringing them down until there is only one and nothing progresses.
Any ideas? thanks.
I ended up phoning a friend on this one...
the code that was running fine in JupyterHub, but not via the commandline was essentially a:
read parquet,
filter on some small field,
coalesce(1)
write parquet
I had assumed that coalesce(1) and repartition(1) should have the same results -- even though coalesce(N) and repartition(N) do not -- given that they all go to one partition.
According to my friend, Spark sometimes optimizes coalesce(1) to a single task, which was the behavior I saw. By changing it to repartition(1), everything works fine.
I still have no idea why it works fine in JupyterHub --- having done >20 experiments -- and never on the commandline -- also >20 experiements.
But, if you want to take your data lake to a data puddle this way, use repartition(1) or repartition(n), where n is small, instead of coalesce.
I use pyspark version 1.5.0 with Cloudera 5.5.0. All scripts are running fine except when I use sc.wholeTextFiles. Using this command gives an error:
Kryo Serialization failed: Buffer overflow. Available:0, required: 23205706. To avoid this, increase spark.kryoserializer.buffer.max
However, I don't find the property spark.kryoserializer.buffer.max in the spark web UI; it is not present under the Environment tab in Spark web UI. The only "kryo" in this page is the value org.apache.spark.selializer.KryoSerializer of the name spark.serializer.
Why can't I see this property? And how to fix the problem?
EDIT
Turns out that the Kryo error was caused by a printing to the shell. Without printing, the error is actually java.io.IOExceptionL Filesystem closed!
The script now works correctly for a small portion of the data, but running it on all of the data (about 500GB, 10,000 files) returns this error.
I tried to pass in the propery --conf "spak.yarn.executor.memoryOverhead=2000", and it seems that it allows a slightly larger part of the data to be read, but it still ultimately fails on the full data. It takes 10-15 minutes of running before the error appears.
The RDD is big, but the error is produced even when only doing .count() on it.
You should pass such property when submitting a job. This is why it's not in Cloudera UI.
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html
In your case: --conf "spark.kryoserializer.buffer.max = 64M" (for example)
Also, I'm not sure but it might happen that if you increase Kryo buffer you might want to increase akka frame size.