PySpark job only partly running - apache-spark

I have a PySpark script that I am running locally with spark-submit in Docker. In my script I have a call toPandas() on a PySpark DataFrame and afterwards I have various manipulations of the DataFrame, finishing in a call to to_csv() to write results to a local CSV file.
When I run this script, the code after the call to toPandas() does not appear to run. I have log statements before this method call and afterwards, however only the log entries before the call show up on the spark-submit console output. I have thought that maybe this is due to the rest of the code being run in a separate executor process by Spark, so the logs don't show on the console. If this is true, how can I see my application logs for the executor? I have enabled the event log with spark.eventLog.enabled=true but this seems to show only internal events, not my actual application log statements.
Even if the assumption about executor logs above is true or false, I don't see the CSV file written to the path that I expect (/tmp). Further, the history server says No completed applications found! when I start it, configuring it to read the event log (spark.history.fs.logDirectory for the history server, spark.eventLog.dir for Spark). It does show an incomplete application, the only complete job listed there is for my toPandas() call.
How am I supposed to figure out what's happening? Spark shows no errors at any point, and I can't seem to view my own application logs.

When you use the toPandas() to convert your spark dataframe to your pandas dataframe, it's actually a heavy action because it will pull all the records to the driver.
Remember that Spark is a distributed computing engine and it's doing the parallel computing. Therefore your data will be distributed to different node and it's completely different to pandas dataframe, since pandas works on single machine but spark work in cluster. You can check this post: why does python dataFrames' are localted only in the same machine?
Back to your post, actually it covers 2 questions:
Why there is no logs after toPandas(): As mentioned above, Spark is a distributed computing engine. The event log will only save the job details which appear in Spark computation DAG. Other non Spark log will not be saved in spark log, if you really those log, you need to use external library like logging to collect the logs in driver.
Why there is no CSV saved in /tmp dir: As you mentioned that when you check the event log, there is a an incomplete application but not a failed application, I believe you dataframe is so huge that you collection has not finished and your transformation in pandas dataframe has even not yet started. You can try to collect few record, let's say df.limit(20).toPandas() to see if it works or not. If it works, that means your dataframe that converts to pandas is so large and it takes time. If it's not work, maybe you can share more about the error traceback.

Related

PySpark SQL error - ExternalAppendOnlyUnsafeRowArray

I am trying to perform a complex query through Spark. When I try to visualize the results (by calling the .show() method) the application stucks.
I tried to see what was going on by looking at the Spark logs and I noticed that the Spark Job related to the *show * method had a huge number of tasks and seemed to be not running at all.
By looking at the logs, I noticed the following:
INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I tried to modify the spill threshold by injecting the following config in the spark session:
("spark.shuffle.sort.bypassMergeThreshold", "8192")
But I got the same log message and outcome.
I am running the application on Spark 3.2.1.

Unable to change number of partitions in Pyspark with Spark 3.0.1

I'm using Pyspark on Spark 3.0.1 on Windows 10 locally for testing and developing, and regardless of what I try the number of processes spawned is always 200 which is way too many for my small test cases.
I'm creating my Spark-SQL context like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark_test").master("local")\
.config('spark.shuffle.partitions', '16')\
.config('spark.adaptive.enabled', 'True')\
.config("spark.adaptive.coalescePartitions.enabled", "True").getOrCreate()
Doing print(spark.sparkContext._conf.getAll()) later shows that the parameters have been correctly set (host censored by me):
[('spark.master', 'local'),
('spark.driver.host', '**************'),
('spark.app.name', 'pyspark_test'),
('spark.adaptive.enabled', 'True'),
('spark.rdd.compress', 'True'),
('spark.adaptive.coalescePartitions.enabled', 'True'),
('spark.driver.port', '58352'),
('spark.serializer.objectStreamReset', '100'),
('spark.submit.pyFiles', ''),
('spark.shuffle.partitions', '16'),
('spark.executor.id', 'driver'),
('spark.submit.deployMode', 'client'),
('spark.app.id', 'local-1602571079244')]
I'm executing the task using spark-submit in the console, so each SparkSession should be created new with the given config.
My code contains a groupBy, an inner join, and a write.csv at the end. The csv output is the main issue here.
When I do a coalesce(1) before writing csv it takes 3 minutes to collect 200 pieces of data into one, the output csv has 338KB. In the Stages Overview I can see that it only runs 2 tasks in parallel while going through the 200 pieces. Without that it just writes 200 separate csv files with 2KB each which also takes around 3 minutes.
My input data is two csv files with the sizes 3.8MB and 826KB.
I tried this with and without enabling adaptive optimization, but it feels like my settings are being ignored anyway.
I am aware of this related question but that was three and a half years ago on V1.6.
Also I did experiment with first creating a SparkContext, setting and getting a conf, stopping the SparkContext and using the conf for my SparkSession, but that didn't help either.
So my simple question is: Why is my setting of spark.shuffle.partitions being ignored and how do I fix this?
I do feel a bit stupid now.
I need to set spark.sql.shuffle.partitions and not spark.shuffle.partitions.
I was expecting Spark to throw an error on getting a setting that doesn't exist and when that didn't happen I thought it was okay.

Dilemma about Spark partitions

I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks
Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.

spark repartition / executor inconsistencies commandline vs jupyter

I wasn't really sure what to title this question -- happy for a suggested better summary
I'm beating my head trying to figure out why a dead simple spark job works fine from Jupyter, but from the command line is left with insufficient executors to progress.
What I'm trying to do: I have a large amount of data (<1TB) from which I need to extract a small amount of data (~1GB) and save as parquet.
Problem I have: when my dead-simple code is run from the command line, I only get as many executors as I have final partitions, which is ideally one given it is small. The same exact code works just fine in Jupyter, same cluster, where it tasks out >10k tasks across my entire cluster. The commandline version never progresses. Since it doesn't produce any logs beyond reporting lack of progress, i'm not sure where more to dig.
I have tried both python3 mycode.py and spark-submit mycode.py with lots of variations to no avail. My cluster has dynamicAllocation configured.
import findspark
findspark.init('/usr/lib/spark/')
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
data = spark.read.parquet(<datapath>).select(<fields>)
subset = [<list of items>]
spark.sparkContext.broadcast(subset)
data.filter(field.isin.(subset)).coalesce(1).write.parquet("output")
** edit: original version mistakenly had repartition(1) instead of coalesce.
In this case, run from the command line, my process will get one executor.
In my logs, the only real hint I get is
WARN TaskSetManager: Stage 1 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB.
which makes sense given the lack of resources being allocated.
I have tried to manually force the number of executors using spark-submit runtime settings. In that case, it will start with my initial settings and then immediately start bringing them down until there is only one and nothing progresses.
Any ideas? thanks.
I ended up phoning a friend on this one...
the code that was running fine in JupyterHub, but not via the commandline was essentially a:
read parquet,
filter on some small field,
coalesce(1)
write parquet
I had assumed that coalesce(1) and repartition(1) should have the same results -- even though coalesce(N) and repartition(N) do not -- given that they all go to one partition.
According to my friend, Spark sometimes optimizes coalesce(1) to a single task, which was the behavior I saw. By changing it to repartition(1), everything works fine.
I still have no idea why it works fine in JupyterHub --- having done >20 experiments -- and never on the commandline -- also >20 experiements.
But, if you want to take your data lake to a data puddle this way, use repartition(1) or repartition(n), where n is small, instead of coalesce.

PySpark: pull data to driver and then upload to dataframe

I am trying to create a pyspark dataframe from data stored in an external database. I use the pyodbc module to connect to the database and pull the required data, after which I use spark.createDataFrame to send my data to the cluster for analysis.
I run the script using --deploy-mode client, so the driver runs on the master node, but the executors can be distributed to other machines. The problem is pyodbc is not installed on any of the worker nodes (this is fine since I don't want them all querying the database anyway), so when I try to import this module in my scripts, I get an import error (unless all the executors happen to be on the master node).
My question is how can I specify that I want a certain portion of my code (in this case, importing pyodbc and querying the database) to run on the driver only? I am thinking something along the lines of
if __name__ == '__driver__':
<do stuff>
else:
<wait until stuff is done>
Your imports in your python driver DO only run on the master. The only time you will see errors on your executors about missing imports is if you are referencing some object/function from one of those imports in a function you are calling on a driver. I would look carefully at any python code you are running in RDD/DataFrame calls for unintended references. If you post your code, we can give you more specific guidance.
Also, routing data through your driver is usually not a great idea because it will not scale well. If you have lots of data you are going to try and force all through a single point which defeats the purpose of distributed processing!
Depending on what database you are using is, there is probably a Spark Connector implemented to load it directly into a dataframe. If you are using ODBC then maybe you are using SQL Server? For example, in that case you should be able to use JDBC drivers, like for example in this post:
https://stephanefrechette.com/connect-sql-server-using-apache-spark/#.Wy1S7WNKjmE
This is not how spark is supposed to work. Spark collections (RDDs or DataFrames) are inherently distributed. What you're describing is to create a dataset locally, by reading the whole dataset into drivers memory, and then sending it over to executors for further processing by creating an RDD or DataFrame out of it. That does not make much sense.
If you want to make sure that there is only one connection from spark to your database, then set the parallelism to 1. You can then increase the parallelism in further transformation steps.

Resources