Pyspark does not release memory even after its operation completed - apache-spark

As per my currently requirement, I am using Pyspark with flask apis(Python Framework) and I am creating Spark session while flask apis server getting up. And I am using spark session for heavy wait computing while api call. So here what happen in each api request, Memory size increased even after its operation are done.
I have did following after each apis called
1: spark.catalog.clearCache()
2: df.unpersist()
Even my memory is gradually increased.
Any one help me to come out from this big issue. I have tring different different configuration even my memory is not reduced.

unpersist() is by default unpersist(blocking=false), which means it's just a flag on your dataframe, saying that Spark can delete it whenever possible. Meanwhile, unpersist(blocking=true) will block your process instead

Related

Spark BroadcastHashJoin operator and Dynamic Allocation enabled

In my company, we have the following scenario:
we have dynamic allocation enabled by default for every data pipeline written, so we can save some costs and enable resource sharing among the different execution
also, most of the queries running perform joins and Spark has some interesting optimizations regarding it, like the join strategy change, which occurs when Spark identifies that one side of the join is small enough to be broadcasted. This is what we called BroadcastHashJoin and we have lots of queries with these operators in their respective query plans
last but not least, our pipelines run on EMR clusters using the client mode.
We are having a problem that happens when the YARN (RM on EMR) queue where a job was submitted is full, and there are not enough resources to allocate new executors for a given application. Since the driver process runs on the machine that submitted the application (client-mode), the broadcast job started and, after 300s it fails showing the broadcast timeout error.
Running the same job in a different schedule (a time when the queue usage is not too high), it was able to run successfully.
My questions are all related to how these three different things work together (dynamic allocation enabled, BHJ, client-mode). So, if you haven't enabled dynamic allocation, it's easier to see that the broadcast operation will occur for every executor that was requested initially through the spark-submit command. But, if we enable dynamic allocation, how the broadcast operation will occur for the next executors that will be dynamically allocated? Will the driver have to send it again for every new executor? Will they be subject to the same 300 timeout seconds? Is there a way to prevent the driver (client-mode) from starting the broadcast operation unless it has enough executors?
Source: BroadcastExchangeExec source code here
PS: we have tried already define spark.dynamicAllocation.minExecutors property equal to 1, but no success. The job still started only with the driver allocated and errored after the 300s.

Can I run multiple Spark History Servers in parallel?

We use Spark History Server to analyze our Spark runs, and we have a lot of them. This causes a high load on the server, since logs are analyzed lazily (i.e. on the first request).
I was wondering if we can scale out this service, or just scale it up? Is it version-dependant?
Since it's a concurrency issue, I rather get a trustworthy answer instead of running it and hoping for the best.
Thanks!

Do I need to/how to clean up Spark Sessions?

I am launching Spark 2.4.6 in a Python Flask web service. I'm running a single Spark Context and I have also enabled FAIR scheduling.
Each time a user makes a request to one of the REST end points I call spark = sparkSession.newSession() and then execute various operations using Spark SQL in this somewhat isolated environment.
My concern is, after 100 or 10,000 or a million requests with an equal number of new sessions, at some point am I going to run into issues? Is there a way to let my SparkContext know that I don't need an old session anymore and that it can be cleared?

How Do I monitor progess and recover in a long-running Spark map job?

We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.

Spark: Writing to DynamoDB, limited write capacity

My use case is to write to DynamoDB from a Spark application. As I have limited write capacity for DynamoDB and do not want to increase it for cost implications, how can I limit the Spark application to write at a regulated speed?
Can this be achieved by reducing the partitions to 1 and then executing foreachPartition()?
I already have auto-scaling enabled but don't want to increase it any further.
Please suggest other ways of handling this.
EDIT: This needs to be achieved when the Spark application is running on a multi-node EMR cluster.
Bucket scheduler
The way I would do this is to create a token bucket scheduler in your Spark application. A token bucket pattern is a common to design to ensure an application does not breach API limits. I have used this design successfully in very similar situations. You may find someone has written a library you can use for this purpose.
DynamoDB retry
Another (less attractive), option would be to increase the retry times on your DynamoDB connection. When your write does not succeed due to throughput provision exceeded, you can essentially instruct your DyanmoDB SDK to keep retrying for as long as you like. Details in this answer. This option may appeal if you want a 'quick and dirty' solution.

Resources