Fix PlayFramework Slick Error - java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend rejected from slick.util.AsyncExecutor - apache-spark

UPDATE 2
More investigation, some code refactoring and many trials later, We have more insight. The issue seems relative to concurrent read-write. IE When trying to read from DB while in the same time write (update) occurs. Not sure if this is managed by Postgres or Slick.
We are facing a random error with Slick 3.
That is the brief history.
We are building an ETL pipeline Spark, Minio as S3 Storage (Minio is an open-source alternative to AWS) and Delta tables. The pipeline has a web interface created using Play Framework (Scala).
Cluster is consisted of:
7 workers nodes of 16 cores and 64GB RAM each configured in client
mode.
1 Storage node
spark.defaut.parallelism and spark.sql.shuffle.partitions are both set to 600
spark.dynamicAllocation is disabled
App data (session data, users data, and some other records in) is saved in PostgreSQL using Slick 3 mapper.
Data processed size is exponentially growing, and now it is around 50 GB. (In production, we aim to process Terabytes of data)
Data processing flow consists essentially in data Aggregation using group-by and saving data into S3 Storage following these steps
Read CSV data from Storage and create read_df dataframe
Read main_db from dtorag and create main_df
Merge read_df with main_df
GroupBy a specfic Key (let’s say user_id)
Save records to Storage to replace main_db. To guarantee data integrity, this stage is split into three phases:
Write records to a temp object referenced by date time
Backup Existing database object main_db (copy to another object)
Rename temp object to main_db (copy and delete)
Then Update PostgreSQL history table with processed job information such as:
time_started, time_ended, number_of_rows_processed, size, etc. And that is where issue occurs.
We are facing a random error, and we noticed it happens when shuffle occurs after groupby. Sometimes, we end up with 1000+ partitions. In those cases, Step 5 is not completed and gives the following Exception:
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#291fad07 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#7345bd2c[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 26]
completed tasks value sometimes is lower, sometime reaches hundreds.
Below is code that is executed in step 5
Await.result(mainDbSvc.update("main.delta", mainDb), 600.seconds)
Googling the exception (And we did a ton of research about this), we found that it could be because connections are closed before code is executed when using transactionally. Notice, we don’t use transactionnally in our code. Below is code executed when calling update()
val updateQuery = this.mainDbTable.filter(_.id === id).update(db)
dbConfig.db.run(updateQuery)
That is the actual slick configuration:
connectionPool = "HikariCP"
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
numThreads = 100
Initially, before errors starting, it was
numThreads = 20
maxConnections = 20
We tried queueSize = 2000 but not fixed.
Can someone have a solution for us?
Furthermore, we suspect the step5 to be responsible for that connection closed issue because that did not happen when it is turned off. What is the link between threads that read/write from S3 Storage (on another server) and hikari (slick) processes that are killed?
And is there a better way to guarantee data integrity (in case of failure while writing data) without this time-consuming copy-restore-and-delete process ?
Note:
After Aggregation, we repartition() to reduce partitions and avoid skew data before saving results. Coalesce() made driver JVM craches with OOM.
main_df and read_df do not have the same schema so, overwritting using delta in built-in method is not possible.
Update() functions Await time was 10s but following issue, we increased it, but that did not fix the issue.
UPDATE
This is the full trace of exception.
An error has occurred: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#313a2647 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#5c6d1059 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at slick.util.AsyncExecutor$$anon$1$$anon$4.execute(AsyncExecutor.scala:161)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction(BasicBackend.scala:265)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction$(BasicBackend.scala:263)
at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.slick$basic$BasicBackend$DatabaseDef$$runInContextInline(BasicBackend.scala:242)
at slick.basic.BasicBackend$DatabaseDef.runInContextSafe(BasicBackend.scala:148)
at slick.basic.BasicBackend$DatabaseDef.runInContext(BasicBackend.scala:142)
at slick.basic.BasicBackend$DatabaseDef.runInContext$(BasicBackend.scala:141)
at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.runInternal(BasicBackend.scala:77)
at slick.basic.BasicBackend$DatabaseDef.runInternal$(BasicBackend.scala:76)
at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.run(BasicBackend.scala:74)
at slick.basic.BasicBackend$DatabaseDef.run$(BasicBackend.scala:74)
at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:37)
at modules.load.daos.slick3.JobsDao.update(JobsDao.scala:180)
at modules.load.services.JobService.update(JobService.scala:50)
at modules.load.models.JobSnippet$.updateJob(Job.scala:113)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6(BuildController.scala:169)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6$adapted(BuildController.scala:151)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

Related

Spark Web UI showing Job SUCCEEDED but Tasks Succeeded Less than Total

In the "Details for Job n section, the UI shows "Status: SUCCEEDED", however one of the Stages shows 22029 succeeded Tasks out of 59400 total tasks. I'm running this through a Python Jupyter notebook running Spark 3.0.1, and I haven't stopped the Spark context yet so the application is still running. In fact, the Stages Tab shows the stage in question as still active. I don't understand how the stage could still be active, yet the Job is listed as Completed and Successful in the UI.
The relevant code (I think) is below, where I try to parallelize as much as possible many SQL queries and then union the result dataframes together. Lastly, I'm writing them to cloud storage in parquet.
EDIT: I also can see the same information from the REST API using the endpoints documented here in the docs, and those values are the same as I see in the Web UI.
There are no jobs appearing in the Jobs tab as failed, and I believe that ultimately the data is successfully written and correct.
I have seen in the logs many instances of Dropping event from queue appStatus. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. Because of that, I am experimenting with the parameter spark.scheduler.listenerbus.eventqueue.capacity to increase it and see if that results in a difference in the reporting of Succeeded and Total tasks for that stage.
Upon increasing spark.scheduler.listenerbus.eventqueue.capacity from default of 10000 to 65000, there seems to be a corresponding decrease in events dropped, as well as an increase in Succeeded Tasks reported for that stage, improving to ~ 47K from ~ 22K. I have also noticed that the difference in Succeeded and Total tasks for that stage is on the order of the number of dropped events in the log so I will see if limiting the dropped events can resolve the discrepancy.
def make_df(query: str):
df = spark.sql(query)
return df
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df_list = list(map(make_df, queries))
df = functools.reduce(lambda x, y: x.union(y), df_list)
df.repartition("col1", "col2")\
.write.partitionBy("col1", "col2")\
.mode("overwrite")\
.parquet(path)
Why would my job be reporting as Successful when there are still tasks remaining that aren't successful?

AWS Glue Job fails at create_dynamic_frame_from_options when reading from s3 bucket with lot of files

The data inside my s3 bucket looks like this...
s3://bucketName/prefix/userId/XYZ.gz
There are around 20 million users, and within each user's subfolder, there will be 1 - 10 files.
My glue job starts like this...
datasource0 = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://bucketname/prefix/"], 'useS3ListImplementation':True, 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': 100 * 1024 * 1024}, format="json", transformation_ctx = "datasource0")
There are a bunch of optimizations like groupFiles, groupSize & useS3ListImplementations I have attempted, as shown above.
I'm using G.2X worker instances to provide the maximum memory for the jobs.
This job however fails consistently on that first line, with 'SDKClientException, Unable to execute HTTP request: Unsupported record version Unknown-0.0', and with ' Unable to execute HTTP request: Received close_notify during handshake' error on enabling useS3ListImplementations.
From the monitoring, I observe that this job is using only one executor, though I have allocated 10 (or 20 in some runs), and driver memory is growing to 100%, and CPU hovers around 50%.
I understand my s3 folders are not organized the best way. Given this structure, is there a way to make this glue job work?
My objective is to transform the json data inside those historical folders to parquet in one go. Any better way of achieving this is also welcome.

Redis memory usage continues to climb when using task.forget()

I have a mysql database which stores thousands of stock OHLC data for 2 years. Data is read from MySQL in the form of pandas dataframes and then submitted to celery in large batch jobs which eventually lead to "OOM command not allowed when used memory > 'maxmemory'".
I have added the following celery config options. These options have allowed my script to run longer however redis inevitably reaches 2gb memory and celery throws OOM errors.
result_expires = 30
ignore_result = True
worker_max_tasks_per_child = 1000
From the redis side I have tried playing with the maxmemory policy using both allkeys-lru and volatile-lru. Neither seem to make a difference.
When celery hits the OOM error the redis cli shows max memory usage and no keys?
# Memory
used_memory:2144982784
used_memory_human:2.00G
used_memory_rss:1630146560
used_memory_rss_human:1.52G
used_memory_peak:2149023792
used_memory_peak_human:2.00G
used_memory_peak_perc:99.81%
used_memory_overhead:2144785284
used_memory_startup:987472
used_memory_dataset:197500
used_memory_dataset_perc:0.01%
allocator_allocated:2144944880
allocator_active:1630108672
allocator_resident:1630108672
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:2147483648
maxmemory_human:2.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:0.76
allocator_frag_bytes:18446744073194715408
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:37888
mem_fragmentation_ratio:0.76
mem_fragmentation_bytes:-514798320
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:2143797684
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
And there are zero keys?
127.0.0.1:6379[1]> keys *
(empty list or set)
When I run this same code in subsets of 200*5 requests (then terminate) everything runs successfully. Redis memory usage caps around 100mb and when the python process terminates all the memory usage drops as expected. This leads me to believe I could probably implement a handler to do 200*5 requests at a time however I suspect that the python process (my script) terminating is what is actually freeing memory in celery/redis...
I would like to avoid subsetting this and process everything in MySQL in one shot. About 5000 pandas dataframes * 5 tasks total.
I do not understand why the memory usage in redis continues to grow when I am forgetting all results immediately following retrieving them?
Here is an example for how this is done in my code:
def getTaskResults(self, caller, task):
#Wait for the results and then call back
#Attach this Results object in the callback along with the data
ready = False
while not ready:
if task.ready():
ready = True
data = pd.read_json(task.get())
data.sort_values(by=['Date'], inplace=True)
task.forget()
return caller.resultsCB(data, self)
This is probably my ignorance with redis but if there are no keys how is it consuming all that memory, or how can I validate what is actually consuming that memory in redis?
Since I store the taskID of every call to celery in an object I have confirmed that trying to do a task.get after adding in task.forget throws an error.

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

Elastic search could not write all entries: May be es was overloaded

I have an application where I read csv files and do some transformations and then push them to elastic search from spark itself. Like this
input.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource", "{date}/" + type).save()
I have several nodes and in each node, I run 5-6 spark-submit commands that push to elasticsearch
I am frequently getting Errors
Could not write all entries [13/128] (Maybe ES was overloaded?). Error sample (first [5] error messages):
rejected execution of org.elasticsearch.transport.TransportService$7#32e6f8f8 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#4448a084[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 451515]]
My Elasticsearch cluster has following stats -
Nodes - 9 (1TB space,
Ram >= 15GB ) More than 8 cores per node
I have modified following parameters for elasticseach
spark.es.batch.size.bytes=5000000
spark.es.batch.size.entries=5000
spark.es.batch.write.refresh=false
Could anyone suggest, What can I fix to get rid of these errors?
This occurs because the bulk requests are incoming at a rate greater than elasticsearch cluster could process and the bulk request queue is full.
The default bulk queue size is 200.
You should handle ideally this on the client side :
1) by reducing the number the spark-submit commands running concurrently
2) Retry in case of rejections by tweaking the es.batch.write.retry.count and
es.batch.write.retry.wait
Example:
es.batch.write.retry.wait = "60s"
es.batch.write.retry.count = 6
On elasticsearch cluster side :
1) check if there are too many shards per index and try reducing it.
This blog has a good discussion on criteria for tuning the number of shards.
2) as a last resort increase the thread_pool.index.bulk.queue_size
Check this blog with an extensive discussion on bulk rejections.
The bulk queue in your ES cluster is hitting its capacity (200) . Try increasing it. See this page for how to change the bulk queue capacity.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
Also check this other SO answer where OP had a very similar issue and was fixed by increasing the bulk pool size.
Rejected Execution of org.elasticsearch.transport.TransportService Error

Resources