I have a mysql database which stores thousands of stock OHLC data for 2 years. Data is read from MySQL in the form of pandas dataframes and then submitted to celery in large batch jobs which eventually lead to "OOM command not allowed when used memory > 'maxmemory'".
I have added the following celery config options. These options have allowed my script to run longer however redis inevitably reaches 2gb memory and celery throws OOM errors.
result_expires = 30
ignore_result = True
worker_max_tasks_per_child = 1000
From the redis side I have tried playing with the maxmemory policy using both allkeys-lru and volatile-lru. Neither seem to make a difference.
When celery hits the OOM error the redis cli shows max memory usage and no keys?
# Memory
used_memory:2144982784
used_memory_human:2.00G
used_memory_rss:1630146560
used_memory_rss_human:1.52G
used_memory_peak:2149023792
used_memory_peak_human:2.00G
used_memory_peak_perc:99.81%
used_memory_overhead:2144785284
used_memory_startup:987472
used_memory_dataset:197500
used_memory_dataset_perc:0.01%
allocator_allocated:2144944880
allocator_active:1630108672
allocator_resident:1630108672
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:2147483648
maxmemory_human:2.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:0.76
allocator_frag_bytes:18446744073194715408
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:37888
mem_fragmentation_ratio:0.76
mem_fragmentation_bytes:-514798320
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:2143797684
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
And there are zero keys?
127.0.0.1:6379[1]> keys *
(empty list or set)
When I run this same code in subsets of 200*5 requests (then terminate) everything runs successfully. Redis memory usage caps around 100mb and when the python process terminates all the memory usage drops as expected. This leads me to believe I could probably implement a handler to do 200*5 requests at a time however I suspect that the python process (my script) terminating is what is actually freeing memory in celery/redis...
I would like to avoid subsetting this and process everything in MySQL in one shot. About 5000 pandas dataframes * 5 tasks total.
I do not understand why the memory usage in redis continues to grow when I am forgetting all results immediately following retrieving them?
Here is an example for how this is done in my code:
def getTaskResults(self, caller, task):
#Wait for the results and then call back
#Attach this Results object in the callback along with the data
ready = False
while not ready:
if task.ready():
ready = True
data = pd.read_json(task.get())
data.sort_values(by=['Date'], inplace=True)
task.forget()
return caller.resultsCB(data, self)
This is probably my ignorance with redis but if there are no keys how is it consuming all that memory, or how can I validate what is actually consuming that memory in redis?
Since I store the taskID of every call to celery in an object I have confirmed that trying to do a task.get after adding in task.forget throws an error.
Related
UPDATE 2
More investigation, some code refactoring and many trials later, We have more insight. The issue seems relative to concurrent read-write. IE When trying to read from DB while in the same time write (update) occurs. Not sure if this is managed by Postgres or Slick.
We are facing a random error with Slick 3.
That is the brief history.
We are building an ETL pipeline Spark, Minio as S3 Storage (Minio is an open-source alternative to AWS) and Delta tables. The pipeline has a web interface created using Play Framework (Scala).
Cluster is consisted of:
7 workers nodes of 16 cores and 64GB RAM each configured in client
mode.
1 Storage node
spark.defaut.parallelism and spark.sql.shuffle.partitions are both set to 600
spark.dynamicAllocation is disabled
App data (session data, users data, and some other records in) is saved in PostgreSQL using Slick 3 mapper.
Data processed size is exponentially growing, and now it is around 50 GB. (In production, we aim to process Terabytes of data)
Data processing flow consists essentially in data Aggregation using group-by and saving data into S3 Storage following these steps
Read CSV data from Storage and create read_df dataframe
Read main_db from dtorag and create main_df
Merge read_df with main_df
GroupBy a specfic Key (let’s say user_id)
Save records to Storage to replace main_db. To guarantee data integrity, this stage is split into three phases:
Write records to a temp object referenced by date time
Backup Existing database object main_db (copy to another object)
Rename temp object to main_db (copy and delete)
Then Update PostgreSQL history table with processed job information such as:
time_started, time_ended, number_of_rows_processed, size, etc. And that is where issue occurs.
We are facing a random error, and we noticed it happens when shuffle occurs after groupby. Sometimes, we end up with 1000+ partitions. In those cases, Step 5 is not completed and gives the following Exception:
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#291fad07 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#7345bd2c[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 26]
completed tasks value sometimes is lower, sometime reaches hundreds.
Below is code that is executed in step 5
Await.result(mainDbSvc.update("main.delta", mainDb), 600.seconds)
Googling the exception (And we did a ton of research about this), we found that it could be because connections are closed before code is executed when using transactionally. Notice, we don’t use transactionnally in our code. Below is code executed when calling update()
val updateQuery = this.mainDbTable.filter(_.id === id).update(db)
dbConfig.db.run(updateQuery)
That is the actual slick configuration:
connectionPool = "HikariCP"
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
numThreads = 100
Initially, before errors starting, it was
numThreads = 20
maxConnections = 20
We tried queueSize = 2000 but not fixed.
Can someone have a solution for us?
Furthermore, we suspect the step5 to be responsible for that connection closed issue because that did not happen when it is turned off. What is the link between threads that read/write from S3 Storage (on another server) and hikari (slick) processes that are killed?
And is there a better way to guarantee data integrity (in case of failure while writing data) without this time-consuming copy-restore-and-delete process ?
Note:
After Aggregation, we repartition() to reduce partitions and avoid skew data before saving results. Coalesce() made driver JVM craches with OOM.
main_df and read_df do not have the same schema so, overwritting using delta in built-in method is not possible.
Update() functions Await time was 10s but following issue, we increased it, but that did not fix the issue.
UPDATE
This is the full trace of exception.
An error has occurred: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#313a2647 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#5c6d1059 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at slick.util.AsyncExecutor$$anon$1$$anon$4.execute(AsyncExecutor.scala:161)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction(BasicBackend.scala:265)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction$(BasicBackend.scala:263)
at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.slick$basic$BasicBackend$DatabaseDef$$runInContextInline(BasicBackend.scala:242)
at slick.basic.BasicBackend$DatabaseDef.runInContextSafe(BasicBackend.scala:148)
at slick.basic.BasicBackend$DatabaseDef.runInContext(BasicBackend.scala:142)
at slick.basic.BasicBackend$DatabaseDef.runInContext$(BasicBackend.scala:141)
at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.runInternal(BasicBackend.scala:77)
at slick.basic.BasicBackend$DatabaseDef.runInternal$(BasicBackend.scala:76)
at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.run(BasicBackend.scala:74)
at slick.basic.BasicBackend$DatabaseDef.run$(BasicBackend.scala:74)
at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:37)
at modules.load.daos.slick3.JobsDao.update(JobsDao.scala:180)
at modules.load.services.JobService.update(JobService.scala:50)
at modules.load.models.JobSnippet$.updateJob(Job.scala:113)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6(BuildController.scala:169)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6$adapted(BuildController.scala:151)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
I am working with a large CSV (~60GB; ~250M rows) with Dask in Jupyter.
The first thing I want to do with the DF after loading it is to concatenate two string columns. I can do so successfully, but I noticed that cell execution time does not seem to decrease with higher workers counts (I tried 5, 10, and 20 on a machine with 64 logical cores). If anything, every five or so workers seem to add an extra minute to execution time.
Meanwhile, the progress bar of Dask's dashboard suggests that the task scales well with worker count. At 5 workers the task finishes (ac. to the dashboard) in about 10-15 min. At 20 workers the stream visualisation suggests task completion in roughly 3-5 min. But cell execution time remains around 25 min, i.e. in the 5-worker case the cell will appear to be hanging for an extra 10-15 min. after the stream has finished; in the 20-worker case -- for 20-22 more min., with no evidence of worker activity as far as I can see.
This is the code that I'm running:
import dask
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=20)
client = Client(cluster)
df = dd.read_csv('df_name.csv', dtype={'col1': 'object', 'col2': 'object'})
with ProgressBar():
df["col_merged"] = df["col3"]+df["col4"]
df = df.compute()
Python version: 3.9.1
Dask version: 2021.06.2
What am I missing? Could this simply be overhead from having Dask to coordinate several workers?
To add to #SultanOrazbayev 's answer, the specific thing that's taking time after the tasks have all been done, is copying data from the workers into your client process to assemble the single in-memory dataframe that you have asked for. This is not a "task", as all the computing has already happened, and does not parallelise well, because the client is a single thread pulling data from the workers.
As with the comment above: if you want to achieve parallelism, you need to load the data in workers (which dd.read_csv does) and act on them in workers o get your result. You should on .compute() relatively small things. Conversely, if your data first comfortably into memory, there was probably nothing to be gained by having dask involved at all, just use pandas.
Running
df = df.compute()
will attempt to load all the 250M rows into memory. If this is feasible with your machine, you will still spend a lot of time because each worker is going to send their chunk, so there will be a lot of data transfer...
The core idea is to bring into memory only the results of the reduced calculations, and distribute the workload among the workers until then.
I have a wordlist of 11 character which I want to append in a url. After some modification in request.js,I am able to run 5 million size wordlist in requestlist array.It start throwing JavaScript heap memory error after going higher.I have billion of size of wordlist to process. I can able to generate my wordlist with js code.5 million entry finishes up in an hour,due to higher server capacityR I possess. Requestlist is a static variable so I cant add again in it.How can I run it infinitely for billions of combination.If any cron script can help then I am open to this also.
It would be better to use RequestQueue for such a high amount of Requests. The queue is persisted to disk as an SQLite database so memory usage is not an issue.
I suggest adding let's say 1000 requests into the queue and immediately start crawling, while pushing more requests to the queue. Enqueueing tens of millions or billions of requests might take long, but you don't need to wait for that.
For best performance, use apify version 1.0.0 or higher.
With AWS-XRay tracing enabled on my lambda function i've found that as the number of parallel requests increases to dynamodb the performance of the read's decreases.
Here is an example of the XRay Traces:
Above you can see that the first set of GetItem requests execute in under 300ms. This set only has 6 async read requests running in parallel. The next set of read requests all execute in on average atleast 1.5 seconds - with 57 async read requests running in parallel.
Thoughts on what this could be due to:
this may be due to a "cold start" feature as dynamodb adds capacity to deal with parallel reads? (This dyanmodb instance is pay-by-request, not provisioned)
Additionally, i recognize that this may not be related parallel requests at all, but it may be a good place to start asking questions. Wondering if anyone knows what could be causing such a dramatic performance decrease.
I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.