Elastic search could not write all entries: May be es was overloaded - apache-spark

I have an application where I read csv files and do some transformations and then push them to elastic search from spark itself. Like this
.option("es.resource", "{date}/" + type).save()
I have several nodes and in each node, I run 5-6 spark-submit commands that push to elasticsearch
I am frequently getting Errors
Could not write all entries [13/128] (Maybe ES was overloaded?). Error sample (first [5] error messages):
rejected execution of org.elasticsearch.transport.TransportService$7#32e6f8f8 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#4448a084[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 451515]]
My Elasticsearch cluster has following stats -
Nodes - 9 (1TB space,
Ram >= 15GB ) More than 8 cores per node
I have modified following parameters for elasticseach
Could anyone suggest, What can I fix to get rid of these errors?

This occurs because the bulk requests are incoming at a rate greater than elasticsearch cluster could process and the bulk request queue is full.
The default bulk queue size is 200.
You should handle ideally this on the client side :
1) by reducing the number the spark-submit commands running concurrently
2) Retry in case of rejections by tweaking the es.batch.write.retry.count and
es.batch.write.retry.wait = "60s"
es.batch.write.retry.count = 6
On elasticsearch cluster side :
1) check if there are too many shards per index and try reducing it.
This blog has a good discussion on criteria for tuning the number of shards.
2) as a last resort increase the thread_pool.index.bulk.queue_size
Check this blog with an extensive discussion on bulk rejections.

The bulk queue in your ES cluster is hitting its capacity (200) . Try increasing it. See this page for how to change the bulk queue capacity.
Also check this other SO answer where OP had a very similar issue and was fixed by increasing the bulk pool size.
Fix PlayFramework Slick Error - java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend rejected from slick.util.AsyncExecutor

More investigation, some code refactoring and many trials later, We have more insight. The issue seems relative to concurrent read-write. IE When trying to read from DB while in the same time write (update) occurs. Not sure if this is managed by Postgres or Slick.
We are facing a random error with Slick 3.
That is the brief history.
We are building an ETL pipeline Spark, Minio as S3 Storage (Minio is an open-source alternative to AWS) and Delta tables. The pipeline has a web interface created using Play Framework (Scala).
Cluster is consisted of:
7 workers nodes of 16 cores and 64GB RAM each configured in client
1 Storage node
spark.defaut.parallelism and spark.sql.shuffle.partitions are both set to 600
spark.dynamicAllocation is disabled
App data (session data, users data, and some other records in) is saved in PostgreSQL using Slick 3 mapper.
Data processed size is exponentially growing, and now it is around 50 GB. (In production, we aim to process Terabytes of data)
Data processing flow consists essentially in data Aggregation using group-by and saving data into S3 Storage following these steps
Read CSV data from Storage and create read_df dataframe
Read main_db from dtorag and create main_df
Merge read_df with main_df
GroupBy a specfic Key (let’s say user_id)
Save records to Storage to replace main_db. To guarantee data integrity, this stage is split into three phases:
Write records to a temp object referenced by date time
Backup Existing database object main_db (copy to another object)
Rename temp object to main_db (copy and delete)
Then Update PostgreSQL history table with processed job information such as:
time_started, time_ended, number_of_rows_processed, size, etc. And that is where issue occurs.
We are facing a random error, and we noticed it happens when shuffle occurs after groupby. Sometimes, we end up with 1000+ partitions. In those cases, Step 5 is not completed and gives the following Exception:
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#291fad07 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#7345bd2c[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 26]
completed tasks value sometimes is lower, sometime reaches hundreds.
Below is code that is executed in step 5
Await.result(mainDbSvc.update("main.delta", mainDb), 600.seconds)
Googling the exception (And we did a ton of research about this), we found that it could be because connections are closed before code is executed when using transactionally. Notice, we don’t use transactionnally in our code. Below is code executed when calling update()
val updateQuery = this.mainDbTable.filter(_.id === id).update(db)
That is the actual slick configuration:
connectionPool = "HikariCP"
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
numThreads = 100
Initially, before errors starting, it was
numThreads = 20
maxConnections = 20
We tried queueSize = 2000 but not fixed.
Can someone have a solution for us?
Furthermore, we suspect the step5 to be responsible for that connection closed issue because that did not happen when it is turned off. What is the link between threads that read/write from S3 Storage (on another server) and hikari (slick) processes that are killed?
And is there a better way to guarantee data integrity (in case of failure while writing data) without this time-consuming copy-restore-and-delete process ?
After Aggregation, we repartition() to reduce partitions and avoid skew data before saving results. Coalesce() made driver JVM craches with OOM.
main_df and read_df do not have the same schema so, overwritting using delta in built-in method is not possible.
Update() functions Await time was 10s but following issue, we increased it, but that did not fix the issue.
This is the full trace of exception.
An error has occurred: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#313a2647 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#5c6d1059 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at slick.util.AsyncExecutor$$anon$1$$anon$4.execute(AsyncExecutor.scala:161)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction(BasicBackend.scala:265)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction$(BasicBackend.scala:263)
at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.slick$basic$BasicBackend$DatabaseDef$$runInContextInline(BasicBackend.scala:242)
at slick.basic.BasicBackend$DatabaseDef.runInContextSafe(BasicBackend.scala:148)
at slick.basic.BasicBackend$DatabaseDef.runInContext(BasicBackend.scala:142)
at slick.basic.BasicBackend$DatabaseDef.runInContext$(BasicBackend.scala:141)
at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.runInternal(BasicBackend.scala:77)
at slick.basic.BasicBackend$DatabaseDef.runInternal$(BasicBackend.scala:76)
at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.run(BasicBackend.scala:74)
at slick.basic.BasicBackend$DatabaseDef.run$(BasicBackend.scala:74)
at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:37)
at modules.load.daos.slick3.JobsDao.update(JobsDao.scala:180)
at modules.load.services.JobService.update(JobService.scala:50)
at modules.load.models.JobSnippet$.updateJob(Job.scala:113)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6(BuildController.scala:169)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6$adapted(BuildController.scala:151)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

Issue with dsbulk unload

I am getting the below messages while unloading using dsbulk. I am not able to figure out what this means
[s0|347101951|0] Error sending cancel request. This is not critical (the request will eventually time out server-side). (HeartbeatException: null)
Not sending heartbeat because a previous one is still in progress. Check that advanced.heartbeat.interval is not lower than advanced.heartbeat.timeout.
"Error sending cancel request" is typical of continuous paging queries. It seems the coordinator is in trouble for some reason, which is why you are also seeing heartbeat failures. Dsbulk may be putting too much load on the cluster.
You didn't mention which version of dsbulk exactly, but assuming 1.4+ I would recommend trying the following actions (individually or combined):
Disable continuous paging with dsbulk.executor.continuousPaging.enabled = false (this is likely to slow down dsbulk).
Use smaller page sizes, e.g. 1000 rows:
If not using continuous paging: datastax-java-driver.basic.request.page-size = 1000 .
If using continuous paging: datastax-java-driver.advanced.continuous-paging.page-size = 1000.
Throttle dsbulk to reduce the load on the cluster
Either "soft" throttle by limiting the number of concurrent requests, e.g. 128:
DSBulk < 1.6: dsbulk.executor.maxInFlight = 128.
DSBulk >= 1.6: dsbulk.engine.maxConcurrentQueries = 128.
Or "hard" throttle by limiting the number of requests per second, e.g. 500: dsbulk.executor.maxPerSecond = 500.

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?
Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

Elasticsearch cluster size/architecture

I've been trying to setup an elasticsearch cluster for processing some log data from some 3D printers .
we are having more than 850K documents generated each day for 20 machines . each of them has it own index .
Right now we have the data of 16 months with make it about 410M records to index in each of the elasticsearch index .
we are processing the data from CSV files with spark and writing to an elasticsearch cluster with 3 machines each one of them has 16GB of RAM and 16 CPU cores .
but each time we reach about 10-14M doc/index we are getting a network error .
Job aborted due to stage failure: Task 173 in stage 9.0 failed 4 times, most recent failure: Lost task 173.3 in stage 9.0 (TID 17160, wn21-xxxxxxx.ax.internal.cloudapp.net, executor 3): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[X.X.X.X:9200]]
I'm sure this is not a network error it's just elasticsearch cannot handle more indexing requests .
To solve this , I've tried to tweak many elasticsearch parameters such as : refresh_interval to speed up the indexation and get rid of the error but nothing worked . after monitoring the cluster we think that we should scale it up.
we also tried to tune the elasticsearch spark connector but with no result .
So I'm looking for a right way to choose the cluster size ? is there any guidelines on how to choose your cluster size ? any highlights will be helpful .
NB : we are interested mainly in indexing data since we have only one query or two to run on data to get some metrics .
I would start by trying to split up the indices by month (or even day) and then search across an index pattern. Example: sensor_data_2018.01, sensor_data_2018.02, sensor_data_2018.03 etc. And search with an index pattern of sensor_data_*
Some things which will impact what cluster size you need will be:
How many documents
Average size of each document
How many messages/second are being indexed
Disk IO speed
I think your cluster should be good enough to handle that amount of data. We have a cluster with 3 nodes (8CPU / 61GB RAM each), ~670 indices, ~3 billion documents, ~3TB data and have only had indexing problems when the indexing rate exceeds 30,000 documents/second. Even then only the indexing of a few documents will fail and can be successfully retried after a short delay. Our implementation is also very indexing heavy with minimal actual searching.
I would check the elastic search server logs and see if you can find a more detailed error message. Possible look for RejectedExecutionException's. Also check the cluster health and node stats when you start to receive the failures which might shed some more light on whats occurring. If possible implement a re-try and backoff when failures start to occur to give ES time to catch up to the load.
Hope that helps a bit!
This is a network error, saying the data node is ... lost. Maybe a crash, you can check the elasticsearch logs to see whats going on.
The most important thing to understand with elasticsearch4Hadoop is how work is parallelized:
1 Spark partition by 1 elasticsearch shard
The important thing is sharding, this is how you load-balance the work with elasticsearch. Also, refresh_interval must be > 30 secondes, and, you should disable replication when indexing, this is very basic configuration tuning, I am sure you can find many advises about that on documentation.
With Spark, you can check on web UI (port 4040) how the work is split into tasks and partitions, this help a lot. Also, you can monitor the network bandwidth between Spark and ES, and es node stats.

Cassandra throwing NoHostAvailableException after 5 minutes of high IOPS run

I'm using datastax cassandra 2.1 driver and performing read/write operations at the rate of ~8000 IOPS. I've used pooling options to configure my session and am using separate session for read and write each of which connect to a different node in the cluster as contact point.
This works fine for say 5 mins but after that I get a lot of exceptions like :
Failed with: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: / (com.datastax.driver.core.TransportException: [/] Connection has been closed), / (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
Can anyone help me out here on what could be the problem?
The exception asks me to increase number of connections per host but how high a value can I set for this parameter ?
Also I'm not able to set CoreConnectionsPerHost beyond 2 as it throws me exception saying 2 is the max.
This is how I'm creating each read / write session.
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 200);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.withPoolingOptions( poolingOpts )
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);
Your problem might not actually be in your code or the way you are connecting. If you say the problem is happening after a few minutes then it could simply be that your cluster is becoming overloaded trying to process the ingestion of data and cannot keep up. The typical sign of this is when you start seeing JVM garbage collection "GC" messages in the cassandra system.log file, too many small ones batched together of large ones on their own can mean that incoming clients are not responded to causing this kind of scenario. Verify that you do not have too many of these event showing up in your logs first before you start to look at your code. Here's a good example of a large GC event:
INFO [ScheduledTasks:1] 2014-05-15 23:19:49,678 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2896 ms for 2 collections, 310563800 used; max is 8375238656
When connecting to a cluster there are some recommendations, one of which is only have one Cluster object per real cluster. As per the article I've linked below (apologies if you already studied this):
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a prepared statement
You can reduce the number of network roundtrips and also have atomic operations by using batches
As you are doing a high number of reads I'd most definitely recommend using setFetchSize also if its applicable to your code
For reference heres the connection options in case you find it useful
Hope this helps.
