Writing to HBase from multiple processes

Writing to HBase from multiple processes - multithreading

I have to put around 2.5 Tb of data in HBase DB. Since it will take a long time for a single process to write it to DB, I was trying to create multiple processes to do this. But I am doubtful whether it is safe to do that because I am calling batch.commit(finalize=True) from multiple processes simultaneously. I am using 1.1.2.2.6.1.0-129 version of HBase in my system.
I am doing all the write operations through HBase REST API. Because of doing multiple operations I am getting the following errors on my server:
2017-08-07 11:01:18,311 INFO [hconnection-0x5e834655-shared--pool3-t52497] client.AsyncProcess: #7542, table=clueweb12, attempt=12/35 failed=124ops, last exception: org.apache.hadoop.hbase.RegionTooBusyException: org.apache.hadoop.hbase.RegionTooBusyException: Above memstore limit, regionName=clueweb12,http://www.innovativeusers.org/list/archives/2006/msg058,1502034291085.f923654226ace7491d61f8b67764c46f., server=node9.local,16020,1499079576247, memstoreSize=539000216, blockingMemStoreSize=536870912
at org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:3824)
at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2977)
at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2928)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:748)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:708)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2124)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2141)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
on node9.local,16020,1499079576247, tracking started null, retrying after=20140ms, replay=124ops
Is this error leading to loss of information?

Related

Fix PlayFramework Slick Error - java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend rejected from slick.util.AsyncExecutor

UPDATE 2
More investigation, some code refactoring and many trials later, We have more insight. The issue seems relative to concurrent read-write. IE When trying to read from DB while in the same time write (update) occurs. Not sure if this is managed by Postgres or Slick.
We are facing a random error with Slick 3.
That is the brief history.
We are building an ETL pipeline Spark, Minio as S3 Storage (Minio is an open-source alternative to AWS) and Delta tables. The pipeline has a web interface created using Play Framework (Scala).
Cluster is consisted of:
7 workers nodes of 16 cores and 64GB RAM each configured in client
mode.
1 Storage node
spark.defaut.parallelism and spark.sql.shuffle.partitions are both set to 600
spark.dynamicAllocation is disabled
App data (session data, users data, and some other records in) is saved in PostgreSQL using Slick 3 mapper.
Data processed size is exponentially growing, and now it is around 50 GB. (In production, we aim to process Terabytes of data)
Data processing flow consists essentially in data Aggregation using group-by and saving data into S3 Storage following these steps
Read CSV data from Storage and create read_df dataframe
Read main_db from dtorag and create main_df
Merge read_df with main_df
GroupBy a specfic Key (let’s say user_id)
Save records to Storage to replace main_db. To guarantee data integrity, this stage is split into three phases:
Write records to a temp object referenced by date time
Backup Existing database object main_db (copy to another object)
Rename temp object to main_db (copy and delete)
Then Update PostgreSQL history table with processed job information such as:
time_started, time_ended, number_of_rows_processed, size, etc. And that is where issue occurs.
We are facing a random error, and we noticed it happens when shuffle occurs after groupby. Sometimes, we end up with 1000+ partitions. In those cases, Step 5 is not completed and gives the following Exception:
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#291fad07 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#7345bd2c[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 26]
completed tasks value sometimes is lower, sometime reaches hundreds.
Below is code that is executed in step 5
Await.result(mainDbSvc.update("main.delta", mainDb), 600.seconds)
Googling the exception (And we did a ton of research about this), we found that it could be because connections are closed before code is executed when using transactionally. Notice, we don’t use transactionnally in our code. Below is code executed when calling update()
val updateQuery = this.mainDbTable.filter(_.id === id).update(db)
dbConfig.db.run(updateQuery)
That is the actual slick configuration:
connectionPool = "HikariCP"
dataSourceClass = "org.postgresql.ds.PGSimpleDataSource"
numThreads = 100
Initially, before errors starting, it was
numThreads = 20
maxConnections = 20
We tried queueSize = 2000 but not fixed.
Can someone have a solution for us?
Furthermore, we suspect the step5 to be responsible for that connection closed issue because that did not happen when it is turned off. What is the link between threads that read/write from S3 Storage (on another server) and hikari (slick) processes that are killed?
And is there a better way to guarantee data integrity (in case of failure while writing data) without this time-consuming copy-restore-and-delete process ?
Note:
After Aggregation, we repartition() to reduce partitions and avoid skew data before saving results. Coalesce() made driver JVM craches with OOM.
main_df and read_df do not have the same schema so, overwritting using delta in built-in method is not possible.
Update() functions Await time was 10s but following issue, we increased it, but that did not fix the issue.
UPDATE
This is the full trace of exception.
An error has occurred: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#313a2647 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
java.util.concurrent.RejectedExecutionException: Task slick.basic.BasicBackend$DatabaseDef$$anon$3#5c6d1059 rejected from slick.util.AsyncExecutor$$anon$1$$anon$2#77884590[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21]
at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at slick.util.AsyncExecutor$$anon$1$$anon$4.execute(AsyncExecutor.scala:161)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction(BasicBackend.scala:265)
at slick.basic.BasicBackend$DatabaseDef.runSynchronousDatabaseAction$(BasicBackend.scala:263)
at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.slick$basic$BasicBackend$DatabaseDef$$runInContextInline(BasicBackend.scala:242)
at slick.basic.BasicBackend$DatabaseDef.runInContextSafe(BasicBackend.scala:148)
at slick.basic.BasicBackend$DatabaseDef.runInContext(BasicBackend.scala:142)
at slick.basic.BasicBackend$DatabaseDef.runInContext$(BasicBackend.scala:141)
at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.runInternal(BasicBackend.scala:77)
at slick.basic.BasicBackend$DatabaseDef.runInternal$(BasicBackend.scala:76)
at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:37)
at slick.basic.BasicBackend$DatabaseDef.run(BasicBackend.scala:74)
at slick.basic.BasicBackend$DatabaseDef.run$(BasicBackend.scala:74)
at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:37)
at modules.load.daos.slick3.JobsDao.update(JobsDao.scala:180)
at modules.load.services.JobService.update(JobService.scala:50)
at modules.load.models.JobSnippet$.updateJob(Job.scala:113)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6(BuildController.scala:169)
at modules.load.controllers.Ops.BuildController.$anonfun$jsProcess$6$adapted(BuildController.scala:151)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

AWS DocumentDB Performance Issue with Concurrency of Aggregations

I'm working with DocumentDB in AWS, and I've been having troubles when I try to read from the same collection simultaneously from different aggregation queries.
The issue is not that I cannot read from the database, but rather that it takes a lot of time to complete the queries. It doesn't matter if I trigger the queries simultaneously or one after the other.
I'm using a Lambda Function with NodeJS to run my code. And I'm using mongoose to handle the connection with the database.
Here's a sample code that I put together to illustrate my problem:
query1() {
return Collection.aggregate([...])
}
query2() {
return Collection.aggregate([...])
}
query3() {
return Collection.aggregate([...])
}
It takes the same time if I run it using Promise.all
Promise.all([ query1(), query2(), query3() ])
Than if I run it waiting for the previous one to finish
query1().then(result1 => query2().then(result3 => query3()))
While if I run each query in different Lambda Executions, it takes significantly less time for each individual query to finish (Between 1 and 2 seconds).
So if they were running in parallel the execution should be finished with the time of the query that takes the most time (2 seconds), and not take 7 seconds, as it does now.
So my guessing is that the instance of DocumentDB is running the queries in sequence no matter how I send them. In the collection there are around 19,000 documents with a total size of almost 25Mb.
When I check the metrics of the instance, the CPUUtilization is barely over 8% and the RAM available only drops by 20Mb. So I don't think the problem of the delay has to do with the size of the instance.
Do you know why DocumentDB is behaving like this? Is there a configuration that I can change to run the aggregations in parallel?

How to avoid query taking long time when adding new node to cluster

I found query (just a simple select query) takes long time when adding a new node to cluster.
My execution time log :
17:49:40.008 [ThreadPoolTaskScheduler-14] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 8 ms
17:50:00.010 [ThreadPoolTaskScheduler-3] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 15010 ms
17:50:15.008 [ThreadPoolTaskScheduler-4] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 10008 ms
17:50:20.008 [ThreadPoolTaskScheduler-16] INFO task.DiskCounting - void task.DiskCounting.runJob() executed in 7 ms
Normally it takes about 10ms, and suddently takes 15000ms when adding node.
And I found it stuck because waiting for new node init data
Cassandra log (the new node):
INFO [HANDSHAKE-/194.187.1.52] 2019-05-31 17:49:36,056 OutboundTcpConnection.java:560 - Handshaking version with /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,059 Gossiper.java:1055 - Node /194.187.1.52 is now part of the cluster
INFO [RequestResponseStage-1] 2019-05-31 17:49:36,069 Gossiper.java:1019 - InetAddress /194.187.1.52 is now UP
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [GossipStage:1] 2019-05-31 17:49:36,109 TokenMetadata.java:479 - Updating topology for /194.187.1.52
INFO [MigrationStage:1] 2019-05-31 17:49:39,347 ViewManager.java:137 - Not submitting build tasks for views in keyspace system_traces as storage service is not initialized
INFO [MigrationStage:1] 2019-05-31 17:49:39,352 ColumnFamilyStore.java:411 - Initializing system_traces.events
INFO [MigrationStage:1] 2019-05-31 17:49:39,382 ColumnFamilyStore.java:411 - Initializing system_traces.sessions
Stuck when : Node /194.187.1.52 is now part of the cluster
And client will wait for new node init all data
What I have tried:
1. I try use consistency with ONE or QUORUM, and is no difference
2. I try turn replication factor to 1, 2 or 3, and still no difference
Why new node become part of cluster when that node not init data completely.
Is there a way to solve this.
I expect when I query to old node, the performance is not influenced by just waiting for new node to init data.
.
.
.
I resolved this problem.
I write wrong config, I let all node become seeds even before they joining cluster, this cause read timed-out during adding new node to cluster.
After fix this. all read is normal, but somehow I found insert query timed-out during adding node.
Finally I tune this to avoid insert timed-out:
/sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
and also change conf to limit throughput
stream_throughput_outbound_megabits_per_sec : 100
Really thanks for help.

This is just a theory, but one possible cause of this is that the new node is being chosen as a coordinator by your driver client, in this case, consistency level and replication aren't the main contributing factor to the delay in servicing your query.
If the new node is slowly performing initially for whatever reason and the driver is sending requests to it, the behavior of the coordinator can impact the servicing of your request.
What exactly is runJob doing? You suggested it is making a single query, but is it possible that it's a range query?
If it's a single query and it's taking as long as 10 seconds, that seems odd as the default read_request_timeout is 5 seconds. If it's a range query (a read involving multiple partitions), the default is 10 seconds. Are you adjusting those timeouts?
When you see responses that long for a single query that could mean the coordinator is what is impeding responsiveness as otherwise if the coordinator was responsive and the replicas were slow, you'd see ReadTimeoutException message serviced to the client.
To better react to these cases, a number of client drivers implement a strategy called 'speculative execution'. As described in the documentation for the DataStax Java Driver for Apache Cassandra:
Sometimes a Cassandra node might be experiencing difficulties (ex: long GC pause) and take longer than usual to reply. Queries sent to that node will experience bad latency.
One thing we can do to improve that is pre-emptively start a second execution of the query against another node, before the first node has replied or errored out. If that second node replies faster, we can send the response back to the client (we also cancel the first execution – note that “cancelling” in this context simply means discarding the response when it arrives later, Cassandra does not support cancellation of in flight requests at this stage)
You could configure your driver to speculatively execute with a constant threshold for idempotent requests (such as reads are). In the 3.x java driver, its done this way:
Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.withSpeculativeExecutionPolicy(
new ConstantSpeculativeExecutionPolicy(
500, // delay before a new execution is launched
2 // maximum number of executions
))
.build();
In this case, if the coordinator was slow to respond, after 500ms the driver chooses another coordinator and submits a second quest, and the first coordinator to respond wins.
Note that this might cause an amplification of requests sent to your cluster overall, so you want to tune that delay in such a way that it only kicks in when response time is highly anomalous. In your case, if requests normally take less than 10ms, 500ms is probably a reasonable number depending on what your higher percentile latencies look like.
All that being said, if you are able to identify that the problem is the new node behaving poorly as a coordinator. It's worth understanding why. Adding speculative execution could be a nice way of possibly working around the problem, but it's probably better to try to understand why the new node is so slowly performing. Having monitoring in place to observe Cassandra's metrics would likely give great visibility into the problem.

That is a behaviour that you can find with too high consistency, or not having enough copies of data (replication factor). When a new node is added to the cluster, a rearrangement of the ownership of tokens occur, once that it is determined what data will the new node be the owner, it will start streaming that data, which may saturate the network.
In your question you don't mention the network setting or if you are using cloud instances, which have a direct impact for these constraints, for example, an AWS m3.large instance will be more restricted in network capabilities than an i3.4xlarge.
Other variable to consider is disk configuration, if you are using your own hardware look for the cap in IO of your drives settings; if you are on the cloud, using the instance storage, when available, will have a better performance than external volumes ( as AWS EBS; if this is the case, ensure that you are enabling the " EBS optimized" option if your instance allows it)
Usually a RF of 3 with consistency level of Quorum should also help you to prevent the issue.

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?

Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

Cassandra throwing NoHostAvailableException after 5 minutes of high IOPS run

I'm using datastax cassandra 2.1 driver and performing read/write operations at the rate of ~8000 IOPS. I've used pooling options to configure my session and am using separate session for read and write each of which connect to a different node in the cluster as contact point.
This works fine for say 5 mins but after that I get a lot of exceptions like :
Failed with: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.1.123:9042 (com.datastax.driver.core.TransportException: [/10.0.1.123:9042] Connection has been closed), /10.0.1.56:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout while trying to acquire available connection (you may want to increase the driver number of per-host connections)))
Can anyone help me out here on what could be the problem?
The exception asks me to increase number of connections per host but how high a value can I set for this parameter ?
Also I'm not able to set CoreConnectionsPerHost beyond 2 as it throws me exception saying 2 is the max.
This is how I'm creating each read / write session.
PoolingOptions poolingOpts = new PoolingOptions();
poolingOpts.setCoreConnectionsPerHost(HostDistance.REMOTE, 2);
poolingOpts.setMaxConnectionsPerHost(HostDistance.REMOTE, 200);
poolingOpts.setMaxSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 128);
poolingOpts.setMinSimultaneousRequestsPerConnectionThreshold(HostDistance.REMOTE, 2);
cluster = Cluster
.builder()
.withPoolingOptions( poolingOpts )
.addContactPoint(ip)
.withRetryPolicy( DowngradingConsistencyRetryPolicy.INSTANCE )
.withReconnectionPolicy( new ConstantReconnectionPolicy( 100L ) ).build();
Session s = cluster.connect(keySpace);

Your problem might not actually be in your code or the way you are connecting. If you say the problem is happening after a few minutes then it could simply be that your cluster is becoming overloaded trying to process the ingestion of data and cannot keep up. The typical sign of this is when you start seeing JVM garbage collection "GC" messages in the cassandra system.log file, too many small ones batched together of large ones on their own can mean that incoming clients are not responded to causing this kind of scenario. Verify that you do not have too many of these event showing up in your logs first before you start to look at your code. Here's a good example of a large GC event:
INFO [ScheduledTasks:1] 2014-05-15 23:19:49,678 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 2896 ms for 2 collections, 310563800 used; max is 8375238656
When connecting to a cluster there are some recommendations, one of which is only have one Cluster object per real cluster. As per the article I've linked below (apologies if you already studied this):
Use one cluster instance per (physical) cluster (per application lifetime)
Use at most one session instance per keyspace, or use a single Session and explicitly specify the keyspace in your queries
If you execute a statement more than once, consider using a prepared statement
You can reduce the number of network roundtrips and also have atomic operations by using batches
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/fourSimpleRules.html
As you are doing a high number of reads I'd most definitely recommend using setFetchSize also if its applicable to your code
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/cqlStatements.html
http://www.datastax.com/documentation/developer/java-driver/2.1/java-driver/reference/queryBuilderOverview.html
For reference heres the connection options in case you find it useful
http://www.datastax.com/documentation/developer/java-driver/2.1/common/drivers/reference/connectionsOptions_c.html
Hope this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string