Is spark JDBC sink transactionally safe at node level? - apache-spark

I have a question related to opening a transaction at partition level. If I use jdbc connector to write to database (postgess), will partition specific writes at worker node be transactionally safe i.e.
If a worker node goes down while writing the data, will the rows related to this partition/ worker node be rolled back?

There is a transaction boundary on the partition (see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L588)
But if there are failures afterwards before the task is marked as SUCCESS, for example with a network issue or timeout, then you might still get multiple writes

Related

Cassandra write query timeout out after PT2S

I have cassandra monolithic application where I want to write at high rate reading some payloads from queue. Cassandra cluster has 3 nodes . When i start processing large number of messages in parallel(by spawning threads) I get below exceptions
java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S
I am creating CQLsession as bean
return CqlSession.builder().addContactPoints(contactPoints)
/*.addContactPoint(new InetSocketAddress("localhost", 9042))*/
.withConfigLoader(new DefaultDriverConfigLoader()).withLocalDatacenter("datacenter1")
.addTypeCodecs(new CustomDateCodec())
.withKeyspace("dev").build();
I am injecting this CqlSession into my mapper and other classes to run queries
In my datastax driver i have given ip of 3 nodes as contact points
Is there any tuning I need to do in CQLsession creation/ or my cassandra nodes so that they can take is writes at high concurrency ?
Also How many writes can I do in parallel ?
All are update statement without any if condition only on primary key
The timeout you're seeing is a result of your app overloading the cluster, effectively doing a DDoS attack.
PT2S is the 2-second write timeout. There will come a point when the commitlog disks can only take so much write IO. If you're seeing dropped mutations in the logs or nodetool tpstats, that's confirmation that the commitlog can't keep up with the writes.
If your cluster can sustain 10K writes/sec but your app is doing 20K writes then you need to double the size of your cluster (add more nodes) to support the throughput requirements. Cheers!

Timeout to read from Alluxio

I encountered this error while performing a Presto query on Alluxio. What does this timeout mean, and how can I fix it?
com.facebook.presto.spi.PrestoException: Error opening Hive split alluxio://xxxxx:19998/s3/data/m-00020 (offset=134217728,
length=67108864) using org.apache.hadoop.mapred.TextInputFormat:
Timeout to read 39963328512 from [id: 0x23615709, L:/xxxxx:34740 -
R:xxxxx/xxxxx:29999]
You will receive this error when the Alluxio worker takes too long (configurable through alluxio.user.network.netty.timeout) to provide data to the client.
One simple workaround is to increase timeout.
However, this is generally a symptom of the worker being overloaded in some way. Common things to check in your setup:
Alluxio worker load, possibly a problem if your compute is co-located and there is no resource management
Alluxio worker to under file system load/bandwidth, this is often a bottleneck for remote storages like object stores.
If these are bottlenecks, you can try reducing the concurrency or increasing the number of nodes in your cluster.

Hazelcast - OperationTimeoutException

I am using Hazelcast version 3.3.1.
I have a 9 node cluster running on aws using c3.2xlarge servers.
I am using a distributed executor service and a distributed map.
Distributed executor service uses a single thread.
Distributed map is configured with no replication and no near-cache and stores about 1 million objects of size 1-2kb using Kryo serializer.
My use case goes as follow:
All 9 nodes constantly execute a synchronous remote operation on the distributed executor service and generate about 20k hits per second (about ~2k per node).
Invocations are executed using Hazelcast API: com.hazelcast.core.IExecutorService#executeOnKeyOwner.
Each operation accesses the distributed map on the node owning the partition, does some calculation using the stored object and stores the object in to the map. (for that I use the get and set API of the IMap object).
Every once in a while Hazelcast encounters a timeout exceptions such as:
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:mapService', op=GetOperation{}, partitionId=212, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[172.31.44.2]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received! backups-expected:0 backups-completed: 0
In some cases I see map partitions start to migrate which makes thing even worse, nodes constantly leave and re-join the cluster and the only way I can overcome the problem is by restarting the entire cluster.
I am wondering what may cause Hazelcast to block a map-get operation for 120 seconds?
I am pretty sure it's not network related since other services on the same servers operate just fine.
Also note that the servers are mostly idle (~70%).
Any feedbacks on my use case will be highly appreciated.
Why don't you make use of an entry processor? This is also send to the right machine owning the partition and the load, modify, store is done automatically and atomically. So no race problems. It will probably outperform the current approach significantly since there is less remoting involved.
The fact that the map.get is not returning for 120 seconds is indeed very confusing. If you switch to Hazelcast 3.5 we added some logging/debugging stuff for this using the slow operation detector (executing side) and slow invocation detector (caller side) and should give you some insights what is happening.
Do you see any Health monitor logs being printed?

How data will be consistent on cassandra cluster

I have a doubt when i read datastax documentation about cassandra write consistency. I have a question on how cassandra will maintain consistent state on following scenario:
Write consistency level = Quorum
replication factor = 3
As per docs, When a write occurs coordinator node will send this write request to all replicas in a cluster. If one replica succeed and other fails then coordinator node will send error response back to the client but node-1 successfully written the data and that will not be rolled back.
In this case,
Will read-repair (or hinted-handoff or nodetool repair) replicate the inconsistent data from node-1 to node-2 and node-3?
If not how will cassandra takes care of not replicating inconsistent data to other replicas?
Can you please clarify my question
You are completely right, the read repair or other methods will update the node-2 and node-3.
This means even the failed write will eventually update other nodes (if at least one succeeded). Cassandra doesn't have anything like rollback that relational databases have.
I don't see there is anything wrong - the system does what you tell it, i.e., two override one, and since the error messages sent back to the client as "fail", then the ultimate status should be "fail" by read repair tool.
Cassandra Coordinator node maintains the failed replica data in its storage and it will retry periodically (3 times or so) then if it succeeds then it will send the latest data, otherwise it will truncate the data in its storage.
In case of any read query, Coordinator node sends requests to all the replica nodes, and it will compare the results from all the replica nodes. If one of the replica node is not sending the latest data, then it will send read repair command to that node in order to keep the nodes in sync.

Cassandra - reading with consistency level ONE

How is reading with CL ONE implemented by Cassandra?
Does coordinator query all replicas and waits for the first to answer?
According to documentation, coordinator should query one single closest replica. What happens if timeout occurs during this query - does it try another replica, or it returns error to client?
Does coordinator query all replicas and waits for the first to answer?
As you mentioned, it queries the closest node, as determined by the snitch.
What happens if timeout occurs during this query
There is additional documentation on the Dynamic Snitch, which states that:
By default, all snitches also use a dynamic snitch layer that monitors
read latency and, when possible, routes requests away from
poorly-performing nodes.
By that definition, if the node chosen by the snitch should fail, the snitch should route the transaction to the [next] closest node.
Note that as of 2.0.2, Cassandra has a feature called Rapid Read Protection, which:
[A]llows Cassandra to tolerate node failure without dropping a single request

Resources