I have a table which has around 5,00,000 rows.
I am willing to fetch data from this table using multithreading, in a batch of 50,000 each,
as in each thread should 50,000 rows. Each thread's should be unique.
I was able to create a thread :
#Async
public CompletableFuture<List<Employee>> findAllFromEmployee() {
final List<Employee> employees = employeeRepository.findAll();
return CompletableFuture.completedFuture(employees);
}
I defined task executor like this :
#Bean(name = "taskExecutor")
public Executor taskExecutor() {
System.out.println("Creating Async Task Executor");
final ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(2);
executor.setMaxPoolSize(2);
executor.setQueueCapacity(100);
executor.setThreadNamePrefix("EmployeeThread-");
executor.initialize();
return executor;
}
I am not able understand how should I make sure that each threads reads only 50,000 fields.
Thanks!!
Order your results from the table in a consistent manner (eg order by primary key). Assuming your primary key is a field called "id", you can use this:
employeeRepository.findAllOrderById()
Then from row 0 to 50,000, process them with Thread #1, 50,001 to 100,000 goes to Thread #2 and so on.
Related
I have a got a problem when I use JdbcCursorItemReader its fetching same records in the threads and skip those nos of equal records.
Suppose there are 2000 records, with chunk size of 500 and below taskexecutor, we getting duplicate records in the all three threads and skip few records with verifyCursorPosition(false)
But, with verifyCursorPosition(true) the o/p that I received is proper but at the end it throws an error org.springframework.dao.InvalidDataAccessResourceUsageException: Unexpected cursor position change. resulting into FAILED state of the JOB.
Can some one has face same problem or have any alternate solution for the same which includes multiple threads and chunks
`#Bean(name = "SteptaskPool")
public TaskExecutor threadPoolTaskExecutor(){
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setCorePoolSize(3);
taskExecutor.setMaxPoolSize(3);
taskExecutor.setQueueCapacity(8);
taskExecutor.setThreadNamePrefix("Stepprocess-");
taskExecutor.setWaitForTasksToCompleteOnShutdown(Boolean.TRUE);
return taskExecutor;
}
`
I am trying to insert data into Cassandra local cluster using async execution and version 4 of the driver (as same as my Cassandra instance)
I have instantiated the cql session in this way:
CqlSession cqlSession = CqlSession.builder()
.addContactEndPoint(new DefaultEndPoint(
InetSocketAddress.createUnresolved("localhost",9042))).build();
Then I create a statement in an async way:
return session.prepareAsync(
"insert into table (p1,p2,p3, p4) values (?, ?,?, ?)")
.thenComposeAsync(
(ps) -> {
CompletableFuture<AsyncResultSet>[] result = data.stream().map(
(d) -> session.executeAsync(
ps.bind(d.p1,d.p2,d.p3,d.p4)
)
).toCompletableFuture()
).toArray(CompletableFuture[]::new);
return CompletableFuture.allOf(result);
}
);
data is a dynamic list filled with user data.
When I exec the code I get the following exception:
Caused by: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was available to execute the query
at com.datastax.oss.driver.api.core.AllNodesFailedException.fromErrors(AllNodesFailedException.java:53)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.sendRequest(CqlPrepareHandler.java:210)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.onThrottleReady(CqlPrepareHandler.java:167)
at com.datastax.oss.driver.internal.core.session.throttling.PassThroughRequestThrottler.register(PassThroughRequestThrottler.java:52)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.<init>(CqlPrepareHandler.java:153)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.process(CqlPrepareAsyncProcessor.java:66)
at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.process(CqlPrepareAsyncProcessor.java:33)
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:210)
at com.datastax.oss.driver.api.core.cql.AsyncCqlSession.prepareAsync(AsyncCqlSession.java:90)
The node is active and some data are inserted before the exception rise. I have also tried to set up a data center name on the session builder without any result.
Why this exception rise if the node is up and running? Actually I have only one local node and that could be a problem?
The biggest thing that I don't see, is a way to limit the current number of active async threads.
Basically, if that (mapped) data stream gets hit hard enough, it'll basically create all of these new threads that it's awaiting. If the number of writes coming in from those threads creates enough back-pressure that node can't keep up or catch up to, the node will become overwhelmed and not accept requests.
Take a look at this post by Ryan Svihla of DataStax:
Cassandra: Batch Loading Without the Batch — The Nuanced Edition
Its code is from the 3.x version of the driver, but the concepts are the same. Basically, provide some way to throttle-down the writes, or limit the number of "in flight threads" running at any given time, and that should help greatly.
Finally, I have found a solution using BatchStatement and a little custom code to create a chucked list.
int chunks = 0;
if (data.size() % 100 == 0) {
chunks = data.size() / 100;
} else {
chunks = (data.size() / 100) + 1;
}
final int finalChunks = chunks;
return session.prepareAsync(
"insert into table (p1,p2,p3, p4) values (?, ?,?, ?)")
.thenComposeAsync(
(ps) -> {
AtomicInteger counter = new AtomicInteger();
final List<CompletionStage<AsyncResultSet>> batchInsert = data.stream()
.map(
(d) -> ps.bind(d.p1,d.p2,d.p3,d.p4)
)
.collect(Collectors.groupingBy(it -> counter.getAndIncrement() / finalChunks))
.values().stream().map(
boundedStatements -> BatchStatement.newInstance(BatchType.LOGGED, boundedStatements.toArray(new BatchableStatement[0]))
).map(
session::executeAsync
).collect(Collectors.toList());
return CompletableFutures.allSuccessful(batchInsert);
}
);
I have a multiple streams (N) which should update the same cache. So, assume, that there is at least N threads. Each thread may process values with similar keys. The problem is that if i do update as following:
1. Read old value from cache (multiple threads get the same old value)
2. Merge new value with old value (each thread update old value)
3. Save updated value back to the cache (only the last update was saved, another one is lost)
i can lost some updates if multiple threads will simultaneously try to update the same record. At first glance, there is a solution to make all updates atomic: for example, use Increment mutation in hbase or add in aerospike (currently, i'm considering these caches for my case). If value consists only of numeric primitive types, then it is ok, because both cache implementations support atomic inc/dec.
1. Inc/dec each value (cache will resolve sequence of this ops by it's self)
But what if value consists not only of primitives? Then i have to read value and update it in my code. In this case i still can lose some updates.
As i wrote, currently i'm considering hbase and aerospike, but both not fully fit for my case. In hbase, as i know, there is no way to lock row from client side (> ~0.98), so i have to use checkAndPut operation for each complex type. In aerospike i can achieve something like row-based lock using lua udfs, but i want to avoid them. Redis allow to watch record and if there is was update from another thread the transaction will fail and i can catch this error and try again.
So, my question is how to achieve something like row-based lock for such updates and is row-based lock will be a correct way? Maybe there is another approach?
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("sample")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Duration(500))
val source = Source()
val stream = source.stream(ssc)
stream.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
rdd.foreachPartition(partition => {
if (partition.nonEmpty) {
val cache = Cache()
partition.foreach(entity=> {
// in this block if 2 distributed workers (in case of apache spark, for example)
//will process entities with the same keys i can lose one of this update
// worker1 and worker2 will get the same value
val value = cache.get(entity.key)
// both workers will update this value but may get different results
val updatedValue = ??? // some non-trivial update depends on entity
// for example, worker1 put new value, then worker2 put new value. In this case only updates from worker2 are visible and updates from worker1 are lost
cache.put(entity.key, updatedValue)
})
}
})
}
})
ssc.start()
ssc.awaitTermination()
}
So, in case if i use kafka as source i can workaround this if messages are partitioned by keys. In this case i can rely on the fact that only 1 worker will process particular record at any point of time. But how to handle the same situation when messages partitioned randomly (key is inside message body)?
In my Java application accessing Cassandra, it can insert 500 rows per second, but only update 50 rows per second(actually the updated rows didn't exist).
Updating one hundred fields is as fast as updating one field.
I just use CQL statements in the Java application.
Is this situation normal? How can I improve my application?
public void InsertSome(List<Data> data) {
String insertQuery = "INSERT INTO Data (E,D,A,S,C,......) values(?,?,?,?,?,.............); ";
if (prepared == null)
prepared = getSession().prepare(insertQuery);
count += data.size();
for (int i = 0; i < data.size(); i++) {
List<Object> objs = getFiledValues(data.get(i));
BoundStatement bs = prepared.bind(objs.toArray());
getSession().execute(bs);
}
}
public void UpdateOneField(Data data) {
String updateQuery = "UPDATE Data set C=? where E=? and D=? and A=? and S=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
BoundStatement bs = prepared.bind(data.getC(), data.getE(),
data.getD(), data.getA(), data.getS());
getSession().execute(bs);
}
public void UpdateOne(Data data) {
String updateQuery = "UPDATE Data set C=?,U=?,F........where E=? and D=? and A=? and S=? and D=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
......
BoundStatement bs = prepared.bind(objs2.toArray());
getSession().execute(bs);
}
Schema:
Create Table Data (
E,
D,
A,
S,
D,
C,
U,
S,
...
PRIMARY KEY ((E
D),
A,
S)
) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 64 }
AND compaction = { 'class' : 'LeveledCompactionStrategy' };
Another scenario:
I used the same application to access another cassandra cluster. The result was different. UPDATE was as fast as INSERT. But it only INSERT/UPDATE 5 rows per second. This cassandra cluster is the DataStax Enterprise running on GCE(I used the default DataStax Enterprise on Google Cloud Launcher)
So I think it's probably that some configurations are the reasons. But I don't know what they are.
Conceptually UPDATE and INSERT are the same so I would expect similar performance. UPDATE doesn't check to see if the data already exists (unless you are doing a lightweight transaction with IF EXISTS).
I noticed that each of your methods prepare a statement if it is not null. Is it possible the statement is being reprepared each time? That would add for a roundtrip for every method invocation. I also noticed that InsertSome does multiple inserts per invocation, where UpdateOne / UpdateOneField execute one statement. So if the statement were prepared every time, thats an invocation per update, where it's only done once per insert for a list.
Cassandra uses log-structured merge trees for an on-disk format, meaning all writes are done sequentially (the database is the append-only log). That implies a lower write latency.
At the cluster level, Cassandra is also able to achieve greater write scalability by partitioning the key space such that each machine is only responsible for a portion of the keys. That implies a higher write throughput, as more writes can be done in parallel.
I've huge tables in cassandra, more than 2 billions rows and increasing. The rows have a date field and it is following date bucket pattern so as to limit each row.
Even then, I've more than a million entries for a particular date.
I want to read and process rows for each day as fast as possible. What I am doing is that getting instance of com.datastax.driver.core.ResultSet and obtain iterator from it and share that iterator across multiple threads.
So, essentially I want to increase the read throughput. Is this the correct way? If not, please suggest a better way.
Unfortunately you cannot do this as is. The reason why is that a ResultSet provides an internal paging state that is used to retrieve rows 1 page at a time.
You do have options however. Since I imagine you are doing range queries (queries across multiple partitions), you can use a strategy where you submit multiple queries across token ranges at a time using the token directive. A good example of this is documented in Paging through unordered partitioner results.
java-driver 2.0.10 and 2.1.5 each provide a mechanism for retrieving token ranges from Hosts and splitting them. There is an example of how to do this in the java-driver's integration tests in TokenRangeIntegrationTest.java#should_expose_token_ranges():
PreparedStatement rangeStmt = session.prepare("SELECT i FROM foo WHERE token(i) > ? and token(i) <= ?");
TokenRange foundRange = null;
for (TokenRange range : metadata.getTokenRanges()) {
List<Row> rows = rangeQuery(rangeStmt, range);
for (Row row : rows) {
if (row.getInt("i") == testKey) {
// We should find our test key exactly once
assertThat(foundRange)
.describedAs("found the same key in two ranges: " + foundRange + " and " + range)
.isNull();
foundRange = range;
// That range should be managed by the replica
assertThat(metadata.getReplicas("test", range)).contains(replica);
}
}
}
assertThat(foundRange).isNotNull();
}
...
private List<Row> rangeQuery(PreparedStatement rangeStmt, TokenRange range) {
List<Row> rows = Lists.newArrayList();
for (TokenRange subRange : range.unwrap()) {
Statement statement = rangeStmt.bind(subRange.getStart(), subRange.getEnd());
rows.addAll(session.execute(statement).all());
}
return rows;
}
You could basically generate your statements and submit them in async fashion, the example above just iterates through the statements one at a time.
Another option is to use the spark-cassandra-connector, which essentially does this under the covers and in a very efficient way. I find it very easy to use and you don't even need to set up a spark cluster to use it. See this document for how to use the Java API.