Is BoundStatement really more efficient than SimpleStatement in Cassandra? - cassandra

I respectively use SimpleStatement and BoundStatement to select data from Cassandra 100 times.
I found BoundStatement didn't improve too much.
However, this link says:
SimpleStatement: a simple implementation built directly from a character string. Typically used for queries that are executed only once or a few times.
BoundStatement: obtained by binding values to a prepared statement. Typically used for queries that are executed often, with different values.
In my case, why the efficiency of SimpleStatement and BoundStatement is nearly the same?
long start0 = System.currentTimeMillis();
long start;
DataEntity mc = null;
ResultSet results = null;
BoundStatement bs = null;
PK pk = null;
CassandraData dao = new CassandraData();
long end;
Session session = dao.getSession();
//SimpleStatement ss = new SimpleStatement("select * from test where E='"+pk.E+"' and D="+pk.D+" and M="+pk.M);
PreparedStatement prepared = session.prepare(
"select * from test where E=? and D=? and M=?");
for (int i = 0; i < 100; i++) {
start = System.currentTimeMillis();
pk = ValidData.getOnePk();
//results = session.execute(ss);
bs = prepared.bind(pk.E, pk.D, pk.M);
results = session.execute(bs);
end = System.currentTimeMillis();
logger.info("Show One:" + (end - start) / 1000.0 + "s elapsed.");
}
long end0 = System.currentTimeMillis();
logger.info("Show All:" + (end0 - start0) / 1000.0 + "s elapsed.");

Its because you are doing too less iterations. 100 isn't enough to see the difference.
But in the end they both do quite much the same job except that the simple statement:
Only does one trip to Cassandra node.
Does not leave any prepared statement stored in the nodes memory.
So what makes the BoundStatement different is that you prepare it once, bind and execute it multiple times.
So once you prepare the BoundStatement, you are only sending the binding variables over the connection. And not the whole statement. So from there is the difference of efficiency should come.
What else you could do is enable tracing and check how much time actually statement spends and where.
And for testing I would recommend doing some warmup before actually starting to measure times. As there are lot of factors that can actually affect the results.
PD: I assume that ValidData.getOnePk() returns random data and you aren't hitting the Cassandra row cache.

Related

Why Is BuiltStatement more efficient than BoundStatement in Cassandra?

This link says:
BoundStatement: obtained by binding values to a prepared statement. Typically used for queries that are executed often, with different values.
BuiltStatement: a statement built with the QueryBuilder DSL. It can be executed directly like a simple statement, or prepared.
So in my opinion, BuiltStatement is equal to BoundStatement.
Howerver, in my case, I found BuiltStatement is obviously more efficient than BoundStatement. Why did this happen?
public static void main(String[] args) {
Data mc = null;
ResultSet results = null;
PK pk = null;
CassandraData dao = new CassandraData();
Session session = dao.getSession();
long start, end;
long start0 = System.currentTimeMillis();
// PreparedStatement prepared = session.prepare(
// "select * from test where E=? and D=? and M=?");
Statement statement = null;
logger.info("Start:");
for (int i = 0; i < 100; i++) {
pk = ValidData.getOnePk();
start = System.currentTimeMillis();
// statement = prepared.bind(pk.E, pk.D, pk.M);
// statement.setReadTimeoutMillis(100000);
statement = getSelect(pk);
results = session.execute(statement);
end = System.currentTimeMillis();
logger.info("Show OneKb:" + (end - start) / 1000.0 + "s.");
}
long end0 = System.currentTimeMillis();
logger.info("Show OneKb Average:" + (end0 - start0) / 1000.0 / 100 + "s/OneKb.");
}
private static Statement getSelect(PK pk) {
Select ss = QueryBuilder.select().from("test");
ss.setConsistencyLevel(com.datastax.driver.core.ConsistencyLevel.ONE);
ss.where(QueryBuilder.eq("E", pk.E))
.and(QueryBuilder.eq("D", pk.D))
.and(QueryBuilder.eq("M", pk.M)).limit(1)
.setReadTimeoutMillis(100 * 1000);
return ss;
}
I ran this case 100 times and the average time of BoundStatement was 1.316s and the average time of BuiltStatement was 0.199s.
I found where I was wrong.
When using BuiltStatement, I appended limit(1) method to fetch only one record. But when using BoundStatement, I didn't append limit 1 to restrict the returned quantity. In fact, it would return average 100 records. So in this condition, it was slower.

Spark - operate a non serialized method on every element in a JavaPairRDD

Is it possible to go over elements in the JavaPairRDD but to perform some calculation f() on one machine only?
Each element (x,y) in the JavaPairRDD is a record (features and label), and I would like to call some calculation f(x,y) which can't be serialized.
f() is an self-implemented algorithm that updates some weights $w_i$.
The weights $w_i$ should be the same object for data from all partitions (rater broke up into tasks - each of which is operated on by an executor).
Update:
What I'm trying to do is to run "algo" f() 100 times in such a way that every iteration of algo is single-threaded but the whole 100 iterations run in parallel on different nodes. In high level, the code is something like that:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}
I would like that algo.fit() will update the same weights $w_i$ instead of fitting weights for every partition.

Is the number of Parameters in the IN-Operator in Cassandra limited?

I have a pretty simple question which I can't find an answer to on the Internet or on stackoverflow:
Is the number of Parameters in the IN-Operator in Cassandra limited?
I have made some tests with a simple table with Integer-Keys from 1 to 100000. If I put the keys from 0 to 1000 in my IN-Operator (like SELECT * FROM test.numbers WHERE id IN (0,..,1000)) I get the correct number of rows back. But for example for 0 to 100000 I always get only 34464 rows back. And for 0 to 75000 its 9464.
I am using the Datastax Java Driver 2.0 and the relevant codeparts look like the following:
String query = "SELECT * FROM test.numbers WHERE id IN ?;";
PreparedStatement ps = iot.getSession().prepare(query);
bs = new BoundStatement(ps);
List<Integer> ints = new ArrayList<Integer>();
for (int i = 0; i < 100000; i++) {
ints.add(i);
}
bs.bind(ints);
ResultSet rs = iot.getSession().execute(bs);
int rowCount = 0;
for (Row row : rs) {
rowCount++;
}
System.out.println("counted rows: " + rowCount);
It's also possible that I'm binding the list of Integers in a wrong way. If that's the case I would appreciate any hints too.
I am using Cassandra 2.0.7 with CQL 3.1.1.
This is not a real-limitation but a PreparedStatement one.
Using a BuiltStatement and QueryBuilder I didn't have any of these problems.
Try it yourself:
List<Integer> l = new ArrayList<>();
for (int i = 0; i < 100000; i++) {
l.add(i);
}
BuiltStatement bs = QueryBuilder.select().column("id").from("test.numbers").where(in("id", l.toArray()));
ResultSet rs = Cassandra.DB.getSession().execute(bs);
System.out.println("counted rows: " + rs.all().size());
HTH,
Carlo

Replace While() loop with Timer to prevent the GUI from freezing [Multithreading?]

How can I use the Timer class and timer events to turn this loop into one that executes chunks at a time?
My current method of just running the loop keeps freezing up the flash/air UI.
I'm trying to acheive psuedo multithreading. Yes, this is from wavwriter.as:
// Write to file in chunks of converted data.
while (dataInput.bytesAvailable > 0)
{
tempData.clear();
// Resampling logic variables
var minSamples:int = Math.min(dataInput.bytesAvailable/4, 8192);
var readSampleLength:int = minSamples;//Math.floor(minSamples/soundRate);
var resampleFrequency:int = 100; // Every X frames drop or add frames
var resampleFrequencyCheck:int = (soundRate-Math.floor(soundRate))*resampleFrequency;
var soundRateCeil:int = Math.ceil(soundRate);
var soundRateFloor:int = Math.floor(soundRate);
var jlen:int = 0;
var channelCount:int = (numOfChannels-inputNumChannels);
/*
trace("resampleFrequency: " + resampleFrequency + " resampleFrequencyCheck: " + resampleFrequencyCheck
+ " soundRateCeil: " + soundRateCeil + " soundRateFloor: " + soundRateFloor);
*/
var value:Number = 0;
// Assumes data is in samples of float value
for (var i:int = 0;i < readSampleLength;i+=4)
{
value = dataInput.readFloat();
// Check for sanity of float value
if (value > 1 || value < -1)
throw new Error("Audio samples not in float format");
// Special case with 8bit WAV files
if (sampleBitRate == 8)
value = (bitResolution * value) + bitResolution;
else
value = bitResolution * value;
// Resampling Logic for non-integer sampling rate conversions
jlen = (resampleFrequencyCheck > 0 && i % resampleFrequency < resampleFrequencyCheck) ? soundRateCeil : soundRateFloor;
for (var j:int = 0; j < jlen; j++)
{
writeCorrectBits(tempData, value, channelCount);
}
}
dataOutput.writeBytes(tempData);
}
}
I have once implemented pseudo multithreading in AS3 by splitting the task into chunks, instead of splitting the data into chunks.
My solution might not be optimal, but it worked nicely for me in the context of performing a large Depth-First Search while allowing the Flash game to flow nicely.
Use a variable ticks to count computation "ticks", similar to CPU clock cycles. Every time you perform some operation, you increment this counter by 1. Increment it even more after a heavier operation is performed.
In specific parts of your code, insert checkpoints where you check if ticks > threshold, where threshold is a parameter you want to tune after you have this pseudo multithreading working.
If ticks > threshold at the checkpoint, you save the current state of your task, set ticks to zero, then exit the function.
The method has to be retried later, so here you employ a Timer with an interval parameter that should also be tuned later.
When restarting the method, use the saved state of your paused task to detect where your task should be resumed.
For your specific situation, I would suggest splitting the tasks of the for loops, instead of thinking about the while loop. The idea is to interrupt the for loops, remember their state, then continue from there after the resting interval.
To simplify, imagine that we have only the outmost for loop. A sketch of the new method is:
WhileLoop: while (dataInput.bytesAvailable > 0 && ticks < threshold)
{
if(!didSubTaskA) {
// do subtask A...
ticks += 2;
didSubTaskA = true;
}
if(ticks > threshold) {
ticks = 0;
restTimer.reset();
restTimer.start(); // This dispatches an event that should trigger this method
break WhileLoop;
}
for (var i:int = next_unused_i;i < readSampleLength;i+=4) {
next_unused_i = i+1;
// do subtask B...
ticks += 1;
if(ticks > threshold) {
ticks = 0;
restTimer.reset();
restTimer.start();
break WhileLoop;
}
}
next_unused_i = 0;
didSubTaskA = false;
}
if(ticks > threshold) {
ticks = 0;
restTimer.reset();
restTimer.start();
}
The variables ticks, threshold, restTimer, next_unused_i, and didSubTaskA are important, and can't be local method variables. They could be static or class variables. Subtask A is that part where you "Resampling logic variables", and also the variables used there can't be local variables, so make them class variables as well, so their values can persist when you leave and come back to the method.
You can make it look nicer by creating your own Task class, then storing there the whole state of interrupted state of your "threaded"-algorithm. Also, you could maybe make the checkpoint become a local function.
I didn't test the code above so I can't guarantee it works, but the idea is basically that. I hope it helps

Divide up money evenly in C# using a functional approach

I have these 2 values:
decimal totalAmountDue = 1332.29m;
short installmentCount = 3;
I want to create 3 installments that have an even amount based on the totalAmountDue (extra pennies apply starting with the lowest installment number going to the highest installment number) using this class:
public class Installment
{
public Installment( short installmentNumber, decimal amount )
{
InstallmentNumber = installmentNumber;
Amount = amount;
}
public short InstallmentNumber { get; private set; }
public decimal Amount { get; private set; }
}
The installments should be as follows:
{ InstallmentNumber = 1, Amount = 444.10m }
{ InstallmentNumber = 2, Amount = 444.10m }
{ InstallmentNumber = 3, Amount = 444.09m }
I am looking for an interesting way to create my 3 installments. Using a simple LINQ to objects method would be nice. I have been trying to understand more about functional programming lately and this seems like it could be a fairly good exercise in recursion. The only decent way I can think of doing this is with a traditional while or for loop at the moment...
There's not a whole lot here that is "functional". I would approach the problem like this:
var pennies = (totalAmountDue * 100) % installmentCount;
var monthlyPayment = totalAmountDue / installmentCount;
var installments = from installment in Enumerable.Range(1, installmentCount)
let amount = monthlyPayment + (Math.Max(pennies--, 0m) / 100)
select new Installment(installment, amount);
You might be able to work something out where you constantly subtract the previous payment from the total amount and do the division rounding up to the nearest penny. In F# (C# is too wordy for this) it might be something like:
let calculatePayments totalAmountDue installmentCount =
let rec getPayments l (amountLeft:decimal) = function
| 0 -> l
| count -> let paymentAmount =
(truncate (amountLeft / (decimal)count * 100m)) / 100m
getPayments (new Installment(count, paymentAmount)::l)
(amountLeft - paymentAmount)
(count - 1)
getPayments [] totalAmountDue installmentCount
For those unfamiliar with F#, what that code is doing is setting up a recursive function (getPayments) and bootstrapping it with some initial values (empty list, starting values). Using match expressions it sets up a terminator (if installmentCount is 0) returning the list so far. Otherwise it calculates the payment amount and calls the recursive method adding the new installment to the front of the list, subtracting the payment amount from the amount left, and subtracting the count.
This is actually building the list in reverse (adding on to the front each time), so we throw away the extra pennies (the truncate) and eventually it catches up with us so the penny rounding works as expected. This is obviously more math intensive than the add/subtract code above since we divide and multiply in every iteration. But it is fully recursive and takes advantage of tail recursion so we'll never run out of stack.
The trouble with C# here is that you want a sequence of installments and recursion and there's no idiomatic built-in structure for doing that in C#. Here I'm using F#'s list which is immutable and O(1) operation to prepend.
You could possibly build something using the Scan() method in the Reactive Extensions to pass state from once instance to another.
Talljoe,
I think you are pushing me in the right direction. This code below seems to work. I had to switch out how the penny math was working but this looks pretty good (I think)
decimal totalAmountDue = 1332.29m;
short installmentCount = 8;
var pennies = (totalAmountDue * 100) % installmentCount;
var monthlyPayment = Math.Floor(totalAmountDue / installmentCount * 100);
var installments = from installmentNumber in Enumerable.Range(1, installmentCount)
let extraPenny = pennies-- > 0 ? 1 : 0
let amount = (monthlyPayment + extraPenny) / 100
select new Installment(installmentNumber, amount);

Resources