cassandra 1.1.x get by composite key - cassandra

Is it possible using Hector or Astyanax get rows by composite keys (in multiple columns, not the ones serialized to one column)?
In cqlsh i created simple column family:
CREATE COLUMNFAMILY kkvv (x int, y int, val1 varchar, val2 varchar, PRIMARY KEY (x,y));
According to Cassandra Developer Center the rows are stored by x as a key and rest is stored in columns.
I cant' figure how to get columns slice for given x and y.
Executing cql in hector that cql
cqlQuery.setQuery("select * from kkvv")
gives me rows:
Row(2,ColumnSlice([HColumn(x=2)]))
Row(10,ColumnSlice([HColumn(x=10)]))
and console cqlsh gives:
x | y | val1 | val2
----+-----+-------+-----------
2 | 1 | v1_1 | v2_1
10 | 27 | v1_4b | v2_4b
10 | 91 | v1_4a | v2_4a
Anyone has managed to do that in any cassandra client for java?
Can i use thrift for that, or it is cql only feature?

There are two somewhat-different syntaxes at work here: CQL 2 and CQL 3. By default, a Cassandra connection expects CQL 2. CQL 2, though, doesn't understand composite key columnfamilies of the sort you've made here.
So you are apparently correctly using CQL 3 with cqlsh, since it's displaying your columns in a sane way, but you're not using it with Hector. I'm not sure whether Hector or Astyanax even support that yet. The latest release of the cassandra-jdbc driver does, so, if Hector and/or Astyanax use that, then they should work too.
There isn't (and probably won't be) any support in Thrift for treating composite-comparator columnfamilies as tables with multi-component primary keys, the way CQL 3 does it. Use CQL 3 if you want this.

Did you try the CompositeQuery.java example provided in the cassandra-tutorial project?
Also, have you read Introduction to Composite Columns by DataStax?

Good explanation how rows with composite keys are stored in Cassandra is here.
In Astyanax and Hector i noticed funny thing - when a tried to connect - it used CQL2. When i connect to Cassandra with CQL3 with cassandra api (code from example bellow), somewhere was stored this setting, after that Astyanax and Hector used cql3 instead of CQL2. Connections were made as separate executions, so it couldn't be stored on the client side... Someone has any thoughts about it?
CQL version can be set on org.apache.cassandra.thrift.Cassandra.Client with set_cql_version method.
If someone is interested in working example using pure Cassandra api:
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.util.List;
import org.apache.cassandra.thrift.Cassandra;
import org.apache.cassandra.thrift.Column;
import org.apache.cassandra.thrift.Compression;
import org.apache.cassandra.thrift.CqlResult;
import org.apache.cassandra.thrift.CqlRow;
import org.apache.cassandra.thrift.InvalidRequestException;
import org.apache.cassandra.thrift.SchemaDisagreementException;
import org.apache.cassandra.thrift.TimedOutException;
import org.apache.cassandra.thrift.UnavailableException;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TBinaryProtocol;
import org.apache.thrift.protocol.TProtocol;
import org.apache.thrift.transport.TFramedTransport;
import org.apache.thrift.transport.TSocket;
import org.apache.thrift.transport.TTransport;
public class KKVVGetter {
private static Cassandra.Client client;
private static TTransport transport;
public static void main(String[] args) throws UnsupportedEncodingException, InvalidRequestException,
UnavailableException, TimedOutException, SchemaDisagreementException, TException {
transport = new TFramedTransport(new TSocket("localhost", 9160));
TProtocol protocol = new TBinaryProtocol(transport);
client = new Cassandra.Client(protocol);
transport.open();
client.set_cql_version("3.0.0");
executeQuery("USE ks_test3");
show("select x,y,val1,val2 from kkvv where x > 1 and x < 11 and y < 100 and y > 2");
System.out.println("\n\n*****************************\n\n");
show("select x,y,val1,val2 from kkvv");
transport.close();
}
private static int toInt(byte[] bytes) {
int result = 0;
for (int i = 0; i < 4; i++) {
result = (result << 4) + (int) bytes[i];
}
return result;
}
private static CqlResult executeQuery(String query) throws UnsupportedEncodingException, InvalidRequestException,
UnavailableException, TimedOutException, SchemaDisagreementException, TException {
return client.execute_cql_query(ByteBuffer.wrap(query.getBytes("UTF-8")), Compression.NONE);
}
private static void show(String query) throws UnsupportedEncodingException, InvalidRequestException,
UnavailableException, TimedOutException, SchemaDisagreementException, TException {
CqlResult result = executeQuery(query);
List<CqlRow> rows = result.getRows();
System.out.println("rows: " + rows.size());
for (CqlRow row : rows) {
System.out.println("columns: " + row.getColumnsSize());
for (Column c : row.getColumns()) {
System.out.print(" " + new String(c.getName()));
switch (new String(c.getName())) {
case "x":
case "y":
System.out.print(" " + toInt(c.getValue()));
break;
case "val1":
case "val2":
System.out.print(" " + new String(c.getValue()));
break;
default:
break;
}
System.out.println();
}
}
}
}
Example for schema in question.

Related

Insert is 10 times faster than Update in Cassandra. Is it normal?

In my Java application accessing Cassandra, it can insert 500 rows per second, but only update 50 rows per second(actually the updated rows didn't exist).
Updating one hundred fields is as fast as updating one field.
I just use CQL statements in the Java application.
Is this situation normal? How can I improve my application?
public void InsertSome(List<Data> data) {
String insertQuery = "INSERT INTO Data (E,D,A,S,C,......) values(?,?,?,?,?,.............); ";
if (prepared == null)
prepared = getSession().prepare(insertQuery);
count += data.size();
for (int i = 0; i < data.size(); i++) {
List<Object> objs = getFiledValues(data.get(i));
BoundStatement bs = prepared.bind(objs.toArray());
getSession().execute(bs);
}
}
public void UpdateOneField(Data data) {
String updateQuery = "UPDATE Data set C=? where E=? and D=? and A=? and S=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
BoundStatement bs = prepared.bind(data.getC(), data.getE(),
data.getD(), data.getA(), data.getS());
getSession().execute(bs);
}
public void UpdateOne(Data data) {
String updateQuery = "UPDATE Data set C=?,U=?,F........where E=? and D=? and A=? and S=? and D=?; ";
if (prepared == null)
prepared = getSession().prepare(updateQuery);
......
BoundStatement bs = prepared.bind(objs2.toArray());
getSession().execute(bs);
}
Schema:
Create Table Data (
E,
D,
A,
S,
D,
C,
U,
S,
...
PRIMARY KEY ((E
D),
A,
S)
) WITH compression = { 'sstable_compression' : 'DeflateCompressor', 'chunk_length_kb' : 64 }
AND compaction = { 'class' : 'LeveledCompactionStrategy' };
Another scenario:
I used the same application to access another cassandra cluster. The result was different. UPDATE was as fast as INSERT. But it only INSERT/UPDATE 5 rows per second. This cassandra cluster is the DataStax Enterprise running on GCE(I used the default DataStax Enterprise on Google Cloud Launcher)
So I think it's probably that some configurations are the reasons. But I don't know what they are.
Conceptually UPDATE and INSERT are the same so I would expect similar performance. UPDATE doesn't check to see if the data already exists (unless you are doing a lightweight transaction with IF EXISTS).
I noticed that each of your methods prepare a statement if it is not null. Is it possible the statement is being reprepared each time? That would add for a roundtrip for every method invocation. I also noticed that InsertSome does multiple inserts per invocation, where UpdateOne / UpdateOneField execute one statement. So if the statement were prepared every time, thats an invocation per update, where it's only done once per insert for a list.
Cassandra uses log-structured merge trees for an on-disk format, meaning all writes are done sequentially (the database is the append-only log). That implies a lower write latency.
At the cluster level, Cassandra is also able to achieve greater write scalability by partitioning the key space such that each machine is only responsible for a portion of the keys. That implies a higher write throughput, as more writes can be done in parallel.

I am getting an InvalidTypeException whenever I am using the row.getToken("fieldname")?

for the following piece of code I am getting an InvalidTypeException whenever I am using the row.getToken("fieldname").
Record RowToRecord(Row rw) {
ColumnDefinitions cd = rw.getColumnDefinitions();
Record rec = new Record();
int i;
for(i = 0; i < cd.size(); i++) {
rec.fields.add(cd.getName(i));
System.out.println(cd.getName(i));
//System.out.println((rw.getToken(cd.getName(i))).getValue());
Token tk = rw.getToken(cd.getName(i)); //// InvalidTypeException on this line.
//System.out.println(tk.getValue()+" "+tk.getType().toString());
rec.values.add(tk.getValue());
rec.types.add(tk.getType().toString());
//Token tk = new Token();
}
return rec;
}
getToken is meant to be called on a column that contains a Cassandra token. In 99% of cases, that will be the result of a call to the token() CQL function, for example the first column in this query:
select token(id), col1 from my_table where id = ...
Your code is calling it for all columns, which will fail as soon as you have a column that doesn't match the CQL type for tokens.
That CQL type depends on the partitioner used in your cluster:
murmur3 partitioner (the default): token(...) will return a BIGINT
random partitioner: VARINT
ordered partitioner: BLOB
In theory you can call getToken on any column with this type (although in practice it probably only makes sense for columns that are the result of a token() call, as explained above).

Spark connector: Partition usage and performance issue

I am trying to run a spark job (which talks to Cassandra) to read data, do some aggregation, and then write aggregates to Cassandra
I have 2 tables (monthly_active_users (MAU) , daily_user_metric_aggregates (DUMA))
For every record in MAU, there will be one or more records in DUMA
Get every records in MAU and fetch user_id in it then find records in DUMA for that user (with server side filters applied like metric_name in ('ms', 'md')
If one or more records in DUMA for the specified where clause then i need to increment the count of appMauAggregate map (app wise MAU counts)
I tested this algorithm, works as expected but i wanted to find out
1) Is it an optimized algorithm (or) is there any better way to do it? I have a sense that something is not correct and i am not seeing speedups. Looks like Cassandra client is being created and shutdown for each spark action (collect). Takes long time to process small dataset.
2) Spark workers are not co-located with cassandra, meaning spark worker is running in different node (container) than C* node (we may move spark worker to C* node for data locality)
3) I am seeing spark job is being created/submitted for every spark action (collect) and i belive that it is an expected behavior from spark, is there anyway to cutdown reads from C* and create joins so that data retrierval is fast?
4) What is the downside of this algorithm? Can you recommend better design approach, meaning w/r/t partition strategy, loading C* partition onto Spark partition, executor's / driver's memory requirement?
5) As long as algorithm and design approach is fine then i can play around with spark tuning. I am using 5 workers (each with 16 CPU and 64GB RAM)
C* Schema :
MAU:
CREATE TABLE analytics.monthly_active_users (
month text,
app_id uuid,
user_id uuid,
PRIMARY KEY (month, app_id, user_id)
) WITH CLUSTERING ORDER BY (app_id ASC, user_id ASC)
data:
cqlsh:analytics> select * from monthly_active_users limit 2;
month | app_id | user_id
--------+--------------------------------------+--------------------------------------
2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1
2015-2 | 108eeeb3-7ff1-492c-9dcd-491b68492bf2 | 2c70a31a-031c-4dbf-8dbd-e2ce7bdc2bc7
DUMA:
CREATE TABLE analytics.daily_user_metric_aggregates (
metric_date timestamp,
user_id uuid,
metric_name text,
"count" counter,
PRIMARY KEY (metric_date, user_id, metric_name)
) WITH CLUSTERING ORDER BY (user_id ASC, metric_name ASC)
data:
cqlsh:analytics> select * from daily_user_metric_aggregates where metric_date='2015-02-08' and user_id=199c0a31-8e74-46d9-9b3c-04f67d58b4d1;
metric_date | user_id | metric_name | count
--------------------------+--------------------------------------+-------------------+-------
2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | md | 1
2015-02-08 | 199c0a31-8e74-46d9-9b3c-04f67d58b4d1 | ms | 1
Spark Job :
import java.net.InetAddress
import java.util.concurrent.atomic.AtomicLong
import java.util.{Date, UUID}
import com.datastax.spark.connector.util.Logging
import org.apache.spark.{SparkConf, SparkContext}
import org.joda.time.{DateTime, DateTimeZone}
import scala.collection.mutable.ListBuffer
object MonthlyActiveUserAggregate extends App with Logging {
val KeySpace: String = "analytics"
val MauTable: String = "mau"
val CassandraHostProperty = "CASSANDRA_HOST"
val CassandraDefaultHost = "127.0.0.1"
val CassandraHost = InetAddress.getByName(sys.env.getOrElse(CassandraHostProperty, CassandraDefaultHost))
val conf = new SparkConf().setAppName(getClass.getSimpleName)
.set("spark.cassandra.connection.host", CassandraHost.getHostAddress)
lazy val sc = new SparkContext(conf)
import com.datastax.spark.connector._
def now = new DateTime(DateTimeZone.UTC)
val metricMonth = now.getYear + "-" + now.getMonthOfYear
private val mauMonthSB: StringBuilder = new StringBuilder
mauMonthSB.append(now.getYear).append("-")
if (now.getMonthOfYear < 10) mauMonthSB.append("0")
mauMonthSB.append(now.getMonthOfYear).append("-")
if (now.getDayOfMonth < 10) mauMonthSB.append("0")
mauMonthSB.append(now.getDayOfMonth)
private val mauMonth: String = mauMonthSB.toString()
val dates = ListBuffer[String]()
for (day <- 1 to now.dayOfMonth().getMaximumValue) {
val metricDate: StringBuilder = new StringBuilder
metricDate.append(now.getYear).append("-")
if (now.getMonthOfYear < 10) metricDate.append("0")
metricDate.append(now.getMonthOfYear).append("-")
if (day < 10) metricDate.append("0")
metricDate.append(day)
dates += metricDate.toString()
}
private val metricName: List[String] = List("ms", "md")
val appMauAggregate = scala.collection.mutable.Map[String, scala.collection.mutable.Map[UUID, AtomicLong]]()
case class MAURecord(month: String, appId: UUID, userId: UUID) extends Serializable
case class DUMARecord(metricDate: Date, userId: UUID, metricName: String) extends Serializable
case class MAUAggregate(month: String, appId: UUID, total: Long) extends Serializable
private val mau = sc.cassandraTable[MAURecord]("analytics", "monthly_active_users")
.where("month = ?", metricMonth)
.collect()
mau.foreach { monthlyActiveUser =>
val duma = sc.cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates")
.where("metric_date in ? and user_id = ? and metric_name in ?", dates, monthlyActiveUser.userId, metricName)
//.map(_.userId).distinct().collect()
.collect()
if (duma.length > 0) { // if user has `ms` for the given month
if (!appMauAggregate.isDefinedAt(mauMonth)) {
appMauAggregate += (mauMonth -> scala.collection.mutable.Map[UUID, AtomicLong]())
}
val monthMap: scala.collection.mutable.Map[UUID, AtomicLong] = appMauAggregate(mauMonth)
if (!monthMap.isDefinedAt(monthlyActiveUser.appId)) {
monthMap += (monthlyActiveUser.appId -> new AtomicLong(0))
}
monthMap(monthlyActiveUser.appId).incrementAndGet()
} else {
println(s"No message_sent in daily_user_metric_aggregates for user: $monthlyActiveUser")
}
}
for ((metricMonth: String, appMauCounts: scala.collection.mutable.Map[UUID, AtomicLong]) <- appMauAggregate) {
for ((appId: UUID, total: AtomicLong) <- appMauCounts) {
println(s"month: $metricMonth, app_id: $appId, total: $total");
val collection = sc.parallelize(Seq(MAUAggregate(metricMonth.substring(0, 7), appId, total.get())))
collection.saveToCassandra(KeySpace, MauTable, SomeColumns("month", "app_id", "total"))
}
}
sc.stop()
}
Thanks.
Your solution is the least efficient possible. You are performing a join by looking up each key one-by-one, avoiding any possible parallelization.
I've never used the Cassandra connector, but I understand it returns RDDs. So you could do this:
val mau: RDD[(UUID, MAURecord)] = sc
.cassandraTable[MAURecord]("analytics", "monthly_active_users")
.where("month = ?", metricMonth)
.map(u => u.userId -> u) // Key by user ID.
val duma: RDD[(UUID, DUMARecord)] = sc
.cassandraTable[DUMARecord]("analytics", "daily_user_metric_aggregates")
.where("metric_date in ? metric_name in ?", dates, metricName)
.map(a => a.userId -> a) // Key by user ID.
// Count "duma" by key.
val dumaCounts: RDD[(UUID, Long)] = duma.countByKey
// Join to "mau". This drops "mau" entries that have no count
// and "duma" entries that are not present in "mau".
val joined: RDD[(UUID, (MAURecord, Long))] = mau.join(dumaCounts)
// Get per-application counts.
val appCounts: RDD[(UUID, Long)] = joined
.map { case (u, (mau, count)) => mau.appId -> 1 }
.countByKey
There is a parameter spark.cassandra.connection.keep_alive_ms which controls for how long keep the connection opened. Take a look at the documentation page.
If you colocate Spark Workers with Cassandra nodes, connector will take advantage of this and create partitions appropriately so that the executor will always fetch data from the local node.
I can see some design improvements you can make in DUMA table: metric_date seems to be not the best choice for partition key - consider making (user_id, metric_name) a partition key because in that case you will not have to generate dates for the query - you will just need to put user_id and metrics_name to the where clause. Moreover, you can add a month identifier to the primary key - then, each partition will include only those information which are related to what you want to fetch with each query.
Anyway, the functionality of join in Spark-Cassandra-Connector are currently being implemented (see this ticket).

Example about how to use map cql type with DataStax java driver

I am trying to use the datastax java driver to update and query a column family that has a map field. Does anyone an example about how to use cql collections with the Datastax Java Driver?
Thanks
I will add some examples of using the CQL collections with both simple and prepared statements to the current Java driver doc.
You can use CQL collections with prepared statements. There's an example in the Java driver doc in the quick start section.
http://docs.datastax.com/en/developer/java-driver/2.0/java-driver/quick_start/qsSimpleClientBoundStatements_t.html
Step 4 binds a Java HashSet object to a CQL set column in the songs table.
Normally I'd ask what you've tried, but I know that this isn't in the DataStax documentation for the Java Driver. I'll go through what worked for me.
A couple of things to note:
The DataStax Cassandra Java class directed me to put my variables directly into the CQL text string (instead of binding the map). I'm guessing that binding collections wasn't working at the time of production (for the class videos).
Collections can't be queried using DevCenter, so you'll need to check their values via the command line with cqlsh if you want to see what they are outside your app.
To update an existing row (in the "users" table which has a Map<varchar,varchar> phone_numbers), try something like this:
String cqlQuery = "UPDATE users SET phone_numbers = phone_numbers + ";
cqlQuery += "{'" + phoneType + "':'" + phoneNumber+ "'} ";
cqlQuery += "WHERE username = ?";
PreparedStatement preparedStatement = getSession().prepare(cqlQuery);
BoundStatement boundStatement = preparedStatement.bind(user);
getSession().execute(boundStatement);
The better way to do this (assuming a Map<String,String> phoneNumbers), is to bind the collection to the prepared statement, like this:
String cqlQuery = "UPDATE users SET phone_numbers = ? ";
cqlQuery += "WHERE username = ?";
PreparedStatement preparedStatement = getSession().prepare(cqlQuery);
BoundStatement boundStatement = preparedStatement.bind(phoneNumbers,user);
getSession().execute(boundStatement);
Likewise, to read it back out:
String cqlQuery2 = "SELECT phone_numbers FROM users WHERE username = ?";
PreparedStatement preparedStatement2 = getSession().prepare(cqlQuery2);
BoundStatement boundStatement2 = preparedStatement2.bind(user);
ResultSet result2 = getSession().execute(boundStatement2);
Map<String,String> newMap = result2.one().getMap("phone_numbers", String.class, String.class);
They just covered this today in the (free) CAS101J class on DataStax Academy.
Here is how I did it; This has mapping for a tuple column in Cassandra as well as the map column in Cassandra. I was using Scala with the DataStax Cassandra Java Driver
Needed imports
import java.lang
import java.text.SimpleDateFormat
import com.datastax.driver.core._
import com.datastax.driver.core.querybuilder.QueryBuilder
import scala.collection.Map
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector.cql.CassandraConnector
import org.apache.spark.rdd.RDD
import scala.collection.JavaConversions._
Code snippet
val simpleDateFormat: SimpleDateFormat = new SimpleDateFormat("dd-MM-yyyy H:mm:ss")
val start_date: java.util.Date = simpleDateFormat.parse(val("StartTime").replaceAll(":(\\s+)", ":"))
val b_tuple= session.getCluster().getMetadata().newTupleType(DataType.cint(), DataType.cint(), DataType.text())
val time_tuple = session.getCluster().getMetadata().newTupleType(DataType.timestamp(), DataType.timestamp())
val time_tuple_value = time_tuple.newValue(start_date, end_date)
val b_tuple_value = bin_cell_tuple.newValue(b._1: lang.Integer, b._2: lang.Integer, val("xxx"))
val statement_2: Statement = QueryBuilder.insertInto("keyspace", "table_name")
.value("b_key", bin_cell_tuple_value)
.value("time_key", time_tuple_value)
.value("some_map", mapAsJavaMap(my_scala_map))
session.executeAsync(statement_2)

need a counter query which give all counter for a primary key using hector API

I am using hector API for cassandra.
i create a counter table like follows
private void addColumnFamilyCounter(ThriftCluster cluster, String cfName, int rowCacheKeysToSave) {
String cassandraKeyspace = this.env.getProperty("cassandra.keyspace");
ThriftCfDef cfd =
new ThriftCfDef(cassandraKeyspace, cfName, ComparatorType.UTF8TYPE);
cfd.setRowCacheKeysToSave(rowCacheKeysToSave);
cfd.setDefaultValidationClass(ComparatorType.COUNTERTYPE.getClassName());
cluster.addColumnFamily(cfd);
}
and call the above method like follows
addColumnFamilyCounter(cluster, COUNTER_CF, 0);
The format of the table is like follows
Primary key columns
Munich jingle : 1
mingle : 2
tingle : 1
pingle : 5
Now i want to execute a query to get all the columns and their values under Munich. is there any way i can get all the columns.
What i knw till now is the following query but it gives me value for only a combination of primary key and a column key.
#Override
public long getTagCounter(String domain, String tag) {
CounterQuery<String, String> counter =
new ThriftCounterColumnQuery<String, String>(keyspaceOperator,
StringSerializer.get(),
StringSerializer.get());
counter.setColumnFamily(TAG_COUNTER_CF).setKey("p_key").setName("name");
return counter.execute().get().getValue();
}
okay i found answer by myself.I hope it will be helpful to other
CounterSlice<String> query = HFactory.createCounterSliceQuery(keyspaceOperator , StringSerializer.get(), StringSerializer.get())
.setColumnFamily("CF")
.setKey("PK")
.setRange(null, null, false, Integer.MAX_VALUE).execute().get();
for(HCounterColumn<String> col : query.getColumns()){
log.info(col.getName());
log.info(col.getvalue());
}

Resources