Get Cassandra partitioner [duplicate] - get

I'm developing a mechanism for Cassandra using Hector.
What I need at this moment is to know which are the hash values of the keys to look at which node is stored (looking at the tokens of each one), and ask directly this node for the value. What I understood is that depending on the partitioner Cassandra uses, the values are stored independently from one partitioner to other. So, are the hash values of all keys stored in any table? In case not, how could I implement a generic class that once I read from System Keyspace the partitioner that is using Cassandra this class could be an instance of it without the necessity of modifying the code depending on the partitioner? I would need it to call the getToken method to calculate the hash value for a given key.

Hector's CqlQuery is poorly supported and buggy. You should use the native Java CQL driver instead: https://github.com/datastax/java-driver

You could just reuse the partitioners defined in Cassandra: https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/dht and then using the token ranges you could do the routing.

The CQL driver offers token-aware routing out of the box. I would use that instead of trying to reinvent the wheel in Hector, especially since Hector uses the legacy Thrift API instead of CQL.

Finally after testing different implementations I found the way to get the partitioner using the next code:
CqlQuery<String, String, String> cqlQuery = new CqlQuery<String, String, String>(
ksp, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
cqlQuery.setQuery("select partitioner from local");
QueryResult<CqlRows<String, String, String>> result = cqlQuery.execute();
CqlRows rows = result.get();
for (int i = 0; i < rows.getCount(); i++) {
RowImpl<String, String, String> row = (RowImpl<String, String, String>) rows
.getList().get(i);
List<HColumn<String, String>> column = row.getColumnSlice().getColumns();
for (HColumn<String , String> c: column) {
System.out.println(c.getValue());
}
}

Related

How to create an update statement where a UDT value need to be updated using QueryBuilder

I have the following udt type
CREATE TYPE tag_partitions(
year bigint,
month bigint);
and the following table
CREATE TABLE ${tableName} (
tag text,
partition_info set<FROZEN<tag_partitions>>,
PRIMARY KEY ((tag))
)
The table schema is mapped using the following model
case class TagPartitionsInfo(year:Long, month:Long)
case class TagPartitions(tag:String, partition_info:Set[TagPartitionsInfo])
I have written a function which should create an Update.IfExists query: But I don't know how I should update the udt value. I tried to use set but it isn't working.
def updateValues(tableName:String, model:TagPartitions, id:TagPartitionKeys):Update.IfExists = {
val partitionInfoType:UserType = session.getCluster().getMetadata
.getKeyspace("codingjedi").getUserType("tag_partitions")
//create value
//the logic below assumes that there is only one element in the set
val partitionsInfoSet:Set[UDTValue] = model.partition_info.map((partitionInfo:TagPartitionsInfo) =>{
partitionInfoType.newValue()
.setLong("year",partitionInfo.year)
.setLong("month",partitionInfo.month)
})
println("partition info converted to UDTValue: "+partitionsInfoSet)
QueryBuilder.update(tableName).
`with`(QueryBuilder.WHAT_TO_DO_HERE_TO_UPDATE_UDT("partition_info",partitionsInfoSet))
.where(QueryBuilder.eq("tag", id.tag)).ifExists()
}
The mistake was I was adding partitionsInfoSet in the table but it is a Set of Scala. I needed to convert into Set of Java using setAsJavaSet
QueryBuilder.update(tableName).`with`(QueryBuilder.set("partition_info",setAsJavaSet(partitionsInfoSet)))
.where(QueryBuilder.eq("tag", id.tag))
.ifExists()
}
Although, it didn't answer your exact question, wouldn't it be easier to use Object Mapper for this? Something like this (I didn't modify it heavily to match your code):
#UDT(name = "scala_udt")
case class UdtCaseClass(id: Integer, #(Field #field)(name = "t") text: String) {
def this() {
this(0, "")
}
}
#Table(name = "scala_test_udt")
case class TableObjectCaseClassWithUDT(#(PartitionKey #field) id: Integer,
udts: java.util.Set[UdtCaseClass]) {
def this() {
this(0, new java.util.HashSet[UdtCaseClass]())
}
}
and then just create case class and use mapper.save on it. (Also note that you need to use Java collections, until you're imported Scala codecs).
The primary reason for using Object Mapper could be ease of use, and also better performance, because it's using prepared statements under the hood, instead of built statements that are much less efficient.
You can find more information about Object Mapper + Scala in article that I wrote recently.

Cassandra datastax driver ResultSet sharing in multiple threads for fast reading

I've huge tables in cassandra, more than 2 billions rows and increasing. The rows have a date field and it is following date bucket pattern so as to limit each row.
Even then, I've more than a million entries for a particular date.
I want to read and process rows for each day as fast as possible. What I am doing is that getting instance of com.datastax.driver.core.ResultSet and obtain iterator from it and share that iterator across multiple threads.
So, essentially I want to increase the read throughput. Is this the correct way? If not, please suggest a better way.
Unfortunately you cannot do this as is. The reason why is that a ResultSet provides an internal paging state that is used to retrieve rows 1 page at a time.
You do have options however. Since I imagine you are doing range queries (queries across multiple partitions), you can use a strategy where you submit multiple queries across token ranges at a time using the token directive. A good example of this is documented in Paging through unordered partitioner results.
java-driver 2.0.10 and 2.1.5 each provide a mechanism for retrieving token ranges from Hosts and splitting them. There is an example of how to do this in the java-driver's integration tests in TokenRangeIntegrationTest.java#should_expose_token_ranges():
PreparedStatement rangeStmt = session.prepare("SELECT i FROM foo WHERE token(i) > ? and token(i) <= ?");
TokenRange foundRange = null;
for (TokenRange range : metadata.getTokenRanges()) {
List<Row> rows = rangeQuery(rangeStmt, range);
for (Row row : rows) {
if (row.getInt("i") == testKey) {
// We should find our test key exactly once
assertThat(foundRange)
.describedAs("found the same key in two ranges: " + foundRange + " and " + range)
.isNull();
foundRange = range;
// That range should be managed by the replica
assertThat(metadata.getReplicas("test", range)).contains(replica);
}
}
}
assertThat(foundRange).isNotNull();
}
...
private List<Row> rangeQuery(PreparedStatement rangeStmt, TokenRange range) {
List<Row> rows = Lists.newArrayList();
for (TokenRange subRange : range.unwrap()) {
Statement statement = rangeStmt.bind(subRange.getStart(), subRange.getEnd());
rows.addAll(session.execute(statement).all());
}
return rows;
}
You could basically generate your statements and submit them in async fashion, the example above just iterates through the statements one at a time.
Another option is to use the spark-cassandra-connector, which essentially does this under the covers and in a very efficient way. I find it very easy to use and you don't even need to set up a spark cluster to use it. See this document for how to use the Java API.

Spark Cassandra Connector keyBy and shuffling

I am trying to optimize my spark job by avoiding shuffling as much as possible.
I am using cassandraTable to create the RDD.
The column family's column names are dynamic, thus it is defined as follows:
CREATE TABLE "Profile" (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='ALL' AND
...
This definition results in CassandraRow RDD elements in the following format:
CassandraRow <key, column1, value>
key - the RowKey
column1 - the value of column1 is the name of the dynamic column
value - the value of the dynamic column
So if I have RK='profile1', with columns name='George' and age='34', the resulting RDD will be:
CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>
Then I need to group elements that share the same key together to get a PairRdd:
PairRdd<String, Iterable<CassandraRow>>
Important to say, that all the elements I need to group are in the same Cassandra node (share the same row key), so I expect the connector to keep the locality of the data.
The problem is that using groupBy or groupByKey causes shuffling. I rather group them locally, because all the data is on the same node:
JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
.cassandraTable(ks, "Profile")
.groupBy(new Function<ColumnFamilyModel, String>() {
#Override
public String call(ColumnFamilyModel arg0) throws Exception {
return arg0.getKey();
}
})
My questions are:
Does using keyBy on the RDD will cause shuffling, or will it keep the data locally?
Is there a way to group the elements by key without shuffling? I read about mapPartitions, but didn't quite understand the usage of it.
Thanks,
Shai
I think you are looking for spanByKey, a cassandra-connector specific operation that takes advantage of the ordering provided by cassandra to allow grouping of elements without incurring in a shuffle stage.
In your case, it should look like:
sc.cassandraTable("keyspace", "Profile")
.keyBy(row => (row.getString("key")))
.spanByKey
Read more in the docs:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

How to update multiple rows using Hector

Is there a way I can update multiple rows in cassandra database using column family template like supply a list of keys.
currently I am using updater columnFamilyTemplate to loop through a list of a keys and do an update for each row. I have seen queries like multigetSliceQuery but I don't know their equivalence in doing updates.
There is no utility method in ColumnFamilyTemplate that allow you to just pass a list of keys with a list of mutation in one call.
You can implement your own using mutators.
This is the basic code on how to do it in hector
Set<String> keys = MY_KEYS;
Map<String, String> pairsOfNameValues = MY_MUTATION_BY_NAME_AND_VALUE;
Set<HColumn<String, String>> colums = new HashSet<HColumn<String,String>>();
for (Entry<String, String> pair : pairsOfNameValues.entrySet()) {
colums.add(HFactory.createStringColumn(pair.getKey(), pair.getValue()));
}
Mutator<String> mutator = template.createMutator();
String column_family_name = template.getColumnFamily();
for (String key : keys) {
for (HColumn<String, String> column : colums) {
mutator.addInsertion(key, BASIC_COLUMN_FAMILY, column);
}
}
mutator.execute();
Well it should look like that. This is an example for insertion, be sure to use the following methods for batch mutations:
mutator.addInsertion
mutator.addDeletion
mutator.addCounter
mutator.addCounterDeletion
since this ones will execute right away without waiting for the mutator.execute():
mutator.incrementCounter
mutator.deleteCounter
mutator.insert
mutator.delete
As a last note: A mutator allows you to batch mutations on multiple rows on multiple column families at once ... which is why I generally prefer to use them instead of CF templates. I have a lot of denormalization for functionalities that use the "push-on-write" pattern of NoSQL.
You can use a batch mutation to insert as much as you want (within thrift_max_message_length_in_mb). See http://hector-client.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/cassandra/model/MutatorImpl.html.

Cassandra Hector: how to insert null as a column value?

An often use-case with Cassandra is storing the data in the column names of the dynamically created column family. In this situation the row values themselves are not needed, and a usual practice is to store nulls there.
However, when dealing with Hector, it seems like there is no way to insert null value, because Hector HColumnImpl does an explicit null-check in the column's constructor:
public HColumnImpl(N name, V value, long clock, Serializer<N> nameSerializer,
Serializer<V> valueSerializer) {
this(nameSerializer, valueSerializer);
notNull(name, "name is null");
notNull(value, "value is null");
this.column = new Column(nameSerializer.toByteBuffer(name));
this.column.setValue(valueSerializer.toByteBuffer(value));
this.column.setTimestamp(clock);
}
Are there any ways to insert nulls via Hector? If not, what is the best practice in the situation when you don't care about column values and need only their names?
Try using an empty byte[], i.e. new byte[0];

Resources