Astyanax key range query - cassandra

trying to write a query which will paginate through all rows in a column family using astyanax client and RowSliceQuery.
keyspace.prepareQuery(COLUMN_FAMILY).getKeyRange(null, null, null, null, 100);
Done this successfully using hector where 1st call is done with null start and end keys. After retrieving 1st page I use last key from the result to make query for second page and etc. This is code for 1st page using hector.
HFactory.createRangeSlicesQuery(keyspace,
LongSerializer.get(), new CompositeSerializer(),
BytesArraySerializer.get())
.setColumnFamily(COLUMN_FAMILY)
.setRange(null, null, false, 100).setRowCount(100);
Now when I am trying to do this with astyanax I am getting errors about null and non-null keys and tokens. Not sure what tokens do in this query. Also I am able to use allRows(), but would like to do this using key range query as it gives me more flexibility.
Does anybody have an example of key range query using astyanax? I cannot find an example neither in "getting started" documentation or anywhere else on the net.
Thanks!
Anton

What you are referring to is the getRowRange method:
keyspace.prepareQuery(CF_STANDARD1)
.getRowRange(startKey, endKey, startToken, endToken, count)
Note however that this works only when the ByteOrderedPartitioner is used. Since by default Cassandra uses the Murmur3Partitioner, this will usually not work. Using an index to do this instead is recommended. Astyanax also provides the reverse index search recipe which takes advantage of a second column family which stores your keys as columns to allow efficient range searches on the original data.

Check this sample code. I hope this code will help you in doing the paging.
IndexQuery<String, String> query = keyspace
.prepareQuery(CF_STANDARD1).searchWithIndex()
.setRowLimit(10).autoPaginateRows(true).addExpression()
.whereColumn("Index2").equals().value(42);
Best,

Related

Accumulo: Equivalent of MongoDB createIndex()

I am new to Accumulo and I am trying to retrieve all rowIDs corresponding to a column family/qualifier. In MongoDB, this could be done by creating an index on the field using createIndex(). IS there any way of doing the same in Accumulo?
Extracting RowIds from Column Family and Qualifier can be done in the above mentioned way its recommended to make groups of family for better performance in this case.
And for searching on Values , I recomend you to make reverse Indexes
RowIds in Accumulo can be retrieved in you know the family or qualifier
Code for Java
Scanner scan = conn.createScanner("tableName", new Authorizations());
//for scanning on only Column Family
scan.fetchColumnFamily(new Text("CF"));
//for Scanning on both family and qualifier
scan.fetchColumn(new Text("CF"),new Text("CQ"));
for (Entry<Key, Value> entry : scan) {
System.out.println(entry.getKey().getRow());
}
Accumulo Shell
After selecting the specific table, use cammand
scan -c CF:CQ

Paging through Cassandra using QueryBuilder

The DataStax documentation says that to page through all data, the following CQL query is useful:
SELECT * FROM test WHERE token(k) > token(42);
Is it possible to build this query using the QueryBuilder? It provides a token method, but that seems to work only on column names, not on values.
Ideally, the value (in the example: 42) is of type Object, just like in the eq/gte/lte functions.
Try using automatic paging with the .fetchSize method. It uses token under the hood:
Automatic paging is introduced Cassandra 2.0. Automatic paging allows the developer to iterate on an entire ResultSet without having to care about its size: some extra rows are fetched as the client code iterate over the results while the old ones are dropped. The amount of rows that must be retrieved can be parameterized at query time. In the Java Driver this will looks like:
Statement stmt = new SimpleStatement("SELECT * FROM images");
stmt.setFetchSize(100);
ResultSet rs = session.execute(stmt);
Source: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
QueryBuilder.fcall("token", value) ;
can solve the problem!

Composite key in Cassandra with Pig and where_clause for part of the key in the where clause

I basically have the same problem as the following Composite key in Cassandra with Pig. The only difference is I try to query for a part of the composite key within the where_clause of pig.
The data structure is similar to the earlier mentioned issue, I'll copy some code/context to minimize the reading of that issue.
We have a CQL table that looks something like this:
CREATE table data (
occurday text,
seqnumber int,
occurtimems bigint,
unique bigint,
fields map<text, text>,
primary key ((occurday, seqnumber), occurtimems, unique)
)
Instead of querying for both the seqnumber and the occurday (as was the issue in previously mentioned issue) I try to query one of the keys.
If I execute this query as part of a LOAD from within Pig, however, things don't work.
-- Need to URL encode the query
data = LOAD 'cql://ks/data?where_clause=occurday%3D%272013-10-01%27' USING CqlStorage();
gives
java.lang.RuntimeException
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:665)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.<init>(CqlPagingRecordReader.java:301)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:167)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:181)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: InvalidRequestException(why:occurday cannot be restricted by more than one relation if it includes an Equal)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result$prepare_cql3_query_resultStandardScheme.read(Cassandra.java:51017)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result$prepare_cql3_query_resultStandardScheme.read(Cassandra.java:50994)
at org.apache.cassandra.thrift.Cassandra$prepare_cql3_query_result.read(Cassandra.java:50933)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_prepare_cql3_query(Cassandra.java:1756)
at org.apache.cassandra.thrift.Cassandra$Client.prepare_cql3_query(Cassandra.java:1742)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.prepareQuery(CqlPagingRecordReader.java:605)
at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader$RowIterator.executeQuery(CqlPagingRecordReader.java:635)
... 7 more
Basically my question is, what am I doing wrong or what don't I understand?
As I understand from CqlPagingRecorderReader Used when Partition Key Is Explicitly Stated
I should be able to query with just part of the partition key?
Also while reading
Add CqlRecordReader to take advantage of native CQL pagination
I get the impression this should be possible, but I am swimming around with (in my opinion) no clear direction on how to accomplish this.
Any help is very very welcome at this point.
Regards,
Lennart Weijl
PS.
I am running on Cassandra 2.0.9 with Pig 0.13.0
According to CASSANDRA-6311, I believe you need to apply the 6331-v2-2.0-branch.txt patch, recompile pig, and then update your LOAD statement to:
data = LOAD 'cql://ks/data?where_clause=occurday%3D%272013-10-01%27' USING CqlInputFormat();
The key change being USING CqlInputFormat() which triggers the use of the new CqlRecordReader that was released in Cassandra 2.0.7.
Edit: Note that the exception is thrown from CqlPagingRecordReader which means you're still using the old record reader.

Cassandra Searching for a RowKey

I am very new to Cassandra and this time still I have not done my part on reading much about the architecture. I have a simple question for which I am not getting an answer for.
This is a sample data when I do a list abcColumnFamily:
RowKey:Message_1
=> (column=word, value=Message_1, timestamp=1373976339934001)
RowKey:Message_2
=> (column=word, value=Message_2, timestamp=1373976339934001)
How can I search for the Rowkey having say Message_1
In SQL world: Select * from Table where Rowkey = 'Message_1' (= OR like). I want to simply search on full string.
My intention is to just check whether a particular data of my interest is there in a rowkey or not.
For CQL try:
select * from abcColumnFamily where KEY = 'Message_1'
If You want to query that data using CLI try the following:
assume abcColumnFamily keys as utf8;
get abcColumnFamily['Message_1'];

Querying azure table storage for null values

Does anyone know the proper way to query azure table storage for a null value. From what I've read, it's possible (although there is a bug which prevents it on development storage). However, I keep getting the following error when I do so on the live cloud storage:
One of the request inputs is not valid.
This is a dumbed down version of the LINQ query that I've put together.
var query = from fooBar in fooBarSVC.CreateQuery<FooBar>("FooBars")
where fooBar.PartitionKey == kPartitionID
&& fooBar.Code == kfooBarCode
&& fooBar.Effective_Date <= kFooBarDate.ToUniversalTime()
&& (fooBar.Termination_Date > kFooBarDate.ToUniversalTime() || fooBar.Termination_Date == null)
select fooBar;
If I run the query without checking for null, it works fine. I know a possible solution would be to run a second query on the collection that this query brings back. I don't mind doing that if I need to, but would like to know if I can get this approach to work first.
Anyone see anything obvious I'm doing wrong?
The problem is that because azure table storage does not have a schema, the null column actually doesn't exist. This is why your query is not valid. there is no such thing as a null column in table storage. You could do something like store an empty string if you really have to. Really though the fundamental issue here is that Azure table storage really is not built to be queried by any columns other than partition key and row key. Every time you make a query on one of these non-standard columns you are doing a table scan. If you start to get lots of data you are going to have a very high rate of query time outs. I would suggest setting up a manual index for these types of queries. For example, you could store the same data in the same table but with different values for the Row key. Ultimately, if your are app is not getting crazy high usage I would just use SQL Azure as it will be much more flexible for the types of queries you are doing.
Update: Azure has a great guide on table storage design that I would recommend reading. http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/
I just had this problem and found a nice little ninja-trick to actually test for nulls. Although I'm using the Azure Storage interface directly, I'm 90% sure it will work for LINQ too if you do the same.
Here's what I did to check if Price (Int32?) is null:
not (Price lt 0 or Price gt 0)
I'm guessing in your case you can do the same in LINQ by testing if fooBar.Termination_Date is less or greater than DateTime.UtcNow for example. Something like this:
where fooBar.PartitionKey == kPartitionID
&& fooBar.Code == kfooBarCode
&& fooBar.Effective_Date <= kFooBarDate.ToUniversalTime()
&& (fooBar.Termination_Date > kFooBarDate.ToUniversalTime()
|| (not (fooBar.Termination_Date < DateTime.UtcNow
or fooBar.Termination_Date > DateTime.UtcNow))
select fooBar;
For a string column called MyColumn I was able to type: not(MyColumn gt '')
Mike S answer above put me on the right path.
For strings, we can compare to empty string.
IsNotBlank(value)
Can be:
(Value gt '')
Using the Azure Tables client library for .NET. to query for null Guid values.
In the sample code, the property's name is MyColumn.
var filter = Azure.Data.Tables.TableClient
.CreateQueryFilter($"not(MyColumn gt {Guid.Empty})");
The TableClient.CreateQueryFilter method will create the filter:
not(MyColumn gt guid'00000000-0000-0000-0000-000000000000')

Resources