I am using hector-core1.0-5 from prettyprint for connecting to cassandra. Using this API I am able to create the keyspace. But I am unable to find the method which configures the "caching" property of column family. So as default it assigns "KEYS_ONLY" as "caching" value for all column families created. I wan to change this property value to "ALL" so that I can use both the key cache and row cache in cassandra.My cassandra version1.2.0. Anyone help me in finding the way to alter the "caching" property at the time a keyspace is created.
There is not support for get or set caching in ColumnFamilyDefinition interface. Hector community has to patch the code.
No idea about Hector as such. But we are using Playorm for Cassandra and it uses Caches like hibernate. Read more information at http://buffalosw.com/wiki/Caching-in-Playorm/.
Related
I am looking into the Spring Data provider for Cassandra and don't see a way to specify a column as static when a table includes clustering keys. Am I missing something?
There's no CQL generation support for static columns. Do you want to file an issue at https://jira.spring.io/browse/DATACASS?
For now, create the CQL of this table yourself.
we are going to create a new project on cassandra with php or java.
As we estimated, there will be 20K req/sec to cassandra cluster.
Specially wide column feature is important for this project, but i can not make it clear: should i prefer thrift API or CQL3 library like php-driver etc?
There is an post that says 'Thrift API is not going to be getting new features' in this link. So i am not sure about thrift.
if i decided to use cql3, i have to alter table to be sure column exists before all insert queries like this, which is discussed at here. i think this will be a performance issue for me.
So which of them is best to my case ?
Thrift is a legacy interface in Cassandra. All new development should use the native CQL interface.
I'm not clear on why you think you'd need to do an alter table frequently. Typically you would define a schema once and rarely if ever use alter table.
I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!
In case we need to add new columns to existing Cassandra (version 1.2) static column family in production, can we do it without downtime provided we have hundreds of nodes and multiple data centers?
It would be disappointing if not possible.
In the case of adding columns, all that is really going on with an 'ALTER' statement in CQL is some meta-data entry in the system tables. No data files are being re-written.
This meta-data is then used for validation from both the API transports and compaction.
If you really have that big a cluster, you will need to wait a short while for the change to propagate - cqlsh blocks until this happens, IIRC.
When trying to run PIG against a CQL3 created Cassandra Schema,
-- This script simply gets a row count of the given column family
rows = LOAD 'cassandra://Keyspace1/ColumnFamily/' USING CassandraStorage();
counted = foreach (group rows all) generate COUNT($1);
dump counted;
I get the following Error.
Error: Column family 'ColumnFamily' not found in keyspace 'KeySpace1'
I understand that this is by design, but I have been having trouble finding the correct method to load CQL3 tables into PIG.
Can someone point me in the right direction? Is there a missing bit of documentation?
This is now supported in Cassandra 1.2.8
As you mention this is by design because if thrift was updated to allow for this it would compromise backwards computability. Instead of creating keyspaces and column families using CQL (I'm guessing you used cqlsh) try using the C* CLI.
Take a look at these issues as well:
https://issues.apache.org/jira/browse/CASSANDRA-4924
https://issues.apache.org/jira/browse/CASSANDRA-4377
Per this https://github.com/alexliu68/cassandra/pull/3, it appears that this fix is planned for the 1.2.6 release of Cassandra. It sounds like they're trying to get that out in the reasonably near future, but of course there's no certain ETA.
As e90jimmy said, its supported in Cassandra 1.2.8, but we have a issue when using counter column type. This was fixed by Alex Liu but due to regression problem in 1.2.7 the patch doesn't go ahead:
https://issues.apache.org/jira/browse/CASSANDRA-5234
To correct this, wait until 2.0 become production ready or download the source, apply the patch from the above link by yourself and rebuild the cassandra .jar. Worked for me by now...
The best way to access Cql3 Tables in Pig is by using the CqlStorage Handler
The syntax is similar to what you have a above
row = Load 'cql://Keyspace/ColumnFamily/' Using CqlStorage()
More info In the Dev Blog Post