Using Pig with Cassandra CQL3 - cassandra

When trying to run PIG against a CQL3 created Cassandra Schema,
-- This script simply gets a row count of the given column family
rows = LOAD 'cassandra://Keyspace1/ColumnFamily/' USING CassandraStorage();
counted = foreach (group rows all) generate COUNT($1);
dump counted;
I get the following Error.
Error: Column family 'ColumnFamily' not found in keyspace 'KeySpace1'
I understand that this is by design, but I have been having trouble finding the correct method to load CQL3 tables into PIG.
Can someone point me in the right direction? Is there a missing bit of documentation?

This is now supported in Cassandra 1.2.8

As you mention this is by design because if thrift was updated to allow for this it would compromise backwards computability. Instead of creating keyspaces and column families using CQL (I'm guessing you used cqlsh) try using the C* CLI.
Take a look at these issues as well:
https://issues.apache.org/jira/browse/CASSANDRA-4924
https://issues.apache.org/jira/browse/CASSANDRA-4377

Per this https://github.com/alexliu68/cassandra/pull/3, it appears that this fix is planned for the 1.2.6 release of Cassandra. It sounds like they're trying to get that out in the reasonably near future, but of course there's no certain ETA.

As e90jimmy said, its supported in Cassandra 1.2.8, but we have a issue when using counter column type. This was fixed by Alex Liu but due to regression problem in 1.2.7 the patch doesn't go ahead:
https://issues.apache.org/jira/browse/CASSANDRA-5234
To correct this, wait until 2.0 become production ready or download the source, apply the patch from the above link by yourself and rebuild the cassandra .jar. Worked for me by now...

The best way to access Cql3 Tables in Pig is by using the CqlStorage Handler
The syntax is similar to what you have a above
row = Load 'cql://Keyspace/ColumnFamily/' Using CqlStorage()
More info In the Dev Blog Post

Related

ScyllaDB 2.1 - Inconsistency with Materialized View

While deciding on the technology stack for my own product, I decided to go with scyllaDB for database due to it's impressive performance.
For local development, I setup Cassandra on my Macbook.
Considering ScyllaDB now supports (experimental) MV (Materialized View), it made the development easy. For dev server, I'm running ScyllaDB on Ubuntu 16.04 hosted on Linod.
I am facing following issues :
After a few weeks, one day when I deleted an entry from base table (from ScyllaDB running on Ubuntu) using the partition key, the respective MV still showed the respective entry for the deleted record.
It was fixed after I dropped the whole Key-Space and recreated it, but I'm unable to pinpoint what caused this inconsistency.
When I dropped the MV and recreated it, it did not copy the old data.
I tried to search, but could not find a way to force MV to read from base table and populate itself.
For the first issue, I would like to know if anyone faced similar scenario. Also if there is anything I can do to prevent this from happening or if it can't be prevented and that is what it means to be "experimental".
Any help or reference is appreciated.
In 2.1 Scylla lacked view building (that is, using existing data to populate a view on creation), but that is solved in 2.2.
Indeed the MV status of 2.1 is incomplete. It gotten much better in 2.2 which will be released this week. It's still not GA yet but we have a branch on top of 2.2 that merged newer changes from master which is almost there. It should reach GA quality within 2 months.
Note that the Cassandra MV status is experimental and we have been opening JIRA tickets everywhere we identified there is design flaw in C*'s MV.
tldr; I would suggest you either stick with cassandra if you want MV, or manually do the MV's in scylla.
Materialized views are super experimental. I ran them for about 6 months in production replacing their functionality manually. This was done to improve performance. So if performance is your goal here, I suggest avoiding them.
I can attest that the materialized views if created on a already populated table will infact populate the materialized view on their own so this seems like a scylladb problem. Cassandra has a different problem where the writes will crater the DB if you do this on a large production table.
I also did not have issues with truncating the primary table and seeing the reflection in cassandra.
Additionally I had tried scylladb for a spike for performance reasons. I found it very difficult to work with and dropped it after spending a week trying to get it to do what I knew cassandra would do.
Thanks #Highstead for confirming the automatic population of MV if base table has entries while creating the MV.
For the main query of the inconsistency in tables and MV, I found out that it was due to truncate query on base table.
Also found an issue for it https://github.com/scylladb/scylla/issues/3188
It states that currently, truncating the base table wont clear the MVs created from that table.
Vice-versa, you can run truncate query on the MV and it won't throw an exception (where it should've) and MV will be cleared even when base table contains entries.
So solution for now is to truncate each MV along with tables separately.

How to inspect the local hints directory on a Cassandra node?

I'm encountering the same problem as Cassandra system.hints table is empty even when the one of the node is down:
I am learning Cassandra from academy.datastax.com. I am trying the Replication and Consistency demo on local machine. RF = 3 and Consistency = 1.
When my Node3 is down and I am updating my table using update command, the SYSTEM.HINTS table is expected to store hint for node3 but it is always empty.
#amalober pointed out that this was due to a difference the Cassandra version being used. From the Cassandra docs at DataStax:
In Cassandra 3.0 and later, the hint is stored in a local hints directory on each node for improved replay.
This same question was asked 3 years ago, How to access the local data of a Cassandra node, but the accepted solution was to
...Hack something together using the Cassandra source that reads SSTables and have that feed the local client you're hoping to build. A great starting point would be looking at the source of org.apache.cassandra.tools.SSTableExport which is used in the sstable2json tool.
Is there an easier way to access the local hints directory of a Cassandra node?
Is there an easier way to access the local hints directory of a Cassandra node?
The hint directory is defined in $CASSANDRA_HOME/conf/cassandra.yaml file (sometimes it is located under /etc/cassandra also, depending on how you install Cassandra)
Look for the property hints_directory
I guess you are using ccm. So, the hint file should be in $CASSANDRA_HOME/.ccm/yourcluster/yournode/hints directory
I haven't been able to reproduce your issue with not getting a hints file. Every attempt I had resulted in the hints file as expected. There is a way to view the hints easier now.
We added a dump for hints in sstable-tools that you can use to view the mutations in the HH files. We may in the future add ability to use the HH files like sstables in the shell (use mutations to build memtable and include in queries) but for now its pretty raw.
Its pretty simple (sans metadata setup) if you wanna do analysis of data yourself. You can see what we did here and change to your needs: https://github.com/tolbertam/sstable-tools/blob/master/src/main/java/org/apache/cassandra/hints/HintsTool.java#L39

thrift or CQL3 for cassandra wide rows?

we are going to create a new project on cassandra with php or java.
As we estimated, there will be 20K req/sec to cassandra cluster.
Specially wide column feature is important for this project, but i can not make it clear: should i prefer thrift API or CQL3 library like php-driver etc?
There is an post that says 'Thrift API is not going to be getting new features' in this link. So i am not sure about thrift.
if i decided to use cql3, i have to alter table to be sure column exists before all insert queries like this, which is discussed at here. i think this will be a performance issue for me.
So which of them is best to my case ?
Thrift is a legacy interface in Cassandra. All new development should use the native CQL interface.
I'm not clear on why you think you'd need to do an alter table frequently. Typically you would define a schema once and rarely if ever use alter table.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

What is a good Bulk data loading tool for Cassandra

I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.

Resources