While deciding on the technology stack for my own product, I decided to go with scyllaDB for database due to it's impressive performance.
For local development, I setup Cassandra on my Macbook.
Considering ScyllaDB now supports (experimental) MV (Materialized View), it made the development easy. For dev server, I'm running ScyllaDB on Ubuntu 16.04 hosted on Linod.
I am facing following issues :
After a few weeks, one day when I deleted an entry from base table (from ScyllaDB running on Ubuntu) using the partition key, the respective MV still showed the respective entry for the deleted record.
It was fixed after I dropped the whole Key-Space and recreated it, but I'm unable to pinpoint what caused this inconsistency.
When I dropped the MV and recreated it, it did not copy the old data.
I tried to search, but could not find a way to force MV to read from base table and populate itself.
For the first issue, I would like to know if anyone faced similar scenario. Also if there is anything I can do to prevent this from happening or if it can't be prevented and that is what it means to be "experimental".
Any help or reference is appreciated.
In 2.1 Scylla lacked view building (that is, using existing data to populate a view on creation), but that is solved in 2.2.
Indeed the MV status of 2.1 is incomplete. It gotten much better in 2.2 which will be released this week. It's still not GA yet but we have a branch on top of 2.2 that merged newer changes from master which is almost there. It should reach GA quality within 2 months.
Note that the Cassandra MV status is experimental and we have been opening JIRA tickets everywhere we identified there is design flaw in C*'s MV.
tldr; I would suggest you either stick with cassandra if you want MV, or manually do the MV's in scylla.
Materialized views are super experimental. I ran them for about 6 months in production replacing their functionality manually. This was done to improve performance. So if performance is your goal here, I suggest avoiding them.
I can attest that the materialized views if created on a already populated table will infact populate the materialized view on their own so this seems like a scylladb problem. Cassandra has a different problem where the writes will crater the DB if you do this on a large production table.
I also did not have issues with truncating the primary table and seeing the reflection in cassandra.
Additionally I had tried scylladb for a spike for performance reasons. I found it very difficult to work with and dropped it after spending a week trying to get it to do what I knew cassandra would do.
Thanks #Highstead for confirming the automatic population of MV if base table has entries while creating the MV.
For the main query of the inconsistency in tables and MV, I found out that it was due to truncate query on base table.
Also found an issue for it https://github.com/scylladb/scylla/issues/3188
It states that currently, truncating the base table wont clear the MVs created from that table.
Vice-versa, you can run truncate query on the MV and it won't throw an exception (where it should've) and MV will be cleared even when base table contains entries.
So solution for now is to truncate each MV along with tables separately.
Related
Cassandra's MV are not production ready:
Cassandra Materialized views impact
Limitations: https://docs.datastax.com/en/cql-oss/3.3/cql/cql_using/knownLimitationsMV.html
https://techblog.fexcofts.com/2018/05/08/cassandra-materialized-views-ready-for-production/:
It turns out there have been issues with MVs. The biggest issue being
the MV not keeping in sync with the base table. This seems to occur
when creating a MV with a key that is not a key of the base table.
Cassandra does not offer any mechanism for checking the integrity
between the base table and any MVs. So unless you do this manually,
you will be oblivious to any discrepancies. If you do find any
discrepancies, the only way to fix them is to drop and recreate the
MV.
Cassandra's has MV since 2015, already 5.5 years: https://www.datastax.com/blog/new-cassandra-30-materialized-views.
Over to ScyllaDB, a database which first version was released in 2016: https://www.scylladb.com/2016/03/31/release-1-0/. ScyllaDB promotes MVs as production ready.
Why isn't Cassandra able to create production ready MVs like ScyllaDB can? I don't see any limitations for MVs on ScyllaDB on their website. MVs are super useful and I don't understand Cassandra never succeeded production ready MVs, this issue is already open for over 5 years: https://issues.apache.org/jira/browse/CASSANDRA-10346.
How did ScyllaDB solve the inconsistent MV problem? Why can't/haven't Cassandra solved the MV problems?
The Scylla implementation of MV resembles the Cassandra one but isn't identical.
Today, even with Scylla, if the view and the base go out of sync, there is no 100% safe way
to fix it other than a complete view rebuild.
However, we fixed and improved MV implementation a lot and decreased the chances of this happening. Here's a capture of the current closed:open bugs with MV:
https://github.com/scylladb/scylla/issues?q=is%3Aissue+is%3Aopen+materialize+view+
The development pace at Scylla is higher and so is the amount of active committers (as surprising as it is). Nowadays we're working on Raft in order to make regular operations consistent and thus completely sync the view and the base for good.
Full disclosure - I work on the Scylla project.
I don't know who could answer definitively "why hasn't Cassandra done x". This is Open Source Software, so I think the best answer is that nobody in the Community cared enough to do anything about it. And if you care enough, you are welcome to fix it. Maybe there's a better answer, but that's what I've got for you.
As for the "how Scylla did it", there is a detailed technical talk on the implementation of indexing and materialized views at https://www.youtube.com/watch?v=dyWZRjtPI2s. The talk is a bit old - there are recent updates to both functionality and performance - but the underlying infrastructure is all there.
And I can confirm that 2i and MV in Scylla are very widely used at scale in Production.
I wish to migrate data from exasol to exasol, but do not wish to use files as it would take a lot of time to move terabytes of data. I am totally new to exasol and have never worked on migration. Script is given on github (https://github.com/EXASOL/database-migration/blob/master/exasol_to_exasol.sql) but that is again using file import. Any lead would be appreciated!
thanks
Ok, we did this migration for ~80Tb compressed size (~400Tb raw size) database.
First of all, Exasol v6 works with data volumes created in v5 without any problems. There is no need to make this migration ASAP.
The simplest way is:
Upgrade to Exasol v6.
Create an archive volume, make full backup.
Create a data volume, restore backup.
Create new ExaSolution instance pointing to restored data volume.
If everything is ok, drop old Exasol instance and old data volume.
This is the fastest and easiest method, but you'll need a lot of disk space. It is a good idea to drop all indexes and truncate all staging tables to reduce size of backup.
I'm encountering the same problem as Cassandra system.hints table is empty even when the one of the node is down:
I am learning Cassandra from academy.datastax.com. I am trying the Replication and Consistency demo on local machine. RF = 3 and Consistency = 1.
When my Node3 is down and I am updating my table using update command, the SYSTEM.HINTS table is expected to store hint for node3 but it is always empty.
#amalober pointed out that this was due to a difference the Cassandra version being used. From the Cassandra docs at DataStax:
In Cassandra 3.0 and later, the hint is stored in a local hints directory on each node for improved replay.
This same question was asked 3 years ago, How to access the local data of a Cassandra node, but the accepted solution was to
...Hack something together using the Cassandra source that reads SSTables and have that feed the local client you're hoping to build. A great starting point would be looking at the source of org.apache.cassandra.tools.SSTableExport which is used in the sstable2json tool.
Is there an easier way to access the local hints directory of a Cassandra node?
Is there an easier way to access the local hints directory of a Cassandra node?
The hint directory is defined in $CASSANDRA_HOME/conf/cassandra.yaml file (sometimes it is located under /etc/cassandra also, depending on how you install Cassandra)
Look for the property hints_directory
I guess you are using ccm. So, the hint file should be in $CASSANDRA_HOME/.ccm/yourcluster/yournode/hints directory
I haven't been able to reproduce your issue with not getting a hints file. Every attempt I had resulted in the hints file as expected. There is a way to view the hints easier now.
We added a dump for hints in sstable-tools that you can use to view the mutations in the HH files. We may in the future add ability to use the HH files like sstables in the shell (use mutations to build memtable and include in queries) but for now its pretty raw.
Its pretty simple (sans metadata setup) if you wanna do analysis of data yourself. You can see what we did here and change to your needs: https://github.com/tolbertam/sstable-tools/blob/master/src/main/java/org/apache/cassandra/hints/HintsTool.java#L39
Sorry if this is an existing question, but any of the existing ones resolved my problem..
I've installed Cassandra single noded. I don't have a large application right now, but I think this can be the case soon, and I will need more and more nodes..
Well, I'm saving data from a stream to Cassandra, and this were going well, but suddently, when I tried to read data, I've started to receive this error:
"Not enough replica available for query at consistency ONE (1 required but only 0 alive)"
My keyspace was built using simplestrategy with replication_factor = 1. Im saving data separated by a field called "catchId", so most of my queries are like: "select * from data where catchId='xxx'". catchId is a partition key.
I'm using the cassandra-driver-core version 3.0.0-rc1.
The thing is that I don't have that much of data rigth now, and I'm thinking if it will be better to use a RDBMS for now, and migrate to Cassandra only when I have a better infrastructure.
Thanks :)
It seems that your node is unable to respond when you try to make your read (in general this error appears for more than one node). If you do not have lots of data, it's very strange, so this is probably a bad design choice. This can emanate from several things, so you have to make a few investigations.
study your logs ! In particular the system.log
you can change your read_request_timeout_in_ms parameter in cassandra.yaml. Although it's not agood idea in production, it will say you if it's just temporary problem (your request succeed after a little time) or a bigger problem
study your CPU and memory behavior when you are doing requests
if you are very motivated, you can install opscenter which will you give more valuable informations
How and how many write requests are you doing ? It can overwhelm cassandra (even if it's designed for). I recommend to make async requests to avoid problems.
When trying to run PIG against a CQL3 created Cassandra Schema,
-- This script simply gets a row count of the given column family
rows = LOAD 'cassandra://Keyspace1/ColumnFamily/' USING CassandraStorage();
counted = foreach (group rows all) generate COUNT($1);
dump counted;
I get the following Error.
Error: Column family 'ColumnFamily' not found in keyspace 'KeySpace1'
I understand that this is by design, but I have been having trouble finding the correct method to load CQL3 tables into PIG.
Can someone point me in the right direction? Is there a missing bit of documentation?
This is now supported in Cassandra 1.2.8
As you mention this is by design because if thrift was updated to allow for this it would compromise backwards computability. Instead of creating keyspaces and column families using CQL (I'm guessing you used cqlsh) try using the C* CLI.
Take a look at these issues as well:
https://issues.apache.org/jira/browse/CASSANDRA-4924
https://issues.apache.org/jira/browse/CASSANDRA-4377
Per this https://github.com/alexliu68/cassandra/pull/3, it appears that this fix is planned for the 1.2.6 release of Cassandra. It sounds like they're trying to get that out in the reasonably near future, but of course there's no certain ETA.
As e90jimmy said, its supported in Cassandra 1.2.8, but we have a issue when using counter column type. This was fixed by Alex Liu but due to regression problem in 1.2.7 the patch doesn't go ahead:
https://issues.apache.org/jira/browse/CASSANDRA-5234
To correct this, wait until 2.0 become production ready or download the source, apply the patch from the above link by yourself and rebuild the cassandra .jar. Worked for me by now...
The best way to access Cql3 Tables in Pig is by using the CqlStorage Handler
The syntax is similar to what you have a above
row = Load 'cql://Keyspace/ColumnFamily/' Using CqlStorage()
More info In the Dev Blog Post