Can I get some help in understanding the real difference between YSQL vs YCQL? Based on the documentation I understand that the current implementation of underlying storage for YUGABYTE is DOCS DB and uses RAFT for replication.
Based on this can I assume that YSQL vs YCQL the only difference is that we have triggers, stored procs? and SQL features in YSQL and not in YCQL?
Great question. The plan is that over time YSQL will have most of the features in YCQL, but that is not the case today. This is because there is significant work left to be done in YSQL to achieve parity, some of which is already in progress.
YSQL features
YSQL re-uses the upper half of PostgreSQL with a horizontally scalable lower half called DocDB. Thus, YSQL would support all PostgreSQL features - including stored procedures, triggers, common table expressions, extensions and foreign data wrappers (last feature is not yet done).
YCQL features not in YSQL
Here is a list of YCQL features not in YSQL.
Cluster awareness: The client drivers are cluster aware, meaning the clients can discover all nodes of the cluster given just one contact point. These client drivers also get notified of node add/remove and therefore apps do not need a load balancer to use a distributed cluster. There is on-going work to incorporate this functionality into YSQL as a part of jdbc-yugabytedb project.
Topology awareness: The client drivers are also topology aware, meaning they are notified of the regions/zones in which the various nodes of the cluster are deployed. They can perform operations such as reading from nearest region/datacenter.
Automatic data expiry: YCQL supports automatic expiry of data using the TTL feature - you can set a retention policy for data at a table or row level and the older data is automatically purged from the DB.
Collection data types: YCQL supports collection data types such as sets, maps, lists. Note that both YCQL and YSQL support JSONB which can be used to model the above though.
Cassandra API compatible YCQL is Cassandra API compatible, and therefore supports the Cassandra ecosystem projects. Examples include Spark and Kafka connectors, JanusGraph and KairosDB support, etc. Note that while these ecosystem integrations can be built on top of YSQL, it does not exist today and is a matter of prioritization.
Related
If I understand right multiple gremlin servers don't communicate with each other. The scale is in the cassandra/ES only.
If that is true how many vertexes can each gremlin server support?
When the graph is updated by one gremlin server when will the other gremlin servers see that change?
Thanks!
The number of vertices supported is 500 trillion (2^59)
The storage backend is the sole source of state between multiple Gremlin servers. The number of vertices will not be increased by adding additional Gremlin servers.
The limitations on the number of vertices is outlined in the Technical Limitations Page in the JanusGraph Manual.
When one Gremlin Server sees changes made by another is determined by the storage backend choice, but it's still tricky to answer
As far as when the other Gremlin servers will see changes, that is a bit tricky to answer. If you are using a consistent data backend, the answer will generally be as soon as Gremlin finishes its transaction.
But Cassandra is a different beast.
Using an eventually consistent storage backend
Cassandra is what's known as an eventually-consistent database. This means that it trades transactional consistency for availability and partition tolerance; even if you started to lose nodes in the cluster, it will continue to function and serve requests.
The downside to this is that mutations in Cassandra do not instantly become available to consumers; you can even have the case where a client writes a change to Cassandra and that very same client doesn't see the change if they immediately try to read that data.
Chapter 31 in the JanusGraph Manual covers dealing with an eventually consistent storage backend like Cassandra.
Realistically, the amount of time between a mutation and all clients being able to see the mutation in Cassandra depends entirely upon the data load, the nature of the write, and the read/write consistency levels that JanusGraph is configured to read and write to Cassandra with.
I would ask whether Ignite is suitable for my use case which is:
Load all the data of oracle tables to the Ignite cache, and then do various SQL queries(aggregation/join/sub-query) against the data in the cache.
When oracle has newly created data or some data are updated, there are some way that these data can be inserted into the cache or update the corresponding entry in the cache
When the cache is down, there should be some way to restore the data from oracle?
Not sure Ignite SQLGrid can fit in this use case.
Also, I notice that IgniteRDD is not immutable, is IgniteRDD suitable for this use case? That is, I first load the data in oracle into IgniteRDD,
and make the corresponding changes to IgniteRDD with the newly created/updated data to oracle? But it looks that IgniteRDD doesn't support complicated SQL?( aggregation/join/sub-query)
This is one of the basic use cases supported by Ignite.
Data can pre-loaded from Oracle using one of the methods covered in this documentation section.
If you're planning to update the data in Ignite first and propagate to Oracle after (which is preferred way), then it makes sense to use Oracle as a CacheStore in write-through/read-through mode. Ignite will make sure to sync up data with the persistent layer. Moreover, it'll be straightforward to pre-load data from Oracle if the cluster is restarted.
Finally, you can take advantage of GridGain Web Console by connecting to Oracle and map Oracle's scheme to Ignite caches configuration and POJO objects.
As I mentioned, it's recommended to make all the updates through Ignite first which will persist them to Oracle. But if Oracle is updated by other applications that are not aware of Ignite you need to update Ignite cluster on your own somehow. Ignite doesn't have any feature that covers this use case. However, this can be easily implemented with GridGain, that is built on top of Ignite, with it's Oracle Golden Gate Integration.
Once the data is in the Ignite Cluster use SQL Grid to query and/or update your data. SQL Grid engine is ANSI-99 compliant and doesn't have any limitations.
As for Ignite Shared RDD, it stores data in a distributed Ignite cache. This is why it's mutable which is opposite to Spark native RDDs. Shared RDDs SQL capabilities are absolutely the same - it's just one more API on top of SQL Grid.
I have checked the main features of Cassandra and Infinispan. They seem to have and deliver pretty similar characteristics and functionalities:
NoSQL data store
persistance
decentralized
support replication
scalability
fault tolerant
MapReduce support
Queries
One difference I have found out is that Infinispan does not provide tunable consistency (every node has the same data).
When learning about the Infinispan I came across Cassandra Cache Store (http://infinispan.org/docs/cachestores/cassandra/). It provides persistance of data.
But then why I would still want to use Infinispan and not Cassandra directly?
Do these solutions complement each other or they are more competing on the same level?
Infinispan is mainly used as a distributed cache, like memcached/hazelcast and so on.
Natively data are written in memory but you can persist them into what they call "cache stores" -- there are many cache-stores ready (for File/Cassandra/Hbase/Mongo) or you can make your own implementation.
One difference I have found out is that Infinispan does not provide
tunable consistency (every node has the same data).
Tunable consistency and data distribution are two different things. It's not true that "every node has the same data", it depends on how you choose to cluster data. Infinispan, like others, offers both replication (all nodes stores same cache) and distribution (each node will be responsible for a range of tokens). Tunable consistency in Cassandra means that you can choose how many nodes should be informed about your r/w operation before returning the control to the client.
You might need to use Infinispan and not Cassandra directly for many reasons. If for instance you have huge amount of memory in your application servers and you want keep a bigger/different cache than what you can store inside your Cassandra nodes. Other feature you might need is to plug the infinispan-query module in order to perform full-text searches without installing a solr/elasticsearch/whatever cluster or use the transactional capability within is.
IMHO these two products does not compare directly, they're born for different use cases and offers different features. You can use any, one or both, depend on what's your application architecture and needs.
HTH,
Carlo
I started with Cassandra. I use cql 2.0 and I would like to create table with primary key auto_increment. I use cassandra on one node.
Cassandra doesn't have any type of key auto increment feature that you would normally find in an RDBMS. The coordination cost across nodes is too high to make it a worthwhile feature.
Generally you should be using UUIDs whenever you would have used an auto incrementing sequence in an RDBMS. Clients can create these independently of each other with a guarantee of uniqueness (if you are using them correctly). You can use TimeUUIDs if you want to be able to order your keys by creation time (assuming that your clients have synchronized clocks).
You said you are only using a 1 node cluster. If you don't ever plan on growing your cluster to be larger than 1 node then I would suggest using a different database. Cassandra sacrifices many of the traditional database features to make it work really well distributed across a ring of machines. When you only run a single node cluster you lose all of the nice features from an RDMBS without gaining any of the benefits of running a multinode Cassandra cluster.
I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
Zookeeper creates znodes to perform read and write operations and data to and fro goes through znodes in Zookeeper. I want to know that how to support rollback and commit feature in cassandra using Zookeeper. Is there any way by which we can specify cassandra configurations in zookeeper or zookeeper configurations in cassandra.
I know cassandra and zookeeper individually how data is read and written but I dont know how to integrate both of them using Java.
how can we do transactions in Cassandra using Zookeeper.
Thanks.
I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
With great difficulty. Cassandra does not work well as a transactional system. Writes to multiple rows are not atomic, there is no way to rollback writes if some writes fail, and there is no way to ensure readers read a consistent view when reading.
I want to know that how to support rollback and commit feature in cassandra using Zookeeper.
Zookeeper won't help you with this, especially the commit feature. You may be able to write enough information to zookeeper to roll back in case of failure, but if you are doing that, you might as well store the rollback info in cassandra.
Zookeeper and Cassandra work well together when you use Zookeeper as a locking service. Look at the Cages library. Use zookeeper to co-ordinate read/writes to cassandra.
Trying to use cassandra as a transactional system with atomic commits to multiple rows and rollbacks is going to be very frustrating.
There are ways you can use to implement transactions in Cassandra without ZooKeeper.
Cassandra itself has a feature called Lightweight transactions which provides per key linearizability and compare-and-set. With such primitives you can implement serializable transactions on the application level by youself.
Please see the Visualization of serializable cross shard client-side transactions post for for details and step-by-step visualization.
The variants of this approach are used in Google's Percolator system and in CockroachDB.
By the way, if you're fine with Read Committed isolation level then it makes sense to take a look on the RAMP transactions paper by Peter Bailis.
There is a BATCH feature for Cassandra's CQL3 (Cassandra 1.2 is the formal version that released CQL3), which allegedly can atomically apply all the updates in the BATCH as one unit all-or-nothing.
This does not mean you can rollback a successfully executed BATCH as an RDBMS could do, that would have to be manually done.
Depending on the consistency and preferences you provide to the BATCH statement, guarantees of atomicity of the updates can be increased or decreased to some degree with the UNLOGGED option.
http://www.datastax.com/docs/1.2/cql_cli/cql/BATCH
Well, I'm not an exepert at this (far from it actually) but the way I see it, either you deploy some middleware made by yourself, in order to guarantee the specific properties you are looking for or you can just have Cassandra write data to auxilliary files and then copy them through the file system, since the copy function in Java works as an atomic operation.
I don't know anything about the size of the data files you are considering so I don't really know if it is doable, however there might be a way to use this property through smaller bits of information and then combining them as a whole.
Just my 2 cents...