What's the difference between creating a table and creating a columnfamily in Cassandra? - cassandra

I need details from both performance and query aspects, I learnt from some site that only a key can be given when using a columnfamily, if so what would you suggest for my keyspace, I need to use group by, order by, count, sum, ifnull, concat, joins, and some times nested queries.

To answer the original question you posed: a column family and a table are the same thing.
The name "column family" was used in the older Thrift API.
The name "table" is used in the newer CQL API.
More info on the APIs can be found here:
http://wiki.apache.org/cassandra/API
If you need to use "group by,order by,count,sum,ifnull,concat ,joins and some times nested querys" as you state then you probably don't want to use Cassandra, since it doesn't support most of those.
CQL supports COUNT, but only up to 10000. It supports ORDER BY, but only on clustering keys. The other things you mention are not supported at all.

Refer the document: https://cassandra.apache.org/doc/old/CQL-3.0.html
It specifies that the LRM of the CQL supports TABLE keyword wherever COLUMNFAMILY is supported.
This is a proof that TABLE and COLUMNFAMILY are synonyms.

In cassandra there is no difference between table and columnfamily. they are one concept.

For Cassandra 3+ and cqlsh 5.0.1
To verify, enter into a cqlsh prompt within keyspace (ksp):
CREATE COLUMNFAMILY myTable (
... id text,
... name int
);
And type 'desc myTable'.
You'll see:
CREATE TABLE ksp.myTable (
... id text,
... name int
);
They are synonyms, and Cassandra uses table by default.

here small example to understands concept.
A keyspace is an object that holds the column families, user defined types.
Create keyspace University
with replication={'class':SimpleStrategy,
'replication_factor': 3};
create table University.student(roll int Primary KEY,
dept text,
name text,
semester int)
'Create table', table 'Student' will be created in the keyspace 'University' with columns RollNo, Name and dept. RollNo is the primary key. RollNo is also a partition key.
All the data will be in the single partition.
Key aspects while altering Keyspace in Cassandra
Keyspace Name: Keyspace name cannot be altered in Cassandra.
Strategy Name: Strategy name can be altered by specifying new strategy name.
Replication Factor: Replication factor can be altered by specifying new replication factor.
DURABLE_WRITES :DURABLE_WRITES value can be altered by specifying its value true/false. By default, it is true. If set to false, no updates will be written to the commit log and vice versa.
Execution: Here is the snapshot of the executed command "Alter Keyspace" that alters the keyspace strategy from 'SimpleStrategy' to 'NetworkTopologyStrategy' and replication factor from 3 to 1 for DataCenter1.

Column family are somewhat related to relational database's table, with a distribution differences and maybe even idealistic character.
Imaging you have a user entity that might contain 15 column, in a relational db you might want to divide the columns into small-related-column-based struct that we all know as Table. In distributed db such as Cassandra you'll be able to concatenate all those tables entry into a single long row, so if you'll use profiler/ db manager you'll see a single table with 15 columns instead of 2/3 tables. Another interesting thing is that every column family is written to different nodes, maybe on different cluster and be recognized by the row key, meaning that you'll have a single key to all the columns family and won't need to maintain a PK or FK for every table and maintain the relationships between them with 1-1, 1-n, n-n relations. Easy!

Related

Cassandra select CQL: Cannot add column after wildcard

I need to output the write timestamp as part of a table export for lots of tables, though I quite cannot figure out a way which does not force me to explicitely select all columns in the statement.
Instead of being able to do just this:
SELECT *, writetime(data) AS timestamp FROM dls.licenses;
I have to do that:
SELECT column1, column2, ... , writetime(data) AS timestamp FROM dls.licenses;
This is pretty unconvenient since it means I'd have to change the export tool every time the schema of any of the tables changes.
Is there a better way?
Edit: To clarify, the actual error I get is the following. The way the syntax is presented in the error one could think that the SQL should be ok:
SELECT *, writetime(id) AS timestamp FROM dls.licenses;
SyntaxException: line 1:8 mismatched input ',' expecting K_FROM (SELECT *[,]...)
Edit 2: Here is the keyspace and create statement used for this table:
CREATE KEYSPACE IF NOT EXISTS dls WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': ‚1‘ };
CREATE TABLE IF NOT EXISTS dls.licenses (subscription_id text, id text, key text, data text, PRIMARY KEY (key));
CREATE INDEX IF NOT EXISTS ON dls.licenses (id);
BTW: I'm using the fresh Cassandra 4.0.0 (GA).
If you are exporting to CSV or JSON files, you may consider using DataStax's dsbulk.
https://github.com/datastax/dsbulk
The latest version of dsbulk 1.8.0 added support to export writetime and ttl.
https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/schemaOptions.html#schemaOptions__schemaOptionsPreserveTimestamp
dsbulk unload -url myData.csv -k ks1 -t table1 --timestamp
The WHERE clause specifies which rows must be queried. It is composed of relations on the columns that are part of the PRIMARY KEY and/or have a secondary index defined on them.
The column specification of the relation must be one of the following:
One or more members of the partition key of the table
A clustering column, only if the relation is preceded by other relations that specify all columns in the partition key
A column that is indexed using CREATE INDEX.
In Cassandra 3.6 and later, add ALLOW FILTERING to filter only on a non-indexed cluster column.
You may be able to solve your query problem by creating a secondary index on the column you want the writetime for. Keep in mind secondary indexes create overhead and which may result in unintended consequences.
The star (*) in SELECT * is the CQL syntax for "ALL columns" so by definition, it is not possible to include another column since ALL of them are selected even for native CQL functions. For this reason, you need to enumerate all column names + functions-on-columns.
+1 to Yuki's answer. I wanted to add that DSBulk adds a WRITETIME() column for every column in the table because it isn't possible to know in advance the write-time of each column in the partition until the full partition has been read.
Allow me to explain it using a couple of examples.
Schema
Consider this table:
CREATE TABLE users_by_email (
email text,
name text,
address text,
mobile text,
PRIMARY KEY (email)
)
Example 1
If we add a new record with a value specified for all columns:
INSERT INTO users_by_email (email, name, address, mobile)
VALUES ('alice#staysafe.com', 'Alice', '221B Baker St', '098-765-432-109');
then for this partition, all columns will have the same write-time.
Example 2
Consider a situation where a record is fragmented across multiple inserts over a period of time such as:
INSERT INTO users_by_email (email, name) VALUES ('dude#getvaccinated.now', 'Bob');
INSERT INTO users_by_email (email, address) VALUES ('dude#getvaccinated.now', '350 Fifth Ave');
INSERT INTO users_by_email (email, mobile) VALUES ('dude#getvaccinated.now', '012-555-123-456');
Each of the columns name, address and mobile would all have different write-times.
From these 2 examples, you should see that there isn't always a single write-time that applies to all columns in the partition.
For your specific use case, you need to figure out from the DSBulk output which write-time to use for situations where the partition fragments are inserted/updated at different times. Cheers!

In Cassandra, why dropping a column from tables defined with compact storage not allowed?

As per datastx documentation here, we cannot delete column from tables defined with COMPACT STORAGE option. What is the reason for this?
This goes back to the original implementation of CQL3, and changes which were made to allow it to abstract a "SQL-like," wide-row structure on top of the original Thrift-based storage engine. Ultimately, managing the schema comes down to whether or not the underlying structure is a table or a column_family.
As an example, I'll create two tables using an old install of Apache Cassandra (2.1.19):
CREATE TABLE student (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT);
CREATE TABLE studentcomp (
studentid TEXT PRIMARY KEY,
fname TEXT,
name TEXT)
WITH COMPACT STORAGE;
I'll insert one row into each table:
INSERT INTO student (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
INSERT INTO studentcomp (studentid, fname, lname) VALUES ('janderson','Jordy','Anderson');
And then I'll look at the tables with the old cassandra-cli tool:
[default#stackoverflow] list student;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=, value=, timestamp=1599248215128672)
=> (name=fname, value=4a6f726479, timestamp=1599248215128672)
=> (name=lname, value=416e646572736f6e, timestamp=1599248215128672)
[default#stackoverflow] list studentcomp;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: janderson
=> (name=fname, value=Jordy, timestamp=1599248302715066)
=> (name=lname, value=Anderson, timestamp=1599248302715066)
Do you see the empty/"ghost" column value in the first result? That empty column value was CQL3's link between the column values and the table's meta data. If it's not there, then CQL cannot be used to manage a table's columns.
The comparator used for type conversion was all that was really exposed via Thrift. This lack of meta data control/exposure is what allowed Cassandra to be considered "schemaless" in the pre-CQL days. If I run a describe studentcomp from within the cassandra-cli, I can see the comparators (validation class) used:
Column Metadata:
Column Name: lname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Column Name: fname
Validation Class: org.apache.cassandra.db.marshal.UTF8Type
But if I try describe student, I see this:
WARNING: CQL3 tables are intentionally omitted from 'describe' output.
See https://issues.apache.org/jira/browse/CASSANDRA-4377 for details.
Sorry, no Keyspace nor (non-CQL3) ColumnFamily was found with name: student (if this is a CQL3 table, you should use cqlsh instead)
Bascially, tables and column families were different entities forced into the same bucket. Adding WITH COMPACT STORAGE essentially made a table a column family.
With that came the lack of any schema management (adding or removing columns), outside of access to the comparators.
Edit 20200905
Can we somehow / someway (hack) drop the columns from table?
You might be able to accomplish this. Sylvain Lebresne wrote A Thrift to CQL3 Upgrade Guide which will have some necessary details for you. I also advise reading through the Jira ticket mentioned above (CASSANDRA-4377), as that covers many of the in-depth technical challenges that make this difficult.

is it possible to shard Vistess with the secondary sharding Key

We are using Vitess database to scale and achieve Horizontal Sharding in mysql. is it possible to do the secondary shard in Vitess.
For eg:
Table 1 - Agency
(
AgencyID INT,
CreatedOn DATETIME
)
Table 2 - PayrollDetails
(
AgencyID INT FOREIGN KEY TO Agency Table,
PayrollID INT,
PayrollCreatedOn DATETIME
)
Now We sharded both the tables with AgencyID as a Sharding Key. but PayrollDetails table is very huge and it has more then 100 million of records. So now we are planning to shard PayrollDetails table again with the PayrollCreatedOn field and Primary Shard for both the tables should be with the Agency Key but payrollDetails table should shard with the both AgencyID and PayrollCreatedOn.How can we achieve it in Vitess?
Conceptually, the sharding key (primary vindex) is used to decide which shard a row goes to. So, it's not possible to have two sharding keys because they would dictate conflicting locations for the row.
If I understand correctly, you want to query the table using PayrollCreatedOn in the where clause, you can create a secondary Vindex. This will create a lookup table that points at where the row lives, and Vitess can exploit that. There's an explanation for this here: https://vitess.io/docs/reference/vindexes/. There is a new command called CreateLookupVindex that is capable of backfilling this lookup table. It's yet to be documented, though.
Vitess also lets you "materialize" a table by using a different primary vindex. In that case, the second table will be a real-time copy of the first table, but sharded differently. You can see a demo for this on the vitess front page (scroll down to the video).

Insert identical records into multiple tables with different primary keys

I have some billions records with 15 fields, which I want to insert them into Cassandra (with Java api). Since my queries search key can be one of the five different fields of record (i.e search query on fields 3 or 7 or 8 or 13 or 14), so I have created 5 identical tables with different primary keys in Cassandra (similar the note that is mentioned in enter link description here).
Now I read a record (or a batch of the records) and call "inserting into Cassandra" 5 times.
I want to know is there a mechanism in Cassandra that makes me to call "inserting into Cassandra" one times and storing the record(s) into 5 tables automatically?
For example the record(s) stores in MemTable at once (from my code by inserting at once) and the Cassandra core stores them in 5 tables in SSTable?
Since Cassandra 3.0 there is support for materialized views that could help you. But you need to design your source table carefully, as there is a number of limitations on how you can change structure of the materialized views comparing to source table - most notably:
* you can add to primary key at most one column that isn't in the primary key of source table;
* materialized view's primary key should contain all components of primary key of source table, but you can use different order of columns in primary key.
* all columns of materialized view's primary key should be non-null.
More details on these limitations you can find in this blog post.
You also need to be careful with changing partition key to not to get the big partitions (but you may have the same problem if you write data manually). Also, take into account that this adds more load to coordinator node that will need to distribute data to other servers if partition key is changed - when you write data "manually" then driver will send request directly to replica that holds that data.
Syntax for creation of materialized views is in the documentation - it quite similar to SQL's but not exactly (example from documentation):
CREATE TABLE cyclist_mv (cid UUID PRIMARY KEY,
name text, age int, birthday date, country text);
CREATE MATERIALIZED VIEW cyclist_by_age
AS SELECT age, birthday, name, country
FROM cyclist_mv
WHERE age IS NOT NULL AND cid IS NOT NULL
PRIMARY KEY (age, cid);
In this case, we move from one column in primary key (cid) to 2 columns in the primary key (age, and cid). Note the explicit check for non-NULL values in theWHERE` condition.

Is cassandra a row column database?

Im trying to learn cassandra but im confused with the terminology.
Many instances it says the row stores key/value pairs.
but, when I define a table its more like declaring a SQL table ie; you create a table and specify the column names and data types.
Can someone clarify this?
Cassandra is a column based NoSQL database. While yes at its lowest level it does store simple key-value pairs it stores these key-value pairs in collections. This grouping of keys and collections is analogous to rows and columns in a traditional relational model. Cassandra tables contain a schema and can be referenced (with restrictions) using a SQL-like language called CQL.
In your comment you ask about Apples being stored in a different table from oranges. The answer to that specific question is No it will be in the same table. However Cassandra tables have an additional concept call the Partition Key that doesn't really have an analgous concept in the relational world. Take for example the following table definition
CREATE TABLE fruit_types {
fruit text,
location text,
cost float,
PRIMARY KEY ((fruit), location)
}
In this table definition you will notice that we are defining the schema for the table. You will also notice that we are defining a PRIMARY KEY. This primary key is similar but not exactly like a relational concept. In Cassandra the PRIMAY KEY is made up of two parts the PARTITION KEY and CLUSTERING COLUMNS. The PARTITION KEY is the first fields specified in the PRIMARY KEY and can contain one or more fields delimitated by parenthesis. The purpose of the PARTITION KEY is to be hashed and used to define the node that owns the data and is also used to physically divide the information on the disk into files. The CLUSTERING COLUMNS make up the other columns listed in the PRIMARY KEY and amongst other things are used for defining how the data is physically stored on the disk inside the different files as specified by the PARTITION KEY. I suggest you do some additional reading on the PRIMARY KEY here if your interested in more detail:
https://docs.datastax.com/en/cql/3.0/cql/ddl/ddl_compound_keys_c.html
Basically cassandra storage is like sparse matrix, earlier version has a command line tool called cqlsh which can show the exact storage foot print of your columnfamily(aka table in latest version). Later community decided to keep RDBMS kind of syntax for better understanding coz the query language(CQL) syntax is similar to sql.
main storage is key(partition) (which is hash function result of chosen partition column in your table and rest of the columns will be tagged to it like sparse matrix.

Resources