Lists in NoSQL/BigTable Data Modeling & Super Columns (with Cassandra) - cassandra

I'm new to NoSQL and BigTable, and I'm trying to learn how I can (and if should) use super columns to create a BigTable friendly schema.
Based on this article about NoSQL data modeling, it sounds like instead of using JOIN-centric RDBMS schemas, I should aggregate my data into larger tables to de-normalize where possible. Based on that, here's a simple schema I envisioned for a 'User', which I'm trying to create for Cassandra:
User: {
KEY: UserId {
name: {
first,
last
},
age,
gender
}
};
The above column family (User), whose key is a 'UserID', is composed of 3 columns (name, age, gender.) Its column 'name' would be a super column who is composed of 'first' and 'last' columns.
So what I'm asking is:
What does the CQL 3.0 look like to create this column family 'User' with the 'name' super column within it? (Update: This doesn't appear possible.)
Should I be using super columns (like this)? Should I be using something else?
What's an alternative way of representing this schema?
How do I represent a list of values in a table/column family?
Here are some useful links about this that I found, but that I don't quite understand clearly enough to answer my question:
Create a Cassandra schema for a super column with metadata
Cassandra: How to create column in a super column family?
Modeling relational data with Cassandra
Thanks!
Update:
After alot of research, I'm learning a few things:
You cannot create super columns using CQL; there might be other mechanisms to do so, but CQL does not appear to be one of them.
Syntax for SQL 3.0 seems to be drifting from a 'COLUMN FAMILY'-centric approach towards SQL-like 'TABLE' based syntax.
Changed my questions accordingly.

Should I be using super columns (like this)? Should I be using
something else?
You can use that data model that you suggested. But generally it is not recommended for these reason as mentioned in the link.
I'll also note that use of super columns is generally discouraged as
they have several disadvantages. All subcolumns in a super column
need to be deserialized when reading one sub column and you can not
set secondary indexes on super columns. They also only support one
level of nesting.
Hence consider these reasons for your situation.
What's an alternative way of representing this schema?
You can try using composite columns. Read here for more information. Or you can probably just use standard column family, I think standard cf will be suitable for your situation. For example, following suggestion:
User : {
key: userId {
columnName:firstname
ColumnName:lastname
ColumnName:age
ColumnName:gender
ColumnName:zip
ColumnName:street
}
..
};
How do I represent a list of values in a table/column family?
It is possible to store the list in a BytesType in the cf. Or you can probably break the list into individual element and store as CompositeType.

Related

How do we handle alter table column datatypes in Cassandra, in actual scenarios?

I'm aware of the restrictions Cassandra has on modifying column data types once table is created or even dropping and adding column with same name of different data type.
Dropping and Adding is allowed with restrictions.
But, if we talk about actual scenarios, it's not that uncommon to modify table schema during initial phase of our project.
Example: modifying Name column of User table from TEXT to a UDT(User Defined Type) that could encapsulate more information.
Coming from a RDBMS background, this is a very strange behaviour and maybe someone with actual project experience on Cassandra can answer it.
How do we handle such scenario of modifying column datatypes ? And what are the best practices.
Also, is this a common behaviour with other NoSQL or columnar databases ?

How CRUD operations are possible in Composite columns of a Cassandra Column Family?

I've decide to use NOSQL, Cassandra DB in my new project.
I could't found any document that can help me. I have a huge one-to-many relation in my data model! the question is that, when I want to query on my column family it will return many rows with many similar columns and just one different column for the values of composite column:
ID | Name| Age| ExtraInfo|
--------------------------------------------
myId| myName| myAge|"info": 5423|
myId| myName| myAge|"info": this's a test string|
myId| myName| myAge|"info": 454$|
--------------------------------------------
We will have redundancy in returned result!
Also,
how we can update a composite column in Cassandra? for example how we can insert new column in a composite column for an existing row? should we repeat all data of an id in this case (id and name and age)? what will happen if the rows are so heavy?
Let me ask another question, too! Can we use different types for values of composite columns as above example?
As NOSQL is not a relational database, and as you have mentioned that you have huge one to many relations, I'd recommend you to use a relational database like MSSQL or MySql.
As it seams that you are already a relational database user, Please consider reading more about NOSQL databsese, Their joining, Ordering, ... before starting a whole new project using NOSQL.
P.S.: Your question is very general and you can not be answered here, you must first read about it and ask only specific questions which articles don't cover.
Update:
Here is a tutorial for Cassandra NoSQL that you can find examples for CRUD operations and much more: http://www.tutorialspoint.com/cassandra/cassandra_introduction.htm

PouchDB structure

i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?
The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.
... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.
Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.

Is it possible to insert/write data without defining columns in Cassandra?

I am trying to understand the fundamentals of Cassandra data model. I am using CQL. As per I know the schema must be defined before anyone can insert into new columns. If someone needs to add any column can use ALTER TABLE and can INSERT value to that new column.
But in cassandra definitive guide there is written that Cassandra is schema less.
In Cassandra, you don’t define the columns up front; you just define the column
families you want in the keyspace, and then you can start writing data without defining
the columns anywhere. That’s because in Cassandra, all of a column’s names are
supplied by the client.
I am getting confused and not finding any expected answer. Can someone please explain it to me or tell me if I am missing somthing?
Thanks in advance.
Theres two different APIs to interact with Cassandra for writing data. First there's the thrift API which always allowed to create columns dynamically, but also supports adding meta data for your columns.
Next theres the newer CQL based API. CQL was created to provide another abstraction layer that would make it more user friendly to work with Cassandra. With CQL you're required to define a schema upfront for your column names and datatypes. However, that doesn't mean its not possible to use dynamic columns using CQL.
See here for differences:
http://www.datastax.com/dev/blog/thrift-to-cql3
You are reading "Cassandra, the definitive guide": a 3/4 years old book that is telling you something that has changed long time ago. Now you have to define the tables structure before being able to write data.
Here you can find some reasons behind CQL introduction and the schema-less abandonment.
The official Datastax documentation should be your definitive guide.
HTH,
Carlo

Secondary Index in Cassandra will lead to two DB reads

Lets assume a data model in which a User have blog-posts. Each post has a unique title and many attributes.
I have a Column Family "posts" in which each row is like this:
posts = {
"yersterday" : {
date : 03-04-2012
userID : abfe222234
tags : "beatles,paul"
}
}
I want to index the posts by user, so I have another regular column family:
user_posts = {
abfe222234 : {
yesterday : null
....
}
}
This model comes after a lot of research about secondary indexing in Cassandra, in which I came to these slides: http://www.slideshare.net/edanuff/indexing-in-cassandra and understood that Super Column Family are less and less used.
My question:
If you want all the details about the user posts, it means that I have to read the DB twice: once for getting all the posts IDs, and once for fetching all the post's details for those IDs.
What am I missing?
Thanks,
Issahar.
edit:
The other option, is to make "user_posts" be a Super CF, and make it contain all the data that is inside "posts".
pros: you'll have to fetch all the data only once.
cons: 1. You'll duplicate all of your data. 2. You can't search for once attribute of a post.
What do you say?
Looks pretty straightforward to me- you really do indeed need to perform two database reads to get the data in this case. For what it's worth, most relational databases need to perform two logical reads also, unless the data that the user is interested in is fully contained in the index. The only difference is that in a relational DB there is only one network round trip.

Resources