Cassandra wide row and simple row behind the scenes

Cassandra wide row and simple row behind the scenes - cassandra

Trying to learn how cassandra works here, was very hard to get insights about how data is stored behind the scenes, hence im not sure i got it right and wanted to get corrected if i got it wrong, so far i understand that a typical row that the partition key (leftmost column or compuond column in the primary key) is unique, a row would be written to disk in a chunk that looks like this :
where the columns are sorted by their names.
But, if the partition key is not unique for then it would be considered a "wide row" and a row would look like the following examples :
Please correct me if i got it wrong...

For Second part where you have compound primary key the structure will be somthing similar to first except for the fact that:
ColumnName will be replaced by usrname1.comments|username1.comments_ts|username2.comments|username2.comments_ts
VideoId : usrname1.comments|username1.comments_ts|username2.comments|username2.comments_ts
Same thing will be true for comments_by_user
username: videoid1.comments|videoid1.comments_ts|videoid2.comments|videoid2.comments_ts
PS: I am not good at drawing images so you have to do with this text answer.
For more details Refer Slide 48

Related

Cassandra: know when the row got inserted

Small question regarding Cassandra please.
I have a row, I can see it in Cassandra, I can query it etc.
The row looks like this.
CREATE TABLE student_Registration(
Id int PRIMARY KEY,
Name text,
Event text
);
Id Name Event
101 Ashish Ninza
I would like to ask, is there a way to know when this row got inserted please?
I tried looking up for some command to get the insert time, similar to SELECT TTL (Name) from, but for "when the row got inserted" and no luck so far.
May I ask what would be the best way to know when the row got inserted please?
Thank you

That's the responsibility of the writetime function (doc):
SELECT writetime(Name) from student_Registration
Update: But you need to keep in mind following:
the writetime shows the update time of the individual cell, and they could be updated separately
Write time could be setup explicitly when writing the data

How can i update the column to a particular value in a cassandra table?

Hi I am having a cassandra table. my table has around 200 records in it . later i have altered the table to add a new column named budget which is of type boolean . I want to set the default value to be true for that column . what should be the cql looks like.
I am trying the following command but it didnt work
cqlsh:Openmind> update mep_primecastaccount set budget = true ;
SyntaxException: line 1:46 mismatched input ';' expecting K_WHERE
appreciate any help
thank you

Any operation that would require a cluster wide read before write is not supported (as it wont work in the scale that Cassandra is designed for). You must provide a partition and clustering key for an update statement. If theres only 200 records a quick python script or can do this for you. Do a SELECT * FROM mep_primecastaccount and iterate through ResultSet. For each row issue an update. If you have a lot more records you might wanna use spark or hadoop but for a small table like that a quick script can do it.

Chris's answer is correct - there is no efficient or reliable way to modify a column value for each and every row in the database. But for a 200-row table that doesn't change in parallel, it's actually very easy to do.
But there's another way that can work also on a table of billions of rows:
You can handle the notion of a "default value" in your client code. Pre-existing codes will not have a value for "budget" at all: It won't be neither true, nor false, but rather it will be outright missing (a.k.a. "null"). You client code may, when it reads a row with a missing "budget" value, replace it by some default value of its choice - e.g., "true".

Cassandra Table Modeling

Imagine a table with thousands of columns, where most data in the row record is null. One of the columns is an ID, and this ID is known upfront.
select id,SomeRandomColumn
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
SomeRandomColumn is one of thousands, and in most cases the only column with data. SomeRandomColumn is NOT known upfront as the one that contains data.
Is there a CQL query that can do something like this.
select {Only Columns with data}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
I was thinking of putting in a "hint" column that points to the column with data, but that feels wrong unless there is a CQL query that looks something like this with one query;
select ColumnHint.{DataColumnName}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
In MongoDB I would just have a collection and the document I got back would have a "Type" attribute describing the data. So perhaps my real question is how do I replicate what I can do with MondoDB in Cassandra. My Cassandra journey so far is to create UDT's for each unique document, followed by altering the table to add this new UDT as a column. My starter table looks like this where ColumnDataName is the hint;
CREATE TABLE IF NOT EXISTS WideProductInstance (
Id uuid,
ColumnDataName text
PRIMARY KEY (Id)
);
Thanks

Is there a CQL query that can do something like this.
select {Only Columns with data}
from LotsOfColumnsTable
where id = 92e72b9e-7507-4c83-9207-c357df57b318;
No, you cannot do that. And it's pretty easy to explain. To be able to know that a column contains data, Cassandra will need to read it. And if it has to read the data, since the effort is already spent on disk, it will just return this data to the client.
The only saving you'll get if Cassandra was capable of filtering out null column is on the network bandwidth ...
I was thinking of putting in a "hint" column that points to the column with data, but that feels wrong unless there is a CQL query that looks something like this with one query;
Your idea is like storing in another table a list of all column that actually contains real data and not null. It sounds like a JOIN which is bad and not supported. And if you need to read this reference table before reading the original table, you'll have to read at many places and it's going to be expensive
So perhaps my real question is how do I replicate what I can do with MondoDB in Cassandra.
Don't try to replicate the same feature from Mongo to Cassandra. The two database have fundamentally different architecture. What you have to do is to reason about your functional use-case. "How do I want to fetch my data from Cassandra ?" and from this point design a proper data model. Cassandra data model is designed by query.
The best advice for you is to watch some Cassandra Data Model videos (it's free) at http://academy.datastax.com

Handling the following use case in Cassandra?

I've been given the task of modelling a simple in Cassandra. Coming from an almost solely SQL background, though, I'm having a bit of trouble figuring it out.
Basically, we have a list of feeds that we're listening to that update periodically. This can be in RSS, JSON, ATOM, XML, etc (depending on the feed).
What we want to do is periodically check for new items in each feed, convert the data into a few formats (i.e. JSON and RSS) and store that in a Cassandra store.
So, in an RBDMS, the structure would be something akin to:
Feed:
feedId
name
URL
FeedItem:
feedItemId
feedId
title
json
rss
created_time
I'm confused as to how to model that data in Cassandra to facilitate simple things such as getting x amount of items for a specific feed in descending created order (which is probably the most common query).
I've heard of one strategy that mentions having a composite key storing, in this example, the the created_time as a time-based UUID with the feed item ID but I'm still a little confused.
For example, lets say I have a series of rows whose key is basically the feedId. Inside each row, I store a range of columns as mentioned above. The question is, where does the actual data go (i.e. JSON, RSS, title)? Would I have to store all the data for that 'record' as the column value?
I think I'm confusing wide rows and narrow (short?) rows as I like the idea of the composite key but I also want to store other data with each record and I'm not sure how to meld the two together...

You can store everything in one column family. However If the data for each FeedItem is very large, you can split the data for each FeedItem into another column family.
For example, you can have 1 column familyfor Feed, and the columns of that key are FeedItem ids, something like,
Feeds # column family
FeedId1 #key
time-stamp-1-feed-item-id1 #columns have no value, or values are enough info
time-stamp-2-feed-item-id2 #to show summary info in a results list
The Feeds column allows you to quickly get the last N items from a feed, but querying for the last N items of a Feed doesn't require fetching all the data for each FeedItem, either nothing is fetched, or just a summary.
Then you can use another column family to store the actual FeedItem data,
FeedItems # column family
feed-item-id1 # key
rss # 1 column for each field of a FeedItem
title #
...

Using CQL should be easier to understand to you as per your SQL background.
Cassandra (and NoSQL in general) is very fast and you don't have real benefits from using a related table for feeds, and anyway you will not be capable of doing JOINs. Obviously you can still create two tables if that's comfortable for you, but you will have to manage linking data inside your application code.
You can use something like:
CREATE TABLE FeedItem (
feedItemId ascii PRIMARY KEY,
feedId ascii,
feedName ascii,
feedURL ascii,
title ascii,
json ascii,
rss ascii,
created_time ascii );
Here I used ascii fields for everything. You can choose to use different data types for feedItemId or created_time, and available data types can be found here, and depending on which languages and client you are using it can be transparent or require some more work to make them works.
You may want to add some secondary indexes. For example, if you want to search for feeds items from a specific feedId, something like:
SELECT * FROM FeedItem where feedId = '123';
To create the index:
CREATE INDEX FeedItem_feedId ON FeedItem (feedId);
Sorting / Ordering, alas, it's not something easy in Cassandra. Maybe reading here and here can give you some clues where to start looking for, and also that's really depending on the cassandra version you're going to use.

Astyanax retrieve by UUID

How do we retrieve a row in Cassandra using Astyanax?
I have a web application which requires pagination to be done on the server side, the db is cassandra. The row key is a UUID and I have few columns within a row, so I am trying to do pagination on the row keys.
I have put together a solution with which i am not completely happy. The issue is that when i do my first search based on the search filter i get from UI, I don't know the UUID of the first row, So I prepare a Query which gives me first 6 records, I store the key of the 6th record in a MAP and put it in a session, so when user from UI request for a second page, i retrieve this key(UUID) and take it as a start for the next set of records to be retrieved. I was trying to find a cleaner approach.

EDIT in response to question changes...
In that case, sounds like you are doing it just fine. Also sounds like you are using OOP then as otherwise the rows are not in order. While playOrm's solution is more elegant returning you a cursor that you store in the session, I think what you have is just fine.
EDIT since our code changed
line 74 in this link is how we do it(you can drill down into that cursor class basically to see what we do).
https://github.com/deanhiller/playorm/blob/master/src/main/java/com/alvazan/orm/layer9z/spi/db/cassandra/CassandraSession.java
which if you have 10 row keys you just pass in the list of keys.
I am not sure what you mean by pagination???? Are you saying you have a from and to row key and want the rows between them? Are you using order the cluster by row key then?
Another completely different direction to go for ordering is playOrm though which can do S-SQL and if things are partitioned, you can do joins and other such(that link above is to one of the files in playOrm actually).
I am really not sure what you mean "you don't have the row key with you"
later,
Dean

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string