I got a doubt while designing data model in cassandra.
i.e. I have created this CF
Page-Followers{ "page-id" : { "user-id" : "time" } }
I want to make 2 queries on the above CF.
1) To get all user-ids (as an array using multiget function of phpcassa) who are following a particular page.
2) To chech whether a particular user is following a particular page or not.
i.e. An user with user-id = 1111 is following a page page-id=100 or not.
So, how can i make those queries based on that CF.
Note : I don't want to create a new CF for this situation.Because for this user action(i.e. user clicks on follow button on a page), have to insert data in 3 CFs and if i created another CF for this, then have to insert data into total 4 CF. It may cause performance issue.
If you give example in phpcassa, then it would be great...
Another doubts is:- As I have created cassandra data model for my college social netwoeking site(i.e. page-followers, user-followers, notifications, Alerts,...etc). For each user action, i have to insert data into 2 or 3 or more CFs, So Is it cause for performance issue??? Is it a good design??
Please help me...
Thanks in advance
In general, while data modeling in Cassandra, you first look at your queries and then construct a data model suitable for that.
For your case, you can do the following(I have no experience with phpcassa, so i can only give you the approach, you have to figure out the phpcassa bit)
1) Do a slice query with start column as '' and end column as '' and set range to a very large value. This will return you all the columns.
2) Just do a get column for rowkey = 100 and userid = 1111. If the value is not null, the user follows the page.
Cassandra is highly optimized for writes. The recommended way to model data using Cassandra is to write in denormalized fashion, even to multiple CFs. Writing to 2 or 3 families should not be an issue. You can always make the writes asynchronous to achieve better performance.
EDIT: http://thobbs.github.com/phpcassa/tutorial.html is a good place for phpcassa.
Related
I have this structure with about 1000 data points in a list on the website:
Datapoint1:
Datapoint2:
...
Datapoint1000:
With each datapoint containing 6 fields of information.
Each datapoint can be opened to reveal an additional 2-3x of information in sublist.
Would making a new request upon the user clicking on one of my datapoints be considered bad practice in Cassandra? Should I just go ahead and get it all in one go?
Should I just go ahead and get it all in one go?
Definitely not.
Would making a new request upon the user clicking on one of my datapoints be considered bad practice in Cassandra?
That's absolutely the way you should do it. Cassandra is great at writing large amounts of data, but not so great a returning large amounts of data. More, small key-based queries are definitely the way to go.
It is possible to do the JOINs on the client side but as a general proposition, queries which require joins indicate that you possibly didn't design the data model correctly.
You need to model your data such that (a) each application query (b) maps to a single table. If you need to do a client-side JOIN then you need to query the database multiple times to get the data required by your app. It will work but it's not efficient so affects the performance of the app and the database.
To illustrate with an example, let's say you app needs to display a customer's list of orders. The table design would need to be partitioned by customer with (clustered) multiple rows of orders:
CREATE TABLE orders_by_customerid (
customerid text,
orderid text,
orderdate timestamp,
ordertotal decimal,
...
PRIMARY KEY (customerid, orderid)
)
You would retrieve the list of orders for a customer with:
SELECT ... FROM orders_by_customerid WHERE customerid = ?
By default, the driver or Stargate API your app is using would page the results so only the first 100 rows (for example) will be returned instead of retrieving thousands of rows in a single pass. Note that the page size is configurable. Cheers!
Please help me to resolve a confusion. Cassandra book Claims that attempts to query based on column that is not a part of a PK should fail (No secondary index for this column as well). However when I try to do it I can see this warning:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Once I append ALLOW FILTERING to my query, there is no more error. I understand the implication on performance - however there is a clear contradiction to what is written in the book. Was this feature added later or book authors simply missed this?
I think it is great you have a textbook to guide you through important noSQL concepts, but don't rely on it as CASSANDRA is open source and is constantly updated by the community. Online resources such as the official apache documentation is a much better option to retrieve updated information / tutorials on new and existing features.
Although ALLOW FILTERING does exist, it is still recommended to use a different table construction (e.g. changing the column to a key) or create an INDEX to keep querying fast.
AFAIK, Cassandra has ALLOW FILTERING from version 1.
Also to explain ALLOW FILTERING,
As per the datastax documentation,
Let’s take for example the following table:
CREATE TABLE blogs (blogId int,
time1 int,
time2 int,
author text,
content text,
PRIMARY KEY(blogId, time1, time2));
If you execute the following query:
SELECT * FROM blogs;
Cassandra will return you all the data that the table blogs contains.
If you now want only the data at a specified time1, you will naturally add an equal condition on the column time1:
SELECT * FROM blogs WHERE time1 = 1418306451235;
In response, you will receive the following error message:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING.
Cassandra knows that it might not be able to execute the query in an efficient way. It is therefore warning you: “Be careful. Executing this query as such might not be a good idea as it can use a lot of your computing resources”.
The only way Cassandra can execute this query is by retrieving all the rows from the table blogs and then by filtering out the ones which do not have the requested value for the time1 column.
If your table contains for example a 1 million rows and 95% of them have the requested value for the time1 column, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value for the time1 column, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
Unfortunately, Cassandra has no way to differentiate between the 2 cases above as they are depending on the data distribution of the table. Cassandra is therefore warning you and relying on you to make the good choice.
Thanks,
Harry
I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.
If Apache Cassandra's architecture encourages the use of non-normalized column families designed specifically for anticipated queries, how do users edit data that is replicated across many columns without creating inconsistencies?
e.g., example 3 here: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
If Jay was no longer interested in iphones, deleting this piece of information would require that columns in 2 separated column families be deleted. Do users just need to code add/edit/delete functions that appropriately update all the relevant tables, or does Cassandra somehow know how records are related and handle this for users?
In the Cassandra 2.x world, the way to keep your denormalized query tables consistent is to use atomic batches.
In an example taken from the CQL documentation, assume that I have two tables for user data. One is the "users" table and the other is "users_by_ssn." To keep these two tables in sync (should a user change their "state" of residence) I would need to apply an upsert like this:
BEGIN BATCH;
UPDATE users
SET state = 'TX'
WHERE user_uuid = 8a172618-b121-4136-bb10-f665cfc469eb;
UPDATE users_by_ssn
SET state = 'TX'
WHERE ssn = '888-99-3987';
APPLY BATCH;
User need to code add/edit/delete function himself.
Take in to attention that Cassandra 3.0 have materialised view that automate denormalization on the server side. Materialised views would add/edit/update automatically based on the parent table.
Let's say I currently have a table like this
create table comment_counters
{
contentid uuid,
commentid uuid,
...
liked counter,
PRIMARY_KEY(contentid, commentid)
};
This purpose of this table is to track the comments and the number of times individual comments have been "liked".
What I would like to do is to get the top comments (let's say 20 top comments) determined by their number of likes from this table for each content.
I know there's no way to order by counters so what I would like to know is, are there any other ways to do this in Cassandra, by restructuring my tables or tracking more/different information for instance, or am I left with no choice but to do this in an RDBMS?
Sorting in client is not really an option I would like to consider at this stage.
Unfortunately there's now way to do this type of aggregations using plain Cassandra queries. The best option for doing this kind of data analysis would be to use an external tool such as Spark.
Using Spark you can start periodical jobs that would read and aggregate all counters from the comment_counters table and afterwards write the results (such as top 20 comments) to a different table that you can use to query directly afterwards.
See here to get started with Cassandra and Spark.