I want to create a database of events. Events by the same user would have the same user id. I could then execute queries to retrieve users who had done events A and B but not C.
I've had a look at Cassandra but I'm unclear about how I should model this data.
The approach I thought of would be:
Every entry in the database gets a unique id (because I think Cassandra requires a unique primary key?), and then I have one column which is my user id, non-unique. Then I am free to give each event other columns, depending on what is relevant to that event. So I might have some entries:
1,user1,event_column=registered,fname_column=James,lname_column=Mason
2,user2,event_column=deleted
3,user1,event_column=pageview,page_column=homepage
and so on.
Then I'm a bit unclear about how I would select users who had done A and B but not C. Could I do that with one query? Or would I need to bring into java all users who had done A, then all users who had done B and filter for users in common?
Does that approach sound possible and a good way to use Cassandra?
Are there other open source distributed databases which might be appropriate?
Thanks for your help!
With a secondary index on the "event_column" you can ask ...WHERE event_column IN ('A', 'B'), but you cannot do a NOT IN clause like in conventional SQL. (See also this answer).
An example of a database that supports your query is MongoDB where $nin and $in are similar to NOT IN (...) and IN (...), respectively. (MongoDB is a document database where Cassandra is a column database.)
In order to prevent a complete scan of all documents, remember to put a secondary index on the events property if only a minority of documents will contain the events that you search for.
Related
Given a scenario where you have a User table, with id as PRIMARY KEY.
You have a column called email, and a column called name.
You want to UPDATE User.name based on User.email
I realized that the UPDATE command requires you to pass in a PRIMARY KEY. Does this mean I can't use a pure CQL migration, and would need to first query for the User.id primary key before I can UPDATE?
In this case, I DO know the PRIMARY KEY because the UUIDs are the same for dev and prod, but it feels dirty.
Yes, you're correct - you need to know primary key of the record to perform an update on the data, or deletion of specific record. There are several options here, depending of your data model:
Perform full scan of the table using effective token range scan (Look to this answer for more details);
If this is required very often, you can create a materialized view, with User.email as partition key, and fetch all message IDs that you can update (but you'll need to do this from your application, there is no nested query support in CQL). But also be aware that materialized views are "experimental" feature in Cassandra, and may not work all the time (it's more stable in DataStax Enterprise). Also, if you have some users with hundreds of thousands of emails, this may create big partitions.
Do like 2nd item with your code, by using an additional table
I think Alex's answer covers your question -- "how can I find a value in a PK column working backwards from a non-PK column's value?".
However, I think it's worth noting that asking this question indicates you should reconsider your data model. A rule of thumb in C* data model design is that you begin by considering the queries you need, and you've missed the UPDATE query use case. You can probably make things work without changing your model for now, but if you find you need to make other queries you're unprepared for, you'll run into operational issues with lots of indexes and/or MVs.
More generally, search around for articles and other resources about Cassandra data modeling. It sounds like you're basically using C* for a relational use case so you'll want to look into that.
i am new with nosql concept, so when i start to learn PouchDB, i found this conversion chart. My confusion is, how PouchDB handle if lets say i have multiple table, does it mean that i need to create multiple databases? Because from my understanding in pouchdb a database can store a lot of documents, but a document mean a row in sql or am i misunderstood?
The answer to this question seems to be surprisingly under-documented. While #llabball clearly gave a decent answer, I don't think that views are always the way to go.
As you can read here in the section When not to use map/reduce, Nolan explains that for simpler applications, the key is to abuse _ids, and leverage the power of allDocs().
In other words, if you had two separate types (say artists, and albums), then you could prefix the id of each type to obtain an easily searchable data set. For example _id: 'artist_name' & _id: 'album_title', would allow you to easily retrieve artists in name order.
Laying out the data this way will result in better performance due to not requiring extra indexes, and less code. Clearly however, if your data requirements are more complex, then views are the way to go.
... does it mean that i need to create multiple databases?
No.
... a document mean a row in sql or am i misunderstood?
That's right. The SQL table defines column header (name and type) - that are the JSON property names of the doc.
So, all docs (rows) with the same properties (a so called "schema") are the equivalent of your SQL table. You can have as much different schemata in one database as you want (visit json-schema.org for some inspiration).
How to request them separately? Create CouchDB views! You can get all/some "rows" of your tabular data (docs with the same schema) with one request as you know it from SQL.
To write such views easily the property type is very common for CouchDB docs. Your known name from a SQL table can be your type like doc.type: "animal"
Your view names will be maybe animalByName or animalByWeight. Depends on your needs.
Sometimes multiple-databases plan is a good option, like a database per user or even a database per user-feature. Take a look at this conversation on CouchDB mailing list.
i'm walking my first steps with nosql databases, but so far my knowledge is very basic. I try to set up a database for a small invoice system.
In SQL i'd create 4 Tables: Products, Customers , Invoices, and a match table for Invoice and the produts.
But how to do this with nosql? Do i even build relations or just build 1 document for each invoice.
You should keep in mind that NoSQL design is not only based on data structure but also strongly on data function. So you should first ask yourself what kind of queries you need to do over your data and take it from there.
First figure out how far you want to go with denormalization and aggregation. For instance: what sets of data will often require to query or update at once? And try to keep that to a single document even if it means duplicating data from other entities (i.e. Storing customer data along with the invoice data).
So ask yourself why you want to use non relational databases, and how will you use that data. Then decide which modeling techniques to apply and how far. The highly scalable blog has a great article about NoSQL data modeling if you care to give it a read.
... or just build 1 document for each invoice.
Yes, do that for the beginning. Imagine your data in the CouchDB as read-only copy of your data in the relational database. The docs are like the result of your SQL queries.
Do i even build relations?
Of course you can, its the same as in your SQL tables. You including ids of foreign docs and name the property regarding to the relation you want to express e.g. doc.customer_id in an invoice doc can point to the doc._id of a customer doc.
Its helpful you imagine the CouchDB views as "relations" e.g. you can create a view called InvoicesByCustomer with the example above.
But summarized i would recommend to begin with the 1 document for each invoice.-approach and follow #JavoSN hint ...
So you should first ask yourself what kind of queries you need to do over your data and take it from there
... when you know that clearly its time to dig deeper into your possibilities of document designs.
I am very new to cassandra and currently in early stage of project where i am studying cassandra.
Now since cassandra says to de-normalize data and replicate it. So, i have a following scenario :
I have table, user_master, for users. A user has
subject [type text]
hobbies [type list]
uid [type int]
around 40 more attributes
Now, a user wants to search for another user. This search should look for all user who matches the subject and hobbies provided by user. For this reason i am planning to make a different table user_discovery which will have following attribute only for every user
subject [type text]
hobbies [type list]
uid [type int]
*other irrelevant attributes won't be part of this table.
Now my question is:
Do i need to write on both tables for every insert/update in user_master? Can updation of user_discovery be automated when their is any insert/update in user_master.
Even after studying a bit, i am still not so much sure that making a separate table would increase the performance.Since, number of users would be same in both table (yes, number of column would be very less in user_discovery). Any comment on this would be highly appreciated.
Thanks
The idea of separate tables for queries is to have the key of the table contain what you are looking for.
You don't say what the key of your second table looks like, but your wording "the following attributes for every user" looks like you plan to have the user (Id?) as key. This would indeed have no performance advantage.
If you want to find users by their hobby make a table having the hobby as key, and the user id (or whatever it is you use to look up users) as columns. Write one row per hobby, listing all users having that hobby. Write the user into every row matching one of his hobbies.
Do the same for the subject (i.e. separate table, subject as key, user ids as columns).
Then, if you want to find a user having a list of specific hobbies, make one query per hobby, creating the intersection of the users.
To use these kind of lookup-tables you would have indeed to update all table every time you update a user.
Disclaimer: I used this kind of approach rather successfully in a relative complex setting managing a few hundred thousand users. However, this was two years ago, on a Cassandra 1.5 system. I haven't really looked into the new features of Cassandra 2.0, so I have no idea whether it would be possible to use a more elegant approach today.
I have a system where actions of users need to be sent to other users who subscribe to those updates. There aren't a lot of users/subscribers at the moment, but it could grow rapidly so I want to make sure I get it right. Is it just this simple?
create table subscriptions (person_uuid uuid,
subscribes_person_uuid uuid,
primary key (person_uuid, subscribes_person_uuid)
)
I need to be able to look up things in both directions, i.e. answer the questions:
Who are Bob's subscribers.
Who does Bob subscribe to
Any ideas, feedback, suggestions would be useful.
Those two queries represent the start of your model:
you want the user to be the PK or part of the PK.
depending on the cardinality of subscriptions/subscribers you could go with:
for low numbers: using a single table and two sets
for high numbers: using 2 tables similar to the one you describe
#Jacob
Your use case looks very similar to the Twitter example, I did modelize it here
If you want to track both sides of relationship, I'll need to have a dedicated table to index them.
Last but not least, depending on the fact that the users are mutable OR not, you can decide to denormalize (e.g. duplicate User content) or just store user ids and then fetch users content in a separated table.
I've implemented simple join feature in Achilles. Have a look if you want to go this way