I am very new to cassandra and currently in early stage of project where i am studying cassandra.
Now since cassandra says to de-normalize data and replicate it. So, i have a following scenario :
I have table, user_master, for users. A user has
subject [type text]
hobbies [type list]
uid [type int]
around 40 more attributes
Now, a user wants to search for another user. This search should look for all user who matches the subject and hobbies provided by user. For this reason i am planning to make a different table user_discovery which will have following attribute only for every user
subject [type text]
hobbies [type list]
uid [type int]
*other irrelevant attributes won't be part of this table.
Now my question is:
Do i need to write on both tables for every insert/update in user_master? Can updation of user_discovery be automated when their is any insert/update in user_master.
Even after studying a bit, i am still not so much sure that making a separate table would increase the performance.Since, number of users would be same in both table (yes, number of column would be very less in user_discovery). Any comment on this would be highly appreciated.
Thanks
The idea of separate tables for queries is to have the key of the table contain what you are looking for.
You don't say what the key of your second table looks like, but your wording "the following attributes for every user" looks like you plan to have the user (Id?) as key. This would indeed have no performance advantage.
If you want to find users by their hobby make a table having the hobby as key, and the user id (or whatever it is you use to look up users) as columns. Write one row per hobby, listing all users having that hobby. Write the user into every row matching one of his hobbies.
Do the same for the subject (i.e. separate table, subject as key, user ids as columns).
Then, if you want to find a user having a list of specific hobbies, make one query per hobby, creating the intersection of the users.
To use these kind of lookup-tables you would have indeed to update all table every time you update a user.
Disclaimer: I used this kind of approach rather successfully in a relative complex setting managing a few hundred thousand users. However, this was two years ago, on a Cassandra 1.5 system. I haven't really looked into the new features of Cassandra 2.0, so I have no idea whether it would be possible to use a more elegant approach today.
Related
I've been using MongoDB for a few years now, and I'm starting to use Prisma's client, which seems to be providing excellent type safety. I'm not familiar at all with relational data storage, though...
This is how my database is looking:
Users are each associated with a single group
Each one of them can be a candidate in the group, and all vote for one of the group candidates during an election once they've been selected
What best practices would you advise me to follow here to store the votes? (I'm not talking about the privacy perspective specific to voting, this is a whole other topic, but only the performance & data storage stance)
Typically with MongoDB, I would create an embedded object in the Group document, with the User IDs of candidates as the keys, and an array of the User IDs of the voters for each candidate as the values.
But strong types & the relational database-inspired approach of Prisma seem to prevent me from doing anything like that, most likely for the better.
In a relational database, you would typically create a separate table to store the votes. Each row in the table would represent a single vote and would contain the following columns:
group_id: the ID of the group in which the vote took place
candidate_id: the ID of the candidate who received the vote
voter_id: the ID of the user who cast the vote
You can also add a timestamp column to store the date and time when the vote was cast.
To get the votes for a particular group, you would perform a query like this:
SELECT * FROM votes WHERE group_id = :groupId;
You can then use the results of this query to build the necessary data structures in your application.
Using a separate table to store votes has the advantage of keeping the data normalized, which can make it easier to work with and more performant. It also allows you to use foreign key constraints to ensure that each vote references a valid group and candidate.
Given a scenario where you have a User table, with id as PRIMARY KEY.
You have a column called email, and a column called name.
You want to UPDATE User.name based on User.email
I realized that the UPDATE command requires you to pass in a PRIMARY KEY. Does this mean I can't use a pure CQL migration, and would need to first query for the User.id primary key before I can UPDATE?
In this case, I DO know the PRIMARY KEY because the UUIDs are the same for dev and prod, but it feels dirty.
Yes, you're correct - you need to know primary key of the record to perform an update on the data, or deletion of specific record. There are several options here, depending of your data model:
Perform full scan of the table using effective token range scan (Look to this answer for more details);
If this is required very often, you can create a materialized view, with User.email as partition key, and fetch all message IDs that you can update (but you'll need to do this from your application, there is no nested query support in CQL). But also be aware that materialized views are "experimental" feature in Cassandra, and may not work all the time (it's more stable in DataStax Enterprise). Also, if you have some users with hundreds of thousands of emails, this may create big partitions.
Do like 2nd item with your code, by using an additional table
I think Alex's answer covers your question -- "how can I find a value in a PK column working backwards from a non-PK column's value?".
However, I think it's worth noting that asking this question indicates you should reconsider your data model. A rule of thumb in C* data model design is that you begin by considering the queries you need, and you've missed the UPDATE query use case. You can probably make things work without changing your model for now, but if you find you need to make other queries you're unprepared for, you'll run into operational issues with lots of indexes and/or MVs.
More generally, search around for articles and other resources about Cassandra data modeling. It sounds like you're basically using C* for a relational use case so you'll want to look into that.
I have a system where actions of users need to be sent to other users who subscribe to those updates. There aren't a lot of users/subscribers at the moment, but it could grow rapidly so I want to make sure I get it right. Is it just this simple?
create table subscriptions (person_uuid uuid,
subscribes_person_uuid uuid,
primary key (person_uuid, subscribes_person_uuid)
)
I need to be able to look up things in both directions, i.e. answer the questions:
Who are Bob's subscribers.
Who does Bob subscribe to
Any ideas, feedback, suggestions would be useful.
Those two queries represent the start of your model:
you want the user to be the PK or part of the PK.
depending on the cardinality of subscriptions/subscribers you could go with:
for low numbers: using a single table and two sets
for high numbers: using 2 tables similar to the one you describe
#Jacob
Your use case looks very similar to the Twitter example, I did modelize it here
If you want to track both sides of relationship, I'll need to have a dedicated table to index them.
Last but not least, depending on the fact that the users are mutable OR not, you can decide to denormalize (e.g. duplicate User content) or just store user ids and then fetch users content in a separated table.
I've implemented simple join feature in Achilles. Have a look if you want to go this way
I know this question has been asked a million times, but I can't seem to find one that really gives me a good understanding of how relationships work in Kohana's ORM Module.
I have a database with 5 tables:
approved_submissions
-submission_id
-contents
favorites
-user_id
-submission_id
ratings
-user_id
-submission_id
-rating
users
-user_id
votes
-user_id
-submission_id
-vote
Right now, favorites,ratings, and votes have a Primary Key that consists of every column in the table, so as to prevent a user favoriting the same submission_id multiple times, a user voting on the same submission_id multiple times etc. I also believe these fields are set up using foreign keys that reference approved_submissions and users so as to prevent invalid data existing in the respective fields.
Using the DB module, I can access and update these tables no problem. I really feel as though ORM may offer a more powerful and accessible way to accomplish the same things using less code.
Can you demonstrate how I might update a user voting on a submission_id? A user removing a favorite submission_id? A user changing their rating on a particular submission_id?
Also, do I need to make changes to my database structure or is it okay the way it is?
You're probably looking for has_many_through relationships.
So to add a new submission, you'd do something like
$user->add('submissions', $submission);
and to remove
$user->remove('submissions', $submission);
You may want to consider restructuring your database table and key names so you don't end up doing a lot of configuration.
Does Core Data allow one to generate a statement like select FirstName, LastName from Employee instead of the entire row?
In the example of Departments/Employees, let's say I want to write a navagation controller style application to display the Departments available, and when clicked, the Employees available in that Department. Lets suppose that the Employee object is huge for whatever reason. I don't see why I would need to retrieve a huge set of objects in the EmployeesViewController just to display their names in a list view. Is there anyway I can just request the Name field (perhaps two: FirstName, LastName) for all Employees in a given Department?
Just as a side note:
Beware of the dangers of premature optimization. You shouldn't bother trying to tweak a fetch unless you've tested and found that the bare bones fetch is actually a problem. Core Data has a large number of optimizations under the hood that make it far more efficient than it would appear at first glance. In most cases, its an utter waste of time to tweak a fetch.
If you have a table that displays thousands of objects, you will usually only have a few dozen live objects in memory at anyone time. By default, Core Data fetches objects as faults i.e. ghost of the objects without the values of their attributes loaded. Only when you access a specific attribute of a specific object will that object load completely into memory.
If you come from an SQL background, you may be used to having to manually juggle objects created from SQL. Core Data handles all that for you and does so much more efficiently than you can do manually. You're intuitive assumption you developed working with SQL about the degree of manual optimization you need to do will be way off when applied to Core Data.
Instead, pick the simplest and easiest method first and optimize only if you test and find a bottleneck. Most of the time, you will find that your fretting over optimization was completely unwarranted.
If your employee object is very large, and loading it is too expensive, consider partitioning it with one-to-one relationships. If you have a list of fields like:
Employee:
id
name
SSN
DOB
home_address
home_phone
department
office_number
office_phone
manager
job_title
salary_class
start_date
...
You could break this up into:
Employee
id
department
manager
...
EmployeePersonalInfo
employee_id
SSN
DOB
home_address
home_phone
EmployeeJobDescription
employee_id
job_title
start_date
salary_class
And so on. For your most common objects, limiting the fetched data to the most pertinent and commonly-accessed field is good practice.
Yes. Use -[NSFetchRequest setPropertiesToFetch:]. Only available in iOS 3 & OS X 10.6 and later, though.