Secondary Index in Cassandra will lead to two DB reads - cassandra

Lets assume a data model in which a User have blog-posts. Each post has a unique title and many attributes.
I have a Column Family "posts" in which each row is like this:
posts = {
"yersterday" : {
date : 03-04-2012
userID : abfe222234
tags : "beatles,paul"
}
}
I want to index the posts by user, so I have another regular column family:
user_posts = {
abfe222234 : {
yesterday : null
....
}
}
This model comes after a lot of research about secondary indexing in Cassandra, in which I came to these slides: http://www.slideshare.net/edanuff/indexing-in-cassandra and understood that Super Column Family are less and less used.
My question:
If you want all the details about the user posts, it means that I have to read the DB twice: once for getting all the posts IDs, and once for fetching all the post's details for those IDs.
What am I missing?
Thanks,
Issahar.
edit:
The other option, is to make "user_posts" be a Super CF, and make it contain all the data that is inside "posts".
pros: you'll have to fetch all the data only once.
cons: 1. You'll duplicate all of your data. 2. You can't search for once attribute of a post.
What do you say?

Looks pretty straightforward to me- you really do indeed need to perform two database reads to get the data in this case. For what it's worth, most relational databases need to perform two logical reads also, unless the data that the user is interested in is fully contained in the index. The only difference is that in a relational DB there is only one network round trip.

Related

Mongodb and node js transaction alternative and relations

I am trying to implement relations on collections. My requirement is
Post request 1, json body:
{
"username":"aaa",
"password":"bbb",
"role":"owner",
"company":"SAS"
}
Post request 2, creating from first document so I got company name from previous json body:
{
"username":"eee",
"password":"fff",
"role":"engineer",
"company":"SAS"
}
Post request 3, creating from first document so I got company name from previous json body:
{
"username":"uuu",
"password":"kkk",
"role":"engineer",
"company":"SAS"
}
Post request 4, next company json body:
{
"username":"hhh",
"password":"ggg",
"role":"owner",
"company":"GVG"
}
Here company is foreign key field. How can I achieve company with id field without failing one another like transactions.
In mysql I will create two tables company, user and using transactions i will insert in both tables in single post using id's if any update in company name id will remain same for owner and engineer.
How can I achieve these in mongodb, with node.js?
In online searches I have found most suggest avoid transactions and using mongodb functionalities like mongodb embedded.
I would suggest you to start with making schemas for user and company using mongoose. Its an ODM(object document mapper) which is almost always used with node.js and mongodb
Now this is one to many relations. In relational databases as you have mentioned, you would make a company table and a user table.
In mongodb it "depends". If its one to "few" relationship you would just nest the users array into company's collection. Then since you are only updating a single document(pushing user to users array in company's document), you wont be needing any transactions. Single document update is always atomic(no matter how many fields you update on the same document).
But if each company can have large number of users(ever growing nested array is not good, as it can cause data fragmentation and bad performance), then its better to store the company's id in user's document. And even in this case you would not need transaction, since you are not updating the company's document.
Another reason for storing user as separate collection, is query issues. If you just want to query users its difficult if they are nested in companies. So basically you need to consider how you will query and figure out the number of relations then decide to nest of store is separate collections.
First of all, you should notice that Mongo is document-oriented DB, not a relation one. So if you need transactions and relation model, probably you should try to use any SQL relative database? Especially if you are more familiar with them?
About relation and data modeling: you should this article (or even entire part) at official Mongo docs, Data Modelling.
TL:DR, you could create two separate collections (the same as tables in SQL) like employees, and companies (by default, collection's name will be in plural forms). And store data separately.
So you employees will be stored like you mention above, but companies will be like:
{
_id: ObjectID("35473645632")
name: "SAS"
}, ...
and as for your employees collection, you should store not like, "company":"SAS", but, "company":"ObjectID("35473645632"), or even as array if you want it too. But don't forgot to edit you schema than.
You could use not just MongoDB's default _id but your own one, it could be any unique number/string combination
So, if your company will be renamed, your connection with other documents (employees) still will be there.
To request all/any of your employees with company name's you should use .aggregation framework with $lookup, instead of .find.

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

Mongo DB relations between documents in different collections

I'm not yet ready to let this go, which is why I re-thought the problem and edited the Q (original below).
I am using mongoDB for a weekend project and it requires some relations in the DB, which is what the misery is all about:
I have three collections:
Users
Lists
Texts
A user can have texts and lists - lists 'contain' texts. Texts can be in multiple lists.
I decided to go with separate collections (not embeds) because child documents don't always appear in context of their parent (eg. all texts, without being in a list).
So what needs to be done is reference the texts that belong into certain lists with exactly those lists. There can be unlimited lists and texts, though lists will be less in comparison.
In contrast to what I first thought of, I could also put the reference in every single text-document and not all text-ids in the list-documents. It would actually make a difference, because I could get away with one query to find every snippet in a list. Could even index that reference.
var TextSchema = new Schema({
_id: Number,
name: String,
inListID: { type : Array , "default" : [] },
[...]
It is also rather seldom the case that texts will be in MANY lists, so the array would not really explode. The question kind of remains though, is there a chance this scales or actually a better way of implementing it with mongoDB? Would it help to limit the amount of lists a text can be in (probably)? Is there a recipe for few:many relations?
It would even be awesome to get references to projects where this has been done and how it was implemented (few:many relations). I can't believe everybody shies away from mongo DB as soon as some relations are needed.
Original Question
I'll break it down in two problems I see so far:
1) Let's assume a list consists of 5 texts. How do I reference the texts contained in a list? Just open an array and store the text's _ids in there? Seems like those arrays might grow to the moon and back, slowing the app down? On the other hand texts need to be available without a list, so embedding is not really an option. What if I want to get all texts of a list that contains 100 texts.. sounds like two queries and an array with 100 fields :-/. So is this way of referencing the proper way to do it?
var ListSchema = new Schema({
_id: Number,
name: String,
textids: { type : Array , "default" : [] },
[...]
Problem 2) I see with this approach is cleaning the references if a text is deleted. Its reference will still be in every list that contained the text and I wouldn't want to iterate through all the lists to clean out those dead references. Or would I? Is there a smart way to solve this? Just making the texts hold the reference (in which list they are) just moves the problem around, so that's not an option.
I guess I'm not the first with this sort of problem but I was also unable to find a definitive answer on how to do it 'right'.
I'm also interested in general thoughts on best-practice for this sort of referencing (many-to-many?) and especially scalability/performance.
Relations are usually not a big problem, though certain operations involving relations might be. That depends largely on the problem you're trying to solve, and very strongly on the cardinality of the result set and the selectivity of the keys.
I have written a simple testbed that generates data following a typical long-tail distribution to play with. It turns out that MongoDB is usually better at relations than people believe.
After all, there are only three differences to relational databases:
Foreign key constraints: You have to manage these yourself, so there's some risk for dead links
Transaction isolation: Since there are no multi-document transactions, there's some likelihood for creating invalid foreign key constraints even if the code is correct (in the sense that it never tries to create a dead link), but merely interrupted at runtime. Also, it is hard to check for dead links because you could be observing a race condition
Joins: MongoDB doesn't support joins, though a manual subquery with $in does scale well up to several thousand items in the $in-clause, provided the reference values are indexed, of course
Iff you need to perform large joins, i.e. if your queries are truly relational and you need large amount of the data joined accordingly, MongoDB is probably not a good fit. However, many joins required in relational databases aren't truly relational, they are required because you had to split up your object to multiple tables, for instance because it contains a list.
An example of a 'truly' relational query could be "Find me all customers who bought products that got >4 star reviews by customers that ranked high in turnover in June". Unless you have a very specialized schema that essentially was built to support this query, you'll most likely need to find all the orders, group them by customer ids, take the top n results, use these to query ratings using $in and use another $in to find the actual customers. Still, if you can limit yourself to the top, say 10k customers of June, this is three round-trips and some fast $in queries.
That will probably be in the range of 10-30ms on typical cloud hardware as long as your queries are supported by indexes in RAM and the network isn't completely congested. In this example, things get messy if the data is too sparse, i.e. the top 10k users hardly wrote >4 star reviews, which would force you to write program logic that is smart enough to keep iterating the first step which is both complicated and slow, but if that is such an important scenario, there is probably a better suited data structure anyway.
Using MongoDB with references is a gateway to performance issues. Perfect example of what not to use. This is a m:n kind of relation where m and n can scale to millions. MongoDB works well where we have 1:n(few), 1:n(many), m(few):n(many). But not in situations where you have m(many):n(many). It will obviously result in 2 queries and lot of housekeeping.
I am not sure that is this question still actual, but i have similar experience.
First of all i want to say what tells official mongo documentation:
Use embedded data models when: you have one-to-one or one-to-many model.
For model many-to-many use relationships with document references.
I think is the answer) but this answer provide a lot of problems because:
As were mentioned, mongo don't provide transactions at all.
And you don't have foreign key constraints.
Even if you have references (DBRefs) between documents, you will be faced with amazing problem how to dereference this documents.
Each this item - is huge piece of responsibility, even if you work at weekend project. And it might mean that you should be write many code to provide simple behaviour of your system (for example you can see how realize transaction in mongo here).
I have no idea how done foreign key constraints, and i don't saw something in this direction in mongo documentation, that's why i think that it amazing challenge (and risk for project).
And the last, mongo references - it isn't mysql join, and you dont receive all data from parent collection with data from child collection (like all fields from table and all fields from joined table in mysql), you will receive just REFERENCE to another document in another collection, and you will need to do something with this reference (dereference).
It can be easily reached in node by callback, but only in case when you need just one text from one list, but if you need all texts in one list - it's terrible, but if you need all texts in more than one list - it's become nightmare...
Perhaps it's my not the best experience... but i think you should think about it...
Using array in MongoDB is generally not preferable, and generally not advised by experts.
Here is a solution that came to my mind :
Each document of Users is always unique. There can be Lists and Texts for individual document in Users. So therefore, Lists and Texts have a Field for USER ID, which will be the _id of Users.
Lists always have an owner in Users so they are stored as they are.
Owner of Texts can be either Users or List, so you should keep a Field of LIST ID also in it, which will be _id of Lists.
Now mind that Texts cannot have both USER ID and LIST ID, so you will have to keep a condition that there should be only ONE out of both, the other should be null so that we can easily know who is the primary owner of the Texts.
Writing an answer as I want to explain how I will proceed from here.
Taking into consideration the answers here and my own research on the topic, it might actually be fine storing those references (not really relations) in an array, trying to keep it relativley small: less than 1000 fields is very likely in my case.
Especially because I can get away with one query (which I first though I couldn't) that doen't even require using $in so far, I'm confident that the approach will scale. After all it's 'just a weekend-project', so if it doesn't and I end up re-writing - that's fine.
With a text-schema like this:
var textSchema = new Schema({
_id: {type: Number, required: true, index: { unique: true }},
...
inList: { type : [Number] , "default" : [], index: true }
});
I can simply get all texts in a list with this query, where inList is an indexed array containing the _ids of the texts in the list.
Text.find({inList: listID}, function(err, text) {
...
});
I will still have to deal with foreign key constraints and write my own "clean-up" functions that take care of removing references if a list is removed - remove reference in every text that was in the list.
Luckily this will happen very rarely, so I'm okay with going through every text once in a while.
On the other hand I don't have to care about deleting references in a list-document if a text is removed, because I only store the reference on one side of the relation (in the text-document). Quite an important point in my opinion!
#mnemosyn: thanks for the link and pointing out that this is indeed not a large join or in other words: just a very simple relation. Also some numbers on how long those complex operations take (ofc. hardware dependet) is a big help.
PS: Grüße aus Bielefeld.
What I found most helpful during my own research was this vid, where Alvin Richards also talks about many-to-many relations at around min. 17. This is where I got the idea of making the relation one-sided to save myself some work cleaning up the dead references.
Thanks for the help guys
👍

Lists in NoSQL/BigTable Data Modeling & Super Columns (with Cassandra)

I'm new to NoSQL and BigTable, and I'm trying to learn how I can (and if should) use super columns to create a BigTable friendly schema.
Based on this article about NoSQL data modeling, it sounds like instead of using JOIN-centric RDBMS schemas, I should aggregate my data into larger tables to de-normalize where possible. Based on that, here's a simple schema I envisioned for a 'User', which I'm trying to create for Cassandra:
User: {
KEY: UserId {
name: {
first,
last
},
age,
gender
}
};
The above column family (User), whose key is a 'UserID', is composed of 3 columns (name, age, gender.) Its column 'name' would be a super column who is composed of 'first' and 'last' columns.
So what I'm asking is:
What does the CQL 3.0 look like to create this column family 'User' with the 'name' super column within it? (Update: This doesn't appear possible.)
Should I be using super columns (like this)? Should I be using something else?
What's an alternative way of representing this schema?
How do I represent a list of values in a table/column family?
Here are some useful links about this that I found, but that I don't quite understand clearly enough to answer my question:
Create a Cassandra schema for a super column with metadata
Cassandra: How to create column in a super column family?
Modeling relational data with Cassandra
Thanks!
Update:
After alot of research, I'm learning a few things:
You cannot create super columns using CQL; there might be other mechanisms to do so, but CQL does not appear to be one of them.
Syntax for SQL 3.0 seems to be drifting from a 'COLUMN FAMILY'-centric approach towards SQL-like 'TABLE' based syntax.
Changed my questions accordingly.
Should I be using super columns (like this)? Should I be using
something else?
You can use that data model that you suggested. But generally it is not recommended for these reason as mentioned in the link.
I'll also note that use of super columns is generally discouraged as
they have several disadvantages. All subcolumns in a super column
need to be deserialized when reading one sub column and you can not
set secondary indexes on super columns. They also only support one
level of nesting.
Hence consider these reasons for your situation.
What's an alternative way of representing this schema?
You can try using composite columns. Read here for more information. Or you can probably just use standard column family, I think standard cf will be suitable for your situation. For example, following suggestion:
User : {
key: userId {
columnName:firstname
ColumnName:lastname
ColumnName:age
ColumnName:gender
ColumnName:zip
ColumnName:street
}
..
};
How do I represent a list of values in a table/column family?
It is possible to store the list in a BytesType in the cf. Or you can probably break the list into individual element and store as CompositeType.

representing a many-to-many relationship in couchDB

Let's say I'm writing a log analysis application. The main domain object would be a LogEntry. In addition. users of the application define a LogTopic which describes what log entries they are interested in. As the application receives log entries it adds them to couchDB, and also checks them against all the LogTopics in the system to see if they match the criteria in the topic. If it does then the system should record that the entry matches the topic. Thus, there is a many-to-many relationship between LogEntries and LogTopics.
If I were storing this in a RDBMS I would do something like:
CREATE TABLE Entry (
id int,
...
)
CREATE TABLE Topic (
id int,
...
)
CREATE TABLE TopicEntryMap (
entry_id int,
topic_id int
)
Using CouchDB I first tried having just two document types. I'd have a LogEntry type, looking something like this:
{
'type': 'LogEntry',
'severity': 'DEBUG',
...
}
and I'd have a LogTopic type, looking something like this:
{
'type': 'LogTopic',
'matching_entries': ['log_entry_1','log_entry_12','log_entry_34',....],
...
}
You can see that I represent the relationship by using a matching_entries field in each LogTopic documents to store a list of LogEntry document ids. This works fine up to a point, but I have issues when multiple clients are both attempting to add a matching entry to a topic. Both attempt optimistic updates, and one fails. The solution I'm using now is to essentially reproduce the RDBMS approach, and add a third document type, something like:
{
'type':'LogTopicToLogEntryMap',
'topic_id':'topic_12',
'entry_id':'entry_15'
}
This works, and gets past the concurrent update issues, but I have two reservations:
I worry that I'm just using this
approach because it's what I'd do in
a relational DB. I wonder if there's
a more couchDB-like (relaxful?)
solution.
My views can no longer
retrieve all the entries for a
specific topic in one call. My
previous solution allowed that (if I
used the include_docs parameter).
Anyone have a better solution for me? Would it help if I also posted the views I'm using?
I cross-posted this question to the couchdb users mailing list and Nathan Stott pointed me to a very helpful blog post by Christopher Lenz
Your approach is fine. Using CouchDB doesn't mean you'll just abandon relational modeling. You will need need to run two queries but that's because this is a "join". SQL queries with joins are also slow but the SQL syntax lets you express the query in one statement.
In my few months of experience with CouchDB this is what I've discovered:
No schema, so designing the application models is fast and flexible
CRUD is there, so developing your application is fast and flexible
Goodbye SQL injection
What would be a SQL join takes a little bit more work in CouchDB
Depending on your needs I've found that couchdb-lucene is also useful for building more complex queries.
I'd try setting up the relation so that LogEntrys know to which LogTopics they belong. That way, inserting a LogEntry won't produce conflicts as the LogTopics won't need to be changed.
Then, a simple map function would emit the LogEntry once for each LogTopic it belongs to, essentially building up your TopicEntryMap on the fly:
"map": function (doc) {
doc.topics.map(function (topic) {
emit(topic, doc);
});
}
This way, querying the view with a ?key=<topic> argument will give you all the entries that belong to a topic.

Resources