Mongodb: big data structure - node.js

I'm rebuilding my website which is a search engine for nicknames from the most active forum in France: you search for a nickname and you got all of its messages.
My current database contains more than 60Gb of data, stored in a MySQL database. I'm now rewriting it into a mongodb database, and after retrieving 1 million messages (1 message = 1 document) find() started to take a while.
The structure of a document is as such:
{
"_id" : ObjectId(),
"message": "<p>Hai guys</p>",
"pseudo" : "mahnickname", //from a nickname (*pseudo* in my db)
"ancre" : "774497928", //its id in the forum
"datepost" : "30/11/2015 20:57:44"
}
I set the id ancre as unique, so I don't get twice the same entry.
Then the user enters the nickname and it finds all documents that have that nickname.
Here is the request:
Model.find({pseudo: "danickname"}).sort('-datepost').skip((r_page -1) * 20).limit(20).exec(function(err, bears)...
Should I structure it differently? Instead of having one document for each message, I'm having a document for each nickname and I update the document once I get a new message from that nickname?
I was using the first method with MySQL et it wasn't taking that long.
Edit: Or maybe should I just index the nicknames (pseudo)?
Thanks!

Here are some recommendations for your problem about big data:
The ObjectId already contains a timestamp. You can also sort on it. You could save on some disk space by removing the datepost field.
Do you absolutely need the ancre field? The ObjectId is already unique and indexed. If you absolutely need it and need to keep the datepost seperate too, you could replace the _id field to be your ancre field.
As many mentioned, you should add an index on pseudo. This will make the "get all messages where the pseudo is mahnickname" search much faster.
If the amount of messages per user is low, you could store all of them inside a single Document per user. This would avoid having to skip to a specific page, which can be slow. However, be aware of the 16mb limit. I would personally still have them in multiple documents.
To keep fast query speeds, ensure that all your indexed fields fit in RAM. You can see the RAM consumption of indexed fields by typing db.collection.stats() and looking at the indexSizes sub-document.
Would there be a way for you to not skip documents, but use the time it got written to the database as your pages? If so, use the datepost field or the timestamp in _id for your paging strategy. If you decide on using the datepost, make a compound index on pseudo and datepost.
As for your benchmarks, you can closely monitor MongoDB by using mongotop and mongostat.

Related

Node.JS/Express - how to avoid multiple database queries

I have a basic express app and im getting started with db queries and i want to know how to avoid multiple db queries because i dont think its efficient the way i do it :
app.get('/:word', function(req,res){
db.create({'name': word});
console.log('the word is ' + word);
});
What i want to do is :
get the word from the url
check if it exists in the datbaase (or previously requested because if it was then it was probably added already through this basic code)
if it doesn't exist then add it and then proceed to console.log
I want to add each word to my database once only and not run the db query again and again.
Here's what im thinking :
Not so efficient way
query to check if it exists before inserting one
Good way but i dont know how to start here
Cache the word being queried and maintain cache to prevent db queries
More info edit
I'm using mongodb via mongoose
the 'word' key is already unique so i know its not creating duplicate values
i dont want to run ANY db queries if that value or that url has already been hit once
The only way to check if the word already exists is to query the database before inserting. There are libraries (and also database) that implements the findOrCreate method, but this is always just an abstraction. Behind the scenes, the database will search for an existing value before writing.
If your database is huge and queryng is not suitable, you could use a cashing system (like Redis). But this definitely depends on your logic and your data size.
Probably you can just optimize the process just adding and index to the column you want be unique (I guess it's name?).
You could also define the column name as unique. When inserting, the database will throw you an error if the document already exists. But keep in mind again that, behind the scenes, the database is queryng for an existing same value before inserting. The advantage to have an "unique" column is that the index for this column is automatically created and also from your app logic (node js) you can just call the insert method and add a little bit error handling logic.
MongoDB will create any collections you use in your app if they do not already exist.
Insert Unique Value :
Create Unique Index to your key, So that the value will be added only once. If you try to add again it will throws an error to you.
To create Unique Index,
db.collection.createIndex( { "name": 1 }, { unique: true } )
Caching :
For caching, Store your data on cache system(Like: memory-cache, redis) on first time data will be query from MongoDB and then for subsequent need of data you can use cache system.
In mongo db you can use findOneAndUpdate with optional flag upsert: true documentation
To ensure that every word appears only once you should also set unique index on that field. However rememer that unique index is case sensitive so Cat and cat are different words.

How to find particular json document from couchdb

How to find particular json document details from couchdb
For ex : Database name : employee_mgmt, in that database contains 50 json documents. So i want to find particular employee json documents ( Find by employee id ).
CouchDB does in it self not provide you with collections/buckets, hence all your documents are peers. It's up to you to provide meta-data e.g. by having a property $doctype with a value representing what kind of document it is. This is useful if you are writing maps and e.g. want to create a view (secondary index) returning something applicable only to employees.
Know, if you just want to query by _id you don't need the above. Just do a simple GET with an URI as: http://host:port/databasename/documentid
More information: http://docs.couchdb.org/en/1.6.1/api/document/common.html#get--db-docid
If you want to get a batch of documents matching many _id use the builtin index _all_docs http://docs.couchdb.org/en/1.6.1/api/database/bulk-api.html#post--db-_all_docs

Manipulating ref'd mongo records based on _id field

Ok so I have a pretty simple DB setup in a MEAN app (node, mongoose, mongo) where I have Book records, and User records. A book has a single Owner, and can have any number of shared users which are stored in an array in a field called sharedWith:. Originally I was storing the user records with an email address as the _id field. I now realize this was a dumb move on my part because if someone wants to change their email address it effectively cuts them off from their books.
The app is not live yet, so it's not a fatal mistake.
My question is, once I revert the User documents to using the original hash value for _id, and store those in the Owner and sharedWith fields in the book documents, will I have to query each hash just to retrieve the actual usable user data?
I know mongoose has a .populate() method which will resolve sub documents, but as for inserting them? Will I POST the users as email addresses, then query each and store the resulting hashes? I can do this manually, but I wanted to make sure there is not some secret mongo-sauce that can do this in the DB itself.
Thanks!
If you have the _id available in the frontend for the user. You can directly share him a book by adding the _id to the sharedWith array of a book. But if you don't have the _id of the user available in the frontend, you need to manually get the _id by querying with the email and then store the _id in the sharedWith. As to retrieve the books, populate is indeed the best option to use to get user data.
And to get all books shared with a user you can do something like this,
Book.find({sharedWith:user1._id},function(err,docs){ });
This query can be made efficient if you use an index on sharedWith but that depends on your use case.

what's the best way to bind a mongodb doc to a node.js html page

In past with my PHP / Rails - MYSQL apps I've used the unique ID of a table record to keep track of a record in an html file.
So I'd keep track of how to delete a record shown like this (15 being the ID of the record):
Delete this record
So now I'm using MongoDB. I've tried the same method but the objectID ._id attribute seems to be a loooong byte string that I can't use conveniently.
What's the most sensible way of binding a link in the view to a record (for deletion, or other purposes or whatever)?
If the answer is to create a new id that's unique for each document in the collection, then what's the best way to generate those unique id's?
Thank you.
You could use a counter instead of the ObjectID
But this could create a problem when inserting a new document after you deleted a previous one.
See this blog post for more detail info on Sequential unique identifiers with Node.js and MongoDB.
Or you could use the timestamp part of the ObjectID:
objectId.getTimestamp().toString()
See the node objectid docs

How to get Post with Comments Count in single query with CouchDB?

How to get Post with Comments Count in single query with CouchDB?
I can use map-reduce to build standalone view [{key: post_id, value: comments_count}] but then I had to hit DB twice - one query to get the post, another to get comments_count.
There's also another way (Rails does this) - count comments manually, on the application server and save it in comment_count attribute of the post. But then we need to update the whole post document every time a new comment added or deleted.
It seems to me that CouchDB is not tuned for such a way, unlike RDBMS when we can update only the comment_count attribute in CouchDB we are forced to update the whole post document.
Maybe there's another way to do it?
Thanks.
The view's return json includes the document count as 'total_rows', so you don't need to compute anything yourself, just emit all the documents you want counted.
{"total_rows":3,"offset":0,"rows":[
{"id":...,"key":...,value:doc1},
{"id":...,"key":...,value:doc2},
{"id":...,"key":...,value:doc3}]
}

Resources