Large Data Processing and Management in MongoDB using NodeJS

Large Data Processing and Management in MongoDB using NodeJS - node.js

I am trying to do CRUD operations on MongoDB of a very large size around 20GB data and there can be multiple such versions of data. Can anyone guide me on how to handle such high data for the CRUD operations and maintaining the previous versions of the data in MongoDB?
I am using NodeJS as backend and I can also use any other database if required.

Mongodb is a reliable database, I am using it to processes 10-11 billions of data every single day nodejs should also be fine as long as you are handling the files in streams of data.
Things you should do to Optimize:
Indexing - this will be the biggest part, if you want faster queries you better look into indexing in mongodb, every single document needs to be indexed according to your query, else you are going to have a tough time dealing with queries.
Sharding and Replication - this will help you organise the data and increases the query speed, replication would allow you to have your reads and writes separated(there are cons for replication you can read about that in the mongodb documentation).
This are the main things you need to consider, there are a lot but this should get you started...;) need any help please do let me know.

Related

is mongo stored javascript good solution to multiple DB inserts in nodejs

I have a functionality is Mean Stack which has multiple collection inserts and creation. If it do that in plain mongoose , its going to be multiple Mongo calls and it might be slow.
Can i use mongo stored javascript for this?Pass some values to mongo javascript and it can do all the things from there..
Is it a suggested approach?

The recommended way to do lots of inserts is to use the Bulk Write Operations feature. You can define a set of inserts to be done as a single batch, then pass them all to MongoDB in one go.
However, that is really only appropriate for jobs such as a big data take-on, where you are importing a large number of similar records in one go. If you are running a normal application where there might be inserts, updates, deletes and reads in varying proportions and at varying rates, you would be better off letting Mongoose submit them as individual queries, and making sure your server hardware can cope.

is it good to use different collections in a database in mongodb

I am going to do a project using nodejs and mongodb. We are designing the schema of database, we are not sure that whether we need to use different collections or same collection to store the data. Because each has its own pros and cons.
If we use single collection, whenever the database is invoked, total collection will be loaded into memory which reduces the RAM capacity.If we use different collections then to retrieve data we need to write different queries. By using one collection retrieving will be easy and by using different collections application will become faster. We are confused whether to use single collection or multiple collections. Please Guide me which one is better.

Usually you use different collections for different things. For example when you have users and articles in the systems, you usually create a "users" collection for users and "articles" collection for articles. You could create one collection called "objects" or something like that and put everything there but it would mean you would have to add some type fields and use it for searches and storage of data. You can use a single collection in the database but it would make the usage more complicated. Of course it would let you to load the entire collection at once but whether or not it is relevant for the performance of your application, that is something that would have to be profiled and tested to give your the performance impact for your particular use case.

Usually, developers create the different collection for different things. Like for post management, people create 'post' collection and save the posts in post collection and same for users and all.
Using different collection for different purpose is a good pratices.
MongoDB is great at scaling horizontally. It can shard a collection across a dynamic cluster to produce a fast, querable collection of your data.
So having a smaller collection size is not really a pro and I am not sure where this theory comes that it is, it isn't in SQL and it isn't in MongoDB. The performance of sharding, if done well, should be relative to the performance of querying a single small collection of data (with a small overhead). If it isn't then you have setup your sharding wrong.
MongoDB is not great at scaling vertically, as #Sushant quoted, the ns size of MongoDB would be a serious limitation here. One thing that quote does not mention is that index size and count also effect the ns size hence why it describes that:
By default MongoDB has a limit of approximately 24,000 namespaces per
database. Each namespace is 628 bytes, the .ns file is 16MB by
default.
Each collection counts as a namespace, as does each index. Thus if
every collection had one index, we can create up to 12,000
collections. The --nssize parameter allows you to increase this limit
(see below).
Be aware that there is a certain minimum overhead per collection -- a
few KB. Further, any index will require at least 8KB of data space as
the b-tree page size is 8KB. Certain operations can get slow if there
are a lot of collections and the meta data gets paged out.
So you won't be able to gracefully handle it if your users exceed the namespace limit. Also it won't be high on performance with the growth of your userbase.
UPDATE
For Mongodb 3.0 or above using WiredTiger storage engine, it will no longer be the limit.

Yes personally I think having multiple collections in a DB keeps it nice and clean. The only thing I would worry about is the size of the collections. Collections are used by a lot of developers to cut up their db into, for example, posts, comments, users.
Sorry about my grammar and lack of explanation I'm on my phone

How manage big data in MongoDb collections

I have a collection called data which is the destination of all the documents sent from many devices each n seconds.
What is the best practice to keep the collection alive in production without documents overflow?
How could I "clean" the collection and save the content in another one? Is it the correct way?
Thank you in advance.

You cannot overflow, if you use sharding you have almost unlimited space.
https://docs.mongodb.com/manual/reference/limits/#Sharding-Existing-Collection-Data-Size
Those are limits for single shard, and you have to start sharding before reaching them.
It depends on your architecture, however limit (in worst case) of 8.19200 exabytes (or 8,192,000 terabytes) is unreachable for most of even big data apps, if you multiply number of shard possible in replica set by max collection size in one of them.
See also:
What is the max size of collection in mongodb

Mongodb is a best database for storing large collection. You can do below steps for better performance.
Replication
Replication means copying your data several times on a single server or multiple server.
It provides a backup of your data every time when you insert data in your db.
Embedded document
Try to make your collection with refreences. It means that try to make refrences in your db.

Mongodb response slows down incredibly after 200,000 records

Currently our task is to fetch 1 million records from an external server, process it and save it in the db. We are using node.js for fetching the records and mongodb as the database.
We decided to split the process into 2 tasks, fetching the records and processing it. Now we are able to fetch all the records and dump it in mongo but when we are trying to process it(by processing I mean change a few attribute values, do some simple calculation and update the attributes), we see drastically slow response in mongodb updates around 200,000 records.
For processing the data, we take batches of 1000 records process it, update the records( individually) and then go for the next batch. How could the performance be made better?

if you want to maintain response speed in mongoDB after long data then use mongo sharding and replication in your database and collection
replication:-
A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments. This section introduces replication in MongoDB as well as the components and architecture of replica sets. The section also provides tutorials for common tasks related to replica sets.
Replication Link
sharding:-
Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations.
Sharding Link

Potential issue with Couchbase paging

It may be too much turkey over the holidays, but I've been thinking about a potential problem that we could have with Couchbase.
Currently we paginate based on time, but I'm thinking a similar issue could occur with other values used for paging for example the atomic counter. I'll try to explain best I can, this would only occur in a load balanced environment.
For example say we have 4 servers load balanced and storing data to our Couchbase cluster. We sort our records based on timestamps currently. If any of the 4 servers writing the data starts to lag behind the others than our pagination would possibly be missing records when retrieving client side. A SQL DB auto-increment and timestamps for example can be created when the record is stored to the DB which will avoid similar issues. Using a NoSql DB like Couchbase you define the data you need to retrieve on before it is stored to the DB. So what I am getting at is if there is a delay in storing to the DB and you are retrieving in a pagination fashion while this delay has occurred, you run the real possibility of missing data. Since we are paging that data may never be viewed.
Interested in what other thoughts people have on this.
EDIT**
Response to Andrew:
Example a facebook or pintrest type app is storing data to a DB, they have many load balanced servers from the frontend writing to the db. If for some reason writing is delayed its a non issue with a SQL DB because a timestamp or auto increment happens when the data is actually stored to the DB. There will be no missing data when paging. asking for 1-7 will give you data that is only stored in the DB, 7-* will contain anything that is delayed because an auto-increment value has not been created for that record becuase it is not actually stored.
In Couchbase its different, you actually get your auto increment value (atomic counter) and then save it. So for example say a record is going to be stored as atomic counter number 4. For some reasons this is delayed in storing to the DB. Other servers are grabbing 5, 6, 7 and storing that data just fine. The client now asks for all data between 1 and 7, 4 is still not stored. Then the next paging request is 7 to *. 4 will never be viewed.
Is there a way around this? Can it be modelled differently in CB, or is this just a potential weakness in CB when needing to page results. As I mentioned are paging is timestamp sensitive.

Michael,
Couchbase is an eventually consistent database with respect to views. It is ACID with respect to documents. There are durability interfaces that let you manage this. This means that you can rest assured you won't lose data and that indexes will catch up eventually.
In my experience with Couchbase, you need to expect that the nodes will never be in-sync. There are many things the database is doing, such as compaction and replication. The most important thing you can do to enhance performance is to put your views on a separate spindle from the data. And you need to ensure that your main data spindles across your cluster can sustain between 3-4 times your ingestion bandwidth. Also, make sure your main document key hashes appropriately to distribute the load.
It sounds like you are discussing a situation where the data exists in your system for less time than it takes to be processed through the view system. If you are removing data that fast, you need either a bigger cluster or faster disk arrays. Of the two choices, I would expand the size of your cluster. I like to think of Couchbase as building a RAIS, Redundant Array of Independent Servers. By expanding the cluster, you reduce the coincidence of hotspots and gain disk bandwidth. My ideal node has two local drives, one each for data and views, and enough RAM for my working set.
Anon,
Andrew

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string