Sorting enormous dataset

Sorting enormous dataset - node.js

I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.

There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.

Related

AWS DynamoDB count query results without retrieving

I would like to check how many entries are in a DynamoDB table that matches a query without retrieving the actual entries, using boto3.
I want to run a machine learning job on data from DynamoDB table. The data I'm training on is a data that answers a query, not the entire table. I want to run the job only if I have enough data to train on.
Therefore, I want to check if I want to check that I have enough entries that match the query.
It is worth mentioning that the DynamoDB table I'm querying is really big, therefore actual retrieving is no option unless I actually want to run the job.
I know that I can use boto3.dynamodb.describe_table() to get how many entries there are in the entire table, but as I mentioned earlier, I want to know only how many entries match a query.
Any ideas?

This was asked and answered in the past, see How to get item count from DynamoDB?
Basically, you need to use the "Select" parameter to tell DynamoDB to only count the query's results, instead of retrieving them.
As usual in DynamoDB, this is truncated by paging: if the result set (not the count - the actual full results) is larger than 1 MB, then only the first 1 MB is retrieved, and the items in it counted, and you get back this partial count. If you're only interested in checking whether you have "enough" results - this may even be better for you - because you don't want to pay for reading a gigabyte of data just to check if the data is there. You can even ask for a smaller page, to read less - depending on what you consider enough data.
Just remember that you'll pay Amazon not by the amount of data returned (just one integer, the count) but by the amount of data read from disk. Using such counts excessively may lead to surprising large costs.

How to reduce the Querying time for larger size data in MongoDB

I am trying to querying the collection in MongoDB which matches more than 10000 data for the query. Even though I have used index, the querying time exceeds 25 seconds.
For example, I am having a table People with field name, age.
I need to fetch the People data whose age is 25, if query finds the matched objects is 10000, then it takes time to fetch the whole data.
I have created index like db.people.createIndex({"age":1})
Here, how can I reduce the querying time

run db.collection.find().explain() and make sure that your index is in fact used. Make sure that you do not have COLLSCANs there https://docs.mongodb.com/manual/reference/explain-results/.
if your documents have some/many large attributes and you need only some attributes try to request only them (e.g. only _id or _id and name). Less data transferred gives higher speed.
if your db does not fit in memory, make it fit in memory. Once the database does not fit the performance will be much worse.
if you are not running on a sharded cluster, create one based on a reasonable sharding key. Age may not be a good one because than all age=25 documents will end up on one node. Even if you have one computer with multiple CPUs it still may work better for you (if you have enough memory for that). It may even work the other way around. If you have a sharded cluster on one computer and your replicas do not fit in the memory, it may be better to use just one node.

Why is search performance is slow for about 1M documents - how to scale the application?

I have created a search project that based on lucene 4.5.1
There are about 1 million documents and each of them is about few kb, and I index them with fields: docname(stored), lastmodified,content. The overall size of index folder is about 1.7GB
I used one document (the original one) as a sample, and query the content of that document against index. the problems now is each query result is coming up slow. After some tests, I found that my queries are too large although I removed stopwords, but I have no idea how to reduce query string size. plus, the smaller size the query string is, the less accurate the result comes.
This is not limited to specific file, because I also tested with other original files, the performance of search is relatively slow (often 1-8 seconds)
Also, I have tried to copy entire index directory to RAMDirectory while search, that didn't help.
In addition, I have one index searcher only across multiple threads, but in testing, I only used one thread as benchmark, the expected response time should be a few ms
So, how can improve search performance in this case?
Hint: I'm searching top 1000

If the number of fields is large a nice solution is to not store them then serialize the whole object to a binary field.
The plus is, when projecting the object back out after query, it's a single field rather than many. getField(name) iterates over the entire set so O(n/2) then getting the values and setting fields. Just one field and deserialize.
Second might be worth at something like a MoreLikeThis query. See https://stackoverflow.com/a/7657757/277700

Data retrieval - Database VS Programming language

I have been working with databases recently and before that I was developing standalone components that do not use databases.
With all the DB work I have a few questions that sprang up.
Why is a database query faster than a programming language data retrieval from a file.
To elaborate my question further -
Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.
Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.
Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk.
When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.
This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.
I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.
So what happens in a DBMS that makes it faster to retrieve data than through a programming language?
If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.
I use SQL Server if that is of any help.

Why is a database query faster than a programming language data retrieval from a file
That depends on many things - network latency and disk seek speeds being two of the important ones. Sometimes it is faster to read from a file.
In your description of finding a row within a million rows, a database will normally be faster than seeking in a file because it employs indexing on the data.
If you pre-process you data file and provided index files for the different fields, you could speedup data lookup from the filesystem as well.
Note: databases are normally used not for this feature, but because they are ACID compliant and therefore are suitable for working in environments where you have multiple processes (normally many clients on many computers) querying the database at the time.

There are lots of techniques to speed up various kinds of access. As #Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)
On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.
Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.
Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...
In other words, you will have written your own database.

...an old one, I know... just for if somebody finds this: The question contained "assume ... do not have any indexes"
...so the question was about the sequential dataread fight between the database and a flat file WITHOUT indexes, which the database wins...
And the answer is: if you read record by record from disk you do lots of disk seeking, which is expensive performance wise. A database always loads pages by concept - so a couple of records all at once. Less disk seeking is definitely faster. If you would do a mem buffered read from a flat file you could achieve the same or better read values.

Including documents in the emit compared to include_docs = true in CouchDB

I ran across a mention somewhere that doing an emit(key, doc) will increase the amount of time an index takes to build (or something to that effect).
Is there any merit to it, and is there any reason not to just always do emit(key, null) and then include_docs = true?

Yes, it will increase the size of your index, because CouchDB effectively copies the entire document in those cases. For cases in which you can, use include_docs=true.
There is, however, a race condition to be aware of when using this that is mentioned in the wiki. It is possible, during the time between reading the view data and fetching the document, that said document has changed (or has been deleted, in which case _deleted will be true). This is documented here under "Querying Options".

This is a classic time/space tradeoff.
Emitting document data into your index will increase the size of the index file on disk because CouchDB includes the emitted data directly into the index file. However, this means that, when querying your data, CouchDB can just stream the content directly from the index file on disk. This is obviously quite fast.
Relying instead on include_docs=true will decrease the size of your on-disk index, it's true. However, on querying, CouchDB must perform a document read for every returned row. This involves essentially random document lookups from the main data file, meaning that the cost and time of returning data increases significantly.
While the query time difference for small numbers of documents is slow, it will add up over every call made by the application. For me, therefore, emitting needed fields from a document into the index is usually the right call -- disk is cheap, user's attention spans less so. This is broadly similar to using covering indexes in a relational database, another widely echoed piece of advice.
I did a totally unscientific test on this to get a feel for what the difference is. I found about an 8x increase in response time and 50% increase in CPU when using include_docs=true to read 100,000 documents from a view when compared to a view where the documents were emitted directly into the index itself.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string