Out of memory [divide and conquer algorithm] - node.js

So I have a foo table, which is huge and whenever I try to read all data from that table Node.JS gives me out of memory error! However still you can get chuncks of data by having offset and limit; but again I cannot merge all of the chuncks and have them in memory, because I run into out of memory again! In my algorithm I have lots of ids and need to check whether each id exists in foo table or not; what is the best solution (in terms of algorithm complexity) when I cannot have all of the data in memory to see if id exists in foo table or not?
PS: The naive solution is to get chuncks of data and see chunck by chunck for each id; but the complexity is n squared; there should be a better way I believe...

Under the constraints you specified, you could create a hash table containing the ID's you are looking for as keys, with all values initialized to false.
Then, read the table chunk by chunk and for each item in the table, look it up in the hash table. If found, mark the hash table entry with true.
After going all the chunks, your hash table will hold values of true for ID's found in the table.
Providing that the hash table lookup has a fixed time complexity, then this algorithm has a time complexity of O(N).

You can sort your ids, and break them to chunks. Then you can keep in memory range of values in each chunk - (lowestId,highestId) for that chunk.
You can quickly find chunk (if any) id may be contained in using in memory binary search, and then load that specific chunk in memory and to binary search on it.
Complexity should be LogN for both. In general, read about binary search algorithm.

"PS: The naive solution is to get chuncks of data and see chunck by chunck for each id; but the complexity is n squared; there should be a better way I believe..."
Let's say you could load the whole table into your memory. In any case you'll need to check each ID whether or not it is in the DB. You can't do it any better than comparing.
Having said that, a hash table comes to mind. Lets say the IDs are integers, and they are randomly picked. you could hash the IDs you need to check by the last two digits (or the first two for that matter). Then checking the items you have in your memory will be quicker.

Related

How does Cassandra store variable data types like text

assumption is, Cassandra will store fixed length data in column family. like a column family: id(bigint), age(int), description(text), picture(blob). Now description and picture have no limit. How does it store that? Does Cassandra externalize through an ID -> location way?
For example, looks like, in relational databases, a pointer is used to point to the actual location of large texts. See how it is done
Also, looks like, in mysql, it is recommended to use char instead of varchar for better performance. I guess simply because, there is no need for an "id lookup". See: mysql char vs varchar
enter code here
`
Cassandra stores individual cells (column values) in its on-disk files ("sstables") as a 32-bit length followed by the data bytes. So string values do not need to have a fixed size, nor are stored as pointers to other locations - the complete string appears as-is inside the data file.
The 32-bit length limit means that each "text" or "blob" value is limited to 2GB in length, but in practice, you shouldn't use anything even close to that - with Cassandra documentation suggesting you shouldn't use more than 1MB. There are several problems with having very large values:
Because values are not stored as pointers to some other storage, but rather stored inline in the sttable files, these large strings get copied around every time sstable files get rewritten, namely during compaction. It would be more efficient to keep the huge string on disk in a separate files and just copy around pointers to it - but Cassandra doesn't do this.
The Cassandra query language (CQL) does not have any mechanism for store or retrieving a partial cell. So if you have a 2GB string, you have to retrieve it entirely - there is no way to "page" through it, nor a way to write it incrementally.
In Scylla, large cells will result in large latency spikes because Scylla will handle the very large cell atomically and not context-switch to do other work. In Cassandra this problem will be less pronounced but will still likely cause problems (the thread stuck on the large cell will monopolize the CPU until preempted by the operating system).

Storing arrays in Cassandra

I have lots of fast incoming data that is organised thusly;
Lots of 1D arrays, one per logical object, where the position of each element in the array is important and each element is calculated and produced individually in parallel, and so not necessarily in order.
The data arrays themselves are not necessarily written in order.
The length of the arrays may vary.
The data is either read as an entire array at a time so makes sense to store the entire thing together.
The way I see it, the issue is primarily caused by the way the data is made available for writing. If it was all available together I'd just store the entire lot together at the same time and be done with it.
For smaller data loads I can get away with the postgres array datatype. One row per logical object with a key and an array column. This allows me to scale by having one writer per array, writing the elements in any order without blocking any other writer. This is limited by the rate of a single postgres node.
In Cassandra/Scylla it looks like I have the options of either:
Storing each element as its own row which would be very fast for writing, reads would be more cumbersome but doable and involve potentially lots of very wide scans,
or converting the array to json/string, reading the cell, tweaking the value then re-writing it which would be horribly slow and lead to lots of compaction overhead
or having the writer buffer until it receives all the array values and then writing the array in one go, except the writer won't know how long the array should be and will need a timeout to write down whatever it has by this time which ultimately means I'll need to update it at some point in the future if the late data turns up.
What other options do I have?
Thanks
Option 1, seems to be a good match:
I assume each logical object have an unique id (or better uuid)
In such a case, you can create something like
CREATE TABLE tbl (id uuid, ord int, v text, PRIMARY KEY (id, ord));
Where uuid is the partition key, and ord is the clustering (ordering) key, strong each "array" as a partition and each value as a row.
This allows
fast retrieve of the entire "array", even a big one, using paging
fast retrieve of an index in an array

Sorting enormous dataset

I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.
There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.

Data storage parallelization in .NET Core

I am a little lost in this task. There is a requirement for our caching solution to split a large data dictionary into partitions and perform operations on them in separate threads.
The scenario is: We have a large pool of data that is to be kept in memory (40m rows), the chosen strategy is first to have a Dictionary with int key. This dictionary contains a subset of 16 dictionaries that are keyed by guid and contain a data class.
The number 16 is calculated on startup and indicates CPU core count * 4.
The data class contains a byte[] which is basically a translated set of properties and their values, int pointer to metadata dictionary and checksum.
Then there is a set of control functions that takes care of locking and assigns/retrieves Guid keyed data based on a division of the first segment of guid (8 hex numbers) by divider. This divider is just FFFFFFFF / 16. This way each key will have a corresponding partition assigned.
Now I need to figure out how to perform operations (key lookup, iterations and writes) on these dictionaries in separate threads in parallel? Will I just wrap these operations using Tasks? Or will it be better to load these behemoth dictionaries into separate threads whole?
I have a rough idea how to implement data collectors, that will be the easy part I guess.
Also, is using Dictionaries a good approach? Their size is limited to 3mil rows per partition and if one is full, the control mechanism tries to insert on another server that is using the exact same mechanism.
Is .NET actually a good language to implement this solution?
Any help will be extremely appreciated.
Okay, so I implemented ReaderWriterLockSlim and implemented concurrent access through System.Threading.Tasks. I also managed to exclude any dataClass object from the storage, now it is only a dictionary of byte[]s.
It's able to store all 40 million rows taking just under 4GB of RAM and through some careful SIMD optimized manipulations performs EQUALS, <, > and SUM operation iterations in under 20ms, so I guess this issue is solved.
Also the concurrency throughput is quite good.
I just wanted to post this in case anybody faces similar issue in the future.

Cassandra distinct counting

I need to count bunch of "things" in Cassandra.
I need to increase ~100-200 counters every few seconds or so.
However I need to count distinct "things".
In order not to count something twice, I am setting a key in a CF, which program reads before increase the counter, e.g. something like:
result = get cf[key];
if (result == NULL){
set cf[key][x] = 1;
incr counter_cf[key][x];
}
However this read operation slows down the cluster a lot.
I tried to decrease reads, using several columns, e.g. something like:
result = get cf[key];
if (result[key1]){
set cf[key1][x] = 1;
incr counter_cf[key1][x];
}
if (result[key2]){
set cf[key2][x] = 1;
incr counter_cf[key2][x];
}
//etc....
Then I reduced the reads from 200+ to about 5-6, but it still slows down the cluster.
I do not need exact counting, but I can not use bit-masks, nor bloom-filters,
because there will be 1M+++ counters and some could go more than 4 000 000 000.
I am aware of Hyper_Log_Log counting, but I do not see easy way to use it with that many counters (1M+++) either.
At the moment I am thinking of using Tokyo Cabinet as external key/value store,
but this solution, if works, will not be as scalable as Cassandra.
Using Cassandra for the distinct counting is not ideal when the number of distinct values is big. Any time you need to do a read before a write you should ask yourself if Cassandra is the right choice.
If the number of distinct items is smaller you can just store them as column keys and do a count. A count is not free, Cassandra still has to assemble the row to count the number of columns, but if the number of distinct values is in the order of thousands it's probably going to be ok. I assume you've already considered this option and that it's not feasible for you, I just thought I'd mention it.
The way people typically do it is having the HLL's or Bloom filters in memory and then flushing them to Cassandra periodically. I.e. not doing the actual operations in Cassandra, just using it for persistance. It's a complex system, but there's easy way of counting distinct values, especially if you have a massive number of counters.
Even if you switched to something else, for example to something where you can do bit operations on values, you still need to guard against race conditions. I suggest that you simply bite the bullet and do all of your counting in memory. Shard the increment operations over your processing nodes by key and keep the whole counter state (both incremental and distinct) in memory on those nodes. Periodically flush the state to Cassandra and ack the increment operations when you do. When a node gets an increment operation for a key it does not have in memory it loads that state from Cassandra (or creates a new state if there's nothing in the database). If a node crashes the operations have not been acked and will be redelivered (you need a good message queue in front of the nodes to take care of this). Since you shard the increment operations you can be sure that a counter state is only ever touched by one node.

Resources