c++ Armadillo matrix circular buffer and row assignment - multithreading

I'm trying to populate an arma::mat data_mat(n,n) from a streaming data of size n in each time step. Its a high frequency real time system. So, efficiency and thread safety are very important design goals. With this in mind,
1) What is the best (efficient,threadsafe) way to populate the data_mat with a vector (double/float) in each time step until it is finished populating n rows.
2)After n rows are populated,the rows should be circularly buffered with row elements being shifted upward or downward.
I tried data_mat.row(i)=vec
however,I'm not sure if its the most efficient way.
And I'm emulating circular buffer by copying in a loop but I guess its not very efficient. Any advice would be highly appreciated.

Related

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

hash on large set of values, retaining information about sequence and approximate size: via Excel VBA

In Excel, I have two large columns of values that are usually identical in size and sequence. I want a hash for each column to check that the columns are in fact identical (with pretty good probability).
I have an MD5 hash algorithm which gives a has for a single string, but I want something for a large (about 20k) set of values). This would be slow.
I can use a simple function like this:
hash = mean + stdev + skewness
In VBA, this looks like:
Function hash(x As Range)
Application.Volatile
hash = Application.WorksheetFunction.StDev(x) + Application.WorksheetFunction.Skew(x) + Application.WorksheetFunction.Average(x)
End Function
and this gives me some confidence that the columns are the same in terms of magnitudes; but sometimes the values are identical but not in the correct order, and my hash cannot detect this. I need my hash to be able to detect wrong ordering.
I do not require 'anonymizing' or 'randomizing' of the data- there is no issue of privacy etc. In fact, a kind of 'proportional' hash that returns a small value for small errors and a large value for large errors would be extremely useful. Given that some rounding errors may result in small differences that I do not care about, the MD5 algorithm sometimes gives me false warnings.
Unfortunately the data is within excel (because it is the result of previous excel manipulations), and so a VBA function that would keep me in Excel, and allow me to proceed once the columns have been verified, would be best. So I'd like a function of the form
Of course, I could just compare the excel columns by making another column, and perform a large boolean AND (cellA1 =cellB1, cellA2=B2) etc. But this would be tedious and inefficient. I actually have thousands of these columns to compare in order to find bugs.
Any ideas?
The easiest way to compare two columns for near-equality is to use the worksheet function SUMXMY2(). This computes the squared-Euclidean distance between two ranges, thought of as vectors in higher-dimensional space. For example, to check if A1:A20000 is very close to B1:B20000, us the comparison
SUMXMY2(A1:A20000, B1:B20000) < tol
where tol is an error threshold which determines how much round-off error you are willing to tolerate.
Your original idea of using hashing could be useful in some circumstances. To make it tolerant of round-off error, look into the theory of Locality-sensitive hashing rather than cryptographical hashes such as MD5. Any such algorithm if implemented in VBA would be somewhat slow, but depending on what you are trying to do they could be useful.

The most efficient way to break down huge file into chunks of max 10MBs

There is a text file of a size of 20GBs. Each line of the file is a json object. The odd lines are the ones describing the even subsequent lines.
The goal is to divide the big file into chunks of maximum size of 10MBs knowing that each file should have even number of lines so the formatting doesn't get lost. What would be the most efficient way to do that?
My research so far made me lean towards:
split function in Linux. Is there any way to export always even number of lines based on the size?
Modified version of divide & conquer algo. Would this even work?
Estimating average number of lines that meet the 10MBs criteria and iterating through it & exporting it it meets the criteria.
I'm thinking that 1. would be the most efficient but I wanted to get an opinion of experts here.

Lone reference error when calling INDEX with RANDBETWEEN in Excel

I'm trying to do some bootstrapping with a data set in Excel with the formula =INDEX($H$2:$H$5057,RANDBETWEEN(2,5057)), where my original data set in is column H. It seems to work most of the time, but there is always approximately one cell that outputs a reference error. Does anyone know why this happens, or how to avoid including that one cell? I'm trying to generate a histogram from this data, and FREQUENCY does not play nice with an array with an error in it.
Please try:
=INDEX($H$2:$H$5057,RANDBETWEEN(1,5056))
=RANDBETWEEN(2,5057) returns a reasonably arbitrary value of 2 or any integer up to and including 5057. Used as above this specifies the position in the chosen array (H2:H5057) - that only has 5056 elements, so one problem would be when RANDBETWEEN hits on 5057. Much easier to observe with just H2:H4 and RANDBETWEEN(2,4).

Fast repeated row counting in vast data - what format?

My Node.js app needs to index several gigabytes of timestamped CSV data, in such a way that it can quickly get the row count for any combination of values, either for each minute in a day (1440 queries) or for each hour in a couple of months (also 1440). Let's say in half a second.
The column values will not be read, only the row counts per interval for a given permutation. Reducing time to whole minutes is OK. There are rather few possible values per column, between 2 and 10, and some depend on other columns. It's fine to do preprocessing and store the counts in whatever format suitable for this single task - but what format would that be?
Storing actual values is probably a bad idea, with millions of rows and little variation.
It might be feasible to generate a short code for each combination and match with regex, but since these codes would have to be duplicated each minute, I'm not sure it's a good approach.
Or it can use an embedded database like SQLite, NeDB or TingoDB, but am not entirely convinced since they don't have native enum-like types and might or might not be made for this kind of counting. But maybe it would work just fine?
This must be a common problem with an idiomatic solution, but I haven't figured out what it might be called. Knowing what to call this and how to think about it would be very helpful!
Will answer with my own findings for now, but I'm still interested to know more theory about this problem.
NeDB was not a good solution here as it saved my values as normal JSON behind the hood, repeating key names for each row and adding unique IDs. It wasted lots of space and would surely have been too slow, even if just because of disk I/O.
SQLite might be better at compressing and indexing data, but I have yet to try it. Will update with my results if I do.
Instead I went with the other approach I mentioned: assign a unique letter to each column value we come across and get a short string representing a permutation. Then for each minute, add these strings as keys iff they occur, with the number of occurrences as values. We can later use our dictionary to create a regex that matches any set of combinations, and run it over this small index very quickly.
This was easy enough to implement, but would of course have been trickier if I had had more possible column values than the about 70 I found.

Resources