saving and loading large numpy matrix

saving and loading large numpy matrix - python-3.x

The below code is how I save the numpy array and it is about 27GB after saved. There are more than 200K images data and each shape is (224,224,3)
hf = h5py.File('cropped data/features_train.h5', 'w')
for i,each in enumerate(features_train):
hf.create_dataset(str(i), data=each)
hf.close()
This is the method I used to load the data, and it takes hours for loading.
features_train = np.zeros(shape=(1,224,224,3))
hf = h5py.File('cropped data/features_train.h5', 'r')
for key in hf.keys():
x = hf.get(key)
x = np.array(x)
features_train = np.append(features_train,np.array([x]),axis=0)
hf.close()
So, does anyone has a better solution for this large size of data?

You didn't tell us how much physical RAM your server has,
but 27 GiB sounds like "a lot".
Consider breaking your run into several smaller batches.
There is an old saw in java land that asks "why does this have quadratic runtime?",
that is, "why is this so slow?"
String s = ""
for (int i = 0; i < 1e6, i++) {
s += "x";
}
The answer is that toward the end,
on each iteration we are reading ~ a million characters
then writing them, then appending a single character.
The cost is O(1e12).
Standard solution is to use a StringBuilder so we're back
to the expected O(1e6).
Here, I worry that calling np.append() pushes us into the quadratic regime.
To verify, replace the features_train assignment with a simple evaluation
of np.array([x]), so we spend a moment computing and then immediately discarding
that value on each iteration.
If the conjecture is right, runtime will be much smaller.
To remedy it, avoid calling .append().
Rather, preallocate 27 GiB with np.zeros()
(or np.empty())
and then within the loop assign each freshly read array
into the offset of its preallocated slot.
Linear runtime will allow the task to complete much more quickly.

Related

Julia 1.5.2 Performance Questions

I am currently attempting to implement a metaheuristic (genetic) algorithm. In this venture i also want to try and create somewhat fast and efficient code. However, my experience in creating efficient coding is not very great. I was therefore wondering if some people could give some "quick tips" to increase the efficiency of my code. I have created a small functional example of my code which contains most of the elements that the code will contain i regards to preallocating arrays, custom mutable structs, random numbers, pushing into arrays etc.
The options that I have already attempted to explore are options in regards to the package "StaticArrays". However many of my arrays must be mutable (so we need MArrays) and many of them will become very large > 100. The documentation of StaticArrays specify that the size of the StaticArrays package must remain small to remain efficient.
According to the documentation Julia 1.5.2 should be thread safe in regards to rand(). I have therefor attempted to multithread for-loops in my functions to make them run faster. And this results in a slight performance increase .
However if people can se a more efficient way of allocating Arrays or pushing in SpotPrices into an array it would be greatly appreciated! Any other performance tips are also very welcome!
# Packages
clearconsole()
using DataFrames
using Random
using BenchmarkTools
Random.seed!(42)
df = DataFrame( SpotPrice = convert(Array{Float64}, rand(-266:500,8832)),
month = repeat([1,2,3,4,5,6,7,8,9,10,11,12]; outer = 736),
hour = repeat([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]; outer = 368))
# Data structure for the prices per hour
mutable struct SpotPrices
hour :: Array{Float64,1}
end
# Fill-out data structure
function setup_prices(df::DataFrame)
prices = []
for i in 1:length(unique(df[:,3]))
push!(prices, SpotPrices(filter(row -> row.hour == i, df).SpotPrice))
end
return prices
end
prices = setup_prices(df)
# Sampler function
function MC_Sampler(prices::Vector{Any}, sample_size::Int64)
# Picking the samples
tmp = zeros(sample_size, 24)
# Sampling per hour
for i in 1:24
tmp[:,i] = rand(prices[i].hour, sample_size)
end
return tmp
end
samples = MC_Sampler(prices, 100)
#btime setup_prices(df)
#btime MC_Sampler(prices,100)
function setup_prices_par(df::DataFrame)
prices = []
#sync Threads.#threads for i in 1:length(unique(df[:,3]))
push!(prices, SpotPrices(filter(row -> row.hour == i, df).SpotPrice))
end
return prices
end
# Sampler function
function MC_Sampler_par(prices::Vector{Any}, sample_size::Int64)
# Picking the samples
tmp = zeros(sample_size, 24)
# Sampling per hour
#sync Threads.#threads for i in 1:24
tmp[:,i] = rand(prices[i].hour, sample_size)
end
return tmp
end
#btime setup_prices_par(df)
#btime MC_Sampler_par(prices,100)

Have a look at read very carefully https://docs.julialang.org/en/v1/manual/performance-tips/
Basic cleanups start with:
Your SpotPrices struct does not need to me mutable. Anyway since there is only one field you could just define it as SpotPrices=Vector{Float64}
You do not want untyped containers - instead of prices = [] do prices = Float64[]
Using DataFrames.groupby will be much faster than finding unique elements and filtering by them
If yo do not need initialze than do not do it Vector{Float64}(undef, sample_size) is much faster than zeros(sample_size, 24)
You do not need to synchronize #sync before a multi-threaded loop
Create a random states - one separate one for each thread and use them whenever calling the rand function

What does flatten_parameters() do?

I saw many Pytorch examples using flatten_parameters in the forward function of the RNN
self.rnn.flatten_parameters()
I saw this RNNBase and it is written that it
Resets parameter data pointer so that they can use faster code paths
What does that mean?

It may not be a full answer to your question. But, if you give a look at the flatten_parameters's source code , you will notice that it calls _cudnn_rnn_flatten_weight in
...
NoGradGuard no_grad;
torch::_cudnn_rnn_flatten_weight(...)
...
is the function that does the job. You will find that what it actually does is copying the model's weights into a vector<Tensor> (check the params_arr declaration) in:
// Slice off views into weight_buf
std::vector<Tensor> params_arr;
size_t params_stride0;
std::tie(params_arr, params_stride0) = get_parameters(handle, rnn, rnn_desc, x_desc, w_desc, weight_buf);
MatrixRef<Tensor> weight{weight_arr, static_cast<size_t>(weight_stride0)},
params{params_arr, params_stride0};
And the weights copying in
// Copy weights
_copyParams(weight, params);
Also note that they update (or Reset as they explicitly say in docs) the original pointers of weights with the new pointers of params by doing an in-place operation .set_ (_ is their notation for the in-place operations) in orig_param.set_(new_param.view_as(orig_param));
// Update the storage
for (size_t i = 0; i < weight.size(0); i++) {
for (auto orig_param_it = weight[i].begin(), new_param_it = params[i].begin();
orig_param_it != weight[i].end() && new_param_it != params[i].end();
orig_param_it++, new_param_it++) {
auto orig_param = *orig_param_it, new_param = *new_param_it;
orig_param.set_(new_param.view_as(orig_param));
}
}
And according to n2798 (draft of C++0x)
©ISO/IECN3092
23.3.6 Class template vector
A vector is a sequence container that supports random access iterators. In addition, it supports (amortized)constant time insert and erase operations at the end; insert and erase in the middle take linear time. Storage management is handled automatically, though hints can be given to improve efficiency. The elements of a vector are stored contiguously, meaning that if v is a vector <T, Allocator> where T is some type other than bool, then it obeys the identity&v[n] == &v[0] + n for all 0 <= n < v.size().
In some situations
UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().
They explicitly advise people in code warnings to have a contiguous chunk of memory.

Parallelization of Piecewise Polynomial Evaluation

I am trying to evaluate points in a large piecewise polynomial, which is obtained from a cubic-spline. This takes a long time to do and I would like to speed it up.
As such, I would like to evaluate a points on a piecewise polynomial with parallel processes, rather than sequentially.
Code:
z = zeros(1e6, 1) ; % preallocate some memory for speed
Y = rand(11220,161) ; %some data, rand for generating a working example
X = 0 : 0.0125 : 2 ; % vector of data sites
pp = spline(X, Y) ; % get the piecewise polynomial form of the cubic spline.
The resulting structure is large.
for t = 1 : 1e6 % big number
hcurrent = ppval(pp,t); %evaluate the piecewise polynomial at t
z(t) = sum(x(t:t+M-1).*hcurrent,1) ; % do some operation of the interpolated value. Most likely not relevant to this question.
end
Unfortunately, with matrix form and using:
hcurrent = flipud(ppval(pp, 1: 1e6 ))
requires too much memory to process, so cannot be done. Is there a way that I can batch process this code to speed it up?

For scalar second arguments, as in your example, you're dealing with two issues. First, there's a good amount of function call overhead and redundant computation (e.g., unmkpp(pp) is called every loop iteration). Second, ppval is written to be general so it's not fully vectorized and does a lot of things that aren't necessary in your case.
Below is vectorized code code that take advantage of some of the structure of your problem (e.g., t is an integer greater than 0), avoids function call overhead, move some calculations outside of your main for loop (at the cost of a bit of extra memory), and gets rid of a for loop inside of ppval:
n = 1e6;
z = zeros(n,1);
X = 0:0.0125:2;
Y = rand(11220,numel(X));
pp = spline(X,Y);
[b,c,l,k,dd] = unmkpp(pp);
T = 1:n;
idx = discretize(T,[-Inf b(2:l) Inf]); % Or: [~,idx] = histc(T,[-Inf b(2:l) Inf]);
x = bsxfun(#power,T-b(idx),(k-1:-1:0).').';
idx = dd*idx;
d = 1-dd:0;
for t = T
hcurrent = sum(bsxfun(#times,c(idx(t)+d,:),x(t,:)),2);
z(t) = ...;
end
The resultant code takes ~34% of the time of your example for n=1e6. Note that because of the vectorization, calculations are performed in a different order. This will result in slight differences between outputs from ppval and my optimized version due to the nature of floating point math. Any differences should be on the order of a few times eps(hcurrent). You can still try using parfor to further speed up the calculation (with four already running workers, my system took just 12% of your code's original time).
I consider the above a proof of concept. I may have over-optmized the code above if your example doesn't correspond well to your actual code and data. In that case, I suggest creating your own optimized version. You can start by looking at the code for ppval by typing edit ppval in your Command Window. You may be able to implement further optimizations by looking at the structure of your problem and what you ultimately want in your z vector.
Internally, ppval still uses histc, which has been deprecated. My code above uses discretize to perform the same task, as suggested by the documentation.

Use parfor command for parallel loops. see here, also precompute z vector as z(j) = x(j:j+M-1) and hcurrent in parfor for speed up.

The Spline Parameters estimation can be written in Matrix form.
Once you write it in Matrix form and solve it you can use the Model Matrix to evaluate the Spline on all data point using Matrix Multiplication which is probably the most tuned operation in MATLAB.

MATLAB: fastest way to do a root-mean-squared error between a vector and array of vectors

I have a question regarding the fastest way to compute the RMSE between a single vector and an array of vectors. Specifically, I have a vector A representing an point and would like to find the index in a list B of points that A is closest to. Right now I am using:
tempmat = bsxfun(#minus,A,B);
tempmat1 = sqrt(sum(tempmat.^2,2);
index = find(tempmat1 == min(tempmat1));
this takes about 0.058 seconds to calculate the index. Is there a faster way in MATLAB of doing this? I performing this calculations literally millions of times.
Many thanks for reading,
Joe

tempmat = bsxfun(#minus,A,B);
tmpmat1 = sum(tempmat.^2,2);
[m,index] = min(tempmat1);
m = sqrt(m); %# optional, only if you need the actual numerical value
This avoids calculating sqrt on the whole array, since the minumum of the squared differences will have the same index. It also uses the second output of min to avoid the second pass of find.

You'll probably find that
tempmat = A - B(ones(1, size(A,1)), :)
is faster than the bsxfun version, unless size(A,1) is exceptionally large.
This assumes that A is your array and B is your vector. The RSS calculation implies that you have row vectors.
Also, I presume you know that you're calculating the RSS not RMS.

Why overwrite a file more than once to securely delete all traces of a file?

Erasing programs such as Eraser recommend overwriting data maybe 36 times.
As I understand it all data is stored on a hard drive as 1s or 0s.
If an overwrite of random 1s and 0s is carried out once over the whole file then why isn't that enough to remove all traces of the original file?

A hard drive bit which used to be a 0, and is then changed to a '1', has a slightly weaker magnetic field than one which used to be a 1 and was then written to 1 again. With sensitive equipment the previous contents of each bit can be discerned with a reasonable degree of accuracy, by measuring the slight variances in strength. The result won't be exactly correct and there will be errors, but a good portion of the previous contents can be retrieved.
By the time you've scribbled over the bits 35 times, it is effectively impossible to discern what used to be there.
Edit: A modern analysis shows that a single overwritten bit can be recovered with only 56% accuracy. Trying to recover an entire byte is only accurate 0.97% of the time. So I was just repeating an urban legend. Overwriting multiple times might have been necessary when working with floppy disks or some other medium, but hard disks do not need it.

Daniel Feenberg (an economist at the private National Bureau of Economic Research) claims that the chances of overwritten data being recovered from a modern hard drive amount to "urban legend":
Can Intelligence Agencies Read Overwritten Data?
So theoretically overwriting the file once with zeroes would be sufficent.

In conventional terms, when a one is written to disk the media records a one, and when a zero is written the media records a zero. However the actual effect is closer to obtaining a 0.95 when a zero is overwritten with a one, and a 1.05 when a one is overwritten with a one. Normal disk circuitry is set up so that both these values are read as ones, but using specialised circuitry it is possible to work out what previous "layers" contained. The recovery of at least one or two layers of overwritten data isn't too hard to perform by reading the signal from the analog head electronics with a high-quality digital sampling oscilloscope, downloading the sampled waveform to a PC, and analysing it in software to recover the previously recorded signal. What the software does is generate an "ideal" read signal and subtract it from what was actually read, leaving as the difference the remnant of the previous signal. Since the analog circuitry in a commercial hard drive is nowhere near the quality of the circuitry in the oscilloscope used to sample the signal, the ability exists to recover a lot of extra information which isn't exploited by the hard drive electronics (although with newer channel coding techniques such as PRML (explained further on) which require extensive amounts of signal processing, the use of simple tools such as an oscilloscope to directly recover the data is no longer possible)
http://www.cs.auckland.ac.nz/~pgut001/pubs/secure_del.html

Imagine a sector of data on the physical disk. Within this sector is a magnetic pattern (a strip) which encodes the bits of data stored in the sector. This pattern is written by a write head which is more or less stationary while the disk rotates beneath it. Now, in order for your hard drive to function properly as a data storage device each time a new magnetic pattern strip is written to a sector it has to reset the magnetic pattern in that sector enough to be readable later. However, it doesn't have to completely erase all evidence of the previous magnetic pattern, it just has to be good enough (and with the amount of error correction used today good enough doesn't have to be all that good). Consider that the write head will not always take the same track as the previous pass over a given sector (it could be skewed a little to the left or the right, it could pass over the sector at a slight angle one way or the other due to vibration, etc.)
What you get is a series of layers of magnetic patterns, with the strongest pattern corresponding to the last data write. With the right instrumentation it may be possible to read this layering of patterns with enough detail to be able to determine some of the data in older layers.
It helps that the data is digital, because once you have extracted the data for a given layer you can determine exactly the magnetic pattern that would have been used to write it to disk and subtract that from the readings (and then do so on the next layer, and the next).

The reason why you want this is not harddisks, but SSDs. They remap clusters without telling the OS or filesystem drivers. This is done for wear-leveling purposes. So, the chances are quite high that the 0 bit written goes to a different place than the previous 1. Removing the SSD controller and reading the raw flash chips is well within the reach of even corporate espionage. But with 36 full disk overwrites, the wear leveling will likely have cycled through all spare blocks a few times.

"Data Remanence"
There's a pretty good set of references regarding possible attacks and their actual feasibility on Wikipedia.
There are DoD and NIST standards and recommendations cited there too.
Bottom line, it's possible but becoming ever-harder to recover overwritten data from magnetic media. Nonetheless, some (US-government) standards still require at least multiple overwrites. Meanwhile, device internals continue to become more complex, and, even after overwriting, a drive or solid-state device may have copies in unexpected (think about bad block handling or flash wear leveling (see Peter Gutmann). So the truly worried still destroy drives.

What we're looking at here is called "data remanence." In fact, most of the technologies that overwrite repeatedly are (harmlessly) doing more than what's actually necessary. There have been attempts to recover data from disks that have had data overwritten and with the exception of a few lab cases, there are really no examples of such a technique being successful.
When we talk about recovery methods, primarily you will see magnetic force microscopy as the silver bullet to get around a casual overwrite but even this has no recorded successes and can be quashed in any case by writing a good pattern of binary data across the region on your magnetic media (as opposed to simple 0000000000s).
Lastly, the 36 (actually 35) overwrites that you are referring to are recognized as dated and unnecessary today as the technique (known as the Gutmann method) was designed to accommodate the various - and usually unknown to the user - encoding methods used in technologies like RLL and MFM which you're not likely to run into anyhow. Even the US government guidelines state the one overwrite is sufficient to delete data, though for administrative purposes they do not consider this acceptable for "sanitization". The suggested reason for this disparity is that "bad" sectors can be marked bad by the disk hardware and not properly overwritten when the time comes to do the overwrite, therefore leaving the possibility open that visual inspection of the disk will be able to recover these regions.
In the end - writing with a 1010101010101010 or fairly random pattern is enough to erase data to the point that known techniques cannot recover it.

I've always wondered why the possibility that the file was previously stored in a different physical location on the disk isn't considered.
For example, if a defrag has just occurred there could easily be a copy of the file that's easily recoverable somewhere else on the disk.

Here's a Gutmann erasing implementation I put together. It uses the cryptographic random number generator to produce a strong block of random data.
public static void DeleteGutmann(string fileName)
{
var fi = new FileInfo(fileName);
if (!fi.Exists)
{
return;
}
const int GutmannPasses = 35;
var gutmanns = new byte[GutmannPasses][];
for (var i = 0; i < gutmanns.Length; i++)
{
if ((i == 14) || (i == 19) || (i == 25) || (i == 26) || (i == 27))
{
continue;
}
gutmanns[i] = new byte[fi.Length];
}
using (var rnd = new RNGCryptoServiceProvider())
{
for (var i = 0L; i < 4; i++)
{
rnd.GetBytes(gutmanns[i]);
rnd.GetBytes(gutmanns[31 + i]);
}
}
for (var i = 0L; i < fi.Length;)
{
gutmanns[4][i] = 0x55;
gutmanns[5][i] = 0xAA;
gutmanns[6][i] = 0x92;
gutmanns[7][i] = 0x49;
gutmanns[8][i] = 0x24;
gutmanns[10][i] = 0x11;
gutmanns[11][i] = 0x22;
gutmanns[12][i] = 0x33;
gutmanns[13][i] = 0x44;
gutmanns[15][i] = 0x66;
gutmanns[16][i] = 0x77;
gutmanns[17][i] = 0x88;
gutmanns[18][i] = 0x99;
gutmanns[20][i] = 0xBB;
gutmanns[21][i] = 0xCC;
gutmanns[22][i] = 0xDD;
gutmanns[23][i] = 0xEE;
gutmanns[24][i] = 0xFF;
gutmanns[28][i] = 0x6D;
gutmanns[29][i] = 0xB6;
gutmanns[30][i++] = 0xDB;
if (i >= fi.Length)
{
continue;
}
gutmanns[4][i] = 0x55;
gutmanns[5][i] = 0xAA;
gutmanns[6][i] = 0x49;
gutmanns[7][i] = 0x24;
gutmanns[8][i] = 0x92;
gutmanns[10][i] = 0x11;
gutmanns[11][i] = 0x22;
gutmanns[12][i] = 0x33;
gutmanns[13][i] = 0x44;
gutmanns[15][i] = 0x66;
gutmanns[16][i] = 0x77;
gutmanns[17][i] = 0x88;
gutmanns[18][i] = 0x99;
gutmanns[20][i] = 0xBB;
gutmanns[21][i] = 0xCC;
gutmanns[22][i] = 0xDD;
gutmanns[23][i] = 0xEE;
gutmanns[24][i] = 0xFF;
gutmanns[28][i] = 0xB6;
gutmanns[29][i] = 0xDB;
gutmanns[30][i++] = 0x6D;
if (i >= fi.Length)
{
continue;
}
gutmanns[4][i] = 0x55;
gutmanns[5][i] = 0xAA;
gutmanns[6][i] = 0x24;
gutmanns[7][i] = 0x92;
gutmanns[8][i] = 0x49;
gutmanns[10][i] = 0x11;
gutmanns[11][i] = 0x22;
gutmanns[12][i] = 0x33;
gutmanns[13][i] = 0x44;
gutmanns[15][i] = 0x66;
gutmanns[16][i] = 0x77;
gutmanns[17][i] = 0x88;
gutmanns[18][i] = 0x99;
gutmanns[20][i] = 0xBB;
gutmanns[21][i] = 0xCC;
gutmanns[22][i] = 0xDD;
gutmanns[23][i] = 0xEE;
gutmanns[24][i] = 0xFF;
gutmanns[28][i] = 0xDB;
gutmanns[29][i] = 0x6D;
gutmanns[30][i++] = 0xB6;
}
gutmanns[14] = gutmanns[4];
gutmanns[19] = gutmanns[5];
gutmanns[25] = gutmanns[6];
gutmanns[26] = gutmanns[7];
gutmanns[27] = gutmanns[8];
Stream s;
try
{
s = new FileStream(
fi.FullName,
FileMode.Open,
FileAccess.Write,
FileShare.None,
(int)fi.Length,
FileOptions.DeleteOnClose | FileOptions.RandomAccess | FileOptions.WriteThrough);
}
catch (UnauthorizedAccessException)
{
return;
}
catch (IOException)
{
return;
}
using (s)
{
if (!s.CanSeek || !s.CanWrite)
{
return;
}
for (var i = 0L; i < gutmanns.Length; i++)
{
s.Seek(0, SeekOrigin.Begin);
s.Write(gutmanns[i], 0, gutmanns[i].Length);
s.Flush();
}
}
}

There are "disk repair" type applications and services that can still read data off a hard drive even after it's been formatted, so simply overwriting with random 1s and 0s one time isn't sufficient if you really need to securely erase something.
I would say that for the average user, this is more than sufficient, but if you are in a high-security environment (government, military, etc.) then you need a much higher level of "delete" that can pretty effectively guarantee that no data will be recoverable from the drive.

The United States has requirements put out regarding the erasure of sensitive information (i.e. Top Secret info) is to destroy the drive. Basically the drives were put into a machine with a huge magnet and would also physically destroy the drive for disposal. This is because there is a possibility of reading information on a drive, even being overwritten many times.

See this: Guttman's paper

Just invert the bits so that 1's are written to all 0's and 0's are written to all 1's then zero it all out that should get rid of any variable in the magnetic field and only takes 2 passes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string