Size on disk of a partly filled HDF5 dataset

Size on disk of a partly filled HDF5 dataset - python-3.x

I'm reading the book Python and HDF5 (O'Reilly) which has a section on empty datasets and the size they take on disk:
import numpy as np
import h5py
f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32)
f.flush()
# Size on disk is 1KB
dset[0:1024] = np.arange(1024)
f.flush()
# Size on disk is 4GB
After filling part (first 1024 entries) of the dataset with values, I expected the file to grow, but not to 4GB. It's essentially the same size as when I do:
dset[...] = np.arange(1024**3)
The book states that the file size on disk should be around 66KB. Could anyone explain what the reason is for the sudden size increase?
Version info:
Python 3.6.1 (OSX)
h5py 2.7.0

If you open your file in HdfView you can see that chunking is off. This means that the array is stored in one contiguous block of memory in the file and cannot be resized. Thus all 4 GB must be allocated in the file.
If you create your data set with chunking enabled, the dataset is divided up into regularly-sized pieces which are stored haphazardly on disk, and indexed using a B-tree. In that case only the chunks that have (at least one element of) data are allocated on disk. If you create your dataset as follows the file will be much smaller:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=True)
The chunks=True lets h5py determine the size of the chunks automatically. You can also set the chunk size explicitly. For example, to set it to 16384 floats (=64 Kb), use:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=(2**14,) )
The best chunk size depends on the reading and writing patterns of your applications. Note that:
Chunking has performance implications. It’s recommended to keep the
total size of your chunks between 10 KiB and 1 MiB, larger for larger
datasets. Also keep in mind that when any element in a chunk is
accessed, the entire chunk is read from disk.
See http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage

Related

NetCDF uses twice the memory when reading part of data. Why? How to rectify it?

I have 2 fairly large datasets (~47GB) stored in a netCDF file. The datasets have three dimensions: time, s, and s1. The first dataset is of shape (3000,2088,1000) and the second is of shape (1566,160000,25). Both datasets are equal in size. The only difference is their shape. Since my RAM size is only 32GB, I am accessing the data in blocks.
For the first dataset, when I read the first ~12GB chunk of data, the code uses almost twice the amount of memory. Whereas, for the second, it uses just the amount of memory as that of the chunk (12GB). Why is this happening? How do I stop the code from using more than what is necessary?
Not using more memory is very important for my code because my algorithm's efficiency hinges on the fact that every line of code uses just enough memory and not more. Also, because of this weird behaviour, my system starts swapping like crazy. I have a linux system, if that information is useful. And I use python 3.7.3 with netCDF 4.6.2
This is how I am accessing the datasets,
from netCDF4 import Dataset
dat = Dataset('dataset1.nc')
dat1 = Dataset('dataset2.nc')
chunk1 = dat.variables['data'][0:750] #~12GB worth of data uses ~24GB RAM memory
chunk2 = dat1.variables['data'][0:392] #~12GB worth of data uses ~12GB RAM memory

Why the Python memory error using shutil.copyfileobj?

I created an in-memory file and then tried to save it as a file:
import pandas as pd
from io import StringIO
# various calculations
with open(s_outfile, "w") as outfile:
# make a header row
outfile.write('npi,NPImatched,lookalike_npi,domain,dist,rank\n')
stream_out = StringIO()
for i in big_iterator
# more calculations, creating dataframe df_info
df_info.to_csv(stream_out, index=False, header=False)
with open(s_outfile, 'a', newline='\n') as file:
stream_out.seek(0)
shutil.copyfileobj(stream_out, file)
stream_out.close()
The point of writing inside the loop to the StringIO object was to to speed up df_info.to_csv(), which worked ok (but less dramatically than I expected). But when I tried to copy the in-memory object to a file with shutil.copyfileobj(), I got MemoryError, with essentially no further information.
It's a large-ish situation; the loop runs about 1M times and the output data should have had a size of about 6GB. This was running on a GCP Linux compute instance with (I think) about 15GB RAM, although of course less than that (and perhaps less than the size of the in-memory data object) was free at the time.
But why would I get a memory error? Isn't shutil.copyfileobj() all about copying incrementally, using memory safely, and avoiding excessive memory consumption? I see now that it has an optional buffer size parameter, but as far as I can see, it defaults to something much smaller than the scale I'm working at with this data.
Would you expect the error to be avoided if I simply set the buffer size to something moderate like 64KB? Is my whole approach wrong-headed? It takes long enough to get the in-memory data established that I can't test things willy-nilly. Thanks in advance.

is python pandas, python Dask faster than julia CSV in the first time loading? [duplicate]

reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.
python 3.3 example
import pandas as pd
import time
start=time.time()
myData=pd.read_csv("C:\\myFile.txt",sep="|",header=None,low_memory=False)
print(time.time()-start)
Output: 19.90
R 3.0.2 example
system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
stringsAsFactors=F,na.strings=""))
Output:
User System Elapsed
181.13 1.07 182.32
Julia 0.2.0 (Julia Studio 0.4.4) example # 1
using DataFrames
timing = #time myData = readtable("C:/myFile.txt",separator='|',header=false)
Output:
elapsed time: 80.35 seconds (10319624244 bytes allocated)
Julia 0.2.0 (Julia Studio 0.4.4) example # 2
timing = #time myData = readdlm("C:/myFile.txt",'|',header=false)
Output:
elapsed time: 65.96 seconds (9087413564 bytes allocated)
Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?
a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?

The best answer is probably that I'm not as a good a programmer as Wes.
In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.
As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.
On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.

There is a relatively new julia package called CSV.jl by Jacob Quinn that provides a much faster CSV parser, in many cases on par with pandas: https://github.com/JuliaData/CSV.jl

Note that the "n bytes allocated" output from #time is the total size of all allocated objects, ignoring how many of them might have been freed. This number is often much higher than the final size of live objects in memory. I don't know if this is what your memory size estimate is based on, but I wanted to point this out.

I've found a few things that can partially help this situation.
using the readdlm() function in Julia seems to work considerably faster (e.g. 3x on a recent trial) than readtable(). Of course, if you want the DataFrame object type, you'll then need to convert to it, which may eat up most or all of the speed improvement.
Specifying dimensions of your file can make a BIG difference, both in speed and in memory allocations. I ran this trial reading in a file that is 258.7 MB on disk:
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1);
19.072266 seconds (221.60 M allocations: 6.573 GB, 3.34% gc time)
julia> #time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1, dims = (File_Lengths[1], 62));
10.309866 seconds (87 allocations: 528.331 MB, 0.03% gc time)
The type specification for your object matters a lot. For instance, if your data has strings in it, then the data of the array that you read in will be of type Any, which is expensive memory wise. If memory is really an issue, you may want to consider preprocessing your data by first converting the strings to integers, doing your computations, and then converting back. Also, if you don't need a ton of precision, using Float32 type instead of Float64 can save a LOT of space. You can specify this when reading the file in, e.g.:
Data = readdlm("file.csv", ',', Float32)
Regarding memory usage, I've found in particular that the PooledDataArray type (from the DataArrays package) can be helpful in cutting down memory usage if your data has a lot of repeated values. The time to convert to this type is relatively large, so this isn't a time saver per se, but at least helps reduce the memory usage somewhat. E.g. when loading a data set with 19 million rows and 36 columns, 8 of which represented categorical variables for statistical analysis, this reduced the memory allocation of the object from 5x its size on disk to 4x its size. If there are even more repeated values, the memory reduction can be even more significant (I've had situations where the PooledDataArray cuts memory allocation in half).
It can also sometimes help to run the gc() (garbage collector) function after loading and formatting data to clear out any unneeded ram allocation, though generally Julia will do this automatically pretty well.
Still though, despite all of this, I'll be looking forward to further developments on Julia to enable faster loading and more efficient memory usage for large data sets.

Let us first create a file you are talking about to provide reproducibility:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
end
Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.
Single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)
and using e.g. 4 threads:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.812742 seconds (47.28 k allocations: 1.208 GiB)
In R it is:
> system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
+ stringsAsFactors=F,na.strings=""))
user system elapsed
28.615 0.436 29.048
In Python (Pandas) it is:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
25.95710587501526
Now if we test fread from R (which is fast) we get:
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=1))
user system elapsed
1.043 0.036 1.082
> system.time(fread("myFile.txt", sep="|", header=F,
stringsAsFactors=F, na.strings="", nThread=4))
user system elapsed
1.361 0.028 0.416
So in this case the summary is:
despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)
EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.
Data preparation step:
open("myFile.txt", "w") do io
foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
end
CSV.jl, single threaded:
julia> #time CSV.File("myFile.txt", delim='|', header=false);
1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)
julia> #time CSV.File("myFile.txt", delim='|', header=false);
0.029965 seconds (248 allocations: 17.037 MiB)
Pandas:
>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
0.07587623596191406
Conclusions:
the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
for small files CSV.jl is faster than Pandas (if we exclude compilation cost)
Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.
From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.
Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.

In my experience, the best way to deal with larger text files is not load them up into Julia, but rather to stream them. This method has some additional fixed costs, but generally runs extremely quickly. Some pseudo code is this:
function streamdat()
mycsv=open("/path/to/text.csv", "r") # <-- opens a path to your text file
sumvec = [0.0] # <-- store a sum here
i = 1
while(!eof(mycsv)) # <-- loop through each line of the file
row = readline(mycsv)
vector=split(row, "|") # <-- split each line by |
sumvec+=parse(Float64, vector[i])
i+=1
end
end
streamdat()
The code above is just a simple sum, but this logic can be expanded to more complex problems.

using CSV
#time df=CSV.read("C:/Users/hafez/personal/r/tutorial for students/Book2.csv")
recently I tried in Julia 1.4.2. I found different response and at first, I didn't understand Julia. then I posted the same thing in the Julia discussion forums. then I understood that this code will provide only compile time. here you can find benchmark

MemoryError with large data frame in deep learning

Preamble
Hi all,
I'm trying to make a geometric deep learning model using StellarGraph package. With smaller data set, it works well, but unfortunately it's not scalable to a larger data set. Information on machine, environment, used data and resulting error presented as follow.
Machine specification:
CPU: Intel core i5-8350U
RAM: 8GB DDR4
SWAP: 4 GB + 4 GB (Divided into two swapfiles in different SSD)
SSD: 250 GB + 250 GB (2280 and 2242 NVMe)
Environment:
Linux 5.3.11_1 64-bit
Python 3.6.9
Used data (size acquired from sys.getsizeof()):
Sparse block diagonal matrix (shape: 158,950 x 158,950; size: 56)
Dense feature matrix (shape: 158,950 x 14,450; size: 9,537,152)
Modules:
networkx 2.3
numpy 1.15.4
pandas 0.25.3
scipy 1.1.0
scikit-learn 0.21.3
stellargraph 0.8.2
tensorflow 1.14.0
Problem description
I aim to create a geometric deep learning to categorize subject based on adjacency matrices acquired from resting state functional MRI. Adjacency matrix assumes 55 region of interest, resulting in 55x55 matrices for all subjects. In constructing the deep learning model, I used spectral graph convolutional network model from StellarGraph, which take a graph object and nodal feature as its input. I created the graph object from sparse block diagonal matrix obtained by combining adjacency matrices from all subjects. While nodal feature is the characteristic of each node (1 node has 5 characteristic values), constructed into dense block diagonal matrix.
Previously, I made the model using a subset of population sample (around 170). It ran perfectly, and I thought I'd be able to do the same using larger data set. Unfortunately, using the same code I got a MemoryError when registering the StellarGraph object. Code and error presented on following section.
Code and error
# Data parsing with scipy.io as sio and pandas as pd
data = sio.mmread('_data/sparse.mtx')
feature = sio.mmread('_data/sparse-feature.mtx')
feature = pd.DataFrame.sparse.from_spmatrix(feature)
# Create graph object using networkx as nx
g = nx.from_scipy_sparse_matrix(data)
# Create StellarGraph object and its generator
gs = StellarGraph(g, node_features=feature) # MemoryError
generator = FullBatchNodeGenerator(gs)
I'm sorry for not providing sparse.mtx and sparse-feature.mtx file due to confidentiality reason, but I hope previous description on data shape and size may help you to understand its construct. Using above code, python gave me following error:
>>> gs = StellarGraph(g, node_features=feature) # MemoryError
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 786, in __init__
super().__init__(incoming_graph_data, **attr)
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 381, in __init__
node_features, type_for_node, node_types, dtype
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 216, in _convert_from_node_data
{node_type: data}, node_type_map, node_types, dtype
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 182, in _convert_from_node_data
data_arr = arr.values.astype(dtype)
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 5443, in values
return self._data.as_array(transpose=self._AXIS_REVERSED)
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 822, in as_array
arr = mgr._interleave()
File "/home/lam/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 840, in _interleave
result = np.empty(self.shape, dtype=dtype)
MemoryError
While monitoring memory consumption, I observed that the RAM only used up to 55% of its total capacity, and the swap was not used at all. While running the code, I only used TTY + tmux with only vim, top and python session running. Moreover, I also made sure no other memory-hogging processes running in the background. So I'm certain the memory bottleneck is most likely caused by python.
What I have tried
To leverage the memory consumption, I tried to use dask in managing the dense feature data frame. Unfortunately, StellarGraph function can only have pandas array, pandas data frame, dictionary, tuple, or other iterable as its input.
Other than dask, I also tried using sparse matrix (since almost 80% of my data set is zero-valued anyways). However, it gave me TypeError since StellarGraph could not have sparse matrix as its node_features.
I've also read several solutions in managing large data set, which (mostly) suggest iteratively parsing the data into python session. However, I couldn'tgr find any documentation in StellarGraph on such method.
The other option would be using computer with better hardware, which to my regret, I couldn't do due to limited funding. I'm a student and couldn't afford buying better machines for now.
Potential solution
Upgrading the RAM. I'll try salvaging RAM from other computers, but current max size I have would be 16 GB. I'm not sure it will be enough.
Use smaller chunk of feature data set. I managed to go by this solution, but the model's accuracy was really bad (50-ish %).
Questions
Why python only use 55% of my total RAM without dynamic swap allocation?
How should I effectively manage large data frame?
How do I handle MemoryError when creating a StellarGraph object?
How much RAM do I actually need? Would 32GB suffice?

Python works fine. It's a implementation problem caused by StellarGraph.
I think StellarGraph so far doesn't support huge matrix.
File "/home/lam/.local/lib/python3.6/site-packages/stellargraph/core/graph.py", line 182, in _convert_from_node_data
data_arr = arr.values.astype(dtype)
From the beginning of your code to here, all of your data is stored as sparse array, which doesn't take too much memory. Here, arr is supposed to be a DataFrame with columns as pandas.SparseArray. This line of code converts the data structure to normal numpy array, which crash the memory usage.
import numpy as np
a = np.empty((158950,14450),float)
print(a.nbytes/2**30)
17.112698405981064
An empty numpy array here actually takes 17 G memory. I can initialize 3 array like that one on my 16 G computer. Then I get memory error if I try to get more than 3. And I can't initialize a 158,950 x 158,950 numpy array.

reading a 20gb csv file in python

I am trying to read a 20 gb file in python from a remote path. The below code reads the file in chunks but if for any reason the connection to remote path is lost, i have to restart the entire process of reading. Is there a way I can continue from my last read row and keep appending to the list that I am trying to create. Here is my code:
from tqdm import tqdm
chunksize=100000
df_list = [] # list to hold the batch dataframe
for df_chunk in tqdm(pd.read_csv(pathtofile, chunksize=chunksize, engine='python')):
df_list.append(df_chunk)
train_df = pd.concat(df_list)

Do you have much more than 20GB RAM? Because you're reading the entire file into RAM, and represent it as Python objects. That df_list.append(df_chunk) is the culprit.
What you need to is:
read it by smaller pieces (you already do);
process it piece by piece;
discard the old piece after processing. Python's garbage collection will do it for you unless you keep a reference to the spent chunk, as you currently do in df_list.
Note that you can keep the intermediate / summary data in RAM the whole time. Just don't keep the entire input in RAM the whole time.
Or get 64GB / 128GB RAM, whichever is faster for you. Sometimes just throwing more resources at a problem is faster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string