reading a 20gb csv file in python - python-3.x

I am trying to read a 20 gb file in python from a remote path. The below code reads the file in chunks but if for any reason the connection to remote path is lost, i have to restart the entire process of reading. Is there a way I can continue from my last read row and keep appending to the list that I am trying to create. Here is my code:
from tqdm import tqdm
chunksize=100000
df_list = [] # list to hold the batch dataframe
for df_chunk in tqdm(pd.read_csv(pathtofile, chunksize=chunksize, engine='python')):
df_list.append(df_chunk)
train_df = pd.concat(df_list)

Do you have much more than 20GB RAM? Because you're reading the entire file into RAM, and represent it as Python objects. That df_list.append(df_chunk) is the culprit.
What you need to is:
read it by smaller pieces (you already do);
process it piece by piece;
discard the old piece after processing. Python's garbage collection will do it for you unless you keep a reference to the spent chunk, as you currently do in df_list.
Note that you can keep the intermediate / summary data in RAM the whole time. Just don't keep the entire input in RAM the whole time.
Or get 64GB / 128GB RAM, whichever is faster for you. Sometimes just throwing more resources at a problem is faster.

Related

NetCDF uses twice the memory when reading part of data. Why? How to rectify it?

I have 2 fairly large datasets (~47GB) stored in a netCDF file. The datasets have three dimensions: time, s, and s1. The first dataset is of shape (3000,2088,1000) and the second is of shape (1566,160000,25). Both datasets are equal in size. The only difference is their shape. Since my RAM size is only 32GB, I am accessing the data in blocks.
For the first dataset, when I read the first ~12GB chunk of data, the code uses almost twice the amount of memory. Whereas, for the second, it uses just the amount of memory as that of the chunk (12GB). Why is this happening? How do I stop the code from using more than what is necessary?
Not using more memory is very important for my code because my algorithm's efficiency hinges on the fact that every line of code uses just enough memory and not more. Also, because of this weird behaviour, my system starts swapping like crazy. I have a linux system, if that information is useful. And I use python 3.7.3 with netCDF 4.6.2
This is how I am accessing the datasets,
from netCDF4 import Dataset
dat = Dataset('dataset1.nc')
dat1 = Dataset('dataset2.nc')
chunk1 = dat.variables['data'][0:750] #~12GB worth of data uses ~24GB RAM memory
chunk2 = dat1.variables['data'][0:392] #~12GB worth of data uses ~12GB RAM memory

Why the Python memory error using shutil.copyfileobj?

I created an in-memory file and then tried to save it as a file:
import pandas as pd
from io import StringIO
# various calculations
with open(s_outfile, "w") as outfile:
# make a header row
outfile.write('npi,NPImatched,lookalike_npi,domain,dist,rank\n')
stream_out = StringIO()
for i in big_iterator
# more calculations, creating dataframe df_info
df_info.to_csv(stream_out, index=False, header=False)
with open(s_outfile, 'a', newline='\n') as file:
stream_out.seek(0)
shutil.copyfileobj(stream_out, file)
stream_out.close()
The point of writing inside the loop to the StringIO object was to to speed up df_info.to_csv(), which worked ok (but less dramatically than I expected). But when I tried to copy the in-memory object to a file with shutil.copyfileobj(), I got MemoryError, with essentially no further information.
It's a large-ish situation; the loop runs about 1M times and the output data should have had a size of about 6GB. This was running on a GCP Linux compute instance with (I think) about 15GB RAM, although of course less than that (and perhaps less than the size of the in-memory data object) was free at the time.
But why would I get a memory error? Isn't shutil.copyfileobj() all about copying incrementally, using memory safely, and avoiding excessive memory consumption? I see now that it has an optional buffer size parameter, but as far as I can see, it defaults to something much smaller than the scale I'm working at with this data.
Would you expect the error to be avoided if I simply set the buffer size to something moderate like 64KB? Is my whole approach wrong-headed? It takes long enough to get the in-memory data established that I can't test things willy-nilly. Thanks in advance.

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)

Size on disk of a partly filled HDF5 dataset

I'm reading the book Python and HDF5 (O'Reilly) which has a section on empty datasets and the size they take on disk:
import numpy as np
import h5py
f = h5py.File("testfile.hdf5")
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32)
f.flush()
# Size on disk is 1KB
dset[0:1024] = np.arange(1024)
f.flush()
# Size on disk is 4GB
After filling part (first 1024 entries) of the dataset with values, I expected the file to grow, but not to 4GB. It's essentially the same size as when I do:
dset[...] = np.arange(1024**3)
The book states that the file size on disk should be around 66KB. Could anyone explain what the reason is for the sudden size increase?
Version info:
Python 3.6.1 (OSX)
h5py 2.7.0
If you open your file in HdfView you can see that chunking is off. This means that the array is stored in one contiguous block of memory in the file and cannot be resized. Thus all 4 GB must be allocated in the file.
If you create your data set with chunking enabled, the dataset is divided up into regularly-sized pieces which are stored haphazardly on disk, and indexed using a B-tree. In that case only the chunks that have (at least one element of) data are allocated on disk. If you create your dataset as follows the file will be much smaller:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=True)
The chunks=True lets h5py determine the size of the chunks automatically. You can also set the chunk size explicitly. For example, to set it to 16384 floats (=64 Kb), use:
dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32, chunks=(2**14,) )
The best chunk size depends on the reading and writing patterns of your applications. Note that:
Chunking has performance implications. It’s recommended to keep the
total size of your chunks between 10 KiB and 1 MiB, larger for larger
datasets. Also keep in mind that when any element in a chunk is
accessed, the entire chunk is read from disk.
See http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Resources