I am trying to load this CSV file into a pandas data frame using
import pandas as pd
filename = '2016-2018_wave-IV.csv'
df = pd.read_csv(filename)
However, despite my PC being not super slow (8GB RAM, 64 bit python) and the file being somewhat but not extraordinarily large (< 33 MB), loading the file takes more than 10 minutes. It is my understanding that this shouldn't take nearly that long and I would like to figure out what's behind this.
(As suggested in similar questions, I have tried using chunksize and usecol parameters (EDIT and also low_memory), yet without success; so I believe this is not a duplicate but has more to do with the file or the setup.)
Could someone give me a pointer? Many thanks. :)
I was testing the file which you shared and problem is that this csv file have leading and ending double quotes on every line (so Panda thinks that whole line is one column). It have to be removed before processing for example by using sed in linux or just process and re-save file in python or just replace all double quotes in text editor.
To summarize and expand the answer by #Hubert Dudek:
The issue was with the file; not only did it include "s at the start of every line but also in the lines themselves. After I fixed the former, the latter caused the column attribution being messed up.
Related
I'm working on a log file reader which will parse files and display fields from lines in a nicely formatted table in a node/electron app.
If the files were small, I could just read them line-by-line, parse them, store fields extracted from each line in a data structure and allow clients to scroll back and forth throughout the whole the file.
Since these files can be several gigabytes long, I need to do something more complex.
My current thinking is to either:
Read the whole file via the readline package.
Keep track of line ending offsets
Once the file is read, parse the bottom (hence most recent) 50 or so lines so I can extract relevant data and display visually
If client wants to scroll beyond my 50 lines, use the offsets to go to previous lines (via fs.read(..)).
Another method:
Use fs.read() to go straight to the end
Work backwards until newline characters are found
If client wants to scroll around the file, figure out the line offsets on demand
This doesn't even take into account building tail -f style functionality.
I'll have to account for at least ascii and utf8 encodings, as well as windows vs linux style line endings.
This is a LOT of low level work.
Is there a library which already provides functionality?
If I do this myself, any major caveats I haven't mentioned here? I haven't done low level, random access, programming for 20 years.
I am trying to read a large number of files 20k+ from my computer using python but I keep on getting this Memory ERROR(details below). Although I have 16GB of RAM from which 8 GB or more is free all the time and the size of all the files is just 270Mb. I have tried many different solutions like pandas read_csv() reading in chunks using open(file_path).read(100) and many others but I am unable to read the files. I have to create a corpus of words after reading the files in the list. Below is my code so far. Any help will be highly appreciated.
import os
import pandas as pd
collectionPath = r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt"
listOfFilesInCollection = os.listdir(collectionPath)
def wordList(file):
list_of_words_from_file =[]
for line in file:
for word in line.split():
list_of_words_from_file.append(word)
return list_of_words_from_file
list_of_file_word_Lists = {}
file=[]
for file_name in listOfFilesInCollection:
filePath = collectionPath + "\\" + file_name
with open(filePath) as f:
for line in f:
file.append(line)
list_of_file_word_Lists[file_name]=wordList(file)
print(list_of_file_word_Lists)
The error that I get
Traceback (most recent call last): File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
25, in
list_of_file_word_Lists[file_name]=wordList(file) File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
14, in wordList
list_of_words_from_file.append(word) MemoryError
You probably want to move the file=[] at the beginning of the loop because you're currently adding lines of each new file you open without having first removed the lines of all the previous files.
Then, there're very likely more efficient approaches depending on what you're trying to achieve. If the order of words doesn't matter, then maybe using using a dict or a collections.Counter instead of a list can help to avoid duplication of identical strings. If neither the order nor the frequency of words matter, then maybe using a set instead is going to be even better.
Finally, since it's likely you'll find most words in several files, try to store each of them only once in memory. That way, you'll be able to scale way higher than a mere 20k files: there's plenty of space in 16 GiB of RAM.
Keep in mind that Python has lots of fixed overheads and hidden costs: inefficient data structures can cost way more than you would expect.
It is hard to tell why your memory problems arise without knowing the content of your files. Maybe it is enough to make your code more efficient. For example: The split()-function can handle multiple lines itself. So you don't need a loop for that. And using list comprehension is always a good idea in python.
The following code should return what you want and I don't see a reason why you should run out of memory using it. Besides that, Arkanosis' hint to the importance of data types is very valid. It depends on what you want to achieve with all those words.
from pathlib import Path
def word_list_from_file(path):
with open(path, 'rt') as f:
list_words = f.read().split()
return list_words
path_dir = Path(r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt")
dict_file_content = {
str(path.name): word_list_from_file(path)
for path in path_dir.rglob("*.txt")
}
P.S.: I'm not sure how the pathlib-module works in windows. But from what I read, this code snippet is platform independent.
I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)
I'm downloading daily 600MB netcdf-4 files that have this structure:
netcdf myfile {
dimensions:
time_counter = 18 ;
depth = 50 ;
latitude = 361 ;
longitude = 601 ;
variables:
salinity
temp, etc
I'm looking for a better way to convert the time_counter dimension from a fixed size (18) to an unlimited dimension.
I found a way of doing it with the netcdf commands and sed. Like this:
ncdump myfile.nc | sed -e "s#^.time_counter = 18 ;#time_counter = UNLIMITED ; // (currently 18)#" | ncgen -o myfileunlimited.nc
which worked for me for small files, but when dumping a 600 MB netcdf files, takes to much memory and time.
Somebody knows another method for accomplishing this?
Your answers are very insightful. I'm not really looking a way to improve this ncdump-sed-ncgen method, I know that dumping a netcdf file that is 600MB uses almost 5 times more space in a text file (CDL representation). To then modify some header text and generate the netcdf file again, doesn't feels very efficient.
I read the latest NCO commands documentation, and found a option specific to ncks "--mk_rec_dmn". Ncks mainly extracts and writes or appends data to a new netcdf file, then this seems the better approach, extract all the data of myfile.nc and write it with a new record dimension (unlimited dimension) which the "--mk_rec_dmn" does, then replace the old file.
ncks --mk_rec_dmn time_counter myfile.nc -o myfileunlimited.nc ; mv myfileunlimited.nc myfile.nc
To do the opposite operation (record dimension to fixed-size) would be.
ncks --fix_rec_dmn time_counter myfile.nc -o myfilefixedsize.nc ; mv myfilefixedsize.nc myfile.nc
The shell pipeline can only be marginally improved by making the sed step only modify the beginning of the file and pass everything else through, but the expression you have is very cheap to process and will not make a dent in the time spent.
The core problem is likely that you're spending a lot of time in ncdump formatting the file information into textual data, and in ncgen parsing textual data into a NetCDF file format again.
As the route through dump+gen is about as slow as it is shown, that leaves using NetCDF functionality to do the conversion of your data files.
If you're lucky, there may be tools that operate directly on your data files to do changes or conversions. If not, you may have to write them yourself with the NetCDF libraries.
If you're extremely unlucky, NetCDF-4 files are HDF5 files with some extra metadata. In particular, the length of the dimensions is stored in the _netcdf_dim_info dataset in group _netCDF (or so the documentation tells me).
It may be possible to modify the information there to turn the current length of the time_counter dimension into the value for UNLIMITED (which is the number 0), but if you do this, you really need to verify the integrity of the resulting file, as the documentation neatly puts it:
"Note that modifying these files with HDF5 will almost certainly make them unreadable to netCDF-4."
As a side note, if this process is important to your group, it may be worth looking into what hardware could do the task faster. On my Bulldozer system, the process of converting a 78 megabyte file takes 20 seconds, using around 500 MB memory for ncgen working set (1 GB virtual) and 12 MB memory for ncdump working set (111 MB virtual), each task taking up the better part of a core.
Any decent disk should read/sink your files in 10 seconds or so, memory doesn't matter as long as you don't swap, so CPU is probably your primary concern if you take the dump+gen route.
If concurrent memory use is a big concern, you can trade some bytes for space by saving the intermediary result from sed onto disk, which will likely take up to 1.5 gigabytes or so.
You can use the xarray python package's xr.to_netdf() method, then optimise memory usage via using Dask.
You just need to pass names of the dimensions to make unlimited to the unlimited_dims argument and use the chunks to split the data. For instance:
import xarray as xr
ds = xr.open_dataset('myfile.nc', chunks={'time_counter': 18})
ds.to_netcdf('myfileunlimited.nc', unlimited_dims={'time_counter':True})
There is a nice summary of combining Dask and xarray linked here.
I'm trying to read in a 24 GB XML file in C, but it won't work. I'm printing out the current position using ftell() as I read it in, but once it gets to a big enough number, it goes back to a small number and starts over, never even getting 20% through the file. I assume this is a problem with the range of the variable that's used to store the position (long), which can go up to about 4,000,000,000 according to http://msdn.microsoft.com/en-us/library/s3f49ktz(VS.80).aspx, while my file is 25,000,000,000 bytes in size. A long long should work, but how would I change what my compiler(Cygwin/mingw32) uses or get it to have fopen64?
The ftell() function typically returns an unsigned long, which only goes up to 232 bytes (4 GB) on 32-bit systems. So you can't get the file offset for a 24 GB file to fit into a 32-bit long.
You may have the ftell64() function available, or the standard fgetpos() function may return a larger offset to you.
You might try using the OS provided file functions CreateFile and ReadFile. According to the File Pointers topic, the position is stored as a 64bit value.
Unless you can use a 64-bit method as suggested by Loadmaster, I think you will have to break the file up.
This resource seems to suggest it is possible using _telli64(). I can't test this though, as I don't use mingw.
I don't know of any way to do this in one file, a bit of a hack but if splitting the file up properly isn't a real option, you could write a few functions that temp split the file, one that uses ftell() to move through the file and swaps ftell() to a new file when its reaching the split point, then another that stitches the files back together before exiting. An absolutely botched up approach, but if no better solution comes to light it could be a way to get the job done.
I found the answer. Instead of using fopen, fseek, fread, fwrite... I'm using _open, lseeki64, read, write. And I am able to write and seek in > 4GB files.
Edit: It seems the latter functions are about 6x slower than the former ones. I'll give the bounty anyone who can explain that.
Edit: Oh, I learned here that read() and friends are unbuffered. What is the difference between read() and fread()?
Even if the ftell() in the Microsoft C library returns a 32-bit value and thus obviously will return bogus values once you reach 2 GB, just reading the file should still work fine. Or do you need to seek around in the file, too? For that you need _ftelli64() and _fseeki64().
Note that unlike some Unix systems, you don't need any special flag when opening the file to indicate that it is in some "64-bit mode". The underlying Win32 API handles large files just fine.