We have a csv file that is being used like a database, and an ETL script that takes input Excel files and transforms them into the same format to append to the csv file.
The script reads the csv file into a dataframe and appends the new input dataframe to the end, and then uses to_csv to overwrite the old csv file.
The problem is, when we updated to a new version of Python (downloaded with Anaconda), the output csv file is growing larger and larger every time we append data to it. The more lines in the original csv read into the script (which gets output with the new appended data), the larger the output file size is magnified. The actual number of rows and data in the csv files are fine, it's just the file size itself that is unusually large.
Does anyone know if updating to a new version of Python could have broken this process?
Is Python storing data in the csv file that we cannot see?
Any ideas or help is appreciated! Thank you.
Related
I have a .csv file that has around 2 million rows, and I want to add a new column. Problem is, I could manage to that by losing a lot of data (basicly everything above ~1,1m rows). When I used connection to the external file (so that I could read all rows), and made changes to it in Power Query, the changes was not saved to the .csv file.
You can apply one of several solutions:
Using a text editor which can handle huge files, save the csv files into smaller chunks. Apply the modifications to each chunk. Join chunks again to get the desired file.
Create a "small" program yourself, which loads the csv line by line and applies the modification, adding the resulting data to a second file.
Maybe some other software can handle that size of a csv. Patch the LibreOffice for this purpose, to handle 2000000+ lines - the source code is available :)
I'm writing a piece of code for my thesis which collects a series of .csv files that represent temperature images. I want to make the use of these .csv files more efficient by storing them into a dataframe.
I have multiple videos of +10000 frames and each video should be stored in a separate dataframe. I made a code that works for lower numbers of files, however when running the code on a bunch of videos (lets say 10), it crashes after a couple of videos. It returns the MemoryError. I already tried gc.collect() functions and deleted the dataframe after using df.to_hdf to prevent python of keeping lists open in memory that have been created. My memory usage still keeps increasing until it completely fills up the RAM, and it crashes.
the relevant piece of the code is added here.
dfs={}
for l in range(k*number_of_csv,len(i)):
df=pd.read_csv(i[l],sep=',',header=None)
dfs['{:0>8}'.format(i[l])] = df
dfs=pd.concat(dfs)
dfs.to_hdf('experiment_data_series_'+str(k)+'.h5',key='raw',mode='w')
del dfs
gc.collect()
in short: it builds a dataframe from all the csv files and then stores them into the h5 file.
Can someone detect what is missing to prevent this from overconsuming memory?
I already inserted a chunking procedure, so that the number of .csv files that is stored in a single h5 is always <20000. (usually the h5 file has a size of 3-4GB)
I suspect that python allocates some memory for the storing operation, but doesn't liberate it afterwards or something.
I'd appreciate the help.
kind regards,
Cedric
Since you have stated that each video should be stored in a separate dataframe I don't think you need to concatenate them all into dfs. Simply read in each .csv file and then write them to your HDFStore under its own key; imagine a dict of DataFrame objects.
Consider the following code:
# create an HDF5 file which will contain all your data;
# this statement automatically opens the h5 file for you
hdf = pd.HDFStore('experiment_data_series.h5')
for l in range(k*number_of_csv,len(i)):
df = pd.read_csv(i[l], sep=',', header=None)
# store this dataframe in the HDF5 file under some key
hdf['raw_{}'.format(k)] = df
# when you are finished writing to the HDF5 file remember to close it
hdf.close()
later on you can open the file again and view some information
hdf = pd.HDFStore('experiment_data_series.h5', mode='r')
print(hdf.info())
print(hdf.keys())
# you can get one dataframe by its key
df = hdf.select('raw_0', auto_close=True)
I think this should work because you won't be having all of your data loaded in memory at once, i.e., you are using the disk instead.
Is PDI inefficient in terms of writing excel xlsx file with Microsoft Excel Writer.
A transformed excel data file in Pentaho output seems to be three times the size, if the data was transformed manually. Is this inefficiency expected or is there a workaround for it.
A CSV file of the same transformed output is way smaller in size. Have I configured something wrong ?
xlsx files should normally be smaller in size than CSV, since they consist of XML data compressed in ZIP files. Pentaho's Microsoft Excel Writer uses org.apache.poi.xssf.streaming.SXSSFWorkbook and org.apache.poi.xssf.usermodel.XSSFWorkbook to write xlsx files, and they create compressed files so this should not be your issue.
To check the files you could check with a zip utility, to see the file sizes and compression rate, to see if there is a bug. You could also try to open the file in Excel and re-save it, to see if that gives a smaller size, which could indicate an inefficiency.
I have stack overflow data dump file in .xml format,nearly 27GB and I want to convert them in .csv file. Please somebody tell me, tools to convert xml to csv file or python program
Use one of the python xml modules to parse the .xml file. Unless you have much more that 27GB ram, you will need to do this incrementally, so limit your choices accordingly. Use the csv module to write the .csv file.
Your real problem is this. Csv files are lines of fields. They represent a rectangular table. Xml files, in general, can represent more complex structures: hierarchical databases, and/or multiple tables. So your real problem to to understand the data dump format well enough to extract records to write to the .csv file.
I have written a PySpark function to parse the .xml in .csv. XmltoCsv_StackExchange is the github repo. Used it to convert 1 GB of xml within 2-3 minutes on a minimal 2-core and 2 GB RAM Spark setup. It can convert 27GB file too, just increase minPartitions from 4 to around 128 in this line.
raw = (sc.textFile(fileName, 4))
I have a csv file which may have empty or blank rows. I want to delete such rows but the problem is that the csv files can be potentially very large. So, I am looking for a way to do it without having to load it in memory.
The solution that works for me is using a DataTable or a StreamReader, but both of them will be using the memory which is not preferable.
Thanks in advance.
I don't think you can do it without loading the file.
I would use a fast CSV reader/writer from http://www.filehelpers.net - I'm sure you can link writer stream to reader stream so you write as you read and you don't need to load the whole file at once.