I have an xml object which needs to be written to a file. I see that this takes more than 1 hour to complete for 10,000 records. I tried to convert using df_merge['xml'] = df_merge['xml'].astype(str). Still time taken is similar i.e. more than 1 hour just that astype(str) takes more time. So, whatever be the scenario, it takes more than 1 hour to complete to_csv.
So, can I please know how to write large xml object to a file quickly?
Size of 10000 xmls will be around 600 MB.
df_merge.to_csv(settings.OUTPUT_XML, encoding='utf-8', index=False,
columns=['xml'])
Later I tried to use np.savetxt which also takes similar time.
import numpy as np
np.savetxt('output_xml.txt', df_merge['xml'], encoding='utf-8', fmt="%s")
You might consider using serialization. A good library for that is joblib, or other common serialization tools like pickle
A good Stack Overflow post outlining the differences and when to use each is here
In your case, you might be able to serialize your object and it would be done so in much more time, using some example code from below:
# Import joblib's dump function
from joblib import dump
# For speed, keep compression = 0
dump(df_merge, 'df_merge.joblib')
# For smaller file size, you can increase compression, though it will slow your write time
# dump(df_merge, 'df_merge.joblib', compress=9)
You can then use joblib to load the file, like so:
# Import joblib's load function
from joblib import load
# For speed, keep compression = 0
# Note, if you used compress=n, then it will take longer to load
df_merge = load('df_merge.joblib')
Related
I'm trying to work on a dataset with 510,000 rows and 636 columns. I loaded it into a dataframe using the dask dataframe method, but the entries can't be displayed. When i try to get the shape, it results in delays. Is there a way for me to analyze the whole dataset without using big data technologies like Pyspark?
from dask import dataframe
import requests
import zipfile
import os
import pandas as pd
if os.path.exists('pisa2012.zip') == False:
r = requests.get('https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip', allow_redirects=True)
open('pisa2012.zip', 'wb').write(r.content)
if os.path.exists('pisa2012.csv') == False:
with zipfile.ZipFile('pisa2012.zip', 'r') as zip_ref:
zip_ref.extractall('./')
df_pisa = dataframe.read_csv('pisa2012.csv')
df_pisa.shape #Output:(Delayed('int-e9d8366d-1b9e-4f8e-a83a-1d4cac510621'), 636)
Firstly, spark, dask and vaex are all "big data" technologies.
it results in delays
If you read the documentation, you will see that dask is lazy and only performs operations on demand, you have to want to. The reason is, that just getting the shape requires reading all the data, but the data will not be held in memory - that is the whole point and the feature that lets you work with bigger-than-memory data (otherwise, just use pandas).
This works:
df_pisa.shape.compute()
Bute, better, figure out what you actually want to do with the data; I assume you are not just after the shape. You can put multiple operations/delayed objects into a dask.compute() to do them at once and not have to repeat expensive tasks like reading/parsing the file.
I created an in-memory file and then tried to save it as a file:
import pandas as pd
from io import StringIO
# various calculations
with open(s_outfile, "w") as outfile:
# make a header row
outfile.write('npi,NPImatched,lookalike_npi,domain,dist,rank\n')
stream_out = StringIO()
for i in big_iterator
# more calculations, creating dataframe df_info
df_info.to_csv(stream_out, index=False, header=False)
with open(s_outfile, 'a', newline='\n') as file:
stream_out.seek(0)
shutil.copyfileobj(stream_out, file)
stream_out.close()
The point of writing inside the loop to the StringIO object was to to speed up df_info.to_csv(), which worked ok (but less dramatically than I expected). But when I tried to copy the in-memory object to a file with shutil.copyfileobj(), I got MemoryError, with essentially no further information.
It's a large-ish situation; the loop runs about 1M times and the output data should have had a size of about 6GB. This was running on a GCP Linux compute instance with (I think) about 15GB RAM, although of course less than that (and perhaps less than the size of the in-memory data object) was free at the time.
But why would I get a memory error? Isn't shutil.copyfileobj() all about copying incrementally, using memory safely, and avoiding excessive memory consumption? I see now that it has an optional buffer size parameter, but as far as I can see, it defaults to something much smaller than the scale I'm working at with this data.
Would you expect the error to be avoided if I simply set the buffer size to something moderate like 64KB? Is my whole approach wrong-headed? It takes long enough to get the in-memory data established that I can't test things willy-nilly. Thanks in advance.
I have a SPSS database that I need to open, but it is huge and if opened naively as in the code below, it saturates RAM and eventually crashes.
import pandas as pd
def main():
data = pd.read_spss('database.sav')
print(data)
if __name__=='__main__':
main()
The equivalent pandas function to read a SAS database allows for the chunksize and iterator keywords, mapping the file without reading it all into RAM in one shot, but for SPSS this option appears to be missing. Is there another python module that I could use for this task that would allow for mapping of the database without reading it into RAM in its entirety?
You can using pyreadstat using the generator read_file_in_chunks. Use the parameter chunksize to regulate how many rows should be read on each iteration.
import pyreadstat
fpath = 'database.sav'
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat, fpath, chunksize= 10000)
for df, meta in reader:
print(df) # df will contain 10K rows
# do some cool calculations here for the chunk
Pandas read_spss uses pyreadstat under the hood, but exposes only a subset of options.
I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)
I am trying to parallelize a pandas operation which splits a dataframe column having comma separated values into 2 columns. Normal pandas operation is taking around 5 secs on my python instance which directly uses df.str.split on that particular column. My dataframe contains 2 million rows and hence I'm trying to bring down the code running time.
As the first approach to parallelize, I am using Python's multiprocessing library by creating pools equivalent to number of CPU cores available on my instance. For the second approach to the same problem, I am using concurrent.futures library by mentioning a chunksize of 4.
However, I see that multiprocessing library is taking around the same time as the normal pandas operation(5 secs) whereas the concurrent.futures is taking more than a minute to run the same line.
1) Does a Google Compute Engine support these Python multiprocessing libraries?
2) Why is the parallel processing not working on the GCP?
Thanks in advance. Below is the sample code:
import pandas as pd
from multiprocessing import Pool
def split(e):
return e.split(",")
df = pd.DataFrame({'XYZ':['CAT,DOG',
'CAT,DOG','CAT,DOG']})
pool = Pool(4)
df_new = pd.DataFrame(pool.map(split, df['XYZ'], columns = ['a','b'])
df_new = pd.concat([df, df_new], axis=1)
The above code is taking about the same time as the code below which is a normal pandas operation which uses only one core:
df['a'], df['b'] = df['XYZ'].str.split(',',1).str
Uisng concurrent.futures:
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor() as pool:
a = pd.DataFrame(pool.map(split, df['XYZ'], chunksize = 4),
columns=['a','b'])
print (a)
The above code using concurrent.futures is taking more than a minute to run on the GCP. Please note that the code I have posted is just the sample code. The dataframe I am using in the project have 2 million such rows. Any help would be really appreciated!
Why did you choose chunksize=4? This is pretty small, for 2 million rows this would only break it into 500,000 operations. The total runtime might only take 1/4th of the time, but the additional overhead would likely make this take longer than a single-threaded approach.
I'd recommend using a much larger chunksize. Anywhere from 10,000 to 200,000 might be appropriate, but you should tweak this based on some experimentation with the results you get.