Using constant memory with pandas xlsxwriter - python-3.x

Im trying to use the below code to write large pandas dataframes to excel worsheets, if i write it directly the system is running out of RAM, is this a viable option or are there any alternatives?
writer = pd.ExcelWriter('Python Output Analysis.xlsx', engine='xlsxwriter',options=dict(constant_memory=True))

The constant_memory mode of XlsxWriter can be used to write very large Excel files with low, constant, memory usage. The catch is that the data needs to be written in row order and (as #Stef points out in the comments above) Pandas writes to Excel in column order. So constant_memory mode won't work with Pandas ExcelWriter.
As an alternative you could avoid ExcelWriter and write the data directly to XlsxWriter from the dataframe on a row by row basis. However, that will be slower from a Pandas point of view.

If your data is large, just consider saving the data with raw text file. e.g. csv, txt, etc.
df.to_csv('file.csv', index=False, sep=',')
df.to_csv('file.tsv', index=False, sep='\t')
Or split DataFrame, and save to small files.
df_size = df.shape[0]
chunksize = df_size//10
for i in range(0, df_size, chunksize):
# print(i, i+chunksize)
dfn = df.iloc[i:i+chunksize,:]
dfn.to_excel('...')

Related

pyarrow append and read row/columns for time series data

I am looking to use pyarrow to do memory mapped reads both for row and columns, for time series data with multiple columns. I Don't really care about writing historical data at a slower speed. My main aim is the fastest read speed (for single row, single columns, multiple row columns), and there after the fastest possible append speed (with rows appended periodically). Here is the code that generates data I am looking to test on. This is a multiindex dataframe with columns as fields (open, high, low ...) and the index is a two level multiindex with datetime and symbols as the two levels. Comments on this particular architecture are also welcome.
import time
import psutil, os
KB = 1<<10
MB = 1024 * KB
GB = 1024 * MB
idx = pd.date_range('20150101', '20210613', freq='T')
df = {}
for j in range(10):
df[j] = pd.DataFrame(np.random.randn(len(idx), 6), index=idx, columns=[i for i in 'ohlcvi'])
df = pd.concat(df, axis=1)
df = df.stack(level=0)
df.index.names=['datetime', 'sym']
df.columns.name = 'field'
print(df.memory_usage().sum()/GB)
Now I am looking for the most efficient code to do the following:
Write this data in a memory mapped format on disk so that It can be used to read rows/columns or some random access.
Append another row to this dataset at the end.
query the last 5 rows.
query a few random columns for a given set of continuous rows.
query non continuous rows and columns.
If the task masters are looking for how I did it before they allow anybody to answer this question, please respond and I will roll out all the preliminary code I wrote to accomplish this. I am not doing it here as It will probably dirty up the space without much info. I did not get speeds promised on blogs on pyarrow and I am sure I am doing it wrong, thus this request for guidance.

Using a for loop in pandas

I have 2 different tabular files, in excel formats. I want to know if an id number from one of the columns in the first excel file (from the "ID" column) exists in the proteome file in a specific column (take "IHD" for example) and if so, to display the value associated with it. Is there a way to do this, specifically in pandas and possible using a for loop?
After loading the excel files with read_excel(), you should merge() the dataframes on ID and protein. This is the recommended approach with pandas rather than looping.
import pandas as pd
clusters = pd.read_excel('clusters.xlsx')
proteins = pd.read_excel('proteins.xlsx')
clusters.merge(proteins, left_on='ID', right_on='protein')

How to avoid reading empty rows in pandas.read_excel

I have an excel sheet that contains one million rows. Only the first hundred rows or so have data. Remaining rows are empty and blank. pandas.read_excel internally makes use of xlrd to read data. In turn xlrd reads the whole sheet and takes a lot of time(Around 65 seconds). I tried the below code. But could not reduce the reading time.
df= pd.read_excel(file_path, sheetname=sheetname,nrows=1000, skiprows=1, header=None)
I have a 8GB RAM in my machine with Windows 10 OS.
I'm using pandas 0.25.3
Is there any other optimal solution to reduce reading time ?
keep_default_na=False parameter may reduce read time and ignore the NaN values in excel file.
Example usage:
df = pd.read_excel('test.xlsx', keep_default_na=False)

How to write summary of spark sql dataframe to excel file

I have a very large Dataframe with 8000 columns and 50000 rows.
I want to write its statistics information into excel file.
I think we can use describe() method. But how to write it to excel in good format. Thanks
The return type for describe is a pyspark dataframe. The easiest way to get the describe dataframe into an excel readable format is to convert it to a pandas dataframe and then write the pandas dataframe out as a csv file as below
import pandas
df.describe().toPandas().to_csv('fileOutput.csv')
If you want it in excel format, you can try below
import pandas
df.describe().toPandas().to_excel('fileOutput.xls', sheet_name = 'Sheet1', index = False)
Note, the above requires xlwt package to be installed (pip install xlwt in the command line)

Pandas / odo / bcolz selective loading of rows from a large CSV file

Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.
Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).
Does odo or Pandas facilitate methods for this type of selective loading of rows into a dataframe?
I don't know of anything in odo or pandas that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas script that gets the job done.
The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.
import pandas as pd
pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
good_data.append(chunk)
df = pd.concat(good_data)
Add/alter parameters for pd.read_csv and pd.concat as necessary for your specific situation.
If performance is an issue, you may be able to speed things up by using an alternative to .isin, as described in this answer.

Resources