I have an excel sheet that contains one million rows. Only the first hundred rows or so have data. Remaining rows are empty and blank. pandas.read_excel internally makes use of xlrd to read data. In turn xlrd reads the whole sheet and takes a lot of time(Around 65 seconds). I tried the below code. But could not reduce the reading time.
df= pd.read_excel(file_path, sheetname=sheetname,nrows=1000, skiprows=1, header=None)
I have a 8GB RAM in my machine with Windows 10 OS.
I'm using pandas 0.25.3
Is there any other optimal solution to reduce reading time ?
keep_default_na=False parameter may reduce read time and ignore the NaN values in excel file.
Example usage:
df = pd.read_excel('test.xlsx', keep_default_na=False)
Related
I am looking to use pyarrow to do memory mapped reads both for row and columns, for time series data with multiple columns. I Don't really care about writing historical data at a slower speed. My main aim is the fastest read speed (for single row, single columns, multiple row columns), and there after the fastest possible append speed (with rows appended periodically). Here is the code that generates data I am looking to test on. This is a multiindex dataframe with columns as fields (open, high, low ...) and the index is a two level multiindex with datetime and symbols as the two levels. Comments on this particular architecture are also welcome.
import time
import psutil, os
KB = 1<<10
MB = 1024 * KB
GB = 1024 * MB
idx = pd.date_range('20150101', '20210613', freq='T')
df = {}
for j in range(10):
df[j] = pd.DataFrame(np.random.randn(len(idx), 6), index=idx, columns=[i for i in 'ohlcvi'])
df = pd.concat(df, axis=1)
df = df.stack(level=0)
df.index.names=['datetime', 'sym']
df.columns.name = 'field'
print(df.memory_usage().sum()/GB)
Now I am looking for the most efficient code to do the following:
Write this data in a memory mapped format on disk so that It can be used to read rows/columns or some random access.
Append another row to this dataset at the end.
query the last 5 rows.
query a few random columns for a given set of continuous rows.
query non continuous rows and columns.
If the task masters are looking for how I did it before they allow anybody to answer this question, please respond and I will roll out all the preliminary code I wrote to accomplish this. I am not doing it here as It will probably dirty up the space without much info. I did not get speeds promised on blogs on pyarrow and I am sure I am doing it wrong, thus this request for guidance.
Im trying to use the below code to write large pandas dataframes to excel worsheets, if i write it directly the system is running out of RAM, is this a viable option or are there any alternatives?
writer = pd.ExcelWriter('Python Output Analysis.xlsx', engine='xlsxwriter',options=dict(constant_memory=True))
The constant_memory mode of XlsxWriter can be used to write very large Excel files with low, constant, memory usage. The catch is that the data needs to be written in row order and (as #Stef points out in the comments above) Pandas writes to Excel in column order. So constant_memory mode won't work with Pandas ExcelWriter.
As an alternative you could avoid ExcelWriter and write the data directly to XlsxWriter from the dataframe on a row by row basis. However, that will be slower from a Pandas point of view.
If your data is large, just consider saving the data with raw text file. e.g. csv, txt, etc.
df.to_csv('file.csv', index=False, sep=',')
df.to_csv('file.tsv', index=False, sep='\t')
Or split DataFrame, and save to small files.
df_size = df.shape[0]
chunksize = df_size//10
for i in range(0, df_size, chunksize):
# print(i, i+chunksize)
dfn = df.iloc[i:i+chunksize,:]
dfn.to_excel('...')
I have a text file that is around 3.3GB. I am only interested in 2 columns in this text file (out of 47). From these 2 columns, I only need rows where col2=='text1'. For example, consider my text file to have values such as:
text file:
col1~col2~~~~~~~~~~~~
12345~text1~~~~~~~~~~~~
12365~text1~~~~~~~~~~~~
25674~text2~~~~~~~~~~~~
35458~text3~~~~~~~~~~~~
44985~text4~~~~~~~~~~~~
I want to create a df where col2=='text1'. What I have done so far is tried to load the entire textfile into my df and then filter out the needed rows. However, since this is a large text file, creating a df takes more than 45 mins. I believe loading only the necessary rows (if possible) would be ideal as the df would be of considerably smaller size and I won't run into memory issues.
My code:
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str})
df1=df[df['col2']=='text1']
In short, can I filter a column, based on a criteria, while loading the text file to dataframe so as to 1) Reduce time for loading and 2) Reduce the size of df on my memory.
Okay, So I came up with a solution. Basically it has to do with loading the data in chunks, and filtering the chunks for col2=='text1'. This way, I only have a chunk loaded in memory each time and my final df will only have the data I need.
Code:
final=pd.DataFrame()
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str},chunksize=100000)
for chunk in df:
a=chunk[chunk['col2']=='text1']
final=pd.concat([final,a],axis=0)
Better alternatives, if any, will be most welcome!
I have a large pandas DataFrame consisting of some 100k rows and ~100 columns with different dtypes and arbitrary content.
I need to assert that it does not contain a certain value, let's say -1.
Using assert( not (any(test1.isin([-1]).sum()>0))) results in processing time of some seconds.
Any idea how to speed it up?
Just to make a full answer out of my comment:
With -1 not in test1.values you can check if -1 is in your DataFrame.
Regarding the performance, this still needs to check every single value, which is in your case
10^5*10^2 = 10^7.
You only save with this the performance cost for summation and an additional comparison of these results.
Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.
Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).
Does odo or Pandas facilitate methods for this type of selective loading of rows into a dataframe?
I don't know of anything in odo or pandas that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas script that gets the job done.
The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.
import pandas as pd
pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
good_data.append(chunk)
df = pd.concat(good_data)
Add/alter parameters for pd.read_csv and pd.concat as necessary for your specific situation.
If performance is an issue, you may be able to speed things up by using an alternative to .isin, as described in this answer.