Pandas / odo / bcolz selective loading of rows from a large CSV file - python-3.x

Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.
Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).
Does odo or Pandas facilitate methods for this type of selective loading of rows into a dataframe?

I don't know of anything in odo or pandas that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas script that gets the job done.
The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.
import pandas as pd
pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
good_data.append(chunk)
df = pd.concat(good_data)
Add/alter parameters for pd.read_csv and pd.concat as necessary for your specific situation.
If performance is an issue, you may be able to speed things up by using an alternative to .isin, as described in this answer.

Related

pyarrow append and read row/columns for time series data

I am looking to use pyarrow to do memory mapped reads both for row and columns, for time series data with multiple columns. I Don't really care about writing historical data at a slower speed. My main aim is the fastest read speed (for single row, single columns, multiple row columns), and there after the fastest possible append speed (with rows appended periodically). Here is the code that generates data I am looking to test on. This is a multiindex dataframe with columns as fields (open, high, low ...) and the index is a two level multiindex with datetime and symbols as the two levels. Comments on this particular architecture are also welcome.
import time
import psutil, os
KB = 1<<10
MB = 1024 * KB
GB = 1024 * MB
idx = pd.date_range('20150101', '20210613', freq='T')
df = {}
for j in range(10):
df[j] = pd.DataFrame(np.random.randn(len(idx), 6), index=idx, columns=[i for i in 'ohlcvi'])
df = pd.concat(df, axis=1)
df = df.stack(level=0)
df.index.names=['datetime', 'sym']
df.columns.name = 'field'
print(df.memory_usage().sum()/GB)
Now I am looking for the most efficient code to do the following:
Write this data in a memory mapped format on disk so that It can be used to read rows/columns or some random access.
Append another row to this dataset at the end.
query the last 5 rows.
query a few random columns for a given set of continuous rows.
query non continuous rows and columns.
If the task masters are looking for how I did it before they allow anybody to answer this question, please respond and I will roll out all the preliminary code I wrote to accomplish this. I am not doing it here as It will probably dirty up the space without much info. I did not get speeds promised on blogs on pyarrow and I am sure I am doing it wrong, thus this request for guidance.

Using constant memory with pandas xlsxwriter

Im trying to use the below code to write large pandas dataframes to excel worsheets, if i write it directly the system is running out of RAM, is this a viable option or are there any alternatives?
writer = pd.ExcelWriter('Python Output Analysis.xlsx', engine='xlsxwriter',options=dict(constant_memory=True))
The constant_memory mode of XlsxWriter can be used to write very large Excel files with low, constant, memory usage. The catch is that the data needs to be written in row order and (as #Stef points out in the comments above) Pandas writes to Excel in column order. So constant_memory mode won't work with Pandas ExcelWriter.
As an alternative you could avoid ExcelWriter and write the data directly to XlsxWriter from the dataframe on a row by row basis. However, that will be slower from a Pandas point of view.
If your data is large, just consider saving the data with raw text file. e.g. csv, txt, etc.
df.to_csv('file.csv', index=False, sep=',')
df.to_csv('file.tsv', index=False, sep='\t')
Or split DataFrame, and save to small files.
df_size = df.shape[0]
chunksize = df_size//10
for i in range(0, df_size, chunksize):
# print(i, i+chunksize)
dfn = df.iloc[i:i+chunksize,:]
dfn.to_excel('...')

Whats the fastest way to loop through sorted dask dataframe?

I'm new to Pandas and Dask, Dask dataframes wrap pandas dataframes and share most of the same function calls.
I using Dask to sort(set_index) a largeish csv file ~1,000,000 rows ~100columns.
Once it's sorted I use itertuples() to grab each dataframe row, to compare with a row from a database with ~1,000,000 rows ~100 columns.
But it's running slowly (takes around 8 hours), is there a faster way to do this?
I used dask because it can sort very large csv files and has a flexible csv parsing engine. It'll also let me run more advanced operations on the dataset, and parse more data formats in the future
I could presort the csv but I want to see if Dask can be fast enough for my use case, it would make things alot more hands off in the long run.
By using iter_tuples, you are bringing each row back to the client, one by one. Please read up on map_partitions or map to see how you can apply function to rows or blocks of the dataframe without pulling data to the client.
Note that each worker should write to a different file, since they operate in parallel.

Create dataframe from text file based on certain criterias

I have a text file that is around 3.3GB. I am only interested in 2 columns in this text file (out of 47). From these 2 columns, I only need rows where col2=='text1'. For example, consider my text file to have values such as:
text file:
col1~col2~~~~~~~~~~~~
12345~text1~~~~~~~~~~~~
12365~text1~~~~~~~~~~~~
25674~text2~~~~~~~~~~~~
35458~text3~~~~~~~~~~~~
44985~text4~~~~~~~~~~~~
I want to create a df where col2=='text1'. What I have done so far is tried to load the entire textfile into my df and then filter out the needed rows. However, since this is a large text file, creating a df takes more than 45 mins. I believe loading only the necessary rows (if possible) would be ideal as the df would be of considerably smaller size and I won't run into memory issues.
My code:
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str})
df1=df[df['col2']=='text1']
In short, can I filter a column, based on a criteria, while loading the text file to dataframe so as to 1) Reduce time for loading and 2) Reduce the size of df on my memory.
Okay, So I came up with a solution. Basically it has to do with loading the data in chunks, and filtering the chunks for col2=='text1'. This way, I only have a chunk loaded in memory each time and my final df will only have the data I need.
Code:
final=pd.DataFrame()
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str},chunksize=100000)
for chunk in df:
a=chunk[chunk['col2']=='text1']
final=pd.concat([final,a],axis=0)
Better alternatives, if any, will be most welcome!

pyspark: isin vs join

What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically:
Depending on the size of the given list of values, then with respect to runtime when is it best to use isin vs inner join vs
broadcast?
This question is the spark analogue of the following question in Pig:
Pig: efficient filtering by loaded list
Additional context:
Pyspark isin function
Considering
import pyspark.sql.functions as psf
There are two types of broadcasting:
sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin
psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig).
In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.
Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.
Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.
Both join and isin works well for all my daily workcases.
isin works well both of small and little large (~1M) set of list.
Note - If you have a large dataset (say ~500 GB) and you want to do filtering and then processing of filtered dataset, then
using isin the data read/processing is significantly very low and Fast. Whole 500 GB will not be loaded as you have already filtered the smaller dataset from .isin method.
But for the Join case, whole 500GB will loaded and processing. So Time of Processing will be much higher.
My case, After filtering using
isin, and then processing and converting to Pandas DF. It took < 60 secs
with JOIN and then processing and converting to Pandas DF. It takes > 1 hours.

Resources