I have a text file that is around 3.3GB. I am only interested in 2 columns in this text file (out of 47). From these 2 columns, I only need rows where col2=='text1'. For example, consider my text file to have values such as:
text file:
col1~col2~~~~~~~~~~~~
12345~text1~~~~~~~~~~~~
12365~text1~~~~~~~~~~~~
25674~text2~~~~~~~~~~~~
35458~text3~~~~~~~~~~~~
44985~text4~~~~~~~~~~~~
I want to create a df where col2=='text1'. What I have done so far is tried to load the entire textfile into my df and then filter out the needed rows. However, since this is a large text file, creating a df takes more than 45 mins. I believe loading only the necessary rows (if possible) would be ideal as the df would be of considerably smaller size and I won't run into memory issues.
My code:
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str})
df1=df[df['col2']=='text1']
In short, can I filter a column, based on a criteria, while loading the text file to dataframe so as to 1) Reduce time for loading and 2) Reduce the size of df on my memory.
Okay, So I came up with a solution. Basically it has to do with loading the data in chunks, and filtering the chunks for col2=='text1'. This way, I only have a chunk loaded in memory each time and my final df will only have the data I need.
Code:
final=pd.DataFrame()
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str},chunksize=100000)
for chunk in df:
a=chunk[chunk['col2']=='text1']
final=pd.concat([final,a],axis=0)
Better alternatives, if any, will be most welcome!
Related
I am looking to use pyarrow to do memory mapped reads both for row and columns, for time series data with multiple columns. I Don't really care about writing historical data at a slower speed. My main aim is the fastest read speed (for single row, single columns, multiple row columns), and there after the fastest possible append speed (with rows appended periodically). Here is the code that generates data I am looking to test on. This is a multiindex dataframe with columns as fields (open, high, low ...) and the index is a two level multiindex with datetime and symbols as the two levels. Comments on this particular architecture are also welcome.
import time
import psutil, os
KB = 1<<10
MB = 1024 * KB
GB = 1024 * MB
idx = pd.date_range('20150101', '20210613', freq='T')
df = {}
for j in range(10):
df[j] = pd.DataFrame(np.random.randn(len(idx), 6), index=idx, columns=[i for i in 'ohlcvi'])
df = pd.concat(df, axis=1)
df = df.stack(level=0)
df.index.names=['datetime', 'sym']
df.columns.name = 'field'
print(df.memory_usage().sum()/GB)
Now I am looking for the most efficient code to do the following:
Write this data in a memory mapped format on disk so that It can be used to read rows/columns or some random access.
Append another row to this dataset at the end.
query the last 5 rows.
query a few random columns for a given set of continuous rows.
query non continuous rows and columns.
If the task masters are looking for how I did it before they allow anybody to answer this question, please respond and I will roll out all the preliminary code I wrote to accomplish this. I am not doing it here as It will probably dirty up the space without much info. I did not get speeds promised on blogs on pyarrow and I am sure I am doing it wrong, thus this request for guidance.
I'm new to Pandas and Dask, Dask dataframes wrap pandas dataframes and share most of the same function calls.
I using Dask to sort(set_index) a largeish csv file ~1,000,000 rows ~100columns.
Once it's sorted I use itertuples() to grab each dataframe row, to compare with a row from a database with ~1,000,000 rows ~100 columns.
But it's running slowly (takes around 8 hours), is there a faster way to do this?
I used dask because it can sort very large csv files and has a flexible csv parsing engine. It'll also let me run more advanced operations on the dataset, and parse more data formats in the future
I could presort the csv but I want to see if Dask can be fast enough for my use case, it would make things alot more hands off in the long run.
By using iter_tuples, you are bringing each row back to the client, one by one. Please read up on map_partitions or map to see how you can apply function to rows or blocks of the dataframe without pulling data to the client.
Note that each worker should write to a different file, since they operate in parallel.
When I try to write a dataframe as parquet, the file sizes are non-uniform. Although I don't want to make the files uniform, I want to set a max size for each file.
I can't afford to repartition the data as the dataframe is sorted(As per my understanding, repartitioning a sorted dataframe can distort the ordering).
Any help would be appreciated.
I have come across maxRecordsPerFile, but I don't want to limit the number of rows and I might not have full information about the columns(total number of columns and their types). So it's difficult to estimate file size based on rows.
I have read about parquet block size as well and I don't think that helps.
I am trying to use Spark to read data stored in a very large table (contains 181,843,820 rows and 50 columns) which is my training set, however, when I use spark.table() I noticed that the row count is different than the row count when calling the DataFrame's count(), I am currently using PyCharm.
I want to preprocess the data in the table before I can use it further as a training set for a model I need to train.
When loading the table I found out that the DataFrame I'm loading the table to is much smaller (10% of the data in this case).
what I have tried:
raised spark.kryoserializer.buffer.max capacity.
load a smaller table into the DataFrame (70k rows) and actually found no difference in the count() outputs.
this sample is very similar to the code I ran in order to investigate the problem.
df = spark.table('myTable')
print(spark.table('myTable').count()) # output: 181,843,820
print(df.count()) # output 18,261,961
I expect both outputs to be the same (the original 181m), yet they are not, and I dont understand why.
Say we have large csv file (e.g. 200 GB) where only a small fraction of rows (e.g. 0.1% or less) contain data of interest.
Say we define such condition as having one specific column contain a value from a pre-defined list (e.g. 10K values of interest).
Does odo or Pandas facilitate methods for this type of selective loading of rows into a dataframe?
I don't know of anything in odo or pandas that does exactly what you're looking for, in the sense that you just call a function and everything else is done under the hood. However, you can write a short pandas script that gets the job done.
The basic idea is to iterate over chunks of the csv file that will fit into memory, keeping only the rows of interest, and then combining all the rows of interest at the end.
import pandas as pd
pre_defined_list = ['foo', 'bar', 'baz']
good_data = []
for chunk in pd.read_csv('large_file.csv', chunksize=10**6):
chunk = chunk[chunk['column_to_check'].isin(pre_defined_list)]
good_data.append(chunk)
df = pd.concat(good_data)
Add/alter parameters for pd.read_csv and pd.concat as necessary for your specific situation.
If performance is an issue, you may be able to speed things up by using an alternative to .isin, as described in this answer.