TextLMDataBunch Memory issue Language Model Fastai - nlp

I have a dataset with 45 million rows of data. I have three 6gb ram gpu. I am trying to train a language model on the data.
For that, I am trying to load the data as the fastai data bunch. But this part always fails because of the memory issue.
data_lm = TextLMDataBunch.from_df('./', train_df=df_trn,
valid_df=df_val, bs=10)
How do I handle this issue?

When you use this function, your Dataframe is loaded in memory. Since you have a very big dataframe, this causes your memory error. Fastai handles tokenization with a chunksize, so you should still be able to tokenize your text.
Here are two things you should try :
Add a chunksize argument (the default value is 10k) to your TextLMDataBunch.from_df, so that the tokenization process needs less memory.
If this is not enough, I would suggest not to load your whole dataframe into memory. Unfortunately, even if you use TextLMDataBunch.from_folder, it just loads the full DataFrame and pass it to TextLMDataBunch.from_df, you might have to create your own DataBunch constructor. Feel free to comment if you need help on that.

Related

Python Memory Error (After Appending DataFrame)

I want to know more about this as this is new for me..
I am trying to query InfluxDB with python to fetch data in 5 min time interval. I used a simple for-loop to get my data in small chunks and appended the chunks into another empty dataframe inside for loop one after another. This worked out pretty smoothly and I see my output. But while I try to perform mathematical operations on this large dataframe , it gives me a Memory error stated below:
"Memory Error : Unable to allocate 6.95GiB for an array with shape (993407736) and datatype int64"
My system has these info 8.00GB RAM, 64 bit OS x64 based processor.
Could my system be not supporting this ?
Is there an alternate way I can append small dataframes into another dataframe without these memory issues. I am new to this data stuff with python and I need to work with this large chunk of data.... may be an year
Even though, your system has 8GB memory, it will be used by OS and other applications running in your system. Hence it is not able to allocate 6.95GiB only for this program. In case you are building a ML model & trying to run with huge data, You need to consider any of the below options
Use GPU machines offered by any of the cloud provider.
Process the data in small chunks (If it is not ML)

Speeding up dataframe.to_excel operations by a GPU

I was working on extracting some data wherein I constantly need to manipulate some part of fetched data and then append it to another dataframe which contains the combined dataset. I constantly save the dataframe using dataframe.to_excel. Since there is a lot of data, it has started to become a time taking operation, reading the previous file, appending and saving it again, inspite of ample of CPU and RAM. I am using GCP, an N1 type 8vCPU along a 30GB memory. Moreover since I am running various instances of the same script for various projects together, would using a GPU speed these things up ?
I never did it by myself but I think this is possible by using some Pandas alternative.
I found this thread which users seems to provide some solutions to a similar question.
I too have not tried this. I could offer couple of suggestions
rather than to_excel try to use to_csv probably there might be small gains.
you can try this library https://github.com/modin-project/modin, this library seems to make the read and operations faster, but i am not sure able to the write operations.
or you could move it to to_excel line to a different function and perform that operation by spinning out a new thread.

What is the best beetween multiple small h5 files or one huge?

I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread.
[settings : python, Ubuntu 18.04]
I can't find any answer of which is the best in term of data accessing and storage between :
registering all the data in one huge HDF5 file (over 20Go)
splitting it into multiple (over 16 000) small HDF5 files (approx
1.4Mo).
Is there any problem of multiple access of one file by multiple thread ? and in the other case is there an impact of having that amount of files ?
I would go for multiple files if I were you (but read till the end).
Intuitively, you could load at least some files into memory speeding the process a little bit (it is unlikely you would able to do so with 20GB, if you are, than you definitely should as RAM access is much faster).
You could cache those examples (inside custom torch.utils.data.Dataset instance) during the first past and retrieve cached examples (say in list or other more memory-efficient data structure with better cache-locality preferably) instead of reading from disk (similar approach to the one in Tensorflow's tf.data.Dataset object and it's cache method).
On the other hand, this approach is more cumbersome and harder to implement correctly,
though if you are only reading the file with multiple threads you should be fine and there shouldn't be any locks on this operation.
Remember to measure your approach with pytorch's profiler (torch.utils.bottleneck) to pinpoint exact problems and verify solutions.

Out of memory error because of giant input data

I am using RandomForestClassifier in python to predict whether the pixel in the input image is inside the cell or outside it as a pre-processing stage to improve the image , the problem is that the data size of the training set is 8.36GB and also the size of the test data is 8.29GB so whenever I run my program I get (out of memory) error. Will extending the memory not work?. Is there any way to read csv files which contain the data in more than one step and then free the memory after each step?
Hopefully you are using pandas to process this csv file as it would be nearly impossible in native python. As for your memory problem here is a great article explaining how to process large csv files by chunking the data in pandas.
http://pythondata.com/working-large-csv-files-python/

Spark Mlib FPGrowth job fails with Memory Error

I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell):
from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000)
# Perform any RDD operation
for item in model.freqItemsets().toLocalIterator():
# do something with item
I find that whenever I kick off the actual processing by calling either count() or toLocalIterator, my operation ultimately ends with out of memory error. Is FPGrowth not partitioning my data? Is my result data so big that getting even a single partition chokes up my memory? If yes, is there a way I can persist an RDD to disk in a "streaming" fashion without trying to hold it in memory?
Thanks for any insights.
Edit: A fundamental limitation of FPGrowth is that the entire FP Tree has to fit in memory. So, the suggestions about raising the minimum support threshold are valid.
-Raj
Well, the problem is most likely a support threshold. When you set a very low value like here (I wouldn't call one-in-a-million frequent) you basically throw away all the benefits of downward-closure property.
It means that number of itemsets consider is growing exponentially and in the worst case scenario it will be equal to 2N - 1m where N is a number of items. Unless you have a toy data with a very small number of items it is simply not feasible.
Edit:
Note that with ~200K transactions (information taken from the comments) and support threshold 1e-6 every itemset in your data has to be frequent. So basically what you're trying to do here is to enumerate all observed itemsets.

Resources