A map-style dataset in Pytorch has the __getitem__() and __len__() and iterable-style datasets has __iter__() protocol. If we use map-style, we can access the data with dataset[idx] which is great, however with the iterable dataset we can't.
My question is why this distinction was necessary? What makes the data random read so expensive or even improbable?

I wrote a short post on how to use PyTorch datasets, and the difference between map-style and iterable-style dataset.
In essence, you should use map-style datasets when possible. Map-style datasets give you their size ahead of time, are easier to shuffle, and allow for easy parallel loading.
It’s a common misconception that if your data doesn’t fit in memory, you have to use iterable-style dataset. That is not true. You can implement a map-style dataset such that it retrives data as needed.
Check out the full post here.

It's quite possible that the full dataset doesn't fit in memory (could be on a disk, or only accessible over a network). A stream of information doesn't have to be retained if you're not going to access arbitrary offsets. If you're going to request data[0], then data[1], then data[2] over a network, you're sending a lot of requests which introduces latency.
Iterable-like (ResultSet) objects are typical when incrementally reading rows in the results of a database query. It's also conceivable that a dataset could inherently be a stream of information, like logging data, or transactions, or incrementally discovered pages found by a web crawler.


How to optimize memory footprint of Stanza models

I'm using Stanza to get tokens, lemmas and tags from documents in multiple languages for the purposes of a language learning app. This means that I need to store and load many Stanza (default) models for different languages.
My main problem right now is that if I want to load all those models the memory requirement is too much for my resources. I currently deploy a web API running Stanza NLP on AWS. I want to keep my infrastructure costs at a minimum.
One possible solution is to load one model at a time when I need to run my script. I guess that means there will be some extra overhead each time in order to load the model in memory.
Another thing I tried is just to use the processors that I really need which decreases the memory footprint but not by that much.
I tried looking at open and closed issues on Github and Google but didn't find much.
What other possible solutions are out there?
The bottom line is a model for a language has to be in memory during execution, so by some means or another you need to make the model smaller or tolerate storing models on disk. I can offer some suggestions to make the models smaller, though be warned that making your model smaller will probably result in poorer accuracy.
You could examine the percentage breakdown of language requests, and store commonly requested languages in memory and only go to disk for rarer language requests.
The most immediate impact strategy for reducing model size is to shrink the vocabulary size. It is possible you could cut the vocabulary even smaller and still get similar accuracy. We have done some optimization on this front, but there may be more opportunity to cut model size.
You could experiment with smaller model size and word embeddings and may only get a small accuracy drop, we haven't really aggressively experimented with different model sizes to see how much accuracy you lose. This would mean retraining the model and just setting the embedding size and model size parameters smaller.
I don't know a lot about this, but there is a strategy of tagging a bunch of data with your big accurate model, and then training a smaller model to mimic the big model. I believe this is called "knowledge distillation".
In a similar direction, you could tag a bunch of data with Stanza, and then train a CoreNLP model (which I think would have a smaller memory footprint).
In summary, I think the easiest thing to do would be to retrain a model with a smaller vocabulary size. We I think it currently has 250,000 words, and cutting to 10,000 or 50,000 will reduce model size, but may not affect accuracy too badly.
Unfortunately I don't think there is a magical option you can select that will just solve this issue, you will have to retrain models and see what kind of accuracy you are willing to sacrifice for a lower memory footprint.

How does the model in sklearn handle large data sets in python?

Now I have 10GB of data set to train the model in sklearn, but my computer only has 8GB of memory, so I have other ways to go besides incremental classifier.
I think sklearn can be used for larger data if the technique is right. If your chosen algorithms support partial_fit or an online learning approach then you're on track. The chunk_size may influence your success
This link may be useful( Working with big data in python and numpy, not enough ram, how to save partial results on the disc?)
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

Efficient implementation of SOM (Self organizing map) on Pyspark

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.

Deep learning on massive datasets

Theoretical question here. I understand that when dealing with datasets that cannot fit into memory on a single machine, spark + EMR is a great way to go.
However, I would also like to use tensorflow instead of spark's ml lib algorithms to perform deep learning on these large datasets.
From my research I see that I could potentially use a combination of pyspark, elephas and EMR to achieve this. Alternatively there is BigDL and sparkdl.
Am I going about this the wrong way? What is best practice for deep learning on data that cannot fit into memory? Should I use online learning or batch training instead? This post seems to say that "most high-performance deep learning implementations are single-node only"
Any help to point me in the right direction would be greatly appreciated.
In TensorFlow, you can use so you can generate your dataset at runtime without any storage hassles.
See link for example
As you mention "fitting massive dataset to memory", I understand that you are trying to load all data to memory at once and start training. Hence, I give the reply based on this assumption.
General mentality is that if you cannot fit the data to your resources, divide data into smaller chunks and train in an iterative way.
1- Load data one by one instead of trying to load all at once. If you create an execution workflow as "Load Data -> Train -> Release Data (This can be done automatically by garbage collectors) -> Restart" , you can understand how much resource is needed to train single data.
2- Use mini-batches. As soon as you get the resource information from #1, you can make an easy calculation to estimate the mini-batch size. For example, if training single data consumes 1.5 GB of RAM, and your GPU has 8 GB of RAM, theoretically you may train mini-batches with size 5 at once.
3- If the resources are not enough to train even 1-sized single batch, in this case, you may think about increasing your PC capacity or decreasing your model capacity / layers / features. Alternatively, you can go for cloud computing solutions.

Best practice for training on large scale datasets like ImageNet using Theano/Lasagne?

I found that all of the examples of Theano/Lasagne deal with small data set like mnist and cifar10 which can be loaded into memory completely.
My question is how to write efficient code for training on large scale datasets?
Specifically, what is the best way to prepare mini-batches (including real time data augmentation) in order to keep the GPU busy?
Maybe like using CAFFE's ImageDataLayer?
For example, I have a big txt file which contains all the image paths and labels.
It would be appreciated to show some code.
Thank you very much!
In case the data doesn't fit into memory, a good way is to prepare the minibatches and store them into an HDF5 file, which is then used at training time.
However, this does suffice when doing data augmentation as this is done on the fly. Because of Pythons global interpreter lock, images cannot already be loaded and preprocesed while the GPU is busy.
The best way around this, that I know of, is the Fuel library.
Fuel loads and preprocesses the minibatches in a different python process and then streams them to the training process over a TCP socket:
It additionally provides some functions to preprocess the data, such as scaling and mean subtraction:
Hope this helps.
