Writing a block-cyclic distributed matrix to a binary file - io

I am using Scalapack for diagonalizing matrices (25k × 25k) and need the eigenvectors. I am using pzlawrite to write out the distributed matrices to a file. But the speed of writing is quite slow in ASCII for such huge files. I suppose writing the output to a binary file would be much faster, but I am unable to get any examples for writing distributed matrices to a binary file in Fortran. Are there any inbuilt functions in scalapack to write out the distributed matrices in binary?

Related

Calculate memory usage of RandomForestClassifier and IsolationForest

I'd like to evaluate how many memory is used up by both
sklearn.ensemble.IsolationForest
sklearn.ensemble.RandomForestClassifier
But
sys.sizeof(my_isolation_forest_model)
sys.sizeof(my_random_forest_classifier_model)
always returns a value of 48, no matter how the model is fit.
Can you help me find out how much memory space are my models using?
Scikit-Learn trees are represented using Cython data structures, which do not interact with Python's sizeof function in a meaningful way - you're just seeing the size of pointer(s).
As a workaround, you can dump these Cython data structures into a Pickle file (or any alternative serialization protocol/data format), and then measure the size of this file. The data from Cython data structures will all be there.
When using Pickle, be sure to disable any data compression!

Is it possible to load data on Spark workers directly into Apache Arrow in-memory format without first loading it into Spark's in-memory format?

We have a use case for doing a large number of vector multiplications and summing the results such that the input data typically will not fit into the RAM of a single host, even if using 0.5 TB RAM EC2 instances (fitting OLS regression models). Therefore we would like to:
Leverage PySpark for Spark's traditional capabilities (distributing the data, handling worker failures transparently, etc.)
But also leverage C/C++-based numerical computing for doing the actual math on the workers
The leading path seems to be to leverage Apache Arrow with PySpark, and use Pandas functions backed by NumPy (in turn written in C) for the vector products. However, I would like to load the data directly to Arrow format on Spark workers. The existing PySpark/Pandas/Arrow documentation seems to imply that the data is in fact loaded into Spark's internal representation first, then converted into Arrow when Pandas UDFs are called: https://spark.apache.org/docs/3.0.1/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size
I found one related paper to this in which the authors developed a zero-copy Arrow-based interface for Spark, so I take it that this is a highly custom thing to do not currently supported in Spark: https://users.soe.ucsc.edu/~carlosm/dev/publication/rodriguez-arxiv-21/rodriguez-arxiv-21.pdf
I would like to ask if anyone knows of a simple way other than what is described in this paper. Thank you!

How to better structure linear algebra-heavy code in PySpark?

Need some suggestions on scaling a pipeline on spark that performs Collaborative Filtering for about 200k-1m people, but does so in groups, with the largest group being approx. 40-50k customers at best. In addition to Collaborative Filtering, which is reasonably fast with ALS, there's a lot of linear algebra that occurs that I couldn't really figure out how to perform with the spark Dataframe API, and had to drop down to the RDD API to perform, and that leads to a significant loss in performance. I've currently got multiple variations of this script - in scala, pyspark, and python - and by far the fastest, despite not being distributor/parallelized, is python, where I'm using numpy for all linear algebra tasks, and python for the remaining transformations.
So, to summarize, I've got a pipeline with a lot of complicated linear algebra that spark doesn't seem to have performant native data structures for, and the workarounds I've devised - RDDs level manipulations for most operations, parallelizing and broadcasting the RDDs to perform matmul in chunks, etc - are significantly slower than just performing the operations in-memory on numpy.
I've got a couple of ideas on how to scale this, but they are a bit hacky, so I was hoping that somebody more experienced could pitch in.
Keep the entire script in python. Used Dask to distribute the processing of various groups of customers in parallel across the cluster.
Keep the entire script in python, but run that using pyspark, keeping a pandas UDF as an entry/exit point for various python functions. However, since pandas UDF have certain limitations in that I can only input & output a single dataframe, but my analysis requires multiple datasets, I need to have some workarounds. Here's what I've what figured out:
Read all datasets into pyspark. All relevant datasets have same number of rows, indexed with customer and other attributes, so I'll concat each row of a dataset into a single column array. So, basically, the 3-4 datasets become 3-4 columns in a consolidated dataset + a customer index.
Transfer this across to python via a pandas UDF.
Extract all relevant datasets from this combined structure in python, perform all the operations (around 1000 loc) and resemble the outputs into a similar structure as the input and transfer back to pyspark.
Since I used a pandas UDF computations across all groups should have occurred in parallel. This then becomes akin to running a Dask like distributed compute via pyspark.
Extract all the data from this consolidated array, map types, and save via pyspark.
This is extremely hacky, and has a few downsides, but I think it'll do the job. I realize that I won't really be able to debug the python udf code easily, so that'll be an irritant, and the solution is still fundamentally limited by the size of the largest single executor I can get, but despite that it'll likely perform better native pyspark/scala code.
Any suggestions on how to better structure this, or ideas about how to do more rapid linear algebra on pyspark natively would be greatly appreciated.

How does the model in sklearn handle large data sets in python?

Now I have 10GB of data set to train the model in sklearn, but my computer only has 8GB of memory, so I have other ways to go besides incremental classifier.
I think sklearn can be used for larger data if the technique is right. If your chosen algorithms support partial_fit or an online learning approach then you're on track. The chunk_size may influence your success
This link may be useful( Working with big data in python and numpy, not enough ram, how to save partial results on the disc?)
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

Performance of DIM1 Repa Array vs Vector

I've written a program to process a large amount of data samples using Repa. Performance is key for this program. A large part of the operations require parallel maps/folds over a multi-dimensional arrays and Repa is perfect for this. However, there is still a part of my program that only uses one-dimensional arrays and doesn't require parallelism (i.e. overhead of parallelism would harm performance). Some of these operations require functions like take or folds with custom accumulators, which Repa doesn't support. So I'm writing these operations myself by iterating over the Repa array.
Am I better off re-writing these operations by using Vector instead of Repa? Would they result in better performance?
I've read somewhere that one-dimensional Repa arrays are implemented as Vectors 'under the hood' so I doubt that Vectors result in better performance. On the other hand, Vector does have some nice built-in functions that I could use instead of writing them myself.
I've implemented some parts of my program with Data.Vector.Unboxed instead of using one-dimensional Data.Array.Repa. Except for some minor improvements, the algorithms are the same. Data.Vector.Unboxed seems to be 4 times faster than one-dimensional Data.Array.Repa for sequential operations.

Resources