Confusion about the data location when applying Scikit-learn on cluster (Dask) - scikit-learn

I'm currently working on implementing machine learning (Scikit-Learn) from a single machine to a Slurm cluster via dask. According to some tutorials (e.g. https://examples.dask.org/machine-learning/scale-scikit-learn.html), it's quite simple by using job_lib.parallel_backend('dask'). However, the location of the read in data confuses me and none of the tutorials mention it. Should I use dask.dataframe to read in data to make sure it is passed to the cluster or it doesn't matter if I just read in it using pd.dataframe (then the data is stored in the RAM of which machine I run the Jupiter notebook)?
Thank you very much.

If your data is small enough (which it is in the tutorial), and preprocessing steps are rather trivial, then it is okay to read in with pandas. This will read the data in to your local session, not yet any of the dask workers. Once you call with joblib.parallel_backend('dask'), the data will be copied to each worker process and the scikit work will be done there.
If your data is large or you have intensive preprocessing steps its best to "load" the data with dask, and then use dask's built-in preprocessing and grid search where possible. In this case the data will actually be loaded directly from the workers, because of dask's lazy execution paradigm. Dask's grid search will also cache repeated steps of the cross validation and can speed up computation immensely. More can be found here: https://ml.dask.org/hyper-parameter-search.html

Related

Speeding up dataframe.to_excel operations by a GPU

I was working on extracting some data wherein I constantly need to manipulate some part of fetched data and then append it to another dataframe which contains the combined dataset. I constantly save the dataframe using dataframe.to_excel. Since there is a lot of data, it has started to become a time taking operation, reading the previous file, appending and saving it again, inspite of ample of CPU and RAM. I am using GCP, an N1 type 8vCPU along a 30GB memory. Moreover since I am running various instances of the same script for various projects together, would using a GPU speed these things up ?
I never did it by myself but I think this is possible by using some Pandas alternative.
I found this thread which users seems to provide some solutions to a similar question.
I too have not tried this. I could offer couple of suggestions
rather than to_excel try to use to_csv probably there might be small gains.
you can try this library https://github.com/modin-project/modin, this library seems to make the read and operations faster, but i am not sure able to the write operations.
or you could move it to to_excel line to a different function and perform that operation by spinning out a new thread.

Possible ways of comparing large records from one table on a database and another table on another database

I am looking into the ways for comparing records from same table but on different databases. I just need to compare and find the missing records.
I tried out a few methods.
loading the records into a pandas data frame, I used read_sql. But it is taking more time and memory to complete the load and if the records are large, I am getting a memory error.
Tried setting up a standalone cluster of spark and run the comparison, it is also throwing java heap space error. tuning the conf is not working as well.
Please let me know if there are other ways to handle this huge record comparison.
--update
Do we have a tool readily available for cross data source comparison
If your data size is huge you can use cloud services to run your spark job and get the results. Here you can use aws glue which is serverless and is charged as you go.
Or if your data is not considerably large and is something one time job then you can use google colab which is free and run your comparision over it .

Can Kafka-Spark Streaming pair be used for both batch+real time data?

H All,
I am currently working on developing an architecture which should be able to handle both real time and batch data(coming from disparate sources and point solutions - third party tools). The existing architecture is old school and uses mostly RDBMS(I am not going to to go detail in that).
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
But I have been told to use kafka-spark streaming pair for handling all kinds of data.
If anyone has any experience working on kafka-spark streaming pair for handling all kinds of data, could you please give me a brief details if this could be a viable solution and better than having two different pipeline.
Thanks in advance!
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
Pipeline 1: Sqoop is a good choice for batch load, but it will slow in performance because underlying architecture is still on map-reduce. Though there are options to run sqoop on spark, but didn't try that. Once the data is in HDFS then you can use hive, which is great solution for batch processing. Having said that you can replace sqoop with Spark, if you are worrying about the RDMS fetch time. You can also do a batch transformations in spark also. I would say this is good solution.
Pipeline 2: Kafka and Spark streaming are the most obvious choice and is a good choice. But, If you are using Confluent dist. of Kafka then you could replace most of the spark transformations with K-SQL, K-Streams which will create a realtime transformations.
I would say, its good to have separate system for batching and one for real-time. This is what is lambda architecture. But if you are looking for a more unified framework, then you can try Apache Beam, which provides an unified framework for both batch and realtime processing. You can choose from multiple runners to execute your query.
Hope this helps :)
Lambda architecture would be the way to go!
Hope this link gives you enough ideas:
https://dzone.com/articles/lambda-architecture-how-to-build-a-big-data-pipeli
Thanks much.

What is the best beetween multiple small h5 files or one huge?

I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread.
[settings : python, Ubuntu 18.04]
I can't find any answer of which is the best in term of data accessing and storage between :
registering all the data in one huge HDF5 file (over 20Go)
splitting it into multiple (over 16 000) small HDF5 files (approx
1.4Mo).
Is there any problem of multiple access of one file by multiple thread ? and in the other case is there an impact of having that amount of files ?
I would go for multiple files if I were you (but read till the end).
Intuitively, you could load at least some files into memory speeding the process a little bit (it is unlikely you would able to do so with 20GB, if you are, than you definitely should as RAM access is much faster).
You could cache those examples (inside custom torch.utils.data.Dataset instance) during the first past and retrieve cached examples (say in list or other more memory-efficient data structure with better cache-locality preferably) instead of reading from disk (similar approach to the one in Tensorflow's tf.data.Dataset object and it's cache method).
On the other hand, this approach is more cumbersome and harder to implement correctly,
though if you are only reading the file with multiple threads you should be fine and there shouldn't be any locks on this operation.
Remember to measure your approach with pytorch's profiler (torch.utils.bottleneck) to pinpoint exact problems and verify solutions.

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Resources