How to parallelize polars partition_by result dataframes processing in python? - python-3.x

I have polars Data Frame. Also I have a complex function of processing parts of this df.
def processing_df(df):
...
...
return result # np.array for example
df_gr = df_sourse.partition_by(groups="group_col", maintain_order=True)
results = []
for df in df_gr:
res = processing_df(df)
results.append(res)
Is it possible to parallelize the processing of a list of dataframes received from a partition? Taking into account the fact that polars already parallelizes calculations, but as I understand it in this context, polars will not parallelize calculations between processing individual dataframes from the list. I would like to speed up dataframe processing.
P.S. I tried use this approaches, but processes were not running.

Related

the usage of aggregate(0, lambda,lambda) in pyspark

There is a pyspark code segment
seqOp = (lambda x,y: x+y)
sum_temp = df.rdd.map(lambda x: len(x.timestamp)).aggregate(0, seqOp, seqOp)
The output of sum_temp is a numerical value. But I am not clear how does the aggregate(0, seqOp, seqOp) work. It seems to me that normally, the aggregate just use a single function form like "avg"
Moreover, df.rdd.map(lambda x: len(x.timestamp)) is of type pyspark.rdd.PipelinedRDD. How can we get its contents?
According to the docs, the aggregation process:
Starts from the first argument as the zero-value (0),
Then each partition of the RDD is aggregated using the second argument, and
Finally the aggregated partitions are combined into the final result using the third argument. Here, you sum up each partition, and then you sum up the sums from each partition into the final result.
You might have confused this aggregate with the aggregate method of dataframes. RDDs are lower-level objects and you cannot use dataframe aggregation methods here, such as avg/mean/etc.
To get the contents of the RDD, you can do rdd.take(1) to check a random element, or use rdd.collect() to check the whole RDD (mind that this will collect all data onto the driver and could cause memory errors if the RDD is huge).

Why does a distinct count after a MergeSort join on a Spark dataframe of Vectors give a non-deterministic results

A spark inner SortMerge Join on using a column containing a ml Vector appears to give non-deterministic and inaccurate result on larger datasets.
I was using the approxNearestNeighbors method of BucketRandomLSH projection of Spark v2.4.3, and discovered it gives different numbers of results for large data sets.
This problem only appears when executing a SortMerge join; a broadcast join gives the same result every time.
I tracked the problem down to the join on the LSH hash keys. A reproducible example is below...
import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vectors
import scala.util.Random
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
//create a large dataframe containing an Array (length two) of ml Vectors of length one.
val df = Seq.fill(30000)(
(java.util.UUID.randomUUID.toString,
Seq(
Vectors.dense(Random.nextInt(10).toDouble),
Vectors.dense(Random.nextInt(10).toDouble))
)
).toDF
//ensure it's caches
df.cache.count()
//positional explode the vector column
val dfExploded = df.select(col("*"), posexplode(col("_2")))
// now self join on the exploded 'col' and 'pos' fields
dfExploded.join(dfExploded, Seq("pos","col")).drop("pos","col").distinct.count
Different result each time...
scala>dfExploded.join(dfExploded,Seq("pos","col")).drop("pos","col").distinct.count
res139: Long = 139663581
scala>dfExploded.join(dfExploded,Seq("pos","col")).drop("pos","col").distinct.count
res140: Long = 156349630

Spark scala partition dataframe for large cross joins

I have two dataframes that need to be cross joined on a 20-node cluster. However because of their size, a simple crossjoin is failing. I am looking to partition the data and perform the crossjoin and am looking for an efficient way to do it.
Simple Algorithm
Manually split file f1 into three and read into dataframes: df1A, df1B, df1C. Manually split file f2 into four and ready into dataframes: df2A, df2B, df2C, df2D. Cross join df1A X df2A, df1A X df2B,..,df1A X df2D,...,df1C X df2D. Save each cross join in a file and manually put together all files. This way Spark can perform each cross join parallely and things should complete fairly quickly.
Question
Is there is more efficient way of accomplishing this by reading both files into two dataframes, then partitioning each dataframe into 3 and 4 "pieces" and for each partition of one dataframe cross join with every partition of the other dataframe?
Data frame can be partitioned ether range or hash .
val df1 = spark.read.csv("file1.txt")
val df2 = spark.read.csv("file2.txt")
val partitionedByRange1 = df1.repartitionByRange(3, $"k")
val partitionedByRange2 = df2.repartitionByRange(4, $"k")
val result =partitionedByRange1.crossJoin(partitionedByRange2);
NOTE : set property spark.sql.crossJoin.enabled=true
You can convert this in to a rdd and then use cartesian operation on that RDD. You should then be able to save that RDD to a file. Hope that helps

How to pass output of one function to another in Spark

I am sending output of one function which is dataframe to another function.
val df1 = fun1
val df11 = df1.collect
val df2 = df11.map(x =fun2( x,df3))
Above 2 lines are wriiten in main function. Df1 is very large so if i do collect on driver it gives outof memory or gc issue.
What r ways to send output of one function to another in spark?
Spark can run the data processing for you. You don't need the intermediate collect step. You should just chain all of the transformations together and then add an action at the end to save the resulting data out to disk.
Calling collect() is only useful for debugging very small results.
For example, you could do something like this:
rdd.map(x => fun1(x))
.map(y => fun2(y))
.saveAsObjectFile();
This article might be helpful to explain more about this:
http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/

Linear Regression on Apache Spark

We have a situation where we have to run linear regression on millions of small datasets and store the weights and intercept for each of these datasets. I wrote the below scala code to do so, wherein I fed each of these datasets as a row in an RDD and then I try to run the regression on each(data is the RDD which has (label,features) stored in it in each row, in this case we have one feature per label):
val x = data.flatMap { line => line.split(' ')}.map { line =>
val parts = line.split(',')
val parsedData1 = LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
val model = LinearRegressionWithSGD.train(sc.parallelize(List(parsedData1)),100)//using parallelize to convert data to type RDD
(model.intercept,model.weights)
}
The problem here is that, LinearRegressionWithSGD expects an RDD for input, and nested RDDs are not supported in Spark. I chose this approach as all these datasets can be run independent of each other and hence I wanted to distribute them (Hence, ruled out looping).
Can you please suggest if I can use other types (Arrays, Lists etc) to input as a dataset to LinearRegressionWithSGD or even a better approach which will still distribute such computations in Spark?
val modelList = for {item <- dataSet} yield {
val data = MLUtils.loadLibSVMFile(context, item).cache()
val model = LinearRegressionWithSGD.train(data)
model
}
Maybe you can separate your input data into several files and store in HDFS.
Use the directory of those files as input, you can get a list of models.

Resources