Spark: Transforming multiple dataframes in parallel - apache-spark

Understanding how to achieve best parallelism while transforming multiple dataframes in parallel
I have an array of paths
val paths = Array("path1", "path2", .....
I am loading dataframe from each path then transforming and writing to destination path
paths.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
The transformation processData is independent of dataframe I am loading.
This limits to processing one dataframe at a time and most of my cluster resources are idle. As processing each dataframe is independent, I converted Array to ParArray of scala.
paths.par.foreach(path => {
val df = spark.read.parquet(path)
df.transform(processData).write.parquet(path+"_processed")
})
Now it is using more resources in cluster. I am still trying to understand how it works and how to fine tune the parallel processing here
If I increase the default scala parallelism using ForkJoinPool to higher number, can it lead to more threads spawning at driver side and will be in lock state waiting for foreach function to finish and eventually kill the driver?
How does it effect the centralized spark things like EventLoggingListnener which needs to handle more inflow of events as multiple dataframes are processed in parallel.
What parameters do I consider for optimal resource utilization.
Any other approach
Any resources I can go through to understand this scaling would be very helpful

The reason why this is slow is that spark is very good at parallelizing computations on lots of data, stored in one big dataframe. However, it is very bad at dealing with lots of dataframes. It will start the computation on one using all its executors (even though they are not all needed) and wait for it to finish before starting the next one. This results in a lot of inactive processors. This is bad but that's not what spark was designed for.
I have a hack for you. There might need to refine it a little, but you would have the idea. Here is what I would do. From a list of paths, I would extract all the schemas of the parquet files and create a new big schema that gathers all the columns. Then, I would ask spark to read all the parquet files using this schema (the columns that are not present will be set to null automatically). I would then union all the dataframes and perform the transformation on this big dataframe and finally use partitionBy to store the dataframes in separate files, while still doing all of it in parallel. It would look like this.
// let create two sample datasets with one column in common (id)
// and two different columns x != y
val d1 = spark.range(3).withColumn("x", 'id * 10)
d1.show
+---+----+
| id| x |
+---+----+
| 0| 0|
| 1| 10|
| 2| 20|
+---+----+
val d2 = spark.range(2).withColumn("y", 'id cast "string")
d2.show
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 1|
+---+---+
// And I store them
d1.write.parquet("hdfs:///tmp/d1.parquet")
d2.write.parquet("hdfs:///tmp/d2.parquet")
// Now let's create the big schema
val paths = Seq("hdfs:///tmp/d1.parquet", "hdfs:///tmp/d2.parquet")
val fields = paths
.flatMap(path => spark.read.parquet(path).schema.fields)
.toSet //removing duplicates
.toArray
val big_schema = StructType(fields)
// and let's use it
val dfs = paths.map{ path =>
spark.read
.schema(big_schema)
.parquet(path)
.withColumn("path", lit(path.split("/").last))
}
// The we are ready to create one big dataframe
dfs.reduce( _ unionAll _).show
+---+----+----+----------+
| id| x| y| file|
+---+----+----+----------+
| 1| 1|null|d1.parquet|
| 2| 2|null|d1.parquet|
| 0| 0|null|d1.parquet|
| 0|null| 0|d2.parquet|
| 1|null| 1|d2.parquet|
+---+----+----+----------+
Yet, I do not recommend using unionAll on lots of dataframes. Because of spark's analysis of the execution plan, it can be very slow with many dataframes. I would use the RDD version although it is more verbose.
val rdds = sc.union(dfs.map(_.rdd))
// let's not forget to add the path to the schema
val big_df = spark.createDataFrame(rdds,
big_schema.add(StructField("path", StringType, true)))
transform(big_df)
.write
.partitionBy("path")
.parquet("hdfs:///tmp/processed.parquet")
And having a look at my processed directory, I get this:
hdfs:///tmp/processed.parquet/_SUCCESS
hdfs:///tmp/processed.parquet/path=d1.parquet
hdfs:///tmp/processed.parquet/path=d2.parquet

You should play with some variables here. Most important are: CPU cores, the size of each DF and a little use of futures. The propose is decide the priority of each DF to be processed. You can use FAIR configuration but that don't be enough and process all in parallel could consume a big part of your cluster. You have to assign priorities to DFs and use Future pooll to control the number of parallel Jobs running in your app.

Related

PySpark UDF issues when referencing outside of function

I facing the issue that I get the error
TypeError: cannot pickle '_thread.RLock' object
when I try to apply the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),
('Mike','Williams','M',77),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
def find_n_people_with_higher_age(x):
return df_2.filter(df_2['age']>=x).count()
find_n_people_with_higher_age_udf = udf(find_n_people_with_higher_age, IntegerType())
df_1.select(find_n_people_with_higher_age_udf(col('category_id')))
Here's a good article on python UDF's.
I use it as a reference as I suspected that you were running into a serialization issue. I'm showing the entire paragraph to add context of the sentence but really it's the serialization that's the issue.
Performance Considerations
It’s important to understand the performance implications of Apache
Spark’s UDF features. Python UDFs for example (such as our CTOF
function) result in data being serialized between the executor JVM and
the Python interpreter running the UDF logic – this significantly
reduces performance as compared to UDF implementations in Java or
Scala. Potential solutions to alleviate this serialization bottleneck
include:
If you consider what you are asking maybe you'll see why this isn't working. You are asking all data from your dataframe(data_2) to be shipped(serialized) to an executor that then serializes it and ships it to python to be interpreted. Dataframes don't serialize. So that's your issue, but if they did, you are sending an entire data frame to each executor. Your sample data here isn't an issue, but for trillions of records it would blow up the JVM.
What your asking is doable I just need to figure out how do it. Likely a window or group by would be the trick.
add additional data:
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
data_1 = [('James','Smith','M',30),('Anna','Rose','F',41),
('Robert','Williams','M',62),
]
# add more data to make it more interesting.
data_2 = [('Junior','Smith','M',15),('Helga','Rose','F',33),('Gia','Rose','F',34),
('Mike','Williams','M',77), ('John','Williams','M',77), ('Bill','Williams','F',79),
]
columns = ["firstname","lastname","gender","age"]
df_1 = spark.createDataFrame(data=data_1, schema = columns)
df_2 = spark.createDataFrame(data=data_2, schema = columns)
# dataframe to help fill in missing ages
ref = spark.range( 1, 110, 1).toDF("numbers").withColumn("count", lit(0)).withColumn("rolling_Count", lit(0))
countAges = df_2.groupby("age").count()
#this actually give you the short list of ages
rollingCounts = countAges.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#fill in missing ages and remove duplicates
filled = rollingCounts.union(ref).groupBy("age").agg(sum("count").alias("count"))
#add a rolling count across all ages
allAgeCounts = filled.withColumn("rolling_Count", sum(col("count")).over(Window.partitionBy().orderBy(col("age").desc())))
#do inner join because we've filled in all ages.
df_1.join(allAgeCounts, df_1.age == allAgeCounts.age, "inner").show()
+---------+--------+------+---+---+-----+-------------+
|firstname|lastname|gender|age|age|count|rolling_Count|
+---------+--------+------+---+---+-----+-------------+
| Anna| Rose| F| 41| 41| 0| 3|
| Robert|Williams| M| 62| 62| 0| 3|
| James| Smith| M| 30| 30| 0| 5|
+---------+--------+------+---+---+-----+-------------+
I wouldn't normally want to use a window over an entire table, but here the data it's iterating over <= 110 so this is reasonable.

Spark StringIndexer.fit is very slow on large records

I have large data records formatted as the following sample:
// +---+------+------+
// |cid|itemId|bought|
// +---+------+------+
// |abc| 123| true|
// |abc| 345| true|
// |abc| 567| true|
// |def| 123| true|
// |def| 345| true|
// |def| 567| true|
// |def| 789| false|
// +---+------+------+
cid and itemId are strings.
There are 965,964,223 records.
I am trying to convert cid to an integer using StringIndexer as follows:
dataset.repartition(50)
val cidIndexer = new StringIndexer().setInputCol("cid").setOutputCol("cidIndex")
val cidIndexedMatrix = cidIndexer.fit(dataset).transform(dataset)
But these lines of code are very slow (takes around 30 minutes). The problem is that it is so huge that I could not do anything further after that.
I am using amazon EMR cluster of R4 2XLarge cluster with 2 nodes (61 GB of memory).
Is there any performance improvement that I can do further? Any help will be much appreciated.
That is an expected behavior, if cardinality of column is high. As a part of the training process, StringIndexer collects all the labels, and to create label - index mapping (using Spark's o.a.s.util.collection.OpenHashMap).
This process requires O(N) memory in the worst case scenario, and is both computationally and memory intensive.
In cases where cardinality of the column is high, and its content is going to be used as feature, it is better to apply FeatureHasher (Spark 2.3 or later).
import org.apache.spark.ml.feature.FeatureHasher
val hasher = new FeatureHasher()
.setInputCols("cid")
.setOutputCols("cid_hash_vec")
hasher.transform(dataset)
It doesn't guarantee uniqueness and it is not reversible, but it is good enough for many applications, and doesn't require fitting process.
For column that won't be used as a feature you can also use hash function:
import org.apache.spark.sql.functions.hash
dataset.withColumn("cid_hash", hash($"cid"))
Assuming that:
You plan to use the cid as a feature (after StringIndexer + OneHotEncoderEstimator)
Your data sits in S3
A few questions first:
How many distinct values do you have in the cid column?
What's the data format (e.g. Parquet, Csv, etc...) and is it splittable?
See: https://community.hitachivantara.com/s/article/hadoop-file-formats-its-not-just-csv-anymore
Without knowing much more, my first guess is that you should not worry about memory now and check your degree of parallelism first. You only have 2 R4 2XLarge instances that will give you:
8 CPUs
61GB Memory
Personally, I would try to either:
Get more instances
Swap the R4 2XLarge instances with others that have more CPUs
Unfortunately, with the current EMR offering this can only be achieved by throwing money at the problem:
https://aws.amazon.com/ec2/instance-types/
https://aws.amazon.com/emr/pricing/
Finally, what's the need to repartition(50)? That might just introduce further delays...

Aggregating several fields simultaneously from Dataset

I have a data with the following scheme:
sourceip
destinationip
packets sent
And I want to calculate several aggregative fields out of this data and have the following schema:
ip
packets sent as sourceip
packets sent as destination
In the happy days of RDDs I could use aggregate, define a map of {ip -> []}, and count the appearances in a corresponding array location.
In the Dataset/Dataframe aggregate is no longer available, instead UDAF could be used, unfortunately, from the experience I had with UDAF they are immutable, means they cannot be used (have to create a new instance on every map update) example + explanation here
on one hand, technically, I could convert the Dataset to RDD, aggregate etc and go back to dataset. Which I expect would result in performance degradation, as Datasets are more optimized. UDAFs are out of the question due to the copying.
Is there any other way to perform aggregations?
It sounds like you need a standard melt (How to melt Spark DataFrame?) and pivot combination:
val df = Seq(
("192.168.1.102", "192.168.1.122", 10),
("192.168.1.122", "192.168.1.65", 10),
("192.168.1.102", "192.168.1.97", 10)
).toDF("sourceip", "destinationip", "packets sent")
df.melt(Seq("packets sent"), Seq("sourceip", "destinationip"), "type", "ip")
.groupBy("ip")
.pivot("type", Seq("sourceip", "destinationip"))
.sum("packets sent").na.fill(0).show
// +-------------+--------+-------------+
// | ip|sourceip|destinationip|
// +-------------+--------+-------------+
// | 192.168.1.65| 0| 10|
// |192.168.1.102| 20| 0|
// |192.168.1.122| 10| 10|
// | 192.168.1.97| 0| 10|
// +-------------+--------+-------------+
One way to go about it without any custom aggregation would be to use flatMap (or explode for dataframes) like this:
case class Info(ip : String, sent : Int, received : Int)
case class Message(from : String, to : String, p : Int)
val ds = Seq(Message("ip1", "ip2", 5),
Message("ip2", "ip3", 7),
Message("ip2", "ip1", 1),
Message("ip3", "ip2", 3)).toDS()
ds
.flatMap(x => Seq(Info(x.from, x.p, 0), Info(x.to, 0, x.p)))
.groupBy("ip")
.agg(sum('sent) as "sent", sum('received) as "received")
.show
// +---+----+--------+
// | ip|sent|received|
// +---+----+--------+
// |ip2| 8| 8|
// |ip3| 3| 7|
// |ip1| 5| 1|
// +---+----+--------+
As far as the performance is concerned, I am not sure a flatMap is an improvement versus a custom aggregation though.
Here is a pyspark version using explode. It is more verbose but the logic is exactly the same as the flatMap version, only with pure dataframe code.
sc\
.parallelize([("ip1", "ip2", 5), ("ip2", "ip3", 7), ("ip2", "ip1", 1), ("ip3", "ip2", 3)])\
.toDF(("from", "to", "p"))\
.select(F.explode(F.array(\
F.struct(F.col("from").alias("ip"),\
F.col("p").alias("received"),\
F.lit(0).cast("long").alias("sent")),\
F.struct(F.col("to").alias("ip"),\
F.lit(0).cast("long").alias("received"),\
F.col("p").alias("sent")))))\
.groupBy("col.ip")\
.agg(F.sum(F.col("col.received")).alias("received"), F.sum(F.col("col.sent")).alias("sent"))
// +---+----+--------+
// | ip|sent|received|
// +---+----+--------+
// |ip2| 8| 8|
// |ip3| 3| 7|
// |ip1| 5| 1|
// +---+----+--------+
Since you didn't mention the context and aggregations, you may do something like below,
val df = ??? // your dataframe/ dataset
From Spark source:
(Scala-specific) Compute aggregates by specifying a map from column
name to aggregate methods. The resulting DataFrame will also contain
the grouping columns. The available aggregate methods are avg, max,
min, sum, count.
// Selects the age of the oldest employee and the aggregate expense
for each department
df
.groupBy("department")
.agg(Map(
"age" -> "max",
"expense" -> "sum"
))

Running a high volume of Hive queries from PySpark

I want execute a very large amount of hive queries and store the result in a dataframe.
I have a very large dataset structured like this:
+-------------------+-------------------+---------+--------+--------+
| visid_high| visid_low|visit_num|genderid|count(1)|
+-------------------+-------------------+---------+--------+--------+
|3666627339384069624| 693073552020244687| 24| 2| 14|
|1104606287317036885|3578924774645377283| 2| 2| 8|
|3102893676414472155|4502736478394082631| 1| 2| 11|
| 811298620687176957|4311066360872821354| 17| 2| 6|
|5221837665223655432| 474971729978862555| 38| 2| 4|
+-------------------+-------------------+---------+--------+--------+
I want to create a derived dataframe which uses each row as input for a secondary query:
result_set = []
for session in sessions.collect()[:100]:
query = "SELECT prop8,count(1) FROM hit_data WHERE dt = {0} AND visid_high = {1} AND visid_low = {2} AND visit_num = {3} group by prop8".format(date,session['visid_high'],session['visid_low'],session['visit_num'])
result = hc.sql(query).collect()
result_set.append(result)
This works as expected for a hundred rows, but causes livy to time out with higher loads.
I tried using map or foreach:
def f(session):
query = "SELECT prop8,count(1) FROM hit_data WHERE dt = {0} AND visid_high = {1} AND visid_low = {2} AND visit_num = {3} group by prop8".format(date,session.visid_high,session.visid_low,session.visit_num)
return hc.sql(query)
test = sampleRdd.map(f)
causing PicklingError: Could not serialize object: TypeError: 'JavaPackage' object is not callable. I understand from this answer and this answer that the spark context object is not serializable.
I didn't try generating all queries first, then running the batch, because I understand from this question batch querying is not supported.
How do I proceed?
What I was looking for is:
Querying all required data in one go by writing the appropriate joins
Adding custom columns, based on the values of the large dataframe using pyspark.sql.functions.when() and df.withColumn(), then
Flattening the resulting dataframe with df.groupBy() and pyspark.sql.functions.sum()
I think I didn't fully realize that Spark handles dataframes lazily. The supported way of working is to define large dataframes and then the appropriate transforms. Spark will try to execute the data retrieval and the transforms in one go, at the last second and distributed. I was trying to limit the scope up front, which led to unsupported functionality.

create column with a running total in a Spark Dataset

Suppose we have a Spark Dataset with two columns, say Index and Value, sorted by the first column (Index).
((1, 100), (2, 110), (3, 90), ...)
We'd like to have a Dataset with a third column with a running total of the values in the second column (Value).
((1, 100, 100), (2, 110, 210), (3, 90, 300), ...)
Any suggestions how to do this efficiently, with one pass through the data? Or are there any canned CDF type functions out there that could be utilized for this?
If need be, the Dataset can be converted to a Dataframe or an RDD to accomplish the task, but it will have to remain a distributed data structure. That is, it cannot be simply collected and turned to an array or sequence, and no mutable variables are to be used (val only, no var).
but it will have to remain a distributed data structure.
Unfortunately what you've said you seek to do isn't possible in Spark. If you are willing to repartition the data set to a single partition (in effect consolidating it on a single host) you could easily write a function to do what you wish, keeping the incremented value as a field.
Since Spark functions don't share state across the network when they execute, there's no way to create the shared state you would need to keep the data set completely distributed.
If you're willing to relax your requirement and allow the data to be consolidated and read through in a single pass on one host then you may do what you wish by repartitioning to a single partition and applying a function. This does not pull the data onto the driver (keeping it in HDFS/the cluster) but does still compute the output serially, on a single executor. For example:
package com.github.nevernaptitsa
import java.io.Serializable
import java.util
import org.apache.spark.sql.{Encoders, SparkSession}
object SparkTest {
class RunningSum extends Function[Int, Tuple2[Int, Int]] with Serializable {
private var runningSum = 0
override def apply(v1: Int): Tuple2[Int, Int] = {
runningSum+=v1
return (v1, runningSum)
}
}
def main(args: Array[String]): Unit ={
val session = SparkSession.builder()
.appName("runningSumTest")
.master("local[*]")
.getOrCreate()
import session.implicits._
session.createDataset(Seq(1,2,3,4,5))
.repartition(1)
.map(new RunningSum)
.show(5)
session.createDataset(Seq(1,2,3,4,5))
.map(new RunningSum)
.show(5)
}
}
The two statements here show different output, the first providing the correct output (serial, because repartition(1) is called), and the second providing incorrect output because the result is computed in parallel.
Results from first statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 3|
| 3| 6|
| 4| 10|
| 5| 15|
+---+---+
Results from second statement:
+---+---+
| _1| _2|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 9|
+---+---+
A colleague suggested the following which relies on the RDD.mapPartitionsWithIndex() method.
(To my knowledge, the other data structure do not provide this kind of reference to their partitions' indices.)
val data = sc.parallelize((1 to 5)) // sc is the SparkContext
val partialSums = data.mapPartitionsWithIndex{ (i, values) =>
Iterator((i, values.sum))
}.collect().toMap // will in general have size other than data.count
val cumSums = data.mapPartitionsWithIndex{ (i, values) =>
val prevSums = (0 until i).map(partialSums).sum
values.scanLeft(prevSums)(_+_).drop(1)
}

Resources