I have a data set of ~8 GB with ~10 million rows (about 10 columns) and wanted to prove the point that SparkR could outperform SQL. To the contrary, I see extremely poor performance from SparkR compared with SQL.
My code simply loads the file from S3 the runs gapply, where my groupings will typically consist of 1-15 rows -- so 10 million rows divided by 15 gives a lot of groups. Am I forcing too much shuffling, serialization/deserialization? Is that why things run so slowly?
For purposes of illustrating that my build_transition function is not the performance bottleneck, I created a trivial version called build_transition2 as shown below, which returns dummy information with what should be constant execution time per group.
Anything fundamental or obvious with my solution formulation?
build_transition2 <- function(key, x) {
patient_id <- integer()
seq_val <- integer()
patient_id <- append(patient_id, as.integer(1234))
seq_val <- append(seq_val, as.integer(5678))
y <- data.frame(patient_id,
seq_val,
stringsAsFactors = FALSE
)
}
dat_spark <- read.df("s3n://my-awss3/data/myfile.csv", "csv", header = "true", inferSchema = "true", na.strings = "NA")
schema <- structType(structField("patient_ID","integer"),
structField("sequence","integer")
)
result <- gapply(dat_spark, "patient_encrypted_id", build_transition2, schema)
and wanted to prove the point that SparkR could outperform SQL.
That's just not the case. The overhead of indirection caused by the guest language:
Internal Catalyst format
External Java type
Sending data to R
....
Sending data back to JVM
Converting to Catalyst format
is huge.
On to of that, gapply is basically an example of group-by-key - something that we normally avoid in Spark.
Overall gapply should be used if, and only if, business logic cannot be expressed using standard SQL functions. It is definitely not a way to optimize your code under normal circumstances (there might border cases where it might be faster, but in general any special logic, if required, will benefit more from native JVM execution with Scala UDF, UDAF, Aggregator, or reduceGroups / mapGroups).
Related
I have got a simple pyspark script and I would like to benchmark each section.
# section 1: prepare data
df = spark.read.option(...).csv(...)
df.registerTempTable("MyData")
# section 2: Dataframe API
avg_earnings = df.agg({"earnings": "avg"}).show()
# section 3: SQL
avg_earnings = spark.sql("""SELECT AVG(earnings)
FROM MyData""").show()
Do generate reliable measurements one would need to run each section multiple times. My solution using the python time module looks like this.
import time
for _ in range(iterations):
t1 = time.time()
df = spark.read.option(...).csv(...)
df.registerTempTable("MyData")
t2 = time.time()
avg_earnings = df.agg({"earnings": "avg"}).show()
t3 = time.time()
avg_earnings = spark.sql("""SELECT AVG(earnings)
FROM MyData""").show()
t4 = time.time()
write_to_csv(t1, t2, t3, t4)
My Question is how would one benchmark each section ? Would you use the time-module as well ? How would one disable caching for pyspark ?
Edit:
Plotting the first 5 iterations of the benchmark shows that pyspark is doing some form of caching.
How can I disable this behaviour ?
First, you can't benchmark using show, it only calculates and returns the top 20 rows.
Second, in general, PySpark API and Spark SQL share the same Catalyst Optimizer behind the scene, so overall what you are doing (using .agg vs avg()) is pretty much similar and don't have much difference.
Third, usually, benchmarking is only meaningful if your data is really big, or your operation is much longer than expected. Other than that, if the runtime difference is only a couple of minutes, it doesn't really matter.
Anyway, to answer your question:
Yes, there is nothing wrong to use time.time() to measure.
You should use count() instead of show(). count would go forward and compute your entire dataset.
You don't have to worry about cache if you don't call it. Spark won't cache unless you ask for it. In fact, you shouldn't cache at all when benchmarking.
You should also use static allocation instead of dynamic allocation. Or if you're using Databricks or EMR, use a fixed amount of workers and don't auto-scale it.
I recently began to use Spark to process huge amount of data (~1TB). And have been able to get the job done too. However I am still trying to understand its working. Consider the following scenario:
Set reference time (say tref)
Do any one of the following two tasks:
a. Read large amount of data (~1TB) from tens of thousands of files using SciSpark into RDDs (OR)
b. Read data as above and do additional preprossing work and store the results in a DataFrame
Print the size of the RDD or DataFrame as applicable and time difference wrt to tref (ie, t0a/t0b)
Do some computation
Save the results
In other words, 1b creates a DataFrame after processing RDDs generated exactly as in 1a.
My query is the following:
Is it correct to infer that t0b – t0a = time required for preprocessing? Where can I find an reliable reference for the same?
Edit: Explanation added for the origin of question ...
My suspicion stems from Spark's lazy computation approach and its capability to perform asynchronous jobs. Can/does it initiate subsequent (preprocessing) tasks that can be computed while thousands of input files are being read? The origin of the suspicion is in the unbelievable performance (with results verified okay) I see that look too fantastic to be true.
Thanks for any reply.
I believe something like this could assist you (using Scala):
def timeIt[T](op: => T): Float = {
val start = System.currentTimeMillis
val res = op
val end = System.currentTimeMillis
(end - start) / 1000f
}
def XYZ = {
val r00 = sc.parallelize(0 to 999999)
val r01 = r00.map(x => (x,(x,x,x,x,x,x,x)))
r01.join(r01).count()
}
val time1 = timeIt(XYZ)
// or like this on next line
//val timeN = timeIt(r01.join(r01).count())
println(s"bla bla $time1 seconds.")
You need to be creative and work incrementally with Actions that cause actual execution. This has limitations thus. Lazy evaluation and such.
On the other hand, Spark Web UI records every Action, and records Stage duration for the Action.
In general: performance measuring in shared environments is difficult. Dynamic allocation in Spark in a noisy cluster means that you hold on to acquired resources during the Stage, but upon successive runs of the same or next Stage you may get less resources. But this is at least indicative and you can run in a less busy period.
I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?
What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.
The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine
I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)
I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different.
The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class.
val df = spark.read.csv(csvFile).as[FireIncident]
A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below:
1) Use typed column (~ 500 ms on local host)
df.where($"UnitID" === "B02").count()
2) Use temp table and sql query (~ same as option 1)
df.createOrReplaceTempView("FireIncidentsSF")
spark.sql("SELECT * FROM FireIncidentsSF WHERE UnitID='B02'").count()
3) Use strong typed class field (14,987ms, i.e. 30 times as slow)
df.filter(_.UnitID.orNull == "B02").count()
I tested it again with the python API, for the same data set, the timing is 17,046 ms, comparable to the performance of the scala API option 3.
df.filter(df['UnitID'] == 'B02').count()
Could someone shed some light on how 3) and the python API are executed differently from the first two options?
It's because of step 3 here.
In the first two, spark doesn't need to deserialize the whole Java/Scala object - it just looks at the one column and moves on.
In the third, since you're using a lambda function, spark can't tell that you just want the one field, so it pulls all 33 fields out of memory for each row, so that you can check the one field.
I'm not sure why the fourth is so slow. It seems like it would work the same way as the first.
When running python what is happening is that first your code is loaded onto the JVM, interpreted, and then its finally compiled into bytecode. When using the Scala API, Scala natively runs on the JVM so you're cutting out the entire load python code into the JVM part.