performance of UDF in apache spark - apache-spark

I am trying to do a high performance calculations which require custom functions.
As a first stage I am trying to profile the effect of using UDF and I am getting weird results.
I created a simple test (in
Basically I create a dataframe using the range option with 50M records and cache it.
I then do a filter to find those smaller than 10 and count them. Once by doing column < 10 and once by doing it via UDF.
I ran each action 10 times to get a good time estimate.
What I found was that both methods took around the same time: ~4 seconds.
I also tried it in an on premise cluster I have (8 nodes, using yarn, each node with ~40GB memory and plenty of cores). There I got a result of 1 second for the first option and 8 for the second.
First I do not understand how is it that on the databricks cluster I got the same performance. Shouldn't the UDF be much slower? After all, there is no codegen so I should be seeing a much slower process.
Second I don't understand the huge differences between the two cluters: In one I have almost the same time and the other an x8 difference.
Lastly, I was trying to figure out how to write a custom function natively (i.e. the way spark does it). I tried to look at the code for spark and came out with something like this:
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
import org.apache.spark.sql.catalyst.util.TypeUtils
import org.apache.spark.sql.types._
import org.apache.spark.util.Utils
import org.apache.spark.sql.catalyst.expressions._
case class genf(child: Expression) extends UnaryExpression with Predicate with ImplicitCastInputTypes {
override def inputTypes: Seq[AbstractDataType] = Seq(IntegerType)
override def toString: String = s"$child < 10"
override def eval(input: InternalRow): Any = {
val value = child.eval(input)
if (value == null)
} else {
child.dataType match {
case IntegerType => value.asInstanceOf[Int] < 10
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
defineCodeGen(ctx, ev, c => s"($c) < 10")
This, however doesn't work as it would only work from within the sql package (for example AbstractDataType is private).
Is this code even in the right direction? How would I make it work?


Network bound transformation and threading

I am trying to use a REST API to enrich data I have in a spark dataframe. The REST API isn't built by me and requires a single input at a time (no batch option). Unfortunately the REST API latency is slower than I would like so my spark applications seem to spend a lot of time waiting for the API to iterate over each row. Although my REST API has higher latency, it does have very high throughput/capacity which does not seem to get fully used by my spark application.
Since my application appears to be network bound, I was wondering if it would make sense to use threading to help improve the speed of my application. Does spark already capable of doing this internally? If using threads does make sense, is there an easy way to accomplish this? Has anybody successfully done this?
I’ve encountered the same problem when fetching data from a blob storage.
Below is a small self-contained dummy example that I think you can easily modify for your needs.
In the example you should be able to register that it takes a lot longer to construct df_slow vs constructing df_fast.
It works by making each worker process a list of rows in parallel, instead of processing one row at a time sequentially.
You might be able to just swap the slowAdd function with your own Row transforming function. The slowAdd function simulates network latency by sleeping 0.1 seconds.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import Row
# Just some dataframe with numbers
data = [(i,) for i in range(0, 1000)]
df = spark.createDataFrame(data, ["Data"], T.IntegerType())
# Get an rdd that contains 'list of Rows' instead of 'Row'
standardRdd = df.rdd # contains [row1, row3, row3,...]
number_of_partitions = 10
repartionedRdd = standardRdd.repartition(number_of_partitions) # contains [row1, row2, row3,...] but repartioned to increase parallelism
glomRdd = repartionedRdd.glom() # contains roughly [[row1, row2, row3,..., row100], [row101, row102, row103, ...], ...]
# where the number of sublists corresponds to the number of partitions
# Define a transformation function with an artificial delay.
# Substitute this with your own transformation function.
import time
def slowAdd(r):
d = r.asDict()
d["Data"] = d["Data"] + 100
return Row(**d)
# Define a function that maps the slowAdd function from 'list of Rows' to 'list of Rows' in parallel
import concurrent.futures
def slowAdd_with_thread_pool(list_of_rows):
thread_pool = concurrent.futures.ThreadPoolExecutor(max_workers=100)
return [result for result in, list_of_rows)]
# Perform a fast mapping from 'list of Rows' to 'Rows'.
transformed_fast_rdd = glomRdd.flatMap(slowAdd_with_thread_pool)
# For reference, perform a slow mapping from 'Rows' to 'Rows'
transformed_slow_rdd =
# Convert the rdds back to dataframes from the rdd's
df_fast = spark.createDataFrame(transformed_fast_rdd)
#This sum operation will be fast (~100 threads sleeping in parallel on each worker)
df_slow = spark.createDataFrame(transformed_slow_rdd)
#This sum operation will be slow (1 thread sleeping in parallel on each worker)

Spark function aliases - performant udfs

In many of the sql queries I write, I find myself combining spark predefined functions in the exact same way, which often results in verbose and duplicated code, and my developer instinct is to want to refactor it.
So, my question is this : is there some way to define some kind of alias for function combinations without resorting to udfs (which are to avoid for perofmance reasons) - the goal being to make the code clearer and cleaner. Essentially, what I want is something like udfs but without the performance penalty. Also, these function MUST be callable from within a spark-sql query usable in spark.sql calls.
For example, let's say my business logic is to reverse some string and hash it like this : (please note that the function combination here is irrelevant, what is important is that it is some combination of existing pre-defined spark functions - possibly many of them)
FROM person
Is there a way of declaring a business function without paying the performance price of using a udf, allowing the code just above to be rewritten as :
FROM person
I have searched around quite a bit on the spark documentation and on this website and have not found a way of achieving this, which is pretty weird to me because it looks like a pretty natural need, and I don't understand why you should necessarly pay the black-box price of defining and calling a udf.
Is there a way of declaring a business function without paying the performance price of using a udf
You don't have to use udf, you might extend Expression class, or for the simplest operations - UnaryExpression. Then you will have to implement just several methods and here we go. It is natively integrated into Spark, besides that letting use some advantage features such as code generation.
In your case adding business function is pretty straightforward:
def business(column: Column): Column = {
MUST be callable from within a spark-sql query usable in spark.sql calls
This is more tricky but achievable.
You need to create custom functions registrar:
import org.apache.spark.sql.catalyst.FunctionIdentifier
import org.apache.spark.sql.catalyst.expressions.Expression
object FunctionAliasRegistrar {
val funcs: mutable.Map[String, Seq[Column] => Column] = mutable.Map.empty
def add(name: String, builder: Seq[Column] => Column): this.type = {
funcs += name -> builder
def registerAll(spark: SparkSession) = {
funcs.foreach { case (alias, builder) => {
def b(children: Seq[Expression]) = builder.apply( => new Column(expr))).expr
spark.sessionState.functionRegistry.registerFunction(FunctionIdentifier(alias), b)
Then you can use it as follows:
.add("business1", child => lower(reverse(child.head)))
.add("business2", child => upper(reverse(child.head)))
| SELECT business1(name), business2(name) FROM data
|sined |SINED |
|taram |TARAM |
|1taram |1TARAM |
|2taram |2TARAM |
Hope this helps.

Spark write only to one hbase region server

import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.PairRDDFunctions
def bulkWriteToHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sinkTableName: String, outRDD: RDD[(ImmutableBytesWritable, Put)]): Unit = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sinkTableName)
val hJob = Job.getInstance(hConf)
hJob.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, sinkTableName)
what I have found by using this hbase bulk insertion is that, every time spark will only write into one single region server from hbase, which becomes the bottleneck.
however when I use almost the same approach but reading from hbase, it is using multiple executors to do parallel reading .
def bulkReadFromHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sourceTableName: String) = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sourceTableName)
val inputRDD = sparkContext.newAPIHadoopRDD(hConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
can anyone please explain why this could happen? or maybe I have
used the wrong way for spark-hbase bulk I/O ?
Question : I have used the wrong way for spark-hbase bulk I/O ?
No your way is right, although, you need to pre-split regions before hand & create table with presplit regions.
for example create 'test_table', 'f1', SPLITS=> ['1', '2', '3', '4', '5', '6', '7', '8', '9']
Above table occupies 9 regions..
design good rowkey with will starts with 1-9
you can use guava murmur hash like below.
* getMurmurHash.
* #param content
* #return HashCode
public static HashCode getMurmurHash(String content) {
final HashFunction hf = Hashing.murmur3_128();
final HashCode hc = hf.newHasher().putString(content, Charsets.UTF_8).hash();
return hc;
final long hash = getMurmur128Hash(Bytes.toString(yourrowkey as string)).asLong();
final int prefix = Math.abs((int) hash % 9);
now append this prefix to your rowkey
For example
1rowkey1 // will go in to first region
2rowkey2 // will go in to
second region
3rowkey3 // will go in to third region
9rowkey9 //
will go in to ninth region
If you are doing pre-splitting, and want to manually manage region splits, you can also disable region splits, by setting hbase.hregion.max.filesize to a high number and setting the split policy to ConstantSizeRegionSplitPolicy. However, you should use a safeguard value of like 100GB, so that regions does not grow beyond a region server’s capabilities. You can consider disabling automated splitting and rely on the initial set of regions from pre-splitting for example, if you are using uniform hashes for your key prefixes, and you can ensure that the read/write load to each region as well as its size is uniform across the regions in the table
1) please ensure that you can presplit the table before loading data in to hbase table 2) Design good rowkey as Explained below using murmurhash or some other hashing technique. to ensure uniform distribution across the regions.
Also look at
Question : can anyone please explain why this could happen?
reason is quite obvious and simple HOT SPOTTING of data in to one specific reason becuase of poor rowkey for that table...
Consider a hashmap in java which has elements with hashcode 1234. then it will fill all the elements in one bucket isntit ? If hashmap elements are distributed across different good hashcode then it will put elements in different buckets. same is the case with hbase. here your hashcode is just like your rowkey...
Further more,
What happens if I already have a table and I want to split the regions
The RegionSplitter class provides several utilities to help in the administration lifecycle for developers who choose to manually split regions instead of having HBase handle that automatically.
The most useful utilities are:
Create a table with a specified number of pre-split regions
Execute a rolling split of all regions on an existing table
Example :
$ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
where -c 10, specifies the requested number of regions as 10, and -f specifies the column families you want in the table, separated by “:”. The tool will create a table named “test_table” with 10 regions:
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,,1358563771069.acc1ad1b7962564fc3a43e5907e8db33.', STARTKEY => '', ENDKEY => '19999999', ENCODED => acc1ad1b7962564fc3a43e5907e8db33,}
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,19999999,1358563771096.37ec12df6bd0078f5573565af415c91b.', STARTKEY => '19999999', ENDKEY => '33333332', ENCODED => 37ec12df6bd0078f5573565af415c91b,}
as discussed in comment, you found that my final RDD right before writing into hbase only has 1 partition! which indicates that there
was only one executor holding the entire data... I am still trying to
find out why.
Also, Check
spark.default.parallelism defaults to the number of all cores on all
machines. The parallelize api has no parent RDD to determine the
number of partitions, so it uses the spark.default.parallelism.
So You can increase partitions by repartitioning.
NOTE : I observed that, In Mapreduce The number of partitions of the regions/input split = number of mappers launched.. Similarly in your case it may be the same situation where data loaded in to one particular region thats why one executor lauched. please verify that as well
Though you have not provided example data or enough explanation,this is mostly not due to your code or configuration.
It is happening so,due to non-optimal rowkey design.
The data you are writing is having keys(hbase rowkey) improperly structured(maybe monotonically increasing or something else).So, write to one of the regions is happening.You can prevent that thro' various ways(various recommended practices for rowkey design like salting,inverting,and other techniques).
For reference you can see
In case,if you are wondering whether the write is done in parallel for all regions or one by one(not clear from question) look at this :

Trying to understand spark streaming windowing

I'm investigating Spark Streaming as a solution for an anti-fraud service I am building, but I am struggling to figure out exactly how to apply it to my use case. The use case is: data from a user session is streamed, and a risk score is calculated for a given user, after 10 seconds of data is collected for that user. I am planning on using a batch interval time of 2 seconds, but need to use data from the full 10 second window. At first, updateStateByKey() seemed to be the perfect solution, as I could build up a UserRisk object using the events the system collects. The trouble is, I am not sure how to tell Spark to stop updating a user after the 10 seconds have passed, as at the 10 second mark, I run our inference engine against the UserRisk object, and persist the result. The other approach is the window transformation. The issue with the window transformation is that I have to dedup data manually, which might be wasteful. Any suggestions on how to tell updateStateByKey to stop reducing on a certain key after an interval of time has passed?
If you don't want use windowing, you can reduce batch interval to 1s and then in updateStateByKey update also an incremental counter and bypass the update function when it reachs 10.
myDstreamByKey.updateStateByKey( (newValues: Seq[Row], runningState: Option[(UserRisk, Int)]) => {
if(runningState.get._2 < 10){
Some(( updateUserRisk(runningState.get._1, newValues), runningState.get._2 + 1) )
} )
For semplicity I'm considering the State always a Some, but you have to handle it according your business logic, that I don't know.
In my example Row is as a fake case class that represents your original data and UserRisk is the accumulating state. the updateUserRisk function contains your business logic to update the UserRisk
According to your case, you can try reduceByKeyAndWindow Dstream function, It will fulfill your requirement
Here is sample code in java
JavaPairDStream<String, Integer> counts = pairs.reduceByKeyAndWindow(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}, new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 - i2;
}, new Duration(60 * 1000), new Duration(2 * 1000));
Some important links
Spark Streaming Window Operation

Spark RDD.isEmpty costs much time

I built a Spark cluster.
Memory: 32.0 GB Total, 20.0 GB Used
Each worker gets 1 cpu, 6 cores and 10.0 GB memory
My program gets data source from MongoDB cluster. Spark and MongoDB cluster are in the same LAN(1000Mbps).
MongoDB document format:
{name:string, value:double, time:ISODate}
There is about 13 million documents.
I want to get the average value of a special name from a special hour which contains 60 documents.
Here is my key function
*rdd=sc.newAPIHadoopRDD(configOriginal, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])
Apache-Spark-1.3.1 scala doc: SparkContext.newAPIHadoopFile[K, V, F <: InputFormat[K, V]](path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)]
def findValueByNameAndRange(rdd:RDD[(Object,BSONObject)],name:String,time:Date): RDD[BasicBSONObject]={
val nameRdd =>arg._2).filter(_.get("name").equals(name))
val timeRangeRdd1 =>(tuple, tuple.get("time").asInstanceOf[Date]))
val timeRangeRdd2 =>(tuple._1,duringTime(tuple._2,time,getHourAgo(time,1))))
val timeRangeRdd3 = timeRangeRdd2.filter(_._2).map(_._1)
val timeRangeRdd4 = => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
return basicBSONRDD(name, time)
return => {
val bson = new BasicBSONObject()
bson.put("name", tuple._1)
bson.put("value", tuple._2/60)
bson.put("time", time)
bson })
Here is part of Job information
My program works so slowly. Does it because of isEmpty and reduceByKey? If yes, how can I improve it ? If not, why?
=======update === => (x.get("name").toString, x.get("value").toString.toDouble)).reduceByKey(_ + _)
is on the line of 34
I know reduceByKey is a global operation, and may costs much time, however, what it costed is beyond my budget. How can I improvet it or it is the defect of Spark. With the same calculation and hardware, it just costs several seconds if I use multiple thread of java.
First, isEmpty is merely the point at which the RDD stage ends. The maps and filters do not create a need for a shuffle, and the method used in the UI is always the method that triggers a stage change/ this case isEmpty. Why it's running slow is not as easy to discern from this perspective, especially without seeing the composition of the originating RDD. I can tell you that isEmpty first checks the partition size and then does a take(1) and verifies whether data was returned or not. So, the odds are that there is a bottle neck in the network or something else blocking along the way. It could even be a GC issue... Click into the isEmpty and see what more you can discern from there.
