Why RDD calculating count take so much time - apache-spark

(English is not my first language so please excuse any mistakes)
I use SparkSQL reading 4.7TB data from hive table, and performing a count operation. It takes about 1.6 hours to do that. While reading directly from HDFS txt file and performing count, it takes only 10 minutes. The two jobs used same resources and parallelism. Why RDD count takes so much time?
The hive table has about 3000 thousand columns, and maybe serialization is costly. I checked the spark UI and each tasks read about 240MB data and take about 3.6 minutes to execute. I can't believe that serialization overhead is so expensive.
Reading from hive(taking 1.6 hours):
val sql = s"SELECT * FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
val count = hiveData.count()
Reading from hdfs(taking 10 minutes):
val inputPath = s"/path/to/above/hivetable"
val hdfsData = sc.textFile(inputPath)
val count = hdfsData.count()
While using SQL count, it still takes 5 minutes:
val sql = s"SELECT COUNT(*) FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd

Your first method is querying the data instead of fetching the data. Big difference.
val sql = s"SELECT * FROM xxxtable"
val hiveData = sqlContext.sql(sql).rdd
We can look at the above code as programmers and think "yes, this is how we grab all of the data". But the way that the data is being grabbed is via query instead of reading it from a file. Basically, the following steps occur:
Read from file into temporary storage
A Query engine processes query on temp storage and creates results
Results are read into an RDD
There's a lot of steps there! More so than what occurs by the following:
val inputPath = s"/path/to/above/hivetable"
val hdfsData = sc.textFile(inputPath)
Here, we just have one step:
Read from file into RDD
See, that's 1/3 of the steps. Even though it is a simple query, there is still a lot of overhead and processing involved in order to get it into that RDD. Once it's in the RDD though, processing will be easier. As shown by your code:
val count = hdfsData.count()

Your first way it will be load all data to spark, the network, serialization and transform operation it will take a lot of time.
The second way, I think it's because he omitted the hive layer.
If you just count, the third way is better, it's to load only count results after executes count


What is the fastest way to get a large number of time ranges using Apache Spark?

I have about 100 GB of time series data in Hadoop. I'd like to use Spark to grab all data from 1000 different time ranges.
I have tried this using Apache Hive by creating an extremely long SQL statement that has about 1000 'OR BETWEEN X AND Y OR BETWEEN Q AND R' statements.
I have also tried using Spark. In this technique I've created a dataframe that has the time ranges in question and loaded that into spark with:
With this, I'm doing a join with the newly created timestamp dataframe and the larger set of timestamped data.
This query is taking an extremely long time and I'm wondering if there's a more efficient way to do this.
Especially if the data is not partitioned or ordered in any special way, you or Spark need to scan it all no matter what.
I would define a predicate given the set of time ranges:
import scala.collection.immutable.Range
val ranges: List[Range] = ??? // load your ranges here
def matches(timestamp: Int): Boolean = {
// This is not efficient, a better data structure than a List
// should be used, but this is just an example
val data: RDD[(Int, T)] = ??? // load the data in an RDD
val filtered = data.filter(x => matches(x.first))
You can do the same with DataFrame/DataSet and UDFs.
This works well if the set of ranges is provided in the driver. If instead it comes from a table, like the 100G data, first collect it back in the driver, if not too big.
Your Spark job goes through 100GB dataset to select relevant data.
I don’t think there is big difference between using SQL or data frame api, as under the hood the full scan happening anyway.
I would consider re-structuring your data, so it is optimised for specific queries.
In your cases partitioning by time can give quite significant improvement (for ex. HIVE table with partitioning).
If you perform search using the same field, that has been used for partitioning - Spark job will only look into relevant partitions.

What is the best way to collect the Spark job run statistics and save to database

My Spark program has got several table joins(using SPARKSQL) and I would like to collect the time taken to process each of those joins and save to a statistics table. The purpose is to run it continuously over a period of time and gather the performance at very granular level.
val DF1= spark.sql("select x,y from A,B ")
Val DF2 =spark.sql("select k,v from TABLE1,TABLE2 ")
finally I join DF1 and DF2 and then initiate an action like saveAsTable .
What I am looking for is to figure out
1.How much time it really took to compute DF1
2.How much time to compute DF2 and
3.How much time to persist those final Joins to Hive / HDFS
and put all these info to a RUN-STATISTICS table / file.
Any help is appreciated and thanks in advance
Spark uses Lazy Evaluation, allowing the engine to optimize RDD transformations at a very granular level.
When you execute
val DF1= spark.sql("select x,y from A,B ")
nothing happens except the transformation is added to the Directed Acyclic Graph.
Only when you perform an Action, such as DF1.count, the driver is forced to execute a physical execution plan. This is deferred as far down the chain of RDD transformations as possible.
Therefore it is not correct to ask
1.How much time it really took to compute DF1
2.How much time to compute DF2 and
at least based on the code examples you provided. Your code did not "compute" val DF1. We may not know how long processing just DF1 took, unless you somehow tricked the compiler into processing each dataframe separately.
A better way to structure the question might be "how many stages (tasks) is my job divided into overall, and how long does it take to finish those stages (tasks)"?
And this can be easily answered by looking at the log files/web GUI timeline (comes in different flavors depending on your setup)
3.How much time to persist those final Joins to Hive / HDFS
Fair question. Check out Ganglia
Cluster-wide monitoring tools, such as Ganglia, can provide insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia dashboard can quickly reveal whether a particular workload is disk bound, network bound, or CPU bound.
Another trick I like to use it defining every sequence of transformations that must end in an action inside a separate function, and then calling that function on the input RDD inside a "timer function" block.
For instance, my "timer" is defined as such
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0)/1e9 + "s")
and can be used as
val df1 = Seq((1,"a"),(2,"b")).toDF("id","letter")
scala> time{df1.count}
Elapsed time: 1.306778691s
res1: Long = 2
However don't call unnecessary actions just to break down the DAG into more stages/wide dependencies. This might lead to shuffles or slow down your execution.

Spark Streaming appends to S3 as Parquet format, too many small partitions

I am building an app that uses Spark Streaming to receive data from Kinesis streams on AWS EMR. One of the goals is to persist the data into S3 (EMRFS), and for this I am using a 2 minutes non-overlapping window.
My approaches:
Kinesis Stream -> Spark Streaming with batch duration about 60 seconds, using a non-overlapping window of 120s, save the streamed data into S3 as:
val rdd1 = kinesisStream.map( rdd => /* decode the data */)
rdd1.window(Seconds(120), Seconds(120).foreachRDD { rdd =>
val spark = SparkSession...
import spark.implicits._
// convert rdd to df
val df = rdd.toDF(columnNames: _*)
Here is what s3://bucket/20161211.parquet looks like after a while:
As you can see, lots of fragmented small partitions (which is horrendous for read performance)...the question is, is there any way to control the number of small partitions as I stream data into this S3 parquet file?
What I am thinking to do, is to each day do something like this:
val df = spark.read.parquet("s3://bucket/20161211.parquet")
where I kind of repartition the dataframe to 4 partitions and save them back....
It works, I feel that doing this every day is not elegant solution...
That's actually pretty close to what you want to do, each partition will get written out as an individual file in Spark. However coalesce is a bit confusing since it can (effectively) apply upstream of where the coalesce is called. The warning from the Scala doc is:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).
In Dataset's its a bit easier to persist and count to do wide evaluation since the default coalesce function doesn't take repartition as a flag for input (although you could construct an instance of Repartition manually).
Another option is to have a second periodic batch job (or even a second streaming job) that cleans up/merges the results, but this can be a bit complicated as it introduces a second moving part to keep track of.

How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

For example on docs.datastax.com we mention :
table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()
and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries.
While you can't load data faster. You can load portions of the data or terminate early. Spark DataFrames utilize catalyst to optimize it's underlying query plans enables it to take some short cuts.
For example calling limit will allow Spark to skip reading some portions from the underlying DataSource. These would limit the amount of data read from Cassandra by canceling tasks from being executed.
Calling filter, or adding filters can be utilized by the underlying Datasource to help restrict the amount of information actually pulled from Cassandra. There are limitations on what can be pushed down but this is all detailed in the documentation.
Note all of this is accomplished by simply doing further api calls on your DataSource once you call it. For example
val df = sqlContext
.options(table="kv", keyspace="ks")
df.show(10) // Will compute only enough tasks to get 10 records and no more
df.filter(clusteringKey > 5).show() //Will pass down the clustering predicate to C*

Why does Spark DataFrame run out of memory when same process on RDD completes fine?

I'm working with a fairly large amount of data (a few TBs). When I use a subset of the data, I find that Spark dataframes are great to work with. However, when I try calculations on my full dataset the same code returns me a dreaded "java.lang.OutOfMemoryError: GC overhead limit exceeded". What surprised me is that the process completes fine doing the same thing with an RDD. I thought dataframes were supposed to have better optimization. Is this a mistake in my approach or a limitation of dataframes?
For example, here is a simple task using dataframes that completes fine for a subset of my data and chokes on the full sample:
val records = sqlContext.read.avro(datafile)
val uniqueIDs = records.select("device_id").dropDuplicates(Array("device_id"))
val uniqueIDsCount = uniqueIDs.count().toDouble
val sampleIDs = uniqueIDs.sample(withReplacement = false, 100000/uniqueIDsCount)
sampleIDs.write.format("com.databricks.spark.csv").option("delimiter", "|").save(outputfile)
In this case it even chokes on the count.
However, when I try the same thing using RDDs in the following way it calculates fine (and pretty quickly at that).
val rawinput = sc.hadoopFile[AvroWrapper[Observation],NullWritable,
AvroInputFormat[Observation]](rawinputfile).map(x=> x._1.datum)
val tfdistinct = rawinput.map(x => x.getDeviceId).distinct
val distinctCount = tfdistinct.count().toDouble
tfdistinct.sample(false, 100000/distinctCount.toDouble).saveAsTextFile(outputfile)
I'd love to keep using dataframes in the future, am I approaching this wrong?
