How to get an Iterator of Rows using Dataframe in SparkSQL - apache-spark

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is there a way using which I can get all this rows as an Iterable instaed of the entire rows as list.
I am executing this SparkSQL application using yarn-client.

Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator method on a RDD:
val df: org.apache.spark.sql.DataFrame = ???
df.cache // Optional, to avoid repeated computation, see docs for details
val iter: Iterator[org.apache.spark.sql.Row] = df.rdd.toLocalIterator

Actually you can just use: df.toLocalIterator, here is the reference in Spark source code:
/**
* Return an iterator that contains all of [[Row]]s in this Dataset.
*
* The iterator will consume as much memory as the largest partition in this Dataset.
*
* Note: this results in multiple Spark jobs, and if the input Dataset is the result
* of a wide transformation (e.g. join with different partitioners), to avoid
* recomputing the input Dataset should be cached first.
*
* #group action
* #since 2.0.0
*/
def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ =>
withNewExecutionId {
queryExecution.executedPlan.executeToIterator().map(boundEnc.fromRow).asJava
}
}

Related

spark: No of records in DataFrame is different in different runs

I am running a spark job that reads data from teradata. The query looks like
select * from db_name.table_name sample 5000000;
I'm trying to pull sample of 5 million rows of data. When I tried to print the number of rows in the result DataFrame, it is giving different results each time I run. Sometimes it is 4999937 and sometimes it is 5000124. Is there any particular reason for this kind of behaviour?
EDIT #1:
The code I'm using:
val query = "(select * from db_name.table_name sample 5000000) as data"
var teradataConfig = Map("url"->"jdbc:teradata://HOSTNAME/DATABASE=db_name,DBS_PORT=1025,MAYBENULL=ON",
"TMODE"->"TERA",
"user"->"username",
"password"->"password",
"driver"->"com.teradata.jdbc.TeraDriver",
"dbtable" -> query)
var df = spark.read.format("jdbc").options(teradataConfig).load()
df.count
Try caching the resultant dataframe and perform count action on the dataframe
df.cache()
println(s"Record count: ${df.count()}
From here on when you reuse the df to create new dataframe or any other transformation you don't get mismatched counts since it is already in cache.
Make sure you have given enough memory to hold the cached dataframe in memory.

Filtering Dataframe with predicate pushdown from another dataframe

How can I push down a filter to a dataframe reading based on another dataframe I have? Basically want to avoid reading the second dataframe entirely and then doing an inner join. Instead I would like to just submit a filter on the reading to filter at source. Even if I use an inner joined wrapped up with the read, the plan doesn't show that it is getting filtered. I feel like there is definitely a better way to set this up. Using Spark 2.x I have this so far but I want to avoid collecting a List as below:
// Don't want to do this collect...too slow
val idFilter = df1.select("id").distinct().map(r => r.getLong(0)).collect.toList
val df2: DataFrame = spark.read.format("parquet").load("<path>")
.filter($"id".isin(idFilter: _*))
You cannot directly use predicate pushdown unless you are implementing a DataSource yourself. Predicate Pushdown is a mechanism provided by the Spark Datasources which must be implemented by each Datasources individually.
For file based Datasources there is already a simple mechanism in place based on partitioning on Disk.
consider the following DataFrame:
val df = Seq(("test", "day1"), ("test2", "day2")).toDF("data", "day")
If we save that DataFrame to disk the following way:
df.write.partitionBy("day").save("/tmp/data")
The result will be the following folder structure
tmp -
|
| - data - |
|
|--day=day1 -|- part1....parquet
| |- part2....parquet
|
|--day=day2 -|- part1....parquet
|- part2....parquet
If you are now using this datasource like this:
spark.read.load("/tmp/data").filter($"day" = "day1").show()
Spark doesn't even bother loading the data of folder day2 as there is no need for it.
This is one type of predicate pushdown which works for every standard file format spark supports.
A more specific mechanism would be parquet. Parquet is a columnar based file format which means its quit easy to filter out columns. If you have parquet based files with 3 columns a, b, c in a file /tmp/myparquet.parquet the following query:
spark.read.parquet("/tmp/myparquet.parquet").select("a").show()
will result in an internal predicate pushdown where spark is only fetching data for column a without reading data for the columns b or c.
If someone is interested that mechanisms are established by implementing this trait:
/**
* A BaseRelation that can eliminate unneeded columns and filter using selected
* predicates before producing an RDD containing all matching tuples as Row objects.
*
* The actual filter should be the conjunction of all `filters`,
* i.e. they should be "and" together.
*
* The pushed down filters are currently purely an optimization as they will all be evaluated
* again. This means it is safe to use them with methods that produce false positives such
* as filtering partitions based on a bloom filter.
*
* #since 1.3.0
*/
#Stable
trait PrunedFilteredScan {
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}
to be found in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

Stack - Broadcast a csv?

Assume I'm creating a spark dataset from a shared store of data as follows:
Dataset<Row> item = spark.read().option("delimiter", "|").option("header","true").csv(fName).cache();
Is there a way to tell Spark to broadcast item to all nodes, such that no shuffle is needed to use it? I have a bunch of little lookup tables and I'd like to see if broadcasting them helps avoid shuffles.
You can use two approaches:
collect() given Dataset and broadcast it manually. You said that those files are small, so it's possible. But, it will work with UDFs / strong typed operators like map, not with standard function.
Example:
val items = item.as[MyCaseClass].collect()
val itemsBcV = sparkContext.broadcast(items)
// later, UDF
val funnyUDF = udf ((x : String) => {
val valueFromBroadcast = itemsBcV.value;
// processing
});
Preferred: Don't broadcast manually, just in processing add broadcast hint.
First, import org.apache.spark.sql.functions._
For example:
someBigTable.join(broadcast(item), "id")
in pure SQL syntax it is:
item.createOrReplaceTempView("item")
select /*+ BROADCAST(item) */ * from bigTable join item
Spark will manage broadcasting this variable and use quicker Broadcast Hash Join instead of Hash Join or Sort Merge Join

Spark RDD do not get processed in multiple nodes

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column

Performing operations only on subset of a RDD

I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).
Is it possible?
RDD has take(num: Int): Array[T] method, I think I'd need something similar, but returning RDD[T]
You can use RDD.sample to get an RDD out, not an Array. For example, to sample ~1% without replacement:
val data = ...
data.count
...
res1: Long = 18066983
val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190
The third parameter is a seed, and is thankfully optional in the next Spark version.
RDDs are distributed collections which are materialized on actions only. It is not possible to truncate your RDD to a fixed size, and still get an RDD back (hence RDD.take(n) returns an Array[T], just like collect)
I you want to get similarly sized RDDs regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD. Size of the resulting RDD will depend on spark parallelism.
An example from spark-shell:
import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000
val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)
val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))
val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))
millionRdd.count // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions
Apparently it's possible to create RDD subset by first using its take method and then passing returned array to SparkContext's makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) which returns new RDD.
This approach seems dodgy to me though. Is there a nicer way?
I always use parallelize function of SparkContext to distribute from Array[T] but it seems makeRDD do the same. It's correct way both of them.

Resources