How to make Spark skip sort in merge-join? - apache-spark

I want to use the fact that my dataframes are already sorted by a key used for join.
df1.join(df2, df1.sorted_key == df2.sorted_key)
Both dataframes are large, BHJ or SHJ is not an option (SHJ crashes instead of spills)
How to hint Spark that the joined column is already sorted?
I read from SO that hive+bucket+pre-sort helps. However I can't see where the dataframe store its sort status.
df = session.createDataFrame([
('Alice', 1),
('Bob', 2)
])
df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
df = df.sort('_1')
df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: long (nullable = true)
^ Even when I manually sort on the column _1, the dataframe doesn't seem to remember it's sorted by _1.
Also,
How does Spark know the sorted status?
Does a parquet dataset (without hive metadata) remember which columns are sorted? Does Spark recognize it?
How does Hive + bucket + pre-sort help skip sort?
Can I use Hive + pre-sort without bucketing to skip sort?
I saw in the databricks talk Spark bucketing has many limitations and is different from Hive bucketing. Is Hive bucketing preferred?
The optimization talk by Databricks says never use bucketing because it is too hard to maintain in practice. Is it true?

Related

Can IF statement work correctly to build spark dataframe?

I have following code which uses an IF statement to build dataframe conditionally.
Does this work as I expect?
df = sqlContext.read.option("badRecordsPath", badRecordsPath).json([data_path_1, s3_prefix + "batch_01/2/2019-04-28/15723921/15723921_15.json"])
if "scrape_date" not in df.columns:
df = df.withColumn("scrape_date", lit(None).cast(StringType()))
Is this what you are trying to do?
val result = <SOME Dataframe I previously created>
scala> result.printSchema
root
|-- VAR1: string (nullable = true)
|-- VAR2: double (nullable = true)
|-- VAR3: string (nullable = true)
|-- VAR4: string (nullable = true)
scala> result.columns.contains("VAR3")
res13: Boolean = true
scala> result.columns.contains("VAR9")
res14: Boolean = false
So the "result" dataframe has columns "VAR1", "VAR2" and so on.
The next line shows that it contains "VAR3" (result of expression is "true". But it does not contains a column called "VAR9" (result of the expression is "false").
The above is scala, but you should be able to do the same in Python (sorry I did not notice you were asking about python when I replied).
In terms of execution, the if statement will execute locally on the driver node. As a rule of thumb, if something returns an RDD, DataFrame or DataSet, it will be executed in parallel on the executor(s). Since DataFrame.columns returns an Array, any processing of the list of columns will be done in the driver node (because an Array is not an RDD, DataFrame nor DataSet).
Also note that RDD, DataFrame and DataSet will be executed "lazy-lly". That is, Spark will "accumulate" the operations that generate these objects. It will only execute them when you do something that doesn't generate an RDD, DataFrame or DataSet. For example when you do a show or a count or a collect. Part of the reason for doing this is so Spark can optimise the execution of the process. Another is so it only does what is actually needed to generate the answer.

run SQL queries on record history (aggregation) spark streaming

I have a schema -
|-- record_id: integer (nullable = true)
|-- Data1: string (nullable = true)
|-- Data2: string (nullable = true)
|-- Data3: string (nullable = true)
|-- Time: timestamp (nullable = true)
I want to know the record for each record id with latest timestamp. I have not been able to do this in structured streaming. In Spark Streaming, I have achieved this on each incoming batch by using foreachRDD, and converting each incoming RDD to a dataframe and then running my sql query on it.
However, this yields results only on each new RDD, and not using the whole history. I know I can do this in Spark Streaming using Key Value Pairs, but I'm rather interested in running SQL queries on whole of the History (group by, joins and such). How can I do it in Spark Streaming, and not in Spark Structured Streaming ?
Another reason I cant do this in structured streaming is because I cant use Streaming Aggregation before Joins, which is kind of what I require for this.

In Spark 2.0, jdbc dataframes schema is automatically applied as nullable = false

For jdbc dataframes if I specify a custom query like
(select * from table1 where col4 > 10.0) AS table1
then schema for all columns turns out to be nullable = false
col1: string (nullable = false)
col2: string (nullable = false)
col3: string (nullable = false)
col4: float (nullable = false)
This causes null pointer exception when I use custom queries and the resultset contains any null value. I also tried to transform schema programatically but it still fails because of spark lineage, as original dataframe has the restricted schema irrespective of what schema transformed dataframe has.
I found a workaround for this. If I just provide the table name and then provide the select and where clause
sqlContext.read.jdbc(url, table1, dconnectionProperties).
select("col1", "col2", "col3", "col4").
where(s"col4 < 10.0")
the schema is correctly(or I how I want) inferred as
col1: string (nullable = true)
col2: string (nullable = true)
col3: string (nullable = true)
col4: float (nullable = true)
But I wanted to use the custom queries as my queries has some join and aggregations which I want to be pushed down to database to execute.
This started showing up after we moved to spark 2.0.x prior to that this was working fine
This issue is related to Teradata JDBC driver. This problem is discussed at https://community.teradata.com/t5/Connectivity/Teradata-JDBC-Driver-returns-the-wrong-schema-column-nullability/m-p/76667/highlight/true#M3798.
The root cause is discussed on the first page. And the solution is on the third page.
People from Teradata said they fixed the issue in the 16.10.* driver with a MAYBENULL parameter, but I am still seeing an undeterministic behaviour.
Here is a similar discussion
https://issues.apache.org/jira/browse/SPARK-17195
To close the loop, this issue is fixed in driver version 16.04.00.0. Two new parameters need to be added in the connection string COLUMN_NAME=ON,MAYBENULL=ON

Why do Spark DataFrames not change their schema and what to do about it?

I'm using Spark 2.1's Structured Streaming to read from a Kafka topic whose contents are binary avro-encoded.
Thus, after setting up the DataFrame:
val messages = spark
.readStream
.format("kafka")
.options(kafkaConf)
.option("subscribe", config.getString("kafka.topic"))
.load()
If I print the schema of this DataFrame (messages.printSchema()), I get the following:
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- timestampType: integer (nullable = true)
This question should be orthogonal to the problem of avro-decoding, but let's assume I want to somehow convert the value content from the messages DataFrame into a Dataset[BusinessObject], by a function Array[Byte] => BusinessObject. For example completeness, the function may just be (using avro4s):
case class BusinessObject(userId: String, eventId: String)
def fromAvro(bytes: Array[Byte]): BusinessObject =
AvroInputStream.binary[BusinessObject](
new ByteArrayInputStream(bytes)
).iterator.next
Of course, as miguno says in this related question I cannot just apply the transformation with a DataFrame.map(), because I need to provide an implicit Encoder for such a BusinessObject.
That can be defined as:
implicit val myEncoder : Encoder[BusinessObject] = org.apache.spark.sql.Encoders.kryo[BusinessObject]
Now, perform the map:
val transformedMessages : Dataset[BusinessObjecŧ] = messages.map(row => fromAvro(row.getAs[Array[Byte]]("value")))
But if I query the new schema, I get the following:
root
|-- value: binary (nullable = true)
And I think that does not make any sense, as the dataset should use the Product properties of the BusinessObject case-class and get the correct values.
I've seen some examples on Spark SQL using .schema(StructType) in the reader, but I cannot do that, not just because I'm using readStream, but because I actually have to transform the column before being able to operate in such fields.
I am hoping to tell the Spark SQL engine that the transformedMessages Dataset schema is a StructField with the case class' fields.
I would say you get exactly what you ask for. As I already explained today Encoders.kryo generates a blob with serialized object. Its internal structure is opaque for the SQL engine and cannot be accessed without deserializing the object. So effectively what your code does is taking one serialization format and replacing it with another.
Another problem you have is that you try to mix dynamically typed DataFrame (Dataset[Row]) with statically typed object. Excluding UDT API Spark SQL doesn't work like this. Either you use statically Dataset or DataFrame with object structure encoded using struct hierarchy.
Good news is simple product types like BusinessObject should work just fine without any need for clumsy Encoders.kryo. Just skip Kryo encoder definition and be sure to import implicit encoders:
import spark.implicits._

Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

I have two dataframes df1 and df2. Both of them have the following schema:
|-- ts: long (nullable = true)
|-- id: integer (nullable = true)
|-- managers: array (nullable = true)
| |-- element: string (containsNull = true)
|-- projects: array (nullable = true)
| |-- element: string (containsNull = true)
df1 is created from an avro file while df2 from an equivalent parquet file. However, If I execute, df1.unionAll(df2).show(), I get the following error:
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
I ran into the same situation and it turns out to be not only the fields need to be the same but also you need to maintain the exact same ordering of the fields in both dataframe in order to make it work.
This is old and there are already some answers lying around but I just faced this problem while trying to make a union of two dataframes like in...
//Join 2 dataframes
val df = left.unionAll(right)
As others have mentioned, order matters. So just select right columns in the same order than left dataframe columns
//Join 2 dataframes, but take columns in the same order
val df = left.unionAll(right.select(left.columns.map(col):_*))
I found the following PR on github
https://github.com/apache/spark/pull/11333.
That relates to UDF (user defined function) columns, which were not correctly handled during the union, and thus would cause the union to fail. The PR fixes it, but it hasn't made it to spark 1.6.2, I haven't checked on spark 2.x yet.
If you're stuck on 1.6.x there's a stupid work around, map the DataFrame to an RDD and back to a DataFrame
// for a DF with 2 columns (Long, Array[Long])
val simple = dfWithUDFColumn
.map{ r => (r.getLong(0), r.getAs[Array[Long]](1))} // DF --> RDD[(Long, Array[Long])]
.toDF("id", "tags") // RDD --> back to DF but now without UDF column
// dfOrigin has the same structure but no UDF columns
val joined = dfOrigin.unionAll(simple).dropDuplicates(Seq("id")).cache()

Resources