spark reference columns in refactorable way - apache-spark

spark sql is awesome. However, columns are inherently referenced by strings. Even for the dataset API only presence of required columns is checked - not absence of additional fields. And my main problem is that even for the dataset API strings are used to reference columns.
Is there a way to have a more typesafe referencing of columns in spark sql without introducing an additional data structure for each table (besides the initial case class for the type information) to address the names in order to have better refactoring and IDE support.
edit
see the snippet below. It will compile even though it should be clear that it is the wrong column reference. Also edit/refactor in IDE does not seem to work properly.
case class Foo(bar: Int)
import spark.implicits._
val ds = Seq(Foo(1), Foo(2)).toDS
ds.select('fooWrong)
NOTE:
import spark.implicits._
is already imported and 'fooWrong already resembles a type of column

Frameless seems to be the go to solution which offers the desired properties https://github.com/typelevel/frameless
The only downside is, that currently only joins work with column equality. Allowing for any boolean predicate is still in progress.

Related

PySpark "column" object content to display

I am just starting to learn PySpark. I have created a column object, and now I want to see what is in it. Unfortunately, all my research efforts concluded with proposals to access a column of a Spark dataframe. But I want to know how to see what data is in the column object, that I already have.
There must be a simple way, but no success to find it.
The code that created the column object:
baskets=groups.agg(pyspark.sql.functions.collect_list("product_id"))['collect_list(product_id)']
I expect something like the baskets.show(), but that just tells me
column object is not callable
This creates a dataframe:
import pyspark
baskets=groups.agg(pyspark.sql.functions.collect_list("product_id"))
(However, normally we use less verbose lines)
from pyspark.sql import functions as F
baskets = groups.agg(F.collect_list("product_id"))
Now, baskets is a dataframe and you can use baskets.show()
In your code, you have also appended ['collect_list(product_id)']. This way you created a reference in your code to the column. However, Spark has not created this column. So, there's nothing to display. It's just a reference in the code, so that code can become more readable. Here are the methods of pyspark.sql.Column class. There's nothing there to display column's values. It will "get" values only when it is displayed as part of a dataframe.
It takes some time to understand how Spark works. It uses lazy evaluation.
lazy evaluation means that if you tell Spark to operate on a set of data, it listens to what you ask it to do, writes down some shorthand for it so it doesn’t forget, and then does absolutely nothing. It will continue to do nothing, until you ask it for the final answer.
https://developer.hpe.com/blog/the-5-minute-guide-to-understanding-the-significance-of-apache-spark/

Enforcing Schema using .as[A] doesn't enforce the type

I'm trying to read a table in SQL Server, and then I want to enforce the schema of what I read. So I defined:
case class WhatIWant(FieldA: String, FieldB: String)
Then I try to do a regular spark read from SQL server, the Dataframe I read back, is with type of:
input:org.apache.spark.sql.DataFrame
FieldA:integer
FieldB:String
then I appended .as[WhatIWant] when I read, I thought this would make it a Dataset[WhatIWant] with the typing information as I defined. Turns out, in the notebook, it will actually give me:
intput:org.apache.spark.sql.Dataset[WhatIWant]
FieldA:integer
FieldB:String
Now I'm confused on two things:
Will .as[] actually enforce the schema if the schema inferred in Dataframe is different than the definition of the case class?
I was under the assumption Dataset is strongly typed. But the typing info I got from the Databricks notebook is actually not the same as what I defined. Is there an explanation of this?
Quick note:
All the values for "FieldA" are nulls, I felt like that played a part in this, but I'm still confused.

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df".
In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.
The Core Data structure of Spark, i.e., the RDD itself is immutable. This nature is pretty much similar to a string in Java which is immutable as well.
When you concat a string with another literal you are not modifying the original string, you are actually creating a new one altogether.
Similarly, either the Dataframe or the Dataset, whenever you alter that RDD by either adding a column or dropping one you are not changing anything in it, instead you are creating a new Dataset/Dataframe.

Spark Dataframe / SQL - Complex enriching nested data

Context
I have an example of event source data in a dataframe input as shown below.
SOURCE
where eventOccurredTime is a String type. This is from the source and I want to retain this in its original string form (with nano sec)
And I want to use the string to enrich some extra date/time typed data for downstream usage. below is an example
TARGET
Now as a one off I can execute some spark sql on the dataframe as shown below to get the result I want:
import org.apache.spark.sql.DataFrame
def transformDF(): DataFrame = {
spark.sql(
s"""
SELECT
id,
struct(
event.eventCategory,
event.eventName,
event.eventOccurredTime,
struct (
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP) AS eventOccurredTimestampUTC,
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS DATE) AS eventOccurredDateUTC,
unix_timestamp(substring(event.eventOccurredTime,1,23),"yyyy-MM-dd'T'HH:mm:ss.SSS") * 1000 AS eventOccurredTimestampMillis,
datesDim.dateSeq AS eventOccurredDateDimSeq
) AS eventOccurredTimeDim,
NOTE: This is a snippet, for the full event, I have to do this explicitly in this long SQL 20 times for the 20 string dates
Some things to point out:
unix_timestamp(substring(event.eventOccurredTime,1,23)
Above I found I had to substring a date that had nano precision or would return null, hence the substring
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
above is the pattern / naming convention for the 4 nested xDim struct fields to derive and they are present in the predefined spark schema the json is read using to create the source dataframe.
datesDim.dateSeq AS eventOccurredDateDimSeq
To get the above 'eventOccurredDateDimSeq' field, I need to join to a dates dimensions table 'datesDim' (static with an hourly grain), where dateSeq is the 'key' where this date falls into an hourly bucket where datesDim.UTC is defined to the hour
LEFT OUTER JOIN datesDim ON
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:00:00") AS TIMESTAMP) = datesDim.UTC
The table is globally available in the spark cluster so should be quick to look up, but I need to do this for every date enrichment in the payloads and they will have different dates.
dateDimensionDF.write.mode("overwrite").saveAsTable("datesDim")
The general schema pattern is that if there is a string date whose field name is:
x
..there is a 'xDim' struct equiv that immediately follows it in schema order below as described.
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
As mentioned with the snippet, although in the image above I am only showing 'eventOccuredTime' in above, there are more of these through the schema, at lower levels too, that need the same transformation pattern applied.
Problem:
So I have the spark sql (the full monty the snippet came from) to do this one off for 1 event type and its a large, explicit SQL statement that applies the time functions and joins I showed), but here is my problem I need help with.
So I want to try and create a more generic, functionally orientated reusable solution, that traverses a nested dataframe and applies this transformation pattern as described above 'where it needs to'
How do define 'where it needs to'?
Perhaps the naming convention is a good start - traverse the DF, look for any struct fields that have the xDim ('Dim' suffix) pattern, and use the 'x' field presceding as the input, and populate the xDim.* values in line with the naming pattern as described?
How in a function to best join on the datesDim registered table (its static remember) so it performs?
Solution?
Think one or more UDF is needed (we use Scala), maybe by itself or as a fragment within SQL, but not sure. Ensuring the DatesDim lookup performs is key I think.
Or maybe there is another way?
Note: I am working with Dataframes / SparkSQL not Datasets, options for each welcomed though?
Databricks
NOTE: Im actually using the databricks platform for this, so for those verse in SQL 'Higher order functions' in Dbricks
https://docs.databricks.com/spark/latest/spark-sql/higher-order-functions-lambda-functions.html
....is there a slick option here using 'TRANSFORM' as a SQL HOF (might need to register a utility UDF and use this with transform perhaps)?
Awesome, thanks spark community for your help!!! Sorry this is a long post setting the scene.

How to use values (as Column) in function (from functions object) where Scala non-SQL types are expected?

I'd like to undertand how I can dynamically add number of days to a given timestamp: I tried something similar to the example shown below. The issue here is that the second argument is expected to be of type Int, however in my case it returns type Column. How do I unbox this / get the actual value? (The code examples below might not be 100% correct as I write this from top of my head ... I don't have the actual code with me currently)
myDataset.withColumn("finalDate",date_add(col("date"),col("no_of_days")))
I tried casting:
myDataset.withColumn("finalDate",date_add(col("date"),col("no_of_days").cast(IntegerType)))
But this did not help either. So how is it possible to solve this?
I did find a workaround by using selectExpr:
myDataset.selectExpr("date_add(date,no_of_days) as finalDate")
While this works, I still would like to understand how to get the same result with withColumn.
withColumn("finalDate", expr("date_add(date,no_of_days)"))
The above syntax should work.
I think it's not possible as you'd have to use two separate similar-looking type systems - Scala's and Spark SQL's.
What you call a workaround by using selectExpr is probably the only way to do it as you're confined in a single type system, in Spark SQL's and since the parameters are all defined in Spark SQL's "realm" that's the only possible way.
myDataset.selectExpr("date_add(date,no_of_days) as finalDate")
BTW, you've just showed me another reason where support for SQL is different from Dataset's Query DSL. It's about the source of the parameters to functions -- only from structured data sources, only from Scala or a mixture thereof (as in UDFs and UDAFs). Thanks!

Resources