Spark Dataframe / SQL - Complex enriching nested data - apache-spark

Context
I have an example of event source data in a dataframe input as shown below.
SOURCE
where eventOccurredTime is a String type. This is from the source and I want to retain this in its original string form (with nano sec)
And I want to use the string to enrich some extra date/time typed data for downstream usage. below is an example
TARGET
Now as a one off I can execute some spark sql on the dataframe as shown below to get the result I want:
import org.apache.spark.sql.DataFrame
def transformDF(): DataFrame = {
spark.sql(
s"""
SELECT
id,
struct(
event.eventCategory,
event.eventName,
event.eventOccurredTime,
struct (
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP) AS eventOccurredTimestampUTC,
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS DATE) AS eventOccurredDateUTC,
unix_timestamp(substring(event.eventOccurredTime,1,23),"yyyy-MM-dd'T'HH:mm:ss.SSS") * 1000 AS eventOccurredTimestampMillis,
datesDim.dateSeq AS eventOccurredDateDimSeq
) AS eventOccurredTimeDim,
NOTE: This is a snippet, for the full event, I have to do this explicitly in this long SQL 20 times for the 20 string dates
Some things to point out:
unix_timestamp(substring(event.eventOccurredTime,1,23)
Above I found I had to substring a date that had nano precision or would return null, hence the substring
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
above is the pattern / naming convention for the 4 nested xDim struct fields to derive and they are present in the predefined spark schema the json is read using to create the source dataframe.
datesDim.dateSeq AS eventOccurredDateDimSeq
To get the above 'eventOccurredDateDimSeq' field, I need to join to a dates dimensions table 'datesDim' (static with an hourly grain), where dateSeq is the 'key' where this date falls into an hourly bucket where datesDim.UTC is defined to the hour
LEFT OUTER JOIN datesDim ON
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:00:00") AS TIMESTAMP) = datesDim.UTC
The table is globally available in the spark cluster so should be quick to look up, but I need to do this for every date enrichment in the payloads and they will have different dates.
dateDimensionDF.write.mode("overwrite").saveAsTable("datesDim")
The general schema pattern is that if there is a string date whose field name is:
x
..there is a 'xDim' struct equiv that immediately follows it in schema order below as described.
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
As mentioned with the snippet, although in the image above I am only showing 'eventOccuredTime' in above, there are more of these through the schema, at lower levels too, that need the same transformation pattern applied.
Problem:
So I have the spark sql (the full monty the snippet came from) to do this one off for 1 event type and its a large, explicit SQL statement that applies the time functions and joins I showed), but here is my problem I need help with.
So I want to try and create a more generic, functionally orientated reusable solution, that traverses a nested dataframe and applies this transformation pattern as described above 'where it needs to'
How do define 'where it needs to'?
Perhaps the naming convention is a good start - traverse the DF, look for any struct fields that have the xDim ('Dim' suffix) pattern, and use the 'x' field presceding as the input, and populate the xDim.* values in line with the naming pattern as described?
How in a function to best join on the datesDim registered table (its static remember) so it performs?
Solution?
Think one or more UDF is needed (we use Scala), maybe by itself or as a fragment within SQL, but not sure. Ensuring the DatesDim lookup performs is key I think.
Or maybe there is another way?
Note: I am working with Dataframes / SparkSQL not Datasets, options for each welcomed though?
Databricks
NOTE: Im actually using the databricks platform for this, so for those verse in SQL 'Higher order functions' in Dbricks
https://docs.databricks.com/spark/latest/spark-sql/higher-order-functions-lambda-functions.html
....is there a slick option here using 'TRANSFORM' as a SQL HOF (might need to register a utility UDF and use this with transform perhaps)?
Awesome, thanks spark community for your help!!! Sorry this is a long post setting the scene.

Related

PySpark: combine aggregate and window functions

I am working with a legacy Spark SQL code like this:
SELECT
column1,
max(column2),
first_value(column3),
last_value(column4)
FROM
tableA
GROUP BY
column1
ORDER BY
columnN
I am rewriting it in PySpark as below
df.groupBy(column1).agg(max(column2), first(column3), last(column4)).orderBy(columnN)
When I'm comparing the two outcomes I can see differences in the fields generated by the first_value/first and last_value/last functions.
Are they behaving in a non-deterministic way when used outside of Window functions?
Can groupBy aggregates be combined with Window functions?
This behaviour is possible when you have a wide table and you don't specify ordering for the remaining columns. What happens under the hood is that spark takes first() or last() row, whichever is available to it as the first condition-matching row on the heap. Spark SQL and pyspark might access different elements because the ordering is not specified for the remaining columns.
In terms of Window function, you can use a partitionBy(f.col('column_name')) in your Window, which kind of works like a groupBy - it groups the data according to a partitioning column. However, without specifying the ordering for all columns, you might arrive at the same problem of non-determinicity. Hope this helps!
For completeness sake, I recommend you have a look at the pyspark doc for the first() and last() functions here: https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.first
In particular, the following note brings light to why you behaviour was non-deterministic:
Note The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Definitely !
import pyspark.sql.functions as F
partition = Window.partitionBy("column1").orderBy("columnN")
data = data.withColumn("max_col2", F.max(F.col("column2")).over(partition))\
.withColumn("first_col3", F.first(F.col("column3")).over(partition))\
.withColumn("last_col4", F.last(F.col("column4")).over(partition))
data.show(10, False)

How to use Impala to read Hive view containing complex types?

I have some data that is processed and model based on case classes, and the classes can also have other case classes in them, so the final table has complex data, struct, array. Using the case class I save the data in hive using dataframe.saveAsTextFile(path).
This data sometimes changes or needs to have a different model, so for each iteration I use a suffix in the table name (some_data_v01, some_data_v03, etc.).
I also have queries that are run on a schedule on these tables, using Impala, so in order to not modify the query each time I save a a new table, I wanted to use a view that is always updated whenever I change the model.
The problem with that is I can't use Impala to create the view, because of the complex nature of the data in the tables (nested complex types). Apart from being a lot of work to expand the complex types, I want these types to be preserved (lots of level of nesting, duplication of data when joining arrays).
One solution was to create the view using Hive, like this
create view some_data as select * from some_data_v01;
But if I do this, when I want to use the table from Impala,
select * from some_data;
or even something simple, like
select some_value_not_nested, struct_type.some_int, struct_type.some_other_int from some_data;
the error is the following:
AnalysisException: Expr 'some_data_v01.struct_type' in select list returns a complex type
'STRUCT< some_int:INT, some_other_int:INT, nested_struct:STRUCT< nested_int:INT, nested_other_int:INT>, last_int:INT>'. Only scalar types are allowed in the select list.
Is there any way to access this view, or create it in some other way for it to work?

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df".
In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.
The Core Data structure of Spark, i.e., the RDD itself is immutable. This nature is pretty much similar to a string in Java which is immutable as well.
When you concat a string with another literal you are not modifying the original string, you are actually creating a new one altogether.
Similarly, either the Dataframe or the Dataset, whenever you alter that RDD by either adding a column or dropping one you are not changing anything in it, instead you are creating a new Dataset/Dataframe.

spark reference columns in refactorable way

spark sql is awesome. However, columns are inherently referenced by strings. Even for the dataset API only presence of required columns is checked - not absence of additional fields. And my main problem is that even for the dataset API strings are used to reference columns.
Is there a way to have a more typesafe referencing of columns in spark sql without introducing an additional data structure for each table (besides the initial case class for the type information) to address the names in order to have better refactoring and IDE support.
edit
see the snippet below. It will compile even though it should be clear that it is the wrong column reference. Also edit/refactor in IDE does not seem to work properly.
case class Foo(bar: Int)
import spark.implicits._
val ds = Seq(Foo(1), Foo(2)).toDS
ds.select('fooWrong)
NOTE:
import spark.implicits._
is already imported and 'fooWrong already resembles a type of column
Frameless seems to be the go to solution which offers the desired properties https://github.com/typelevel/frameless
The only downside is, that currently only joins work with column equality. Allowing for any boolean predicate is still in progress.

FoundationDB - inserting data through key-value layer and reading it though SQL-layer. Is it possible?

I'm trying to use FoundationDB for some specific application, thereby I'm asking for some help about the issue i cannot resolve or find any information about.
The thing is, in the application, I MUST read the data through the SQL layer (specificly, the ODBC driver). Nevertheless, I can, or even I'd prefer, to insert the data with the standard key-value layer (not through the SQL layer).
So the question is - is it possible? Could you help me with any information or at least point me where to look for it (I failed to find any brief info by myself)?
I belive that inserting the data through the SQL layer is probably less efficient which seems pretty understandable (since the DB itself is no-SQL), or maybe I am wrong here?
Let's not focus about the reasonableness of this approach, please, as this is some experimental academic project :).
Thank you for any help!
Even though you asked not to, I have to give a big warning: There be dragons down this path!
Think of it this way: To write data that is always as the SQL Layer expects you will have to re-implement the SQL Layer.
Academic demonstration follows :)
Staring table and row:
CREATE TABLE test.t(id INT NOT NULL PRIMARY KEY, str VARCHAR(32)) STORAGE_FORMAT tuple;
INSERT INTO test.t VALUES (1, 'one');
Python to read the current and add a new row:
import fdb
import fdb.tuple
fdb.api_version(200)
db = fdb.open()
# Directory for SQL Layer table 'test'.'t'
tdir = fdb.directory.open(db, ('sql', 'data', 'table', 'test', 't'))
# Read all current rows
for k,v in db[tdir.range()]:
print fdb.tuple.unpack(k), '=>', fdb.tuple.unpack(v)
# Write (2, 'two') row
db[tdir.pack((1, 2))] = fdb.tuple.pack((2, u'two'))
And finally, read the data back from SQL:
test=> SELECT * FROM t;
id | str
----+-----
1 | one
2 | two
(2 rows)
What is happening here:
Create a table with keys and values as Tuples using the STORAGE_FORMAT option
Insert a row
Import and open FDB
Open the Directory of the table
Scan all the rows and unpack for printing
Add a new row by creating Tuples containing the expected values
The key contains three components (something like (230, 1, 1)):
The directory prefix
The ordinal of the table, identifier within the SQL Layer Table Group
The value of the PRIMARY KEY
The value contains the columns in the table, in the order they were declared.
Now that we have a simple proof of concept, here are a handful reasons why this is challenging to keep your data correct:
Schema generation, metadata and data format versions weren't checked
PRIMARY KEY wasn't maintained and is still in the "internal" format
No secondary indexes to maintain
No other tables in the Table Group to maintain (i.e. test table is a single table group)
Online DDL was ignored, which (basically) doubles the amount of work to do during DML
It's also important to note that these cautions are only for writing data you want to access through the SQL Layer. The inverse, reading data the SQL Layer wrote, much easier as it doesn't have to worry about these problems.
Hopefully that gives you a sense of the scope!

Resources