How to use Impala to read Hive view containing complex types? - apache-spark

I have some data that is processed and model based on case classes, and the classes can also have other case classes in them, so the final table has complex data, struct, array. Using the case class I save the data in hive using dataframe.saveAsTextFile(path).
This data sometimes changes or needs to have a different model, so for each iteration I use a suffix in the table name (some_data_v01, some_data_v03, etc.).
I also have queries that are run on a schedule on these tables, using Impala, so in order to not modify the query each time I save a a new table, I wanted to use a view that is always updated whenever I change the model.
The problem with that is I can't use Impala to create the view, because of the complex nature of the data in the tables (nested complex types). Apart from being a lot of work to expand the complex types, I want these types to be preserved (lots of level of nesting, duplication of data when joining arrays).
One solution was to create the view using Hive, like this
create view some_data as select * from some_data_v01;
But if I do this, when I want to use the table from Impala,
select * from some_data;
or even something simple, like
select some_value_not_nested, struct_type.some_int, struct_type.some_other_int from some_data;
the error is the following:
AnalysisException: Expr 'some_data_v01.struct_type' in select list returns a complex type
'STRUCT< some_int:INT, some_other_int:INT, nested_struct:STRUCT< nested_int:INT, nested_other_int:INT>, last_int:INT>'. Only scalar types are allowed in the select list.
Is there any way to access this view, or create it in some other way for it to work?

Related

Jooq - converting nested objects

the problem which I have is how to convert jooq select query to some object. If I use default jooq mapper, it works but all fields must be mentioned, and in exact order. If I use simple flat mapper, I have problems with multiset.
The problem with simple flat mapper:
class Student {
private final id;
Set<String> bookIds;
}
private static final SelectQueryMapper<Student> studentMapper = SelectQueryMapperFactory.newInstance().newMapper(Studen.class);
var students = studentMapper.asList(
context.select(
STUDENT.ID.as("id),
multiset(
select(BOOK.ID).from(BOOK).where(BOOK.STUDENT_ID.eq(STUDENT.ID)),
).convertFrom(r -> r.intoSet(BOOK.ID)).as("bookIds"))
.from(STUDENT).where(STUDENT.ID.eq("<id>"))
)
Simple flat mapper for attribute bookIds returns:
Set of exact one String ["[[book_id_1], [book_id_2]]"], instead of ["book_id_1", "book_id_2"]
As I already mention, this is working with default jooq mapper, but in my case all attributes are not mention in columns, and there is possibility that some attributes will be added which are not present in table.
The question is, is there any possibility to tell simple flat mapper that mapping is one on one (Set to set), or to have default jooq mapper which will ignore non-matching and disorder fields.
Also what is the best approach in this situations
Once you start using jOOQ's MULTISET and ad-hoc conversion capabilities, then I doubt you still need third parties like SimpleFlatMapper, which I don't think can deserialise jOOQ's internally generated JSON serialisation format (which is currently an array of arrays, not array of objects, but there's no specification for this format, and it might change in any version).
Just use ad-hoc converters.
If I use default jooq mapper, it works but all fields must be mentioned, and in exact order
You should see that as a feature, not a bug. It increases type safety and forces you to think about your exact projection, helping you avoid to project too much data (which will heavily slow down your queries!)
But you don't have to use the programmatic RecordMapper approach that is currently being advocated in the jOOQ manual and blog posts. The "old" reflective DefaultRecordMapper will continue to work, where you simply have to have matching column aliases / target type getters/setters/member names.

Raw sql with many columns

I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.

Spark Dataframe / SQL - Complex enriching nested data

Context
I have an example of event source data in a dataframe input as shown below.
SOURCE
where eventOccurredTime is a String type. This is from the source and I want to retain this in its original string form (with nano sec)
And I want to use the string to enrich some extra date/time typed data for downstream usage. below is an example
TARGET
Now as a one off I can execute some spark sql on the dataframe as shown below to get the result I want:
import org.apache.spark.sql.DataFrame
def transformDF(): DataFrame = {
spark.sql(
s"""
SELECT
id,
struct(
event.eventCategory,
event.eventName,
event.eventOccurredTime,
struct (
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP) AS eventOccurredTimestampUTC,
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS DATE) AS eventOccurredDateUTC,
unix_timestamp(substring(event.eventOccurredTime,1,23),"yyyy-MM-dd'T'HH:mm:ss.SSS") * 1000 AS eventOccurredTimestampMillis,
datesDim.dateSeq AS eventOccurredDateDimSeq
) AS eventOccurredTimeDim,
NOTE: This is a snippet, for the full event, I have to do this explicitly in this long SQL 20 times for the 20 string dates
Some things to point out:
unix_timestamp(substring(event.eventOccurredTime,1,23)
Above I found I had to substring a date that had nano precision or would return null, hence the substring
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
above is the pattern / naming convention for the 4 nested xDim struct fields to derive and they are present in the predefined spark schema the json is read using to create the source dataframe.
datesDim.dateSeq AS eventOccurredDateDimSeq
To get the above 'eventOccurredDateDimSeq' field, I need to join to a dates dimensions table 'datesDim' (static with an hourly grain), where dateSeq is the 'key' where this date falls into an hourly bucket where datesDim.UTC is defined to the hour
LEFT OUTER JOIN datesDim ON
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:00:00") AS TIMESTAMP) = datesDim.UTC
The table is globally available in the spark cluster so should be quick to look up, but I need to do this for every date enrichment in the payloads and they will have different dates.
dateDimensionDF.write.mode("overwrite").saveAsTable("datesDim")
The general schema pattern is that if there is a string date whose field name is:
x
..there is a 'xDim' struct equiv that immediately follows it in schema order below as described.
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
As mentioned with the snippet, although in the image above I am only showing 'eventOccuredTime' in above, there are more of these through the schema, at lower levels too, that need the same transformation pattern applied.
Problem:
So I have the spark sql (the full monty the snippet came from) to do this one off for 1 event type and its a large, explicit SQL statement that applies the time functions and joins I showed), but here is my problem I need help with.
So I want to try and create a more generic, functionally orientated reusable solution, that traverses a nested dataframe and applies this transformation pattern as described above 'where it needs to'
How do define 'where it needs to'?
Perhaps the naming convention is a good start - traverse the DF, look for any struct fields that have the xDim ('Dim' suffix) pattern, and use the 'x' field presceding as the input, and populate the xDim.* values in line with the naming pattern as described?
How in a function to best join on the datesDim registered table (its static remember) so it performs?
Solution?
Think one or more UDF is needed (we use Scala), maybe by itself or as a fragment within SQL, but not sure. Ensuring the DatesDim lookup performs is key I think.
Or maybe there is another way?
Note: I am working with Dataframes / SparkSQL not Datasets, options for each welcomed though?
Databricks
NOTE: Im actually using the databricks platform for this, so for those verse in SQL 'Higher order functions' in Dbricks
https://docs.databricks.com/spark/latest/spark-sql/higher-order-functions-lambda-functions.html
....is there a slick option here using 'TRANSFORM' as a SQL HOF (might need to register a utility UDF and use this with transform perhaps)?
Awesome, thanks spark community for your help!!! Sorry this is a long post setting the scene.

Storing a list of mixed types in Cassandra

In Cassandra, when specifying a table and fields, one has to give each field a type (text, int, boolean, etc.). The same applies for collections, you have to give lock a collection to specific type (set<text> and such).
I need to store a list of mixed types in Cassandra. The list may contain numbers, strings and booleans. So I would need something like list<?>.
Is this possible in Cassandra and if not, What workaround would you suggest for storing a list of mixed type items? I sketched a few, but none of them seem the right way to go...
Cassandra's CQL interface is strictly typed, so you will not be able to create a table with an untyped collection column.
I basically see two options:
Create a list field, and convert everything to text (not too nice, I agree)
Use the thift API and store everything as is.
As suggested at http://www.mail-archive.com/user#cassandra.apache.org/msg37103.html I decided to encode the various values into binary and store them into list<blob>. This allows to still query the collection values (in Cassandra 2.1+), one just needs to encode the values in the query.
On python, simplest way is probably to pickle and hexify when storing data:
pickle.dumps('Hello world').encode('hex')
And to load it:
pickle.loads(item.decode('hex'))
Using pickle ties the implementation to python, but it automatically converts to correct type (int, string, boolean, etc.) when loading, so it's convenient.

FoundationDB - inserting data through key-value layer and reading it though SQL-layer. Is it possible?

I'm trying to use FoundationDB for some specific application, thereby I'm asking for some help about the issue i cannot resolve or find any information about.
The thing is, in the application, I MUST read the data through the SQL layer (specificly, the ODBC driver). Nevertheless, I can, or even I'd prefer, to insert the data with the standard key-value layer (not through the SQL layer).
So the question is - is it possible? Could you help me with any information or at least point me where to look for it (I failed to find any brief info by myself)?
I belive that inserting the data through the SQL layer is probably less efficient which seems pretty understandable (since the DB itself is no-SQL), or maybe I am wrong here?
Let's not focus about the reasonableness of this approach, please, as this is some experimental academic project :).
Thank you for any help!
Even though you asked not to, I have to give a big warning: There be dragons down this path!
Think of it this way: To write data that is always as the SQL Layer expects you will have to re-implement the SQL Layer.
Academic demonstration follows :)
Staring table and row:
CREATE TABLE test.t(id INT NOT NULL PRIMARY KEY, str VARCHAR(32)) STORAGE_FORMAT tuple;
INSERT INTO test.t VALUES (1, 'one');
Python to read the current and add a new row:
import fdb
import fdb.tuple
fdb.api_version(200)
db = fdb.open()
# Directory for SQL Layer table 'test'.'t'
tdir = fdb.directory.open(db, ('sql', 'data', 'table', 'test', 't'))
# Read all current rows
for k,v in db[tdir.range()]:
print fdb.tuple.unpack(k), '=>', fdb.tuple.unpack(v)
# Write (2, 'two') row
db[tdir.pack((1, 2))] = fdb.tuple.pack((2, u'two'))
And finally, read the data back from SQL:
test=> SELECT * FROM t;
id | str
----+-----
1 | one
2 | two
(2 rows)
What is happening here:
Create a table with keys and values as Tuples using the STORAGE_FORMAT option
Insert a row
Import and open FDB
Open the Directory of the table
Scan all the rows and unpack for printing
Add a new row by creating Tuples containing the expected values
The key contains three components (something like (230, 1, 1)):
The directory prefix
The ordinal of the table, identifier within the SQL Layer Table Group
The value of the PRIMARY KEY
The value contains the columns in the table, in the order they were declared.
Now that we have a simple proof of concept, here are a handful reasons why this is challenging to keep your data correct:
Schema generation, metadata and data format versions weren't checked
PRIMARY KEY wasn't maintained and is still in the "internal" format
No secondary indexes to maintain
No other tables in the Table Group to maintain (i.e. test table is a single table group)
Online DDL was ignored, which (basically) doubles the amount of work to do during DML
It's also important to note that these cautions are only for writing data you want to access through the SQL Layer. The inverse, reading data the SQL Layer wrote, much easier as it doesn't have to worry about these problems.
Hopefully that gives you a sense of the scope!

Resources