FoundationDB - inserting data through key-value layer and reading it though SQL-layer. Is it possible? - foundationdb

I'm trying to use FoundationDB for some specific application, thereby I'm asking for some help about the issue i cannot resolve or find any information about.
The thing is, in the application, I MUST read the data through the SQL layer (specificly, the ODBC driver). Nevertheless, I can, or even I'd prefer, to insert the data with the standard key-value layer (not through the SQL layer).
So the question is - is it possible? Could you help me with any information or at least point me where to look for it (I failed to find any brief info by myself)?
I belive that inserting the data through the SQL layer is probably less efficient which seems pretty understandable (since the DB itself is no-SQL), or maybe I am wrong here?
Let's not focus about the reasonableness of this approach, please, as this is some experimental academic project :).
Thank you for any help!

Even though you asked not to, I have to give a big warning: There be dragons down this path!
Think of it this way: To write data that is always as the SQL Layer expects you will have to re-implement the SQL Layer.
Academic demonstration follows :)
Staring table and row:
CREATE TABLE test.t(id INT NOT NULL PRIMARY KEY, str VARCHAR(32)) STORAGE_FORMAT tuple;
INSERT INTO test.t VALUES (1, 'one');
Python to read the current and add a new row:
import fdb
import fdb.tuple
fdb.api_version(200)
db = fdb.open()
# Directory for SQL Layer table 'test'.'t'
tdir = fdb.directory.open(db, ('sql', 'data', 'table', 'test', 't'))
# Read all current rows
for k,v in db[tdir.range()]:
print fdb.tuple.unpack(k), '=>', fdb.tuple.unpack(v)
# Write (2, 'two') row
db[tdir.pack((1, 2))] = fdb.tuple.pack((2, u'two'))
And finally, read the data back from SQL:
test=> SELECT * FROM t;
id | str
----+-----
1 | one
2 | two
(2 rows)
What is happening here:
Create a table with keys and values as Tuples using the STORAGE_FORMAT option
Insert a row
Import and open FDB
Open the Directory of the table
Scan all the rows and unpack for printing
Add a new row by creating Tuples containing the expected values
The key contains three components (something like (230, 1, 1)):
The directory prefix
The ordinal of the table, identifier within the SQL Layer Table Group
The value of the PRIMARY KEY
The value contains the columns in the table, in the order they were declared.
Now that we have a simple proof of concept, here are a handful reasons why this is challenging to keep your data correct:
Schema generation, metadata and data format versions weren't checked
PRIMARY KEY wasn't maintained and is still in the "internal" format
No secondary indexes to maintain
No other tables in the Table Group to maintain (i.e. test table is a single table group)
Online DDL was ignored, which (basically) doubles the amount of work to do during DML
It's also important to note that these cautions are only for writing data you want to access through the SQL Layer. The inverse, reading data the SQL Layer wrote, much easier as it doesn't have to worry about these problems.
Hopefully that gives you a sense of the scope!

Related

How to use Impala to read Hive view containing complex types?

I have some data that is processed and model based on case classes, and the classes can also have other case classes in them, so the final table has complex data, struct, array. Using the case class I save the data in hive using dataframe.saveAsTextFile(path).
This data sometimes changes or needs to have a different model, so for each iteration I use a suffix in the table name (some_data_v01, some_data_v03, etc.).
I also have queries that are run on a schedule on these tables, using Impala, so in order to not modify the query each time I save a a new table, I wanted to use a view that is always updated whenever I change the model.
The problem with that is I can't use Impala to create the view, because of the complex nature of the data in the tables (nested complex types). Apart from being a lot of work to expand the complex types, I want these types to be preserved (lots of level of nesting, duplication of data when joining arrays).
One solution was to create the view using Hive, like this
create view some_data as select * from some_data_v01;
But if I do this, when I want to use the table from Impala,
select * from some_data;
or even something simple, like
select some_value_not_nested, struct_type.some_int, struct_type.some_other_int from some_data;
the error is the following:
AnalysisException: Expr 'some_data_v01.struct_type' in select list returns a complex type
'STRUCT< some_int:INT, some_other_int:INT, nested_struct:STRUCT< nested_int:INT, nested_other_int:INT>, last_int:INT>'. Only scalar types are allowed in the select list.
Is there any way to access this view, or create it in some other way for it to work?

Raw sql with many columns

I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.

spark reference columns in refactorable way

spark sql is awesome. However, columns are inherently referenced by strings. Even for the dataset API only presence of required columns is checked - not absence of additional fields. And my main problem is that even for the dataset API strings are used to reference columns.
Is there a way to have a more typesafe referencing of columns in spark sql without introducing an additional data structure for each table (besides the initial case class for the type information) to address the names in order to have better refactoring and IDE support.
edit
see the snippet below. It will compile even though it should be clear that it is the wrong column reference. Also edit/refactor in IDE does not seem to work properly.
case class Foo(bar: Int)
import spark.implicits._
val ds = Seq(Foo(1), Foo(2)).toDS
ds.select('fooWrong)
NOTE:
import spark.implicits._
is already imported and 'fooWrong already resembles a type of column
Frameless seems to be the go to solution which offers the desired properties https://github.com/typelevel/frameless
The only downside is, that currently only joins work with column equality. Allowing for any boolean predicate is still in progress.

Spark Dataframe / SQL - Complex enriching nested data

Context
I have an example of event source data in a dataframe input as shown below.
SOURCE
where eventOccurredTime is a String type. This is from the source and I want to retain this in its original string form (with nano sec)
And I want to use the string to enrich some extra date/time typed data for downstream usage. below is an example
TARGET
Now as a one off I can execute some spark sql on the dataframe as shown below to get the result I want:
import org.apache.spark.sql.DataFrame
def transformDF(): DataFrame = {
spark.sql(
s"""
SELECT
id,
struct(
event.eventCategory,
event.eventName,
event.eventOccurredTime,
struct (
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP) AS eventOccurredTimestampUTC,
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS DATE) AS eventOccurredDateUTC,
unix_timestamp(substring(event.eventOccurredTime,1,23),"yyyy-MM-dd'T'HH:mm:ss.SSS") * 1000 AS eventOccurredTimestampMillis,
datesDim.dateSeq AS eventOccurredDateDimSeq
) AS eventOccurredTimeDim,
NOTE: This is a snippet, for the full event, I have to do this explicitly in this long SQL 20 times for the 20 string dates
Some things to point out:
unix_timestamp(substring(event.eventOccurredTime,1,23)
Above I found I had to substring a date that had nano precision or would return null, hence the substring
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
above is the pattern / naming convention for the 4 nested xDim struct fields to derive and they are present in the predefined spark schema the json is read using to create the source dataframe.
datesDim.dateSeq AS eventOccurredDateDimSeq
To get the above 'eventOccurredDateDimSeq' field, I need to join to a dates dimensions table 'datesDim' (static with an hourly grain), where dateSeq is the 'key' where this date falls into an hourly bucket where datesDim.UTC is defined to the hour
LEFT OUTER JOIN datesDim ON
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:00:00") AS TIMESTAMP) = datesDim.UTC
The table is globally available in the spark cluster so should be quick to look up, but I need to do this for every date enrichment in the payloads and they will have different dates.
dateDimensionDF.write.mode("overwrite").saveAsTable("datesDim")
The general schema pattern is that if there is a string date whose field name is:
x
..there is a 'xDim' struct equiv that immediately follows it in schema order below as described.
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
As mentioned with the snippet, although in the image above I am only showing 'eventOccuredTime' in above, there are more of these through the schema, at lower levels too, that need the same transformation pattern applied.
Problem:
So I have the spark sql (the full monty the snippet came from) to do this one off for 1 event type and its a large, explicit SQL statement that applies the time functions and joins I showed), but here is my problem I need help with.
So I want to try and create a more generic, functionally orientated reusable solution, that traverses a nested dataframe and applies this transformation pattern as described above 'where it needs to'
How do define 'where it needs to'?
Perhaps the naming convention is a good start - traverse the DF, look for any struct fields that have the xDim ('Dim' suffix) pattern, and use the 'x' field presceding as the input, and populate the xDim.* values in line with the naming pattern as described?
How in a function to best join on the datesDim registered table (its static remember) so it performs?
Solution?
Think one or more UDF is needed (we use Scala), maybe by itself or as a fragment within SQL, but not sure. Ensuring the DatesDim lookup performs is key I think.
Or maybe there is another way?
Note: I am working with Dataframes / SparkSQL not Datasets, options for each welcomed though?
Databricks
NOTE: Im actually using the databricks platform for this, so for those verse in SQL 'Higher order functions' in Dbricks
https://docs.databricks.com/spark/latest/spark-sql/higher-order-functions-lambda-functions.html
....is there a slick option here using 'TRANSFORM' as a SQL HOF (might need to register a utility UDF and use this with transform perhaps)?
Awesome, thanks spark community for your help!!! Sorry this is a long post setting the scene.

cql binary protocol and named bound variables in prepared queries

imagine I have a simple CQL table
CREATE TABLE test (
k int PRIMARY KEY,
v1 text,
v2 int,
v3 float
)
There are many cases where one would want to make use of the schema-less essence of Cassandra and only set some of the values and do, for example, a
INSERT into test (k, v1) VALUES (1, 'something');
When writing an application to write to such a CQL table in a Cassandra cluster, the need to do this using prepared statements immediately arises, for performance reasons.
This is handled in different ways by different drivers. Java driver for example has introduced (with the help of a modification in CQL binary protocol), the chance of using named bound variables. Very practical: CASSANDRA-6033
What I am wondering is what is the correct way, from a binary protocol point of view, to provide values only for a subset of bound variables in a prepared query?
Values in fact are provided to a prepared query by building a values list as described in
4.1.4. QUERY
[...]
Values. In that case, a [short] <n> followed by <n> [bytes]
values are provided. Those value are used for bound variables in
the query.
Please note the definition of [bytes]
[bytes] A [int] n, followed by n bytes if n >= 0. If n < 0,
no byte should follow and the value represented is `null`.
From this description I get the following:
"Values" in QUERY offers no ways to provide a value for a specific column. It is just an ordered list of values. I guess the [short] must correspond to the exact number of bound variables in a prepared query?
All values, no matter what types they are, are represented as [bytes]. If that is true, any interpretation of the [bytes] value is left to the server (conversion to int, short, text,...)?
Assuming I got this all right, I wonder if a 'null' [bytes] value can be used to just 'skip' a bound variable and not assign a value for it.
I tried this and patched the cpp driver (which is what I am interested in). Queries get executed but when I perform a SELECT from clqsh, I don't see the 'null' string representation for empty fields, so I wonder if that is a hack that for some reasons is not just crashing or the intended way to do this.
I am sorry but I really don't think I can just download the java driver and see how named bound variables are implemented ! :(
---------- EDIT - SOLVED ----------
My assumptions were right and now support to skip a field in a prepared query has been added to cpp driver (see here ) by using a null [bytes value].
What I am wondering is what is the correct way, from a binary protocol point of view, to provide values only for a subset of bound variables in a prepared query?
You need to prepare a query that only inserts/updates the subset of columns that you're interested in.
"Values" in QUERY offers no ways to provide a value for a specific column. It is just an ordered list of values. I guess the [short] must correspond to the exact number of bound variables in a prepared query?
That's correct. The ordering is determined by the column metadata that Cassandra returns when you prepare a query.
All values, no matter what types they are, are represented as [bytes]. If that is true, any interpretation of the [bytes] value is left to the server (conversion to int, short, text,...)?
That's also correct. The driver will use the returned column metadata to determine how to convert native values (strings, UUIDS, ints, etc) to a binary (bytes) format. Cassandra does the inverse of this operation server-side.
Assuming I got this all right, I wonder if a 'null' [bytes] value can be used to just 'skip' a bound variable and not assign a value for it.
A null column insertion is interpreted as a deletion.
Implementation of what I was trying to achieve has been done (see here ) based on the principle I described.

Resources