Hardcoded date formats in predicate push-down? - apache-spark

If a date literal is used in a pushed-down filter expression, e.g.
val postingDate = java.sql.Date.valueOf("2016-06-03")
val count = jdbcDF.filter($"POSTINGDATE" === postingDate).count
where the POSTINGDATE column is of JDBC type Date, the resulting pushed-down SQL query looks like the following:
SELECT .. <columns> ... FROM <table> WHERE POSTINGDATE = '2016-06-03'
Specifically, the date is compiled into a string literal using the hardcoded yyyy-MM-dd format that java.sql.Date.toString emits. Note the implied string conversion for date (and timestamp) values in JDBCRDD.compileValue
/**
* Converts value to SQL expression.
*/
private def compileValue(value: Any): Any = value match {
case stringValue: String => s"'${escapeSql(stringValue)}'"
case timestampValue: Timestamp => "'" + timestampValue + "'"
case dateValue: Date => "'" + dateValue + "'"
case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
case _ => value
}
The resulting query fails if the database is expecting a different format for date string literals. For example, the default format for Oracle is 'dd-MMM-yy', so when the relation query is executed, it fails with a syntax error.
ORA-01861: literal does not match format string
01861. 00000 - "literal does not match format string"
In some situations it may be possible to change the database's expected date format to match the Java format, but in our case we don't have control over that.
It seems like this kind of conversion ought to be going through some kind of vendor specific translation (e.g. through a JDBCDialect). I've filed a JIRA issue to that effect, but, meanwhile, I suggestions how to work around this? I've tried a variety of different approaches, both on the spark side and the JDBC side, but haven't come up with anything yet. This is a critical issue for us, as we're processing very large tables organized by date -- without pushdown, we end up pulling over the entire table for a Spark-side filter.

Related

convert and Insert date format in nodejs into Oracle timestamp

let date = moment(new Date()).format("YYYY-MM-DD hh:mm:ss.000000000 A");
// when i tried to insert date in table it is null
// TImestamp format in OracleDB is 14-03-22 3:53:08.901008000 PM
INSERT INTO STUDENT VALUES(join_date) ( '14-03-22 3:53:08.901008000 PM')
How can get date format like YYYY-MM-DD HH:MM:SS.FF3 AM/PM because in oracle it supports this kind of timestamp
In Oracle, a TIMESTAMP is a binary data type that consists of 7 - 13 bytes (century, year-of-century, month, day, hour, minute, second and between zero and six bytes for fractional seconds). It ALWAYS contains those components and it is NEVER stored in a particular format.
The client application you are using (i.e. SQL/Plus, SQL Developer, NodeJS, Java, etc.) may chose to DISPLAY the binary value with a default format but this is a function of the client application and NOT a function of the database. (Some client applications may use the NLS_TIMESTAMP_FORMAT session parameter from the database as their default format model but the implicit conversion from binary-to-string for display purposes is still something that the client application does, not the database, and not all clients use the database session variables for their defaults.)
You should either:
Use a timestamp literal:
INSERT INTO STUDENT (join_date) VALUES (TIMESTAMP '2022-03-14 15:53:08.901008000');
Explicitly convert your formatted string to a timestamp binary data type using the TO_TIMESTAMP function with a format model:
INSERT INTO STUDENT (join_date)
VALUES (
TO_TIMESTAMP('14-03-22 3:53:08.901008000 PM', 'DD-MM-RR HH12:MI:SS.FF9 AM')
)

Mapping data flow SQL query and Parameters failing

In my mapping dataflow I have simplified this down to dimdate just for the test
My parameters are
The source even tells you exactly how to enter the select query if you are using parameters which is what I'm trying to achieve
Then I import but get two different errors
for parameterizing a table`
SELECT * FROM {$df_TableName}
I get
This error from a select * or invidiual columns
I've tried just the WHERE clause (what I actually need) as a parameter but keep getting datatype mismatch errors
I then started testing multiple ways and it only allows the schema to be parameterised from my queries below
all of these other options seem to fail no matter what I do
SELECT * FROM [{$df_Schema}].[{$df_TableName}] Where [Period] = {$df_incomingPeriod}
SELECT * FROM [dbo].[DimDate] Where [Period] = {$df_incomingPeriod}
SELECT * FROM [dbo].[{$df_TableName}] Where [Period] = {$df_incomingPeriod}
SELECT * FROM [dbo].[DimDate] Where [Period] = 2106
I know there's an issue with the Integer datatype but don't know how to pass this to the query within the parameter without changing its type as the sql engine cannot run [period] as a string
Use CONCAT function in expression builder to build the Query in Dataflow.
concat(<this> : string, <that> : string, ...) => string
Note: Concatenates a variable number of strings together. All the variables should be in form of strings.
Example 1:
concat(toString("select * from "), toString($df_tablename))
Example 2:
concat(toString("select * from "), toString($df_tablename), ' ', toString(" where incomingperiod = "), toString($df_incomingPeriod))
Awesome, it worked like magic for me. I was struggling with parameterizing tables= names which I was passing through Array list.
Created a data flow parameter and gave this value:
#item().TABLE_NAME

Inserting Timestamp Into Snowflake Using Python 3.8

I have an empty table defined in snowflake as;
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP
);
And it creates the correct table, which has been checked using desc command in sql. Then using a snowflake python connector we are trying to execute following query;
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},{ct});'
ctx.cursor().execute(insert_query)
Just before this query the variables are defined, The main challenge is getting the current time stamp written into snowflake. Here the value of ct is defined as;
import datetime
ct = datetime.datetime.now()
print(ct)
2021-04-30 21:54:41.676406
But when we try to execute this INSERT query we get the following errr message;
ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 1 at position 157 unexpected '21'.
Can I kindly get some help on ow to format the date time value here? Help is appreciated.
In addition to the answer #Lukasz provided you could also think about defining the current_timestamp() as default for the TIME_PREDICTED column:
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP DEFAULT current_timestamp
);
And then just insert ACCOUNT_ID and PREDICTED_PROBABILITY:
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY) VALUES ({accountId}, {risk_score});'
ctx.cursor().execute(insert_query)
It will automatically assign the insert time to TIME_PREDICTED
Educated guess. When performing insert with:
insert_query = f'INSERT INTO ...(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED)
VALUES ({accountId}, {risk_score},{ct});'
It is a string interpolation. The ct is provided as string representation of datetime, which does not match a timestamp data type, thus error.
I would suggest using proper variable binding instead:
ctx.cursor().execute("INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES "
"(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) "
"VALUES(:1, :2, :3)",
(accountId,
risk_score,
("TIMESTAMP_LTZ", ct)
)
);
Avoid SQL Injection Attacks
Avoid binding data using Python’s formatting function because you risk SQL injection. For example:
# Binding data (UNSAFE EXAMPLE)
con.cursor().execute(
"INSERT INTO testtable(col1, col2) "
"VALUES({col1}, '{col2}')".format(
col1=789,
col2='test string3')
)
Instead, store the values in variables, check those values (for example, by looking for suspicious semicolons inside strings), and then bind the parameters using qmark or numeric binding style.
You forgot to place the quotes before and after the {ct}. The code should be :
insert_query = "INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},'{ct}');".format(accountId=accountId,risk_score=risk_score,ct=ct)
ctx.cursor().execute(insert_query)

jOOQ Query OrderBy as String

I'm getting the order by clause as a String from the application configuration.
Example
String orderByString = "NAME DESC, NUMBER ASC";
Now I want to use this order by in jOOQ query:
Result<KampagneRecord> records = repository.dsl()
.selectFrom(KAMPAGNE)
.orderBy(orderByString)
.fetch();
Unfortunately orderBy does not accept a String.
Is there a way to add the order by clause to the query?
You could use the fact that jOOQ does not validate your plain SQL templating, and just wrap your string in a DSL.field(String):
Result<KampagneRecord> records = repository.dsl()
.selectFrom(KAMPAGNE)
.orderBy(field(orderByString))
.fetch();
Of course, you will have to make sure that syntactical correctness is guaranteed, and SQL injection is prevented.
Some edge cases that rely on jOOQ being able to transform your SQL's ORDER BY clause might stop working, but in your simple query example, this would not apply.
An alternative solution, in very simple cases, is to preprocess your string. It seems as though this would work:
String orderByString = "NAME DESC, NUMBER ASC";
List<SortField<?>> list =
Stream.of(orderByString.split(","))
.map(String::trim)
.map(s -> s.split(" +"))
.map(s -> {
Field<?> field = field(s[0]);
return s.length == 1
? field.sortDefault()
: field.sort("DESC".equalsIgnoreCase(s[1])
? SortOrder.DESC
: SortOrder.ASC
);
})
.collect(Collectors.toList());
System.out.println(list);
This list can now be passed to the orderBy() clause.

How to ensure Python3 infers numbers as a string instead of an integer?

I have a line of code here:
query = """SELECT v.retailiqpo_ordernumber
FROM public.vmi_purchase_orders v
WHERE v.vendor_account = {}""".format(str(primary_account_number))
I tried to load in the string value of the number, but psycopg2 still throws this error.
psycopg2.ProgrammingError: operator does not exist: character varying = integer
What options do I have to ensure Psycopg2 sees this as a string? Or should I just change the overall structure of the database to just integers?
It's (almost) always better to let psycopg2 interpolate query parameters for you. (http://initd.org/psycopg/docs/usage.html#the-problem-with-the-query-parameters)
query = """SELECT v.retailiqpo_ordernumber
FROM public.vmi_purchase_orders v
WHERE v.vendor_account = %s"""
cur.execute(query, (str(primary_account_number),))
This way psycopg2 will deal with the proper type formatting based on the type of the python value passed.
Use
query = """
SELECT v.retailiqpo_ordernumber
FROM public.vmi_purchase_orders v
WHERE v.vendor_account = '{}'
""".format(primary_account_number)
That way the number inside your query is passed as a string - if your c.vendor_account is of a stringtype (varchar i.e.). The important part are the ' before/after {} so the query string sees it as string.
As Jon Clements pointed out, it is better to let the api handle the conversion:
query = """
SELECT v.retailiqpo_ordernumber
FROM public.vmi_purchase_orders v
WHERE v.vendor_account = %s
"""
cursor.execute(query, (str(primary_account_number),)
Doku: Psycopg - Passing parameters to sql queries

Resources