i'm doing ETL task of transforming queries from one SQL dialect to another. The old db uses T-Sql, the new is hiveQL.
SELECT CAST(CONCAT(FMH.FLUIDMODELID,'_',RESERVOIR,'_',PRESSUREPSIA) AS NVARCHAR(255)) AS FACT_RRFP_INJ_PRESS_R_PHK
, FMH.FluidModelID ,FMH.FluidModelName ,[AnalysisDate]
FROM dbo.LZ_RRFP_FluidModelInj fmi
LEFT JOIN DBO.LZ_RRFP_FluidModelHeader fmh ON fmi.FluidModelIDFK = fmh.FluidModelID
LEFT JOIN LZ_RRFP_FluidModelAss fma on fma.InjectionFluidModelIDFK = fmi.FluidModelIDFK
WHERE FMA.RESERVOIR IN (SELECT RESERVOIR_CD FROM ATT_RESERVOIR)
the error is :
org.apache.spark.sql.catalyst.parser.ParseException:
DataType nvarchar(255) is not supported.
how to convert nvarchar?
Hive uses UTF-8 in STRINGs and VARCHARs, you are fine with using VARCHAR or STRING instead of NVARCHAR.
VARCHAR in Hive is the same as STRING + length validation. As #NickW mentioned in the comment, you can do the same without CAST at all, if you inserting result into the table with VARCHAR(255), it will work the same without CAST.
Related
JDBC version 1.0.1
Server version 7.6
A table defined as follows
create table TVCHAR ( RNUM integer not null , CVCHAR varchar(32 ) null , SHARD KEY ( RNUM ) ) ;
DatabaseMetadata.getColumns returns a type name of VARCHAR(32).
When a query select * from TVCHAR is executed, the ResultsetMetadata returned by the driver describes the column CVCHAR as VARSTRING and not VARCHAR. Would expect a consistent type name from both Resultsets.
Example shown using SQLSquirrel
Any advice?
Try updating your JDBC 1.0.1 driver to more stable version of JDBC or it might be a case that your varchar(32) 'data' exceeds its limit so the JDBC interpreted it as VARSTRING. Because JDBC converts the datatype in Result set metadata, usually when there is something wrong.
I have an empty table defined in snowflake as;
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP
);
And it creates the correct table, which has been checked using desc command in sql. Then using a snowflake python connector we are trying to execute following query;
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},{ct});'
ctx.cursor().execute(insert_query)
Just before this query the variables are defined, The main challenge is getting the current time stamp written into snowflake. Here the value of ct is defined as;
import datetime
ct = datetime.datetime.now()
print(ct)
2021-04-30 21:54:41.676406
But when we try to execute this INSERT query we get the following errr message;
ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 1 at position 157 unexpected '21'.
Can I kindly get some help on ow to format the date time value here? Help is appreciated.
In addition to the answer #Lukasz provided you could also think about defining the current_timestamp() as default for the TIME_PREDICTED column:
CREATE OR REPLACE TABLE db1.schema1.table(
ACCOUNT_ID NUMBER NOT NULL PRIMARY KEY,
PREDICTED_PROBABILITY FLOAT,
TIME_PREDICTED TIMESTAMP DEFAULT current_timestamp
);
And then just insert ACCOUNT_ID and PREDICTED_PROBABILITY:
insert_query = f'INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY) VALUES ({accountId}, {risk_score});'
ctx.cursor().execute(insert_query)
It will automatically assign the insert time to TIME_PREDICTED
Educated guess. When performing insert with:
insert_query = f'INSERT INTO ...(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED)
VALUES ({accountId}, {risk_score},{ct});'
It is a string interpolation. The ct is provided as string representation of datetime, which does not match a timestamp data type, thus error.
I would suggest using proper variable binding instead:
ctx.cursor().execute("INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES "
"(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) "
"VALUES(:1, :2, :3)",
(accountId,
risk_score,
("TIMESTAMP_LTZ", ct)
)
);
Avoid SQL Injection Attacks
Avoid binding data using Python’s formatting function because you risk SQL injection. For example:
# Binding data (UNSAFE EXAMPLE)
con.cursor().execute(
"INSERT INTO testtable(col1, col2) "
"VALUES({col1}, '{col2}')".format(
col1=789,
col2='test string3')
)
Instead, store the values in variables, check those values (for example, by looking for suspicious semicolons inside strings), and then bind the parameters using qmark or numeric binding style.
You forgot to place the quotes before and after the {ct}. The code should be :
insert_query = "INSERT INTO DATA_LAKE.CUSTOMER.ACT_PREDICTED_PROBABILITIES(ACCOUNT_ID, PREDICTED_PROBABILITY, TIME_PREDICTED) VALUES ({accountId}, {risk_score},'{ct}');".format(accountId=accountId,risk_score=risk_score,ct=ct)
ctx.cursor().execute(insert_query)
I am using Spark 3.0, in my Java program I am querying data from views which are in Oracle DB. I used the Java API JdbcRDD to query the views.
The problem I have is that the view doesn't contain any ID or timestamp columns. So, I am unable to construct my SQL query with lowerBound and upperBound values.
Please find below the example query I need to run in Spark. Here exp_stg.usr and exp_stg.prtcpnt are the two views exposed to me.
"SELECT a.participant,
a.desc,
b.firstname,
b.lastname,
b.dept,
b.telno,
b.emailaddr
FROM usr_stg.prtcpnt a
LEFT OUTER JOIN usr_stg.usr b
ON a.participant = b.participant
WHERE a.class = 'SpSession' "
I tried using temp tables in spark and join, but the query performance is bad as there are around ~13,000,000 rows in each view. Hence I tried to use the join query in the Oracle DB.
I was able to overcome the constraint using ROWNUM in the query. Using ROWNUM as the lowerBound and upperBound I am now able to get the data using JdbcRDD.
`
SELECT ROWNUM as id, a.participant,
a.desc,
b.firstname,
b.lastname,
b.dept,
b.telno,
b.emailaddr
FROM usr_stg.prtcpnt a
LEFT OUTER JOIN usr_stg.usr b
ON a.participant = b.participant
WHERE a.class = 'SpSession' and ?<=ROWNUM and ROWNUM<=?"`
I am trying to query the Cassandra database using SparkSQL terminal.
Query:
select * from keyspace.tablename
where user_id = e3a119e0-8744-11e5-a557-e789fe3b4cc1;
Error: java.lang.RuntimeException: [1.88] failure: ``union'' expected but identifier e5 found
Also tried:
user_id= UUID.fromString(\`e3a119e0-8744-11e5-a557-e789fe3b4cc1\`)")
user_id= \'e3a119e0-8744-11e5-a557-e789fe3b4cc1\'")
token(user_id)= token(`e3a119e0-8744-11e5-a557-e789fe3b4cc1`)
I am not sure how can I query data on timeuuid.
TimeUUIDs are not supported as a type in SparkSQL so you are only allowed to do direct string comparisons. Represent the TIMEUUID as a string
select * from keyspace.tablename where user_id = "e3a119e0-8744-11e5-a557-e789fe3b4cc1"
I read the post:
Turning a Comma Separated string into individual rows
And really like the solution:
SELECT A.OtherID,
Split.a.value('.', 'VARCHAR(100)') AS Data
FROM
( SELECT OtherID,
CAST ('<M>' + REPLACE(Data, ',', '</M><M>') + '</M>' AS XML) AS Data
FROM Table1
) AS A CROSS APPLY Data.nodes ('/M') AS Split(a);
But it did not work when I tried to apply the method in Teradata for a similar question. Here is the summarized error code:
select failed 3707: expected something between '.' and the 'value' keyword. So is the code only valid in SQL Server? Would anyone help me to make it work in Teradata or SAS SQL? Your help will be really appreciated!
This is SQL Server syntax.
In Teradata there's a table UDF named STRTOK_SPLIT_TO_TABLE,
e.g.
SELECT * FROM dbc.DatabasesV AS db
JOIN
(
SELECT token AS DatabaseName, tokennum
FROM TABLE (STRTOK_SPLIT_TO_TABLE(1, 'dbc,systemfe', ',')
RETURNS (outkey INTEGER,
tokennum INTEGER,
token VARCHAR(128) CHARACTER SET UNICODE)
) AS d
) AS dt
ON db.DatabaseName = dt.DatabaseName
ORDER BY tokennum;
Or see my answer to this similar question