How to set QueryExecutionContext in boto3 when the query contains joining of tables from multiple databases? - python-3.x

I am using Boto3 package in python3 to execute an Athena query. From the documentation of Boto3, I understand that I can specify a query execution context, i.e. a database name under which the query has to be executed. With a properly specified query execution context, we can omit the fully qualified table name(db_name.table_name) from the query and instead use just the table name.
So the query SELECT * FROM db1.tab1 can be converted to SELECT * FROM tab1 with QueryExecutionContext : {'database':'db1'}
The problem: I need to run a query on Athena from python which looks something like this
SELECT *
FROM ((SELECT *
FROM db1.tab1 AS Temp1)
INNER JOIN (SELECT *
FROM db2.tab2 AS Temp2)
ON temp1.id = temp2.id)
As we can see, the query joins tables from two different databases. If I want to omit the database names from this query, how do I specify the QueryExecutionContext ?

QueryExecutionContext accepts only one database as an argument.So if you want to run a query across multiple databases then you have to pass fully qualified table name along with database.

Related

Pyspark trying to write to DB2 table - truncate overwrite

I am trying to write the data to IBM DB2 (10.5 fix pack 11) using Pyspark (2.4).
When I try to execute below piece of code
df.write.format("jdbc")
.mode('overwrite').option("url",'jdbc:db2://<host>:<port>/<DB>').
option("driver", 'com.ibm.db2.jcc.DB2Driver').
option('sslConnection', 'true')
.option('sslCertLocation','</location/***_ssl.crt?').
option("numPartitions", 1).
option("batchsize", 1000)
.option('truncate','true').
option("dbtable", '<TABLE>').
option("user",'<user>').
option("password", '<PW>')
.save()
job is throwing the following exception:
File
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o97.save. :
com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error:
SQLCODE=-104, SQLSTATE=42601,
SQLERRMC=END-OF-STATEMENT;ABLE<SEHEMA.TABLE>;IMMEDIATE, DRIVER=4.19.80
at com.ibm.db2.jcc.am.b5.a(b5.java:747)
Job is trying to perform truncate but seems like DB2 is expecting ** IMMEDIATE** keyword
In my above code all I am passing is only name of the dbtable, is there a way to pass
IMMEDIATE keyword?
And also from DB2 side, is there a way to set this while opening the session?
Just FYI, my code with out truncate works, but that delete the table and recreates and loads, I don't want to do that on prod environment.
Any thoughts on how to solve this issue are highly appreciated.
DB2Dialect in Spark 2.4 doesn't override the default JDBCDialect's implementation of a TRUNCATE TABLE. Comments in the code suggest to override this method to return a statement that suits your database engine.
/**
* The SQL query that should be used to truncate a table. Dialects can override this method to
* return a query that is suitable for a particular database. For PostgreSQL, for instance,
* a different query is used to prevent "TRUNCATE" affecting other tables.
* #param table The table to truncate
* #param cascade Whether or not to cascade the truncation
* #return The SQL query to use for truncating a table
*/
#Since("2.4.0")
def getTruncateQuery(
table: String,
cascade: Option[Boolean] = isCascadingTruncateTable): String = {
s"TRUNCATE TABLE $table"
}
Perhaps in DB2 case you can actually extend DB2Dialect itself, add your getTruncateQuery() implementation and define your "custom" JDBC protocol, "jdbc:mydb2" for example. You can then use this protocol in JDBC connection URL, .option("url",'jdbc:mydb2://<host>:<port>/<DB>').

Running multiple SQL statements using Boto3 and AWS Glue

I would like to run multiple SQL statements in a single AWS Glue script using boto3.
The first query creates a table from S3 bucket (parquet files)
import boto3
client = boto3.client('athena')
config = {'OutputLocation': 's3://LOGS'}
client.start_query_execution(QueryString =
"""CREATE EXTERNAL TABLE IF NOT EXISTS my_database_name.my_table (
'apples' string,
'oranges' string,
'price' int
) PARTITIONED BY (
update_date string
)
STORED AS PARQUET
LOCATION 's3://LOCATION'
TBLPROPERTIES ('parquet.compression' = 'SNAPPY');""",
QueryExecutionContext = {'Database': 'my_database_name'},
ResultConfiguration = config)
This only creates the table. Then I have to run the following query in order to update the partitions and insert the data.
client.start_query_execution(QueryString =
"""MSCK REPAIR TABLE my_database_name.my_table;""",
QueryExecutionContext = {'Database': 'my_database_name'},
ResultConfiguration = config)
Unfortunately, when I run the above statements in a single GLUE script, the partitions are not updated (only the table is created). I have to separate them in two jobs.
Is it possible to have a single scripts that can execute multiple queries in a sequence?
Using Glue Crawlers is not an option
You should explore the alternative of using partition projection which removes the need of loading partition via repair table or crawlers. See the docs: https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html

Is there any alternative of CREATE TYPE in SQL as CREATE TYPE is Not supported in Azure SQL data warehouse

I am trying to execute this query but as userdefined(Create type) types are not supportable in azure data warehouse. and i want to use it in stored procedure.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
GO
CREATE PROCEDURE usp_upsert_customer_table #customer_table DataTypeforCustomerTable READONLY
AS
BEGIN
MERGE customer_table AS target
USING #customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
CREATE TYPE DataTypeforProjectTable AS TABLE(
Project varchar(255),
Creationtime datetime
);
GO
CREATE PROCEDURE usp_upsert_project_table #project_table DataTypeforProjectTable READONLY
AS
BEGIN
MERGE project_table AS target
USING #project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
Is there any alternative way to do this.
You've got a few challenges there, because most of what you're trying to convert is not the way to do things on ASDW.
First, as you point out, CREATE TYPE is not supported, and there is no equivalent alternative.
Next, the code appears to be doing single inserts to a table. That's really bad on ASDW, performance will be dreadful.
Next, there's no MERGE statement (yet) for ASDW. That's because UPDATE is not the best way to handle changing data.
And last, stored procedures work a little differently on ASDW, they're not compiled, but interpreted each time the procedure is called. Stored procedures are great for big chunks of table-level logic, but not recommended for high volume calls with single-row operations.
I'd need to know more about the use case to make specific recommendations, but in general you need to think in tables rather than rows. In particular, focus on the CREATE TABLE AS (CTAS) way of handling your ELT.
Here's a good link, it shows how the equivalent of a Merge/Upsert can be handled using a CTAS:
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-ctas#replace-merge-statements
As you'll see, it processes two tables at a time, rather than one row. This means you'll need to review the logic that called your stored procedure example.
If you get your head around doing everything in CTAS, and separately around Distribution, you're well on your way to having a high performance data warehouse.
Temp tables in Azure SQL Data Warehouse have a slightly different behaviour to box product SQL Server or Azure SQL Database - they exist at the session level. So all you have to do is convert your CREATE TYPE statements to temp tables and split the MERGE out into separate INSERT / UPDATE / DELETE statements as required.
Example:
CREATE TABLE #DataTypeforCustomerTable (
PersonID INT,
Name VARCHAR(255),
LastModifytime DATETIME
)
WITH
(
DISTRIBUTION = HASH( PersonID ),
HEAP
)
GO
CREATE PROCEDURE usp_upsert_customer_table
AS
BEGIN
-- Add records which do not already exist
INSERT INTO customer_table ( PersonID, Name, LastModifytime )
SELECT PersonID, Name, LastModifytime
FROM #DataTypeforCustomerTable AS source
WHERE NOT EXISTS
(
SELECT *
FROM customer_table target
WHERE source.PersonID = target.PersonID
)
...
Simply load the temp table and execute the stored proc. See here for more details on temp table scope.
If you are altering a large portion of the table then you should consider the CTAS approach to create a new table, then rename it as suggested by Ron.

U-SQL Query data source

I would like to write query to remote Azure SQL database.
I followed the tutorial via Query Data Source - Method 1
I was successful to run the query from tutorial:
#results1 =
SELECT *
FROM EXTERNAL MyAzureSQLDBDataSource EXECUTE #"SELECT ##SERVERNAME AS serverName, GETDATE() AS dayTime, DB_NAME() AS databaseName";
But...
I would like to update this query to following form:
DECLARE #queryA string = #"SELECT ##SERVERNAME AS serverName, GETDATE() AS dayTime, DB_NAME() AS databaseName";
#results2 =
SELECT *
FROM EXTERNAL MyAzureSQLDBDataSource EXECUTE #queryA;
I got an error
E_CSC_USER_SYNTAXERROR: syntax error. Expected one of: string-literal
Any idea why I cannot use query stored in string value?
In real query I need to dynamically create query based on parameters in where statement.
Thank you in advance
According to this article https://msdn.microsoft.com/en-us/library/azure/mt621291.aspx you can provide only a literal, not a variable:
EXECUTE csharp_string_literal
The string literal contains a query
expression in the language supported by the remote data source. E.g.,
if the data source is an Azure SQL Database, then the query string
would be T-SQL.

Sourcing data from DocumentDB in Hadoop

I have a hadoop application that source data from two different DocumentDB collection. However, the json schema of documents belonging to these two collections are different. Both has a field showing time, but one is called TimeStamp and the other one is called UpdatedOn. I'd like to know how I can specify a query which is based on this time field and retrive only those json documents satisfying the condition in my query. I specify my query like below
String query = "SELECT * FROM c WHERE c.Timestamp > " + timestamp;
conf.set(ConfigurationUtil.QUERY, query);
This query applies on one of the collection. I need a query like below
"SELECT * FROM collection1 as c1, collection2 as c2 WHERE c1.Timestamp > x1 OR c2.UpdatedOn > x1"
Is this supported in DocumentDB?
This is not supported since it is not documented, your best bet is two execute these two queries and then merge the results using Linq or any other technique to get one result set.
Hope this helps.

Resources