great expectation with delta table - databricks

I am trying to run a great expectation suite on a delta table in Databricks. But I would want to run this on part of the table with a query. Though the validation is running fine, it's running on full table data.
I know that I can load a Dataframe and pass it to Batch Request but I would like to load the data directly with query.
batch_request = RuntimeBatchRequest(
datasource_name="datasource",
data_connector_name="data_quality_run",
data_asset_name="Input Data",
runtime_parameters={"path": "/delta table path"},
batch_identifiers={"data_quality_check": f"data_quality_check_{datetime.date.today().strftime('%Y%m%d')}"},
batch_spec_passthrough={"reader_method": "delta", "reader_options": {"header": True}, "query" : {"name":"John"}},
)
Above batch request loading the data ignoring the query option. Is there any way to pass the query for delta table in the batch request

You can try to put query inside of runtime_parameters.
This works for me when I am querying data in SQL Server:
batch_request = RuntimeBatchRequest(
datasource_name="my_mssql_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="default_name",
runtime_parameters={
"query": "SELECT * from dbo.MyTable WHERE Created = GETDATE()"
},
batch_identifiers={"default_identifier_name": "default_identifier"},
)

Related

How to load data from database to Spark before filtering

I'm trying to run such a PySpark application:
with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark:
dataframe_mysql = spark.read.format('jdbc').options(
url="jdbc:mysql://.../...",
driver='com.mysql.cj.jdbc.Driver',
dbtable='my_table',
user=...,
password=...,
partitionColumn='id',
lowerBound=0,
upperBound=10000000,
numPartitions=11,
fetchsize=1000000,
isolationLevel='NONE'
).load()
dataframe_mysql = dataframe_mysql.filter("date > '2022-01-01'")
dataframe_mysql.write.parquet('...')
And I found that Spark didn't load data from Mysql until executing the write, this means that Spark let the database take care of filtering the data, and the SQL that database received may like:
select * from my_table where id > ... and id< ... and date > '2022-01-01'
My table was too big and there's no index on date column, the database couldn't handle the filtering. How can I load data into Spark's memory before filtering, I hope the query sent to the databse could be:
select * from my_table where id > ... and id< ...
According to #samkart's comment, set pushDownPredicate to False could solve this problem

How do you setup a Synapse Serverless SQL External Table over partitioned data?

I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'

PySpark Pushing down timestamp filter

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?
While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

UcanAccess retrieve stored query sql

I'm trying to retrieve the SQL that makes up a stored query inside an Access database.
I'm using a combination of UcanAccess 4.0.2, and jaydebeapi and the ucanaccess console. The ultimate goal is to be able to do the following from a python script with no user intervention.
When UCanAccess loads, it successfully loads the query:
Please, enter the full path to the access file (.mdb or .accdb): /Users/.../SnohomishRiverEstuaryHydrology_RAW.accdb
Loaded Tables:
Sensor Data, Sensor Details, Site Details
Loaded Queries:
Jeff_Test
Loaded Procedures:
Loaded Indexes:
Primary Key on Sensor Data Columns: (ID)
, Primary Key on Sensor Details Columns: (ID)
, Primary Key on Site Details Columns: (ID)
, Index on Sensor Details Columns: (SiteID)
, Index on Site Details Columns: (SiteID)
UCanAccess>
When I run, from the UCanAccess console a query like
SELECT * FROM JEFF_TEST;
I get the expected results of the query.
I tried things including this monstrous query from inside a python script even using the sysSchema=True option (from here: http://www.sqlquery.com/Microsoft_Access_useful_queries.html):
SELECT DISTINCT MSysObjects.Name,
IIf([Flags]=0,"Select",IIf([Flags]=16,"Crosstab",IIf([Flags]=32,"Delete",IIf
([Flags]=48,"Update",IIf([flags]=64,"Append",IIf([flags]=128,"Union",
[Flags])))))) AS Type
FROM MSysObjects INNER JOIN MSysQueries ON MSysObjects.Id =
MSysQueries.ObjectId;
But get an object not found or insufficient privileges error.
At this point, I've tried mdbtools and can successfully retrieve metadata, and data from access. I just need to get the queries out too.
If anyone can point me in the right direction, I'd appreciate it. Windows is not a viable option.
Cheers, Seth
***********************************
* SOLUTION
***********************************
from jpype import *
startJVM(getDefaultJVMPath(), "-ea", "-Djava.class.path=/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/ucanaccess-4.0.2.jar:/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/lib/commons-lang-2.6.jar:/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/lib/commons-logging-1.1.1.jar:/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/lib/hsqldb.jar:/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/lib/jackcess-2.1.6.jar")
conn = java.sql.DriverManager.getConnection("jdbc:ucanaccess:///Users/seth.urion/PycharmProjects/pyAccess/FE_Hall_2010_2016_SnohomishRiverEstuaryHydrology_RAW.accdb")
for query in conn.getDbIO().getQueries():
print(query.getName())
print(query.toSQLString())
If you can find a satisfactory way to call Java methods from within Python then you could use the Jackcess Query#toSQLString() method to extract the SQL for a saved query. For example, I just got this to work under Jython:
from java.sql import DriverManager
def get_query_sql(conn, query_name):
sql = ''
for query in conn.getDbIO().getQueries():
if query.getName() == query_name:
sql = query.toSQLString()
break
return sql
# usage example
if __name__ == '__main__':
conn = DriverManager.getConnection("jdbc:ucanaccess:///home/gord/UCanAccessTest.accdb")
query_name = 'Jeff_Test'
query_sql = get_query_sql(conn, query_name)
if query_sql == '':
print '(Query not found.)'
else:
print 'SQL for query [%s]:' % (query_name)
print
print query_sql
conn.close()
producing
SQL for query [Jeff_Test]:
SELECT Invoice.InvoiceNumber, Invoice.InvoiceDate
FROM Invoice
WHERE (((Invoice.InvoiceNumber)>1));

Apache Spark Query with HiveContext doesn't work

I use Spark 1.6.1. In my Spark Java Programm I connect to a Postgres Database and register every table as a temporary table via JDBC. For example:
Map<String, String> optionsTable = new HashMap<String, String>();
optionsTable.put("url", "jdbc:postgresql://localhost/database?user=postgres&password=passwd");
optionsTable.put("dbtable", "table");
optionsTable.put("driver", "org.postgresql.Driver");
DataFrame table = sqlContext.read().format("jdbc").options(optionsTable).load();
table.registerTempTable("table");
This works without problems:
hiveContext.sql("select * from table").show();
Also this works:
DataFrame tmp = hiveContext.sql("select * from table where value=key");
tmp.registerTempTable("table");
And then I can see the contents of the table with:
hiveContext.sql("select * from table").show();
But now I have a Problem. When I execute this:
hiveContext.sql("SELECT distinct id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left and tble.timestamp <= w.right").show();
Spark does nothing, but at the origin databse on Postgres it works very good. So I decided to modify the query a little bit to this:
hiveContext.sql("SELECT id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left").show();
This Query is working and gives me results. But the other query is not working. Where is the difference and why is the first query not working, but the second is working good?
And the database is not very Big. For testing it has a size of 4 MB.
Since you're trying to select a distinct ID, you need to select timestamp as a part of an aggregate function and then group by ID. Otherwise, it doesn't know which time stamp to pair with the ID.

Resources