Multiple Aggregations Across Partition Keys Only Work From Web Base Azure Data Explorer But Not From Python Client Library - python-3.x

I'm working with the Azure Core API. Using the web based data explorer, I'm able to successfully run a query that has multiple aggregations across multiple partitions. Yet when I attempt to run such a query from my shell using the Azure client library for Python (pip install azure-cosmos==4.0.0), I get an error message. I've tried two variations of the query, where one included the partition key and one didn't. Both queries returned the same error message.
container = database.get_container_client('some_container')
query1 = "select c.fmonth, c.fquarter, c.fyear, sum(c.revenue) as actual_revenue__sum, sum(c.predicted_revenue_m1) as predicted_revenue__sum from c where c.fyear=2020 group by c.fmonth, c.fquarter, c.fyear"
query2 = "select c.fmonth, c.fquarter, c.fyear, sum(c.revenue) as actual_revenue__sum, sum(c.predicted_revenue_m1) as predicted_revenue__sum from c where c.date_start >='2020-01-01' and c.date_start < '2021-01-01' group by c.fmonth, c.fquarter, c.fyear"
res = container.query_items(query1, enable_cross_partition_query=True)
Error Message Returned:
CosmosHttpResponseError: (BadRequest) Message: {"Errors":["Cross partition query only supports 'VALUE ' for aggregates."]}
ActivityId: 4961b99e-7032-4eac-ae84-2c8cab03a496, Microsoft.Azure.Documents.Common/2.11.0

Errors out for aggregates on multiple partitions, with enable cross
partition query set to true, but no "value" keyword present
If you want to use aggregates on multiple partitions,select value count(1) from c,this will work.
But python sdk doesn't support group by now.
Refer to this document,.net sdk and js sdk support group by.Other sdk will support later.
Hope this can help you.

Related

Azure Synapse: Cannot bulk load because the file could not be opened. Operating system error code 12(The access code is invalid.)

I am using Azure Synapse to query a large number of CSV files with the OPENROWSET command see here. The files are located on a Data Lake gen 2 connected to the Azure Synapse via a managed identity.
This is working fine when I am only querying a few files at a time, however when I increase the number of files which I am trying to query simultaneousally I am getting the following error:
Azure Synapse: Cannot bulk load because the file <file> could not be opened. Operating system error code 12(The access code is invalid.)
Here <file> is a different file each time I run the query. If I navigate to the file in the linked data view I can download and view the file. Also if I specify to run the query on the file mentioned previousally in an error it will work fine.
The code I am using to query the data lake is below:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed
Here the data source, analytics is a data source specified as follows:
CREATE EXTERNAL DATA SOURCE analytics
WITH
(
location = 'https://<url>.dfs.core.windows.net/analytics'
)
I have tried specifying a large number for the MAXERRORS parameter for BULK in OPENROWSET as I don't mind if only a few files are missed in executing this query, however this only seems to work at the row level for errors and these errors are at the file level.
The query is running here on the built-in serverless pool.
Any ideas on how to get around this issue would be appreciated.
Your code is doing pass through authentication to storage for whatever AAD user is connected to Synapse serverless (which will fail if you are using a SQL login). To use the MSI to connect to storage you will need a database scoped credential and will need to reference it in the external data source as in this example.
-- Optional: Create MASTER KEY if not exists in database:
-- CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<Very Strong Password>
CREATE DATABASE SCOPED CREDENTIAL SynapseIdentity
WITH IDENTITY = 'Managed Identity';
GO
CREATE EXTERNAL DATA SOURCE mysample
WITH ( LOCATION = 'https://<storage_account>.dfs.core.windows.net/<container>/<path>',
CREDENTIAL = SynapseIdentity
)
Also see the sections in that article about the firewall on your storage account if you have that locked down.
Just adding a quick answer to this. After making the changes which Greg suggested, I was able to query a little more data - but still getting hit with the error code 12.
I spoke with Azure support and they advised me that the error message was actually 412 (but it was not visable to me); so this implied file in use/being modified. Adding the following allowed Azure Synapse to ignore this and query the file regardless:
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
or for external tables:
TABLE_OPTIONS = N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
This made my final query:
SELECT
Parsed.*
FROM OPENROWSET
(
bulk '2021/*/**.log',
maxerrors = 2147483647,
data_source = 'analytics_master_key',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b',
ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
) WITH (doc nvarchar(max)) AS Rows
CROSS APPLY OPENJSON(Rows.doc)
WITH
(
col1 NVARCHAR(100),
col2 NVARCHAR(100),
...,
coln NVARCHAR(MAX)
) AS Parsed

How can I add parameters to my CosmosDB Data Explorer workbooks?

I have a CosmosDB Data Explorer workbook with a set of functional predefined queries. The workbook offers real benefits by improving efficiency when our organization needs to run some useful queries against when debugging our data/applications. The one draw back is that its prone to failure via user error.
As of now, users need to edit the cells, replacing query values for those that match their needs before running the cells. This is not optimal as it introduces the opportunity that the queries could be broken unintentionally by an unwitting user should they save an erroneous query, hence erasing record of the working query.
One of the queries looks as follows:
%%sql --database DatabaseName --container ContainerName
SELECT c.propertyOne, c.propertyTwo, SUM(c.propertyForSummation)
FROM c
WHERE c.propertyOne = "XYZ123"
AND c.propertyTwo = 1
AND c.timestamp >= '2021-07-15'
AND c.lastUpdatedTimestamp <= '2021-07-16'
GROUP BY c.propertyOne, c.propertyTwo
I'd like to introduce parameter fields {propertyOne, propertyTwo, timestamp, lastUpdatedTimestamp}, so the query would look as below for future users of the workbook to avoid the previously stated scenario. Is that possible?
%%sql --database DatabaseName --container ContainerName
SELECT c.propertyOne, c.propertyTwo, SUM(c.propertyForSummation)
FROM c
WHERE c.propertyOne = #propertyOne
AND c.propertyTwo = #propertyTwo
AND c.timestamp >= #timestamp
AND c.lastUpdatedTimestamp <= #lastUpdatedTimestamp
GROUP BY c.propertyOne, c.propertyTwo
I'm aware how to do this in Azure Data Studio (ADS) workbooks, but unfortunately can not see the option in Data Explorer, and am unaware of the possibility to connect to CosmosDb from ADS
I also posted the question in learn.microsoft.com and it has been confirmed that the feature is not supported at this time (Aug.10th, 2021)

Cosmos DB spatial query using Spark

I would like to query a cosmos db collection using a spatial query. Specifically the ST_DISTANCE query. This query works as intended using the azure-cosmos Python SDK.
I am looking to use this query via Apache Spark for a more complex query pattern. However, using the ST_DISTANCE query in a SQL cell in a notebook results in the following error.
Error in SQL statement: AnalysisException: Undefined function: 'ST_DISTANCE'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
The notebook is initialized as follows.
# Configure Catalog Api to be used
spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", cosmosEndpoint)
spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountKey", cosmosMasterKey)
from pyspark.sql.functions import col
df = spark.read.format("cosmos.oltp").options(**cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
_______________________________________________________________________
%sql
SELECT * FROM outlets f WHERE ST_DISTANCE(f.boundary, POINT(0,0)) < 600
Based on what I understand from the Cosmos DB Spark connector github repo[1], not all Cosmos DB filter queries are supported via the connector (yet?). So the ST_DISTANCE and other filter functions in the spatial family aren't going to work as those aren't predicates that are natively supported by Spark to be pushed down to the database.
Found something that will help sail past this issue at least temporarily. The query config[2] allows sending a custom query directly to Cosmos DB. A temporary view can be built and queried over. This will not work for all use cases, but this solved my issue where I need a single view with distance filtering done. Rest can be handled via Spark SQL.
Refer spark.cosmos.read.customQuery[2] in below sample.
outlets_cfg = {
"spark.cosmos.accountEndpoint" : cosmosEndpoint,
"spark.cosmos.accountKey" : cosmosMasterKey,
"spark.cosmos.database" : cosmosDatabaseName,
"spark.cosmos.container" : cosmosContainerName,
"spark.cosmos.read.customQuery" : "SELECT * FROM c WHERE ST_DISTANCE(c.location,{\"type\":\"Point\",\"coordinates\": [12.832489, 18.9553242]}) < 1000"
}
df = spark.read.format("cosmos.oltp").options(**outlets_cfg)\
.option("spark.cosmos.read.inferSchema.enabled", "true")\
.load()
df.createOrReplaceTempView("outlets")
[1] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/
[2] https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/docs/configuration-reference.md#query-config

Unable to get scalar value of a query on cosmos db in azure data factory

I am trying to get the count of all records present in cosmos db in a lookup activity of azure data factory. I need this value to do a comparison with other value activity outputs.
The query I used is SELECT VALUE count(1) from c
When I try to preview the data after inserting this query I get an error saying
One or more errors occurred. Unable to cast object of type
'Newtonsoft.Json.Linq.JValue' to type 'Newtonsoft.Json.Linq.JObject'
as shown in the below image:
snapshot of my azure lookup activity settings
Could someone help me in resolving this error and if this is the limitation of azure data factory how can I get the count of all the rows of the cosmos db document using some other ways inside azure data factory?
I reproduce your issue on my side exactly.
I think the count result can't be mapped as normal JsonObject. As workaround,i think you could just use Azure Function Activity(Inside Azure Function method ,you could use SDK to execute any sql as you want) to output your desired result: {"number":10}.Then bind the Azure Function Activity with other activities in ADF.
Here is contradiction right now:
The query sql outputs a scalar array,not other things like jsonObject,or even jsonstring.
However, ADF Look Up Activity only accepts JObject,not JValue. I can't use any convert built-in function here because the query sql need to be produced with correct syntax anyway. I already submitted a ticket to MS support team,but get no luck with this limitation.
I also tried select count(1) as num from c which works in the cosmos db portal. But it still has limitation because the sql crosses partitions.
So,all i can do here is trying to explain the root cause of issue,but can't change the product behaviours.
2 rough ideas:
1.Try no-partitioned collection to execute above sql to produce json output.
2.If the count is not large,try to query columns from db and loop the result with ForEach Activity.
You can use:
select top 1 column from c order by column desc

how to execute query on simple vertical partitioning in sqlalchemy using python

i am trying to work with multiple database and schemas using simple vertical partitioning in sqlalchemy and python .
Have create two database engines and configured successfully to sessionmaker()
Session = sessionmaker()
Session.configure(binds={BaseA:engine1, BaseB:engine2})
Able to get the required sql query generated successfully
driverssql = session.query(drivers)
but when i execute the above query to fetch the requslts i get the follwing error :
resultset=session.execute(driversql)
sqlalchemy.exc.UnboundExecutionError: Could not locate a bind configured on SQL expression or this Session (how can i associate the correct engine with execute statement)
I'm seeing here two variants:
You can create 2 sessionmakers here and use them separately according to an engine.
You can choose necessary engine when executing a query:
engine1 = create_engine(first_db)
engine2 = create_engine(second_db)
session.execute(drivers, bind=engine1)

Resources