I am trying to build an Azure Stream Analytics job in VS Code using the Azure Stream Analytics Tools extension. I have added an event hub as an input and a data lake gen 2 storage account as an output and I can successfully run the job in VS Code using "Use Live Input and Live Output".
The issue I'm having is when I try to set the output to an Azure Cosmos DB Document DB instead I get an error "Failed to convert output 'cosmosdb' : Unsupported data source type.." when trying to use live input and output. I can however use successfully run the job using "Live input and local output"
Is this a limitation of the VS Code extension that you can't debug live output against Cosmos DB? Or have I set something up incorrectly in my cosmos db output? See cosmos db output code
{
"Name": "cosmosdb",
"DataSourceType": "DocumentDB",
"DocumentDbProperties": {
"AccountId": "cosmosdb-dev-eastau-001",
"AccountKey": null,
"Database": "cosmosdb_db",
"ContainerName": "container1",
"DocumentId": ""
},
"DataSourceCredentialDomain": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx.StreamAnalystics",
"ScriptType": "Output"
}
For Live Input to Live Output mode, the only supported output adapters (for now) are Event Hub, Storage Account, and Azure SQL. https://learn.microsoft.com/en-us/azure/stream-analytics/visual-studio-code-local-run-all#local-run-modes
Related
Note: In the question below, I have not yet deployed the function.
I have an azure function that I test locally on VScode (azure extension). This function is blob triggered i.e. triggered when an image is uploaded to a certain blob container (say c) and later uploads a certain metadata (dict) on cosmos. I have also linked my blob storage my vscode. The storage container c has a lot of images already (200+). When I run my azure function locally (on VScode), it seems to be running the function on all existing images again.
It is an issue I have asked elsewhere here (picture attached).
But I have a second question. If this function is running on all triggers on VSCode (not deployed yet), is it replacing/re-writing all my data on cosmos (my azure function uploads some data to cosmos in the end)
Edit:
My blob trigger/azure function is way too long so I will just post on how it triggers
def main(myblob: func.InputStream, doc: func.Out[func.Document]):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {myblob.name}\n"
f"Blob Size: {myblob.length} bytes")
blob_val = myblob.read()
.
.
.
and host.json:
{
"version": "2.0",
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[2.*, 3.0.0)"
}
}
Regardless of where the Function is running (your local machine, Azure, any other supported hosting options), if your configuration has the Cosmos DB Connection String of a real Cosmos DB account, then the Function will perform the update on that account.
Simply the answer to your question is: The update will be performed on the account your Connection String points to regardless of where the Function is running.
One alternative, if you want to do local testing, is to use the Cosmos DB Emulator and have your local configuration Connection String use the Emulator's Connection String, so your local testing won't affect any real account.
And you can have that setting overwritten when running on Azure by adding that setting but pointing to your real Cosmos DB account on the App Settings or Connection Strings.
Is there a way to learn how many RUs were consumed when executing a query using the Cassandra api on CosmosDB?
(My understanding is normal API returns this in an additional HTTP header, but obviously that does not work with CQL as wire protocol..)
The only way I know how to get request charge for specific CQL queries in Cosmos is to turn on diagnostic logging. Then each query you run will result in a diagnostic log entry like this.
{ "time": "2020-03-30T23:55:10.9579593Z", "resourceId": "/SUBSCRIPTIONS/<your_subscription_ID>/RESOURCEGROUPS/<your_resource_group>/PROVIDERS/MICROSOFT.DOCUMENTDB/DATABASEACCOUNTS/<your_database_account>", "category": "CassandraRequests", "operationName": "QuerySelect", "properties": {"activityId": "6b33771c-baec-408a-b305-3127c17465b6","opCode": "<empty>","errorCode": "-1","duration": "0.311900","requestCharge": "1.589237","databaseName": "system","collectionName": "local","retryCount": "<empty>","authorizationTokenType": "PrimaryMasterKey","address": "104.42.195.92","piiCommandText": "{"request":"SELECT key from system.local"}","userAgent": """"}}
For details on how to configure Diagnostic Logging in Cosmos DB see, Monitor Azure Cosmos DB data by using diagnostic settings in Azure
Hope this is helpful.
We are collecting device events via IoT Hub which are then processed with Stream Analytics. We want to generate a status overview containing the last value of every measurement. The status is then written to a CosmosDB output, one document per device.
The simplified query looks like this:
SELECT
device_id as id,
LAST(value) OVER (PARTITION BY device_id LIMIT DURATION(day, 1) WHEN name = 'battery_status') AS battery_status
INTO status
FROM iothub
The resulting document should be (also simplified):
{
"id": "8c03b6cef760",
"battery_status": 95
}
The problem is that not all events contain a battery_status and whenever the last event with battery_status is older than the specified duration, the last value in the CosmosDB document is overwritten with NULL.
What I would need is some construct to omit the value entirely when there is no data and consequently preserve the last value in the output document. Any ideas how I could achieve this?
Currently Azure Stream Analytics does not support partition your output to CosmosDB
for each device.
There are two option to workaround.
You can choose Azure Function to workaround. In azure function you can create a IoT Hub trigger, filter the data with battery_status property ,and then store the data to CosmosDB for each device programmatically.
You can choose Azure Storage Container instead CosmosDB, and then config Azure Storage Container as endpoints and message route in Azure IoT Hub, please refer to IoT Hub Endpoints and this tutorial about how to save IoT hub messages that contain sensor data to your Azure blob storage. In the route config, you can add query string for filtering the data.
If I understand your problem correctly, I just put where condition to filter non battery_status data. You can write multiple queries to process your data, process battery event data separately.
Sample Input
1.
{
"device_id": "8c03b6cef760",
"name": "battery_status",
"value": 67
}
2.
{
"device_id": "8c03b6cef760",
"name": "cellular_connectivity",
"value": 67
}
Output
{
"id": "8c03b6cef760",
"battery_status": 67,
"_rid": "vYYFAIRr5b8LAAAAAAAAAA==",
"_self": "dbs/vYYFAA==/colls/vYYFAIRr5b8=/docs/vYYFAIRr5b8LAAAAAAAAAA==/",
"_etag": "\"8d001092-0000-0000-0000-5b7ffe8e0000\"",
"_attachments": "attachments/",
"_ts": 1535114894
}
ASA Query
SELECT
device_id as id,
LAST(value) OVER (PARTITION BY device_id LIMIT DURATION(day, 1) WHEN name = 'battery_status') AS battery_status
INTO status
FROM iothub
WHERE name = 'battery_status'
I am designing an azure logic app to get a latest added blob in the container from list of blobs in a container to proceed further.
My scenario is, if n files are added into the container path /destinationcontainer then i need to store into sql server using azure functions and delete it after the successful insertion using logic app.
I want to get all the blobs which are there inside the container at a single stretch using /*.txt or /*.csv filename extensions.
as follows:
Is there any better way to get things done?
I figured it out the solution, the problem is not with the azure blob trigger and its with only the approach of solving it.
the blob trigger will automatically detects the /destinationcontainer path we given and report the recently added blob's metadata information such as,
"Id": "xxxxxxsadasdaasd=",
"Name": "text.txt",
"DisplayName": "text.txt",
"Path": "/destinationcontainer/text.txt",
"LastModified": "2017-07-19T09:16:41Z",
"Size": 208078,
"MediaType": "text/plain",
"IsFolder": false,
"ETag": "\"0xcv234232ssd\"",
"FileLocator": "xxxxxxsadasdaasd=",
"LastModifiedBy": null
by using the above meta data information, we can get the File Content of a recent blob.
My final flow of logic app will be,
Thanks all for your valuable insight.
How can you fetch data from an http rest endpoint as an input for a data factory?
My use case is to fetch new data hourly from a rest HTTP GET and update/insert it into a document db in azure.
Can you just create an endpoint like this and put in the rest endpoint?
{
"name": "OnPremisesFileServerLinkedService",
"properties": {
"type": "OnPremisesFileServer",
"description": "",
"typeProperties": {
"host": "<host name which can be either UNC name e.g. \\\\server or localhost for the same machine hosting the gateway>",
"gatewayName": "<name of the gateway that will be used to connect to the shared folder or localhost>",
"userId": "<domain user name e.g. domain\\user>",
"password": "<domain password>"
}
}
}
And what kind of component do I add to create the data transformation job - I see that there is a a bunch of things like hdinsight, data lake and batch but not sure what the differences or appropriate service would be to simply upsert the new set into the azure documentDb.
I think the simplest way will be to use the Azure Logic Apps.
You can make a call to any Restfull service using the Http Connector in Azure Logic App connectors.
So you can do GET and POST/PUT etc in a flow based on schedule or based on some other GET listener:
Here is the documentation for it:
https://azure.microsoft.com/en-us/documentation/articles/app-service-logic-connector-http/
To do this with Azure Data Factory you will need to utilize Custom Activities.
Similar question here:
Using Azure Data Factory to get data from a REST API
If Azure Data Factory is not an absolute requirement Aram's suggestion might serve you better utilizing Logic Apps.
Hope that helps.
This can be achieved with Data Factory. This is especially good if you want to run batches on a schedule and have a single place for monitoring and management. There is sample code in our GitHub repo for an HTTP loader to blob here https://github.com/Azure/Azure-DataFactory. Then, the act of moving data from the blob to docdb will do the insert for you using our DocDB connector. There is a sample on how to use this connector here https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-documentdb-connector/ Here are the brief steps you will take to fulfill your scenario
Create a custom .NET activity to get your data to blob.
Create a linked service of type DocumentDb.
Create linked service of type AzureStorage.
Use input dataset of type AzureBlob.
Use output dataset of type DocumentDbCollection.
Create and schedule a pipeline that includes your custom activity, and a Copy Activity that uses BlobSource and DocumentDbCollectionSink schedule the activities to the required frequency and availability of the datasets.
Aside from that, choosing where to run your transforms (HDI, Data Lake, Batch) will depend on your I/o and perf reqs. You can choose to run your custom activity on Azure Batch or HDI in this case.