I'm trying to read the query string from input but step function gives - "Athena.InvalidRequestException"
{
"StartAt": "CallFunction",
"States": {
"CallFunction": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-west-2:12345:function:readFile",
"ResultPath": "$.query",
"Next": "Start an Athena query"
},
"Start an Athena query": {
"Resource": "arn:aws:states:::athena:startQueryExecution.sync",
"Parameters": {
"QueryString": "$.query",
"WorkGroup": "primary",
"ResultConfiguration": {
"OutputLocation": "s3://test_athena/test1"
}
}
}
}
Intput of Start an Athena query state:
{
"Comment": "Insert your JSON here",
"query": "\"SELECT * FROM test1 LIMIT 10; \""
}
Getting the below error on Start an Athena query state :
{
"resourceType": "athena",
"resource": "startQueryExecution.sync",
"error": "Athena.InvalidRequestException",
"cause": "line 1:1: mismatched input '$'. Expecting: 'ALTER', 'ANALYZE', 'CALL', 'COMMIT', 'CREATE', 'DEALLOCATE', 'DELETE', 'DESC', 'DESCRIBE', 'DROP', 'EXECUTE', 'EXPLAIN', 'GRANT', 'INSERT', 'PREPARE', 'RESET', 'REVOKE', 'ROLLBACK', 'SET', 'SHOW', 'START', 'UNLOAD', 'UPDATE', 'USE', <query> (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: 2a99f6eb-b853-407f-b229-d309a4ca3f5c; Proxy: null)"
}
I'm new to AWS. Can someone help me out in this how to pass query object in QueryString parameter of Athena ?
You are missing the ".$" annotation in your QueryString key:
"QueryString.$": "$.query",
[...] The values of [your Parameter fields] can either be static values that you include in your state machine definition, or selected from either the input or the context object with a path. For key-value pairs where the value is selected using a path, the key name must end in .$.
Source: https://docs.aws.amazon.com/step-functions/latest/dg/input-output-inputpath-params.html#input-output-parameters
Also, you should add the following fields in your second state, to be explicit there:
"Type": "Task",
"End": true
Related
I have a problem indexing an array in Azure Cosmos DB
I am trying to save this indexing policy via the portal
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
],
"compositeIndexes": [
[
{
"path": "/DeviceId",
"order": "ascending"
},
{
"path": "/TimeStamp",
"order": "ascending"
},
{
"path": "/Items/[]/Name/?",
"order": "ascending"
},
{
"path": "/Items/[]/DoubleValue/?",
"order": "ascending"
}
]
]
}
I get the error "Failed to update container DeviceEvents:
Message: {"code":"BadRequest","message":"Message: {"Errors":["The indexing path '\/Items\/[]\/Name\/?' could not be accepted, failed near position '8'."
This seems to be the array [] syntax that is giving an error.
On a side note I am not sure what I am doing makes sense at all but I have a query that looks like this
SELECT SUM(de0["DoubleValue"])
FROM root JOIN de0 IN root["Items"]
WHERE root["ApplicationId"] = 57 AND root["DeviceId"] = 126 AND root["TimeStamp"] >= "2021-02-21T17:55:29.7389397Z" AND de0["Name"] = "Use Case"
Where ApplicationId is the partition key and the item saved looks like this
{
"id": "59ab9323-26ca-436f-8d29-e1ddd826f025",
"DeviceId": 3,
"ApplicationId": 3,
"RawData": "640F7A000A00E30142000000",
"TimeStamp": "2021-02-20T18:36:52.833174Z",
"Items": [
{
"Name": "Battery Status",
"StringValue": "Full",
"DoubleValue": null
},
{
"Name": "Use Case",
"StringValue": null,
"DoubleValue": 12
},
{
"Name": "Battery Voltage",
"StringValue": null,
"DoubleValue": 3.962
},
{
"Name": "Rain Gauge Count",
"StringValue": null,
"DoubleValue": 10
}
],
"_rid": "CgdVAO7B0DNkAAAAAAAAAA==",
"_self": "dbs/CgdVAA==/colls/CgdVAO7B0DM=/docs/CgdVAO7B0DNkAAAAAAAAAA==/",
"_etag": "\"61008771-0000-0d00-0000-603156c50000\"",
"_attachments": "attachments/",
"_ts": 1613846213
}
I need to aggregate on some of these items in the array like say get MAX on temperature or something like this (using Use Case for test although it doesn't make sense). I reasoned that if all the data in the query is in a single composite index the database would be able to do the aggregation without reading the documents themselves. However I can't seem to add a composite index containing an array at all.
Yes, composite index can't contain an array path. It should be a scalar value.
Unlike with included or excluded paths, you can't create a path with
the /* wildcard. Every composite path has an implicit /? at the end of
the path that you don't need to specify. Composite paths lead to a
scalar value and this is the only value that is included in the
composite index.
Reference:https://learn.microsoft.com/en-us/azure/cosmos-db/index-policy#composite-indexes
Need your suggestions in developing code in Azure Synapse.
We have a requirement where our jobs will run in parallel at same time and insert data to the same table.
During this insert there are changes that duplicate entries will be inserted to the same table.
For Example: If Job A and Job B run at same time both with same values then "not exists" or "not in" will fail to work. In this case I will get duplicates from both the job. Primary key or Unique constraint allows duplicates in Azure synapse. Is there any best way to lock tables during data insert. Like if Job A is running then JOB B should not insert the data to same table. Please pour your suggestions as I am new to this. Note: We use stored Procedure to load the data through ADF V2
Thanks,
Nandini
Duplicates must be handled within jobs before inserting data into Azure Synapse. If the duplicates exists between two jobs, then do it after completion of both jobs. It depends really how you are loading data. You can easily manage by creating a temp table instead of directly loading data to final table. Please make sure the structure of temp table should be same as final table (Distribution, Partition, constraints, nullability of the columns) You can use SQL BCP/INSERT TO/CTAS/CTAS with partition switching with stage table to final table.
If you can share specific scenario, it will be helpful to give suggestions relevant to your use case.
I just got the same case and I solved it with Pipeline Runs - Query By Factory
Use a Until activity before the DataFlow activity that writes the values in the table with this expression #equals(activity('pingPL').output.value[0].runId, pipeline().RunId) as follow:
Into the Until activities put a web activity and a wait time:
a. Web activity body - follow docs:
{
"lastUpdatedAfter": "#{addminutes(utcnow(), -30)}",
"lastUpdatedBefore": "#{utcnow()}",
"filters": [
{
"operand": "PipelineName",
"operator": "Equals",
"values": [
"pipeline_name_where_writeInSynapse_is_located"
]
},
{
"operand": "Status",
"operator": "Equals",
"values": [
"InProgress"
]
}
]
}
b. Wait activity 30 sec or whatever make sense
What is happening is, if you trigger several times the same pipeline in parallel the web activity is going to filter each PL status InProgress. It will look like this:
{
"value": [
{
"id": "...",
"runId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"debugRunId": null,
"runGroupId": "52004775-5ef5-493b-8a44-ee3fff6bff7b",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "80efce4dbda74636878bc99472978ccf",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:24:01.0210945Z",
"runEnd": "2021-10-13T17:25:06.9692394Z",
"durationInMs": 65948,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:25:06.9704432Z",
"annotations": [],
"runDimension": {},
"isLatest": true
},
{
"id": "...",
"runId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"debugRunId": null,
"runGroupId": "cf3f5038-ba10-44c3-b8f5-df8ad4c85819",
"pipelineName": "synapse_writting",
"parameters": {
"region": "NW",
"unique_item": "a"
},
"invokedBy": {
"id": "08205e0eda0b41f6b5a90a8dda06a7f6",
"name": "Manual",
"invokedByType": "Manual"
},
"runStart": "2021-10-13T17:28:58.219611Z",
"runEnd": null,
"durationInMs": null,
"status": "InProgress",
"message": "",
"output": null,
"lastUpdated": "2021-10-13T17:29:00.9860175Z",
"annotations": [],
"runDimension": {},
"isLatest": true
}
],
"ADFWebActivityResponseHeaders": {
"Pragma": "no-cache",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains",
"X-Content-Type-Options": "nosniff",
"x-ms-ratelimit-remaining-subscription-reads": "11999",
"x-ms-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-correlation-request-id": "188508ef-8897-4c21-8c37-ccdd4adc6d81",
"x-ms-routing-request-id": "WESTUS2:20211013T172902Z:188508ef-8897-4c21-8c37-ccdd4adc6d81",
"Cache-Control": "no-cache",
"Date": "Wed, 13 Oct 2021 17:29:02 GMT",
"Server": "Microsoft-IIS/10.0",
"X-Powered-By": "ASP.NET",
"Content-Length": "1492",
"Content-Type": "application/json; charset=utf-8",
"Expires": "-1"
},
"effectiveIntegrationRuntime": "NCAP-Simple-DataMovement (West US 2)",
"executionDuration": 0,
"durationInQueue": {
"integrationRuntimeQueue": 0
},
"billingReference": {
"activityType": "ExternalActivity",
"billableDuration": [
{
"meterType": "AzureIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
}
}
Then the Until expression will evaluate if the first value[0] has runId == pipeline_runid to stop the until activity and run the dataflow that writes in Synapse. Once the PL ends the status will be Succeeded and the Web activity in another job will get the next value[0] with the status InProgress and continue with the next write. This creates a dependency to the parallel jobs to wait until the dataflow validates and writes in table if need it.
I am using shc-core to write spark Dataset to hbase, for more details see here.
This is my current shc catalog:
def catalog = s"""{
|"table":{"namespace":"default", "name":"table1"},
|"rowkey":"key",
|"columns":{
|"col0":{"cf":"rowkey", "col":"key", "type":"string"},
|"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},
|"col2":{"cf":"cf2", "col":"col2", "type":"double"},
|"col3":{"cf":"cf3", "col":"col3", "type":"float"},
|"col4":{"cf":"cf4", "col":"col4", "type":"int"},
|"col5":{"cf":"cf5", "col":"col5", "type":"bigint"},
|"col6":{"cf":"cf6", "col":"col6", "type":"smallint"},
|"col7":{"cf":"cf7", "col":"col7", "type":"string"},
|"col8":{"cf":"cf8", "col":"col8", "type":"tinyint"}
|}
|}""".stripMargin
Because the sof rule code cannot be too long,I can only give you part of it:
This is my HBase catalog :
{
"columns": {
"RXSJ": {
"col": "RXSJ",
"cf": "info",
"type": "bigint"
},
"LATITUDE": {
"col": "LATITUDE",
"cf": "info",
"type": "float"
},
"ZJHM": {
"col": "ZJHM",
"cf": "rowkey",
"type": "string"
},
"AGE": {
"col": "AGE",
"cf": "info",
"type": "int"
}
},
"rowkey": "ZJHM",
"table": {
"namespace": "default",
"name": "mongo_hbase_spark_out"
}
}
The other fields output normally, but the rowkey column is not output.
How can I output the rowkey additionaly as a column?
You will not get the rowkey visible in the same way as the other columns. In the description of the HBase Catalog it is mentioned:
Note that the rowkey also has to be defined in details as a column (col0), which has a specific cf (rowkey).
Therefore, it will not show up although you have specified it in the columns section of your catalog.
The rowkey is only visible as actual rowkey as your screenshot also shows.
After testing, I solved the problem.
The whole idea is to output the same column twice
This is my new generated SHC catalog:
{
"columns": {
"rowkey_ZJHM": {
"col": "ZJHM",
"cf": "rowkey",
"type": "string"
},
"ZJHM": {
"col": "ZJHM",
"cf": "info",
"type": "string"
},
"AGE": {
"col": "AGE",
"cf": "info",
"type": "int"
}
},
"rowkey": "ZJHM",
"table": {
"namespace": "default",
"name": "mongo_hbase_spark_out"
}
}
I think rowkey column is Hortonworks-spark shc special column,it always output first column. Only think other ways to output to other cf.
Let me know if you have any better Suggestions
Thanks!
I have a mongo db collection users with the following data format
{
"name": "abc",
"email": "abc#xyz.com"
"address": {
"city": "Gurgaon",
"state": "Haryana"
}
}
Now I'm creating a datasource, an index, and an indexer for this collection using azure rest apis.
Datasource
def create_datasource():
request_body = {
"name": 'users-datasource',
"description": "",
"type": "cosmosdb",
"credentials": {
"connectionString": "<db conenction url>"
},
"container": {"name": "users"},
"dataChangeDetectionPolicy": {
"#odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
"highWaterMarkColumnName": "_ts"
}
}
resp = requests.post(url="<create-datasource-api-url>", data=json.dumps(request_body),
headers=headers)
Index for the above datasource
def create_index(config):
request_body = {
'name': "users-index",
'fields': [
{
'name': 'name',
'type': 'Edm.String'
},
{
'name': 'email',
'type': 'Edm.DateTimeOffset'
},
{
'name': 'address',
'type': 'Edm.String'
},
{
'name': 'doc_id',
'type': 'Edm.String',
'key': True
}
]
}
resp = requests.post(url="<azure-create-index-api-url>", data=json.dumps(request_body),
headers=config.headers)
Now the inxder for the above datasource and index
def create_interviews_indexer(config):
request_body = {
"name": "users-indexer",
"dataSourceName": "users-datasource",
"targetIndexName": users-index,
"schedule": {"interval": "PT5M"},
"fieldMappings": [
{"sourceFieldName": "address.city", "targetFieldName": "address"},
]
}
resp = requests.post("create-indexer-pi-url", data=json.dumps(request_body),
headers=config.headers)
This creates the indexer without any exception, but when I check the retrieved data in azure portal for the users-indexer, the address field is null and is not getting any value from address.city field mapping that is provided while creating the indexer.
I have also tried the following code as a mapping but its also not working.
"fieldMappings": [
{"sourceFieldName": "/address/city", "targetFieldName": "address"},
]
The azure documentation also does not say anything about this kind of mapping. So if anyone can help me on this, it will be very much appreciated.
container element in data source definition allows you to specify a query that you can use to flatten your JSON document (Ref: https://learn.microsoft.com/en-us/rest/api/searchservice/create-data-source) so instead of doing column mapping in the indexer definition, you can write a query and get the output in desired format.
Your code for creating data source in that case would be:
def create_datasource():
request_body = {
"name": 'users-datasource',
"description": "",
"type": "cosmosdb",
"credentials": {
"connectionString": "<db conenction url>",
},
"container": {
"name": "users",
"query": "SELECT a.name, a.email, a.address.city as address FROM a",
},
"dataChangeDetectionPolicy": {
"#odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
"highWaterMarkColumnName": "_ts"
}
}
resp = requests.post(url="<create-datasource-api-url>", data=json.dumps(request_body),
headers=headers)
Support for MongoDb API flavor is in public preview - you need to explicitly indicate Mongo in the datasource's connection string as described in this article. Also note that with Mongo datasources, custom queries suggested by the previous response are not supported afaik. Hopefully someone from the team would clarify the current state of this support.
It's working for me with the below field mapping correctly. Azure search query is returning values for address properly.
"fieldMappings": [{"sourceFieldName": "address.city", "targetFieldName": "address"}]
I did made few changes to the data your provided for e.g.
while creating indexers, removed extra comma at the end of
fieldmappings
while creating index, email field is kept at
Edm.String and not datetimeoffset.
Please make sure you are using the Preview API version since for MongoDB API is in preview mode with Azure Search.
For e.g. https://{azure search name}.search.windows.net/indexers?api-version=2019-05-06-Preview
Here is part of
[
UserJSONImpl{
"id"=26136358,
"name"='BryanConnor',
"screenName"='thewhyaxis',
"location"='null',
"description"='TheWhyAxisisacollectionofindepthwritingaboutthevisualizationsthatdeserveyourattention.',
"isContributorsEnabled"=false,
I'm not too familiar with JSON syntax and I haven't found a source on the web that provides an introduction; when I try to parse each JSONObject in the JSONArray I get an error like
Expected a ',' or ']' at character 14
When I input into jsonlint:
Parse error on line 1:
[ UserJSONImpl{
-----^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', ']'
What's wrong with my JSON?
[
{
"UserJSONImpl": {
"id": 26136358,
"name": "BryanConnor",
"screenName": "thewhyaxis",
"location": null,
"description": "TheWhyAxisisacollectionofindepthwritingaboutthevisualizationsthatdeserveyourattention.",
"isContributorsEnabled": false
}
}
]
Following http://json.org/
[ elements ] with elements as value,
value as object,
object as { members },
members as pair
pair as string : value
value as object
...