Azure Stream Processing upsert to DocumentDB with array - azure

I'm using Azure Stream Analytics to copy my Json over to DocumentDB using upsert to overwrite the document with the latest data. This is great for my base data, but I would love to be able to append the list data, as unfortunately I can only send one list item at a time.
In the example below, the document is matched on id, and all items are updated, but I would like the "myList" array to keep growing with the "myList" data from each document (with the same id). Is this possible? Is there any other way to use Stream Analytics to update this list in the document?
I'd rather steer clear of using a tumbling window if possible, but is that an option that would work?
Sample documents:
{
"id": "1234",
"otherData": "example",
"myList": [{"listitem": 1}]
}
{
"id": "1234",
"otherData": "example 2",
"myList": [{"listitem": 2}]
}
Desired output:
{
"id": "1234",
"otherData": "example 2",
"myList": [{"listitem": 1}, {"listitem": 2}]
}
My current query:
SELECT id, otherData, myList INTO [myoutput] FROM [myinput]

Currently arrays are not merged, this is the existing behavior of DocumentDB output from ASA, also mentioned in this article. I doubt using a tumbling window would help here.
Note that changes in the values of array properties in your JSON document result in the entire array getting overwritten, i.e. the array is not merged.
You could transform the input that is coming as an array (myList) into a dictionary using GetArrayElements function .
Your query might look something like --
SELECT i.id , i.otherData, listItemFromArray
INTO myoutput
FROM myinput i
CROSS APPLY GetArrayElements(i.myList) AS listItemFromArray
cheers!

Related

Search Items by multiple Tags DynamoDB NodeJS

I need to do a search in my dynamoDB table that matches multiple values from a single item.
This is the type of Items i am storing:
{
"id": "<product id>",
"name": "Product Name",
"price": 1.23,
"tags": [
"tag1",
"tag2",
"tag3"
]
I need to return an array of items having tags that match all of the tags a the comma-separated list.
For example: i am looking for items that only contains tags "tag1" and "tag2".
My first aproach was getting all the items from the dynamoDB table and then iterating each item to check if this condition matchs, then add the target item to an object of objects.
My approach is definetly not cost effective, Any suggestions with node.js?
There is not a way to index optimize this generic case (an arbitrary number of tags stored and searched) with DynamoDB.
You can optimize retrieval for one tag by adding extra items in the table where the tag is the partition key and then doing a query (with filter for the other tags) starting there.
Or you can duplicate the data to OpenSearch which is designed for this type of query.

how to compare 2 JSON files in Azure data factory

I'm new to Azure data factory. I want to compare 2 json files through azure data factory. We need to get new list of id's in current JSON file which are not in previous JSON file. Below are the 2 sample JSON files.
Previous JSON file :
{
"count": 2,
"values": [
{
"id": "4e10aa02d0b945ae9dcf5cb9ded9a083"
},
{
"id": "cbc414db-4d08-48f2-8fb7-748c5da45ca9"
}
]
}
Current JSON file:
{
"count": 3,
"values": [
{
"id": "4e10aa02d0b945ae9dcf5cb9ded9a083"
},
{
"id": "cbc414db-4d08-48f2-8fb7-748c5da45ca9"
},
{
"id": "5ea951e3-88d7-40b4-9e3f-d787b94a43c8"
}
]
}
New id's has to perform one activity and old id's has to perform another activity.
WE are running out the time and please help me out.
Thanks in advance!
You can simply use a IfCondition Activity
If expression:
#equals(activity('Lookup1').output.value,activity('Lookup2').output.value)
Further I have used Fail Activity for False condition for better visibility.
--
Lookup1 Activity --> Json1.json
Lookup2 Activity --> Json2.json
This can be done using a single Filter Activity.
I have assigned two parameters "Old_json" and "New_json" for your Previous Json and Current Json files respectively.
In the settings of Filter activity,
Items: #pipeline().parameters.New_json.values
Condition: #not(contains(pipeline().parameters.Old_Json.values,item()))
So, this filter activity goes through each item in New json, and checks if they are present in the old json. If not present, then will give that as an output.
Output of the filter activity
Thanks #KarthikBhyresh-MT for a helpful answer.
Just to add, if (like me) you want to compare two files (or in my case, a file with the output of a SQL query), but don't care about the order of the records, you can do this using a ForEach activity. This also has the benefit of allowing a more specific error message in the case of a difference between the files.
My first If Condition checks the two files have the same row count, with the expression:
#equals(activity('Select from SQL').output.count, activity('Lookup from CSV').output.count)
The False branch leads to a Fail activity with message:
#concat(pipeline().parameters.TestName, ': CSV has ', string(activity('Lookup from CSV').output.count), ' records but SQL query returned ', string(activity('Select from SQL').output.count))
If this succeeds, flow passes to a ForEach, iterating through items:
#activity('Lookup from CSV').output.value
... which contains an If Condition with expression:
#contains(string(activity('Select from SQL').output.value), string(item()))
The False branch for that If Condition contains an Append variable activity, which appends to a variable I've added to the pipeline called MismatchedRecords. The Value appended is:
#item()
Following the ForEach, a final If Condition then checks whether MismatchedRecords contains any items:
#equals(length(variables('MismatchedRecords')), 0)
... and the False branch contains another Fail activity, with message:
#concat(string(length(variables('MismatchedRecords'))), ' records from CSV not found in SQL. Missing records: ', string(variables('MismatchedRecords')), ' SQL output: ', string(activity('Select from SQL').output.value))
The message contains specific information about the records which could not be matched, to allow further investigation.

How can we get the same sequence of columns as a result in DynamoDB query present in the DB?

I am looking for a way to get the boto3 DynamoDB query results(columns), that the way I am inserting into the DB.
My insert operation:
DYNAMO_CLIENT.put_item(
TableName=TABLE_NAME,
Item={
"id": {"S": data.id},
"name": {"S": data.name},
"account_id": {"S": data.provider_account_id},
"provider": {"S": data.provider},
"is_enabled" : {'BOOL':data.enabled)}
},
)
My Query:
response = table.query(
IndexName='provider-id-index',
KeyConditionExpression=Key('provider').eq(provider)
)
for item in response["Items"]:
print(item.values())
Output that I am getting:
{"is_enabled":False, name:"sample","account_id":"12345","id="345","provider":"none"}
Expecting:
{"id="345",name:"sample","account_id":"12345","provider":"none","is_enabled":False}
I know the response["Items"] returning the list of dict objects (its unordered). But I am looking for a way to do it.
I tried with collections.OrderdDict(item.values()[0]) but no luck.
Any solution on this problem would be appriciated.
Thank you.
You can't control the order of the items in the dictionary. If you need things in a certain order you probably should rethink why that is. If the order is important for something that is out of your control you can always take the result and put the values into a new object that is ordered the way you want. If that order is unique to each item you could store the order in an attribute on the item. Generally speaking, this isn't something you should be concerned with.

How do I keep existing data in couchbase and only update the new data without overwriting

So, say I have created some records/documents under a bucket and the user updates only one column out of 10 in the RDBMS, so I am trying to send only that one columns data and update it in couchbase. But the problem is that couchbase is overwriting the entire record and putting NULL`s for the rest of the columns.
One approach is to copy all the data from the exisiting record after fetching it from Cbase, and then overwriting the new column while copying the data from the old one. But that doesn`t look like a optimal approach
Any suggestions?
You can use N1QL update Statments google for Couchbase N1QL
UPDATE replaces a document that already exists with updated values.
update:
UPDATE keyspace-ref [use-keys-clause] [set-clause] [unset-clause] [where-clause] [limit-clause] [returning-clause]
set-clause:
SET path = expression [update-for] [ , path = expression [update-for] ]*
update-for:
FOR variable (IN | WITHIN) path (, variable (IN | WITHIN) path)* [WHEN condition ] END
unset-clause:
UNSET path [update-for] (, path [ update-for ])*
keyspace-ref: Specifies the keyspace for which to update the document.
You can add an optional namespace-name to the keyspace-name in this way:
namespace-name:keyspace-name.
use-keys-clause:Specifies the keys of the data items to be updated. Optional. Keys can be any expression.
set-clause:Specifies the value for an attribute to be changed.
unset-clause: Removes the specified attribute from the document.
update-for: The update for clause uses the FOR statement to iterate over a nested array and SET or UNSET the given attribute for every matching element in the array.
where-clause:Specifies the condition that needs to be met for data to be updated. Optional.
limit-clause:Specifies the greatest number of objects that can be updated. This clause must have a non-negative integer as its upper bound. Optional.
returning-clause:Returns the data you updated as specified in the result_expression.
RBAC Privileges
User executing the UPDATE statement must have the Query Update privilege on the target keyspace. If the statement has any clauses that needs data read, such as SELECT clause, or RETURNING clause, then Query Select privilege is also required on the keyspaces referred in the respective clauses. For more details about user roles, see Authorization.
For example,
To execute the following statement, user must have the Query Update privilege on travel-sample.
UPDATE `travel-sample` SET foo = 5
To execute the following statement, user must have the Query Update privilege on the travel-sample and Query Select privilege on beer-sample.
UPDATE `travel-sample`
SET foo = 9
WHERE city = (SELECT raw city FROM `beer-sample` WHERE type = "brewery"
To execute the following statement, user must have the Query Update privilege on `travel-sample` and Query Select privilege on `travel-sample`.
UPDATE `travel-sample`
SET city = “San Francisco”
WHERE lower(city) = "sanfrancisco"
RETURNING *
Example
The following statement changes the "type" of the product, "odwalla-juice1" to "product-juice".
UPDATE product USE KEYS "odwalla-juice1" SET type = "product-juice" RETURNING product.type
"results": [
{
"type": "product-juice"
}
]
This statement removes the "type" attribute from the "product" keyspace for the document with the "odwalla-juice1" key.
UPDATE product USE KEYS "odwalla-juice1" UNSET type RETURNING product.*
"results": [
{
"productId": "odwalla-juice1",
"unitPrice": 5.4
}
]
This statement unsets the "gender" attribute in the "children" array for the document with the key, "dave" in the tutorial keyspace.
UPDATE tutorial t USE KEYS "dave" UNSET c.gender FOR c IN children END RETURNING t
"results": [
{
"t": {
"age": 46,
"children": [
{
"age": 17,
"fname": "Aiden"
},
{
"age": 2,
"fname": "Bill"
}
],
"email": "dave#gmail.com",
"fname": "Dave",
"hobbies": [
"golf",
"surfing"
],
"lname": "Smith",
"relation": "friend",
"title": "Mr.",
"type": "contact"
}
}
]
Starting version 4.5.1, the UPDATE statement has been improved to SET nested array elements. The FOR clause is enhanced to evaluate functions and expressions, and the new syntax supports multiple nested FOR expressions to access and update fields in nested arrays. Additional array levels are supported by chaining the FOR clauses.
Example
UPDATE default
SET i.subitems = ( ARRAY OBJECT_ADD(s, 'new', 'new_value' )
FOR s IN i.subitems END )
FOR s IN ARRAY_FLATTEN(ARRAY i.subitems
FOR i IN items END, 1) END;
If you're using structured (json) data, you need to read the existing record then update the field you want in your program's data structure and then send the record up again. You can't update individual fields in the json structure without sending it all up again. There isn't a way around this that I'm aware of.
It is indeed true, to update individual items in a JSON doc, you need to fetch the entire document and overwrite it.
We are working on adding individual item updates in the near future.

couchdb - Map Reduce - How to Join different documents and group results within a Reduce Function

I am struggling to implement a map / reduce function that joins two documents and sums the result with reduce.
First document type is Categories. Each category has an ID and within the attributes I stored a detail category, a main category and a division ("Bereich").
{
"_id": "a124",
"_rev": "8-089da95f148b446bd3b33a3182de709f",
"detCat": "Life_Ausgehen",
"mainCat": "COL_LEBEN",
"mainBereich": "COL",
"type": "Cash",
"dtCAT": true
}
The second document type is a transaction. The attributes show all the details for each transaction, including the field "newCat" which is a reference to the category ID.
{
"_id": "7568a6de86e5e7c6de0535d025069084",
"_rev": "2-501cd4eaf5f4dc56e906ea9f7ac05865",
"Value": 133.23,
"Sender": "Comtech",
"Booking Date": "11.02.2013",
"Detail": "Oki Drucker",
"newCat": "a124",
"dtTRA": true
}
Now if I want to develop a map/reduce to get the result in the form:
e.g.: "Name of Main Category", "Sum of all values in transactions".
I figured out that I could reference to another document with "_ID:" and ?include_docs=true, but in that case I can not use a reduce function.
I looked in other postings here, but couldn't find a suitable example.
Would be great if somebody has an idea how to solve this issue.
I understand, that multiple Category documents may have the same mainCat value. The technique called view collation is suitable to some cases where single join would be used in relational model. In your case it will not help: although you use two document schemes, you really have three level structure: main-category <- category <- transaction. I think you should consider changing the DB design a bit.
Duplicating the data, by storing mainCat value also in the transaction document, would help. I suggest to use meaningful ID for the transaction instead of generated one. You can consider for example "COL_LEBEN-7568a6de86e5e" (concatenated mainCat with some random value, where - delimiter is never present in the mainCat). Then, with simple parser in map function, you emit ["COL_LEBEN", "7568a6de86e5e"] for transactions, ["COL_LEBEN"] for categories, and reduce to get the sum.

Resources