How can I turn a cosmosdb list of documents into a hashmap and append values to it - hashmap

I am currently trying to turn my list of documents I am getting from a cosmosdb query into a map so that I can iterate over the objects elements without using their ids. I want to remove some elements, and I want to append some data to elements as well. Finally I want to output a Json file with this data. How can I do this?
For example:
{
"action": "A",
"id": "138",
"validate": "yes",
"BaseVehicle": {
"id": "105"
},
"Qty": {
"value": "1"
},
"PartType": {
"id": "8852"
},
"BatchNumber": 0,
"_attachments": "attachments/",
"_ts": 1551998460
}
Should Look something like this:
"type": "App",
"data": {
"attributes": {
"Qty": {
"values": [
{
"source": "internal",
"locale": "en-US",
"value": "1"
}
]
},
"BaseVehicle": {
"values": [
{
"source": "internal",
"locale": "en-US",
"value": "105"
}
]
},
"PartType": {
"values": [
{
"source": "internal",
"locale": "en-US",
"value": "8852"
}
]
},
}
}
}

You could use Copy Activity in Azure Data Factory to implement your requirements.
1.Write an API to query data from cosmos db and process data into the format you want using code.
2.Output the desired results and configure http connector as source of copy activity.Refer to this link.
3.Configure Azure Blob Storage as sink of copy activity.The dataset properties supports JSON format.Refer to this link.

Related

How to filter/map an array in a data factory copy activity from a hierarchy JSON to tabular file?

The context
I'm new to the Azure environment, so please bare with me if the question is simple. I have called a REST API with pagination. The data has multiple arrays stored in a hierachy. The arrays contains the same value translated in different languages. So in theory if I only want one language from that array the data is already in a tabular format. However, i'm having trouble with filtering the data to the correct language in the mapping part of the copy activity.
Sample data
Below is a sample of the data. I have added 3 different 'rows' for the tabular format. There are 3 different arrays in the data:
['stage']['localization']
['disqualifyReason']['localization']
['title']['localization']
As I work for a dutch company, we only want the value where locale == 'nl-NL' to be returned.
[
{
"id": "f2597aa9-45b3-4142-a343-b1ec27fbfcea",
"email": "some#email.com",
"firstName": "Name",
"lastName": "Name",
"middleName": null,
"created": "2023-01-03T13:29:15.7452993Z",
"status": 1,
"stage": {
"localization":[
{
"locale": "da-DK",
"value": "Ansøgt"
},
{
"locale": "de-DE",
"value": "Beworben"
},
{
"locale": "en-GB",
"value": "Applied"
},
{
"locale": "nl-NL",
"value": "Gesolliciteerd"
}
]
},
"disqualifyReason": {
"localization":[
{
"locale": "nl-NL",
"value": "Geen match"
},
{
"locale": "da-DK",
"value": "Ikke et match"
},
{
"locale": "de-DE",
"value": "Absage - Screening"
},
{
"locale": "en-GB",
"value": "Not a match"
}
]
},
"source":{
"media":{
"id": "c0772eab-09dd-4c7c-86b5-ee9b65ed8398",
"title": {
"localization":[
{
"locale": "nl-NL",
"value": "Tegel voor URL"
}
]
}
}
}
},
{
"id": "a72b856e-8000-4e51-b475-9e6af5cf9e19",
"email": "some#email.com",
"firstName": "Name",
"lastName": "Name",
"middleName": null,
"created": "2023-01-03T13:29:15.7452993Z",
"status": 1,
"stage": {
"localization":[
{
"locale": "nl-NL",
"value": "Afwijzen op CV"
}
]
},
"disqualifyReason": null,
"source":{
"media":{
"id": "c0772eab-09dd-4c7c-86b5-ee9b65ed8398",
"title": {
"localization":[
{
"locale": "nl-NL",
"value": "Tegel voor URL"
}
]
}
}
}
},
{
"id": "f3898ebd-d6d6-4d9e-979e-348fe79325dc",
"email": "some#email.com",
"firstName": "Name",
"lastName": "Name",
"middleName": null,
"created": "2023-01-03T14:36:04.4517426Z",
"status": 1,
"stage": {
"localization":[
{
"locale": "nl-NL",
"value": "1e interview"
},
{
"locale": "da-DK",
"value": "1. samtale"
},
{
"locale": "en-GB",
"value": "1st Interview"
},
{
"locale": "nl-NL",
"value": "1. Interview"
}
]
},
"disqualifyReason": null,
"source":{
"media":{
"id": "c0772eab-09dd-4c7c-86b5-ee9b65ed8398",
"title": {
"localization":[
{
"locale": "nl-NL",
"value": "Tegel voor URL"
}
]
}
}
}
}
]
What did I try
Lots of google, and microsoft learn pages. However, I thought the following dynamic function would work in the mapping part of the copy activity
#filter($['stage']['localization']['locale'] == 'nl-NL'), which it doens't. I can't use the filter function in the copy activity pipeline. I believe I can save the API call to a JSON file, then use data flows to filter it out in a data flow activity, which then stores it to a tabular format. However, isn't there a way to directly filter the data in the copy activity?
Many thanks for any help!
In copy activity mapping, there is a dynamic content option.
But AFAIK, this will only apply to filter specific columns from source. But in your case, you are trying to filter the records which might not be possible using copy activity.
I believe I can save the API call to a JSON file, then use data flows
to filter it out in a data flow activity, which then stores it to a
tabular format.
Yes, using dataflows is the solution for it. And dataflows also support REST API source. You can directly use dataflows and give pagination like copy activity.
Then use filter transformation with your condition.
You will get the desired result in debug.

How do I list the duration of videos in a youtube list one by one?

There is a list that I am not creating the list. I want to export the name and duration of each video in this list to excel or anywhere. It would be nice if I could copy the links of videos as well, but it's not really necessary.
You are looking for YouTube Data API v3 Videos: list#contentDetails.duration.
Indeed by retrieving the JSON at https://www.googleapis.com/youtube/v3/videos?part=contentDetails&id=VIDEO_ID&key=API_KEY you would get for example for jNQXAC9IVRw:
{
"kind": "youtube#videoListResponse",
"etag": "yBF8nDhbRsQIALYRMSY1W9dtIPM",
"items": [
{
"kind": "youtube#video",
"etag": "-gM2wTC3rW9_yfOtoOD4fcaQvl4",
"id": "jNQXAC9IVRw",
"contentDetails": {
"duration": "PT19S",
"dimension": "2d",
"definition": "sd",
"caption": "true",
"licensedContent": true,
"contentRating": {},
"projection": "rectangular"
}
}
],
"pageInfo": {
"totalResults": 1,
"resultsPerPage": 1
}
}

Lookup Activity in Azure Data Factory is not reading JSON file correctly and appending some additional data

I am trying to read JSON file stored in Azure Blob container and use the output in setting a variable but I am getting some addition data apart from main data.
My JSON input is
{
"resourceType": "cust",
"gender": "U",
"birthdate": "1890-07-31",
"identifier": [
{
"system": "https://test.com",
"value": "test"
}
],
"name": [
{
"use": "official",
"family": "Test",
"given": [
"test"
],
"prefix": [
"Mr"
]
}
],
"telecom": [
{
"system": "phone",
"value": "00000",
"use": "home"
}
]
}
The output of lookup activity is:
{
"count": 1,
"value": [
{
"JSON_F52E2B61-18A1-11d1-B105": "[{\"resourceType\":\"cust\",\"identifier\":[{\test.com",\"value\":\"test\"}],\"name\":[{\"use\":\"official\",\"family\":\"Test\",\"given\":\"[ Test ]\",\"prefix\":\"[ ]\"}],\"telecom\":[{\"system\":\"phone\",\"value\":\"00000\",\"use\":\"home\"}],\"gender\":\"unknown\",\"birthDate\":\"1890-07-12T00:00:00\"}]"
}
]
}
Now I don't understand why
in value JSON_F52E2B61-18A1-11d1-B105 is present?
so many \ are there, while it is not present in actual JSON?

Convert Azure Computer VIsion Read response to MergeText skill related in Azure Cognitive Search

Raw Read response from Azure Computer Vision looks like this:
{
"status": "succeeded",
"createdDateTime": "2021-04-08T21:56:17.6819115+00:00",
"lastUpdatedDateTime": "2021-04-08T21:56:18.4161316+00:00",
"analyzeResult": {
"version": "3.2",
"readResults": [
{
"page": 1,
"angle": 0,
"width": 338,
"height": 479,
"unit": "pixel",
"lines": [
{
"boundingBox": [
25,
14
],
"text": "NOTHING",
"appearance": {
"style": {
"name": "other",
"confidence": 0.971
}
},
"words": [
{
"boundingBox": [
27,
15
],
"text": "NOTHING",
"confidence": 0.994
}
]
}
]
}
]
}
}
Copied from here
I want to create custom skill in Azure Cognitive Search that are not using VisionSkill but my own Azure Functions that will use Computer vision client in code.
The problem is, that to pass input to Text.MergeSkill:
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name":"text",
"source": "/document/content"
},
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/text"
},
{
"name":"offsets",
"source": "/document/normalized_images/*/contentOffset"
}
],
"outputs": [
{
"name": "mergedText",
"targetName" : "merged_text"
}
]
}
i need to convert Read output to form that returns OcrSkill from custom skills. That response must look like this:
{
"text": "Hello World. -John",
"layoutText":
{
"language" : "en",
"text" : "Hello World.",
"lines" : [
{
"boundingBox":
[ {"x":10, "y":10}, {"x":50, "y":10}, {"x":50, "y":30},{"x":10, "y":30}],
"text":"Hello World."
},
],
"words": [
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"Hello"
},
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"World."
}
]
}
}
And i copied it from here
My question is, how to convert boundingBox parameter from Read Computer Vision endpoint to form that Text.MergeSkill accept? Do we really need to do that or we just can pass Read response to Text.MergeSkill diffrently?
The built in OCRSkill calls the Cognitive Services Computer Vision Read API for certain languages, and it handles the merging of the text for you via the 'text' output. If at all possible, I would strongly suggest you use this skill instead of writing a custom one.
If you must write a custom skill and merge the output text yourself, per the MergeSkill documentation, the 'text' and 'offsets' inputs are optional. Meaning that you should just be able to directly pass the text from the individual Read API output objects to the MergeSkill via the 'itemsToInsert' input if you just need a way to merge those outputs together into one large text. This would make your skillset look something like this (not tested to know for sure), assuming you are still using the built in AzureSearch image extraction and your custom skill outputs the exact payload that the Read API returns that you shared above.
{
"skills": [
{
"#odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"description": "Custom skill that calls Cognitive Services Computer Vision Read API",
"uri": "<your custom skill uri>",
"batchSize": 1,
"context": "/document/normalized_images/*",
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*"
}
],
"outputs": [
{
"name": "readAPIOutput"
}
]
},
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
"context": "/document",
"insertPreTag": "",
"insertPostTag": "\n",
"inputs": [
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/readAPIOutput/analyzeResult/readResults/*/lines/*/text"
}
],
"outputs": [
{
"name": "mergedText",
"targetName": "merged_text"
}
]
}
]
}
However, if you need to guarantee that the text appears in the correct order based on the bounding boxes, you will likely need to write a custom solution to calculate the positions and recombine the text yourself. Hence the suggestion to use our built in solution in the OCRSkill if at all possible.

Sorting in Elastic Search, using nested object type

I am trying to get data using elastic search in a python program. Currently I am getting the following data from an elastic search request. I wish to sort the data on rank:type. For example i want to sort data by raw_freq or maybe by score.
What should the query look like?
I believe it will be something using nested query. Help would be very much appreciated.
{
"data": [
{
"customer_id": 108,
"id": "Qrkz-2QBigkG_fmtME8z",
"rank": [
{
"type": "raw_freq",
"value": 2
},
{
"type": "score",
"value": 3
},
{
"type": "pmiii",
"value": 1.584962
}
],
"status": "pending",
"value": "testingFreq2"
},
],
}
Here is a simple example of how you can sort your data:
"query": {
"term": {"status": "pending"}
},
"sort": [
{"rank.type.keyword": {"order" : "desc"}}
]

Resources