Getting content for each file in a directory in a github repo using api - github-api

I was searching through the api docs. It appear the same api call is being used for getting details except content when a dir is provided and content when a single file is given.
https://docs.github.com/en/rest/reference/repos#get-repository-content--code-samples
In the case of Response if content is a file
"type": "file",
"encoding": "base64",
"size": 5362,
"name": "README.md",
"path": "README.md",
"content": "encoded content ...",
"sha": "3d21ec53a331a6f037a91c368710b99387d012c1",
"url": "https://api.github.com/repos/octokit/octokit.rb/contents/README.md",
"git_url": "https://api.github.com/repos/octokit/octokit.rb/git/blobs/3d21ec53a331a6f037a91c368710b99387d012c1",
"html_url": "https://github.com/octokit/octokit.rb/blob/master/README.md",
"download_url": "https://raw.githubusercontent.com/octokit/octokit.rb/master/README.md",
"_links": {
"git": "https://api.github.com/repos/octokit/octokit.rb/git/blobs/3d21ec53a331a6f037a91c368710b99387d012c1",
"self": "https://api.github.com/repos/octokit/octokit.rb/contents/README.md",
"html": "https://github.com/octokit/octokit.rb/blob/master/README.md"
}
}
In case of Response if content is a directory
{
"type": "file",
"size": 625,
"name": "octokit.rb",
"path": "lib/octokit.rb",
"sha": "fff6fe3a23bf1c8ea0692b4a883af99bee26fd3b",
"url": "https://api.github.com/repos/octokit/octokit.rb/contents/lib/octokit.rb",
"git_url": "https://api.github.com/repos/octokit/octokit.rb/git/blobs/fff6fe3a23bf1c8ea0692b4a883af99bee26fd3b",
"html_url": "https://github.com/octokit/octokit.rb/blob/master/lib/octokit.rb",
"download_url": "https://raw.githubusercontent.com/octokit/octokit.rb/master/lib/octokit.rb",
"_links": {
"self": "https://api.github.com/repos/octokit/octokit.rb/contents/lib/octokit.rb",
"git": "https://api.github.com/repos/octokit/octokit.rb/git/blobs/fff6fe3a23bf1c8ea0692b4a883af99bee26fd3b",
"html": "https://github.com/octokit/octokit.rb/blob/master/lib/octokit.rb"
}
},
{
"type": "dir",
"size": 0,
"name": "octokit",
"path": "lib/octokit",
"sha": "a84d88e7554fc1fa21bcbc4efae3c782a70d2b9d",
"url": "https://api.github.com/repos/octokit/octokit.rb/contents/lib/octokit",
"git_url": "https://api.github.com/repos/octokit/octokit.rb/git/trees/a84d88e7554fc1fa21bcbc4efae3c782a70d2b9d",
"html_url": "https://github.com/octokit/octokit.rb/tree/master/lib/octokit",
"download_url": null,
"_links": {
"self": "https://api.github.com/repos/octokit/octokit.rb/contents/lib/octokit",
"git": "https://api.github.com/repos/octokit/octokit.rb/git/trees/a84d88e7554fc1fa21bcbc4efae3c782a70d2b9d",
"html": "https://github.com/octokit/octokit.rb/tree/master/lib/octokit"
}
}
]
How can I get the content too as an output field for all the files in a directory through a single api call. Is there any way to do so?

Related

Gitlab: Dependency scanner report is not shown on security dashboard

I am trying to create my own security scanner which will check dependencies. To test the functionality, I created a "mock scanner" which downloads a file from webhook, and saves it as an artifact ought to be uploaded to the server.
The artifact is uploaded successfully and in the CI output I can see the 201 code, but for some reason it is not presented in the security dashboard.
What am I doing wrong?
Thank you!
The CI job looks as following:
mysec_dependency_scanning:
stage: test
script:
- curl https://webhook.site/XXXX -o gl-dependency-scanning-report.json
- sleep 3
allow_failure: true
artifacts:
reports:
dependency_scanning: gl-dependency-scanning-report.json
The content of the json file is from the example provided by gitlab and it as following:
{
"version": "2.0",
"vulnerabilities": [
{
"id": "51e83874-0ff6-4677-a4c5-249060554eae",
"category": "dependency_scanning",
"name": "alik alik",
"message": "Regular Expression Denial of Service in debug",
"description": "alik to regular expression denial of service when untrusted user input is passed into the `o` formatter. It takes around 50k characters to block for 2 seconds making this a low severity issue.",
"severity": "Unknown",
"solution": "Upgrade to latest versions.",
"scanner": {
"id": "dadada",
"name": "dadada"
},
"location": {
"file": "yarn.lock",
"dependency": {
"package": {
"name": "debug"
},
"version": "1.0.5"
}
},
"identifiers": [
{
"type": "gemnasium",
"name": "Gemnasium-37283ed4-0380-40d7-ada7-2d994afcc62a",
"value": "37283ed4-0380-40d7-ada7-2d994afcc62a",
"url": "https://deps.sec.gitlab.com/packages/npm/debug/versions/1.0.5/advisories"
}
],
"links": [
{
"url": "https://nodesecurity.io/advisories/534"
},
{
"url": "https://github.com/visionmedia/debug/issues/501"
},
{
"url": "https://github.com/visionmedia/debug/pull/504"
}
]
},
{
"id": "5d681b13-e8fa-4668-957e-8d88f932ddc7",
"category": "dependency_scanning",
"name": "Authentication bypass via incorrect DOM traversal and canonicalization",
"message": "Authentication bypass via incorrect DOM traversal and canonicalization in saml2-js",
"description": "Some XML DOM traversal and canonicalization APIs may be inconsistent in handling of comments within XML nodes. Incorrect use of these APIs by some SAML libraries results in incorrect parsing of the inner text of XML nodes such that any inner text after the comment is lost prior to cryptographically signing the SAML message. Text after the comment, therefore, has no impact on the signature on the SAML message.\r\n\r\nA remote attacker can modify SAML content for a SAML service provider without invalidating the cryptographic signature, which may allow attackers to bypass primary authentication for the affected SAML service provider.",
"severity": "Unknown",
"solution": "Upgrade to fixed version.\r\n",
"scanner": {
"id": "dadada",
"name": "dadada"
},
"location": {
"file": "yarn.lock",
"dependency": {
"package": {
"name": "saml2-js"
},
"version": "1.5.0"
}
},
"identifiers": [
{
"type": "gemnasium",
"name": "Gemnasium-9952e574-7b5b-46fa-a270-aeb694198a98",
"value": "9952e574-7b5b-46fa-a270-aeb694198a98",
"url": "https://deps.sec.gitlab.com/packages/npm/saml2-js/versions/1.5.0/advisories"
},
{
"type": "cve",
"name": "CVE-2017-11429",
"value": "CVE-2017-11429",
"url": "https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-11429"
}
],
"links": [
{
"url": "https://github.com/Clever/saml2/commit/3546cb61fd541f219abda364c5b919633609ef3d#diff-af730f9f738de1c9ad87596df3f6de84R279"
},
{
"url": "https://github.com/Clever/saml2/issues/127"
},
{
"url": "https://www.kb.cert.org/vuls/id/475445"
}
]
}
],
"remediations": [
{
"fixes": [
{
"id": "5d681b13-e8fa-4668-957e-8d88f932ddc7",
}
],
"summary": "Upgrade saml2-js",
"diff": "ZGlmZiAtLWdpdCBhL...OR0d1ZUc2THh3UT09Cg==" // some content is omitted for brevity
}
]
}
I was able to fix the problem, the issue was an invalid json format.
Had to do alot of trial and error but I was able to create a working template for a dependency scanning report.
{
"version": "3.0.0",
"vulnerabilities": [
{
"id": "dfa1f7f3d56db6e1c3451a232de42f153e0335611de6f0344443d84e448ee2cf",
"category": "dddda",
"name": "dddda",
"message": "ddda",
"description": "dddda lack of validation in `index.js`.",
"cve": "dada",
"severity": "Critical",
"solution": "Upgrade to version 2.0.5 or above.",
"scanner": {
"id": "lalal",
"name": "Code_Analyzer"
},
"location": {
"file": "yarn.lock",
"dependency": {
"iid": 447,
"package": {
"name": "copy-props"
},
"version": "2.0.4"
}
},
"identifiers": [
{
"type": "dada",
"name": "dada-e9e12690-2e4d-4251-bef0-7357ddc05881",
"value": "e9e57890-5e4d-4832-bef2-7337ddc05889",
"url": "https://gitlab.com/gitlab-org/security-products/gemnasium-db/-/blob/master/npm/copy-props/CVE-2219-28503.yml"
},
{
"type": "cve",
"name": "CVE-2237-28503",
"value": "CVE-2237-28503",
"url": "https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2237-28503"
}
],
"links": [
{
"url": "https://nvd.nist.gov/vuln/detail/CVE-2237-28503"
}
]
}
],
"remediations": [],
"dependency_files": [
{
"path": "yarn.lock",
"package_manager": "yarn",
"dependencies": [
{
"iid": 447,
"dependency_path": [
{
"iid": 708
},
{
"iid": 707
}
],
"package": {
"name": "copy-props"
},
"version": "2.0.4"
}
]
}
],
"scan": {
"scanner": {
"id": "lalal",
"name": "Code_Analyzer",
"url": "https://gitlab.com/gitlab-org/security-products/analyzers/gemnasium",
"vendor": {
"name": "lalal"
},
"version": "2.29.5"
},
"type": "dependency_scanning",
"start_time": "2021-05-03T06:47:29",
"end_time": "2021-05-03T06:47:30",
"status": "success"
}
}

How do i get public gist filename from github gist API

i am trying to get filename from github gist API, the below output which i get while calling the gist api.But the problem is how do i get the filename from the output. because filename is under files, and i want to display filename. So how do i do that. I am new to API so only able to display url, id, etc.
{
"url": "https://api.github.com/gists/41da54b03986886e1a4e57c8cdb38dc2",
"forks_url": "https://api.github.com/gists/41da54b03986886e1a4e57c8cdb38dc2/forks",
"commits_url": "https://api.github.com/gists/41da54b03986886e1a4e57c8cdb38dc2/commits",
"id": "41da54b03986886e1a4e57c8cdb38dc2",
"node_id": "MDQ6R2lzdDQxZGE1NGIwMzk4Njg4NmUxYTRlNTdjOGNkYjM4ZGMy",
"git_pull_url": "https://gist.github.com/41da54b03986886e1a4e57c8cdb38dc2.git",
"git_push_url": "https://gist.github.com/41da54b03986886e1a4e57c8cdb38dc2.git",
"html_url": "https://gist.github.com/41da54b03986886e1a4e57c8cdb38dc2",
"files": {
"telegram-white.svg": {
"filename": "telegram-white.svg",
"type": "image/svg+xml",
"language": "SVG",
"raw_url": "https://gist.githubusercontent.com/cmschandan/41da54b03986886e1a4e57c8cdb38dc2/raw/16b1aac41526997d87400f33ad9912d511faf7a4/telegram-white.svg",
"size": 1266
}
},
"public": true,
"created_at": "2020-12-11T12:58:59Z",
"updated_at": "2020-12-11T12:59:46Z",
"description": "",
"comments": 0,
"user": null,
"comments_url": "https://api.github.com/gists/41da54b03986886e1a4e57c8cdb38dc2/comments",
"owner": {
"login": "cmschandan",
"id": 21227329,
"node_id": "MDQ6VXNlcjIxMjI3MzI5",
"avatar_url": "https://avatars.githubusercontent.com/u/21227329?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/cmschandan",
"html_url": "https://github.com/cmschandan",
"followers_url": "https://api.github.com/users/cmschandan/followers",
"following_url": "https://api.github.com/users/cmschandan/following{/other_user}",
"gists_url": "https://api.github.com/users/cmschandan/gists{/gist_id}",
"starred_url": "https://api.github.com/users/cmschandan/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/cmschandan/subscriptions",
"organizations_url": "https://api.github.com/users/cmschandan/orgs",
"repos_url": "https://api.github.com/users/cmschandan/repos",
"events_url": "https://api.github.com/users/cmschandan/events{/privacy}",
"received_events_url": "https://api.github.com/users/cmschandan/received_events",
"type": "User",
"site_admin": false
},
"truncated": false
},
If you are using JavaScript then you can do something like this:
// `getGitHubGist` is a function that I assume you already
// implemented which fetches your desired gist.
const gist = await getGitHubGist();
// Get an array of filename strings
const filenames = Object.keys(gist.files);
// Get an array of all the file details
const files = Object.values(gist.files);

How to get the running status of IOT Edge Modules via Azure REST API? (not the connectionState or status from the 'Get Modules on Device' API Call)

Context:
Azure REST - Modules - Get Modules On Device
By using this API call, I can get information about connectionState(connected/disconnected) & status(enalbled/disabled) of modules. We can check the runtime status of the modules deployed on the device by visiting the Azure Iot Hub web portal
portal.azure.com -> iot hub -> iot edge section -> select the device you wish to find the details for
Question:
How can I get this RUNTIME STATUS via the Azure API?(please refer to the picture).
If you check the module twin of edgeAgent it is like below:-
{
"deviceId": "edgeDevice",
"moduleId": "$edgeAgent",
"etag": "AAAAAAAAAEA=",
"deviceEtag": "NDU1OTY3MjA=",
"status": "enabled",
"statusUpdateTime": "0001-01-01T00:00:00Z",
"connectionState": "Disconnected",
"lastActivityTime": "0001-01-01T00:00:00Z",
"cloudToDeviceMessageCount": 0,
"authenticationType": "sas",
"x509Thumbprint": {
"primaryThumbprint": null,
"secondaryThumbprint": null
},
"version": 501,
"properties": {
"desired": {
"schemaVersion": "1.0",
"runtime": {
"type": "docker",
"settings": {
"minDockerVersion": "v1.25",
"loggingOptions": "",
"registryCredentials": {
}
}
},
"systemModules": {
"edgeAgent": {
"type": "docker",
"settings": {
"image": "mcr.microsoft.com/azureiotedge-agent:1.0",
"createOptions": "{}"
}
},
"edgeHub": {
"type": "docker",
"status": "running",
"restartPolicy": "always",
"settings": {
"image": "mcr.microsoft.com/azureiotedge-hub:1.0",
"createOptions": "{\"HostConfig\":{\"PortBindings\":{\"5671/tcp\":[{\"HostPort\":\"5671\"}],\"8883/tcp\":[{\"HostPort\":\"8883\"}],\"443/tcp\":[{\"HostPort\":\"443\"}]}}}"
}
}
},
"modules": {
"CustomModuleName": {
"version": "1.0",
"type": "docker",
"status": "running",
"restartPolicy": "always",
"settings": {
"image": "iotregdev300.azurecr.io/customModuleName:0.0.2-amd64",
"createOptions": "{\"HostConfig\":{\"PortBindings\":{\"8080/tcp\":[{\"HostPort\":\"8080\"}]}}}"
}
}
}
},
"reported": {
"schemaVersion": "1.0",
"version": {
"version": "1.0.9.4",
"build": "32971639",
"commit": "12d55e582cc7ce95c8abfe11eddfbbc938ed6001"
},
"lastDesiredStatus": {
"code": 200,
"description": ""
},
"runtime": {
"platform": {
"os": "linux",
"architecture": "x86_64",
"version": "1.0.9.4"
},
"type": "docker",
"settings": {
"minDockerVersion": "v1.25",
"loggingOptions": "",
"registryCredentials": {
}
}
},
"systemModules": {
"edgeAgent": {
"type": "docker",
"exitCode": 0,
"statusDescription": "running",
"lastStartTimeUtc": "2020-09-09T07:34:34.4585643Z",
"lastExitTimeUtc": "2020-09-09T07:34:26.9869915Z",
"runtimeStatus": "running",
"imagePullPolicy": "on-create",
"settings": {
"image": "mcr.microsoft.com/azureiotedge-agent:1.0",
"imageHash": "sha256:1a2fffc3c74a2b2510a3149bb2295b68a553e4c9aca90698879902f36fd6d163",
"createOptions": "{}"
}
},
"edgeHub": {
"type": "docker",
"status": "running",
"restartPolicy": "always",
"imagePullPolicy": "on-create",
"env": {},
"exitCode": 0,
"statusDescription": "running",
"lastStartTimeUtc": "2020-09-09T07:34:50.8012461Z",
"lastExitTimeUtc": "2020-09-09T07:34:26.9845717Z",
"restartCount": 0,
"lastRestartTimeUtc": "2020-09-09T07:34:26.9845717Z",
"runtimeStatus": "running",
"settings": {
"image": "mcr.microsoft.com/azureiotedge-hub:1.0",
"imageHash": "sha256:f531eb6c23f347c37ea8c90204e9cb12024aec77d8b2e68e93b14c38ec066520",
"createOptions": "{\"HostConfig\":{\"PortBindings\":{\"5671/tcp\":[{\"HostPort\":\"5671\"}],\"8883/tcp\":[{\"HostPort\":\"8883\"}],\"443/tcp\":[{\"HostPort\":\"443\"}]}}}"
}
}
},
"lastDesiredVersion": 64,
"modules": {
"CustomModuleName": {
"exitCode": 0,
"statusDescription": "running",
"lastStartTimeUtc": "2020-09-09T07:34:49.3923079Z",
"lastExitTimeUtc": "2020-09-09T07:34:26.9606688Z",
"restartCount": 0,
"lastRestartTimeUtc": "2020-09-09T07:34:26.9606688Z",
"runtimeStatus": "running",
"version": "1.0",
"status": "running",
"restartPolicy": "always",
"imagePullPolicy": "on-create",
"type": "docker",
"settings": {
"image": "iotregdev300.azurecr.io/custommodulename:0.0.2-amd64",
"imageHash": "sha256:e728d4b8804d2114beab7c1903f706d8152e404be3f5601ee5e7371e8ac32ecf",
"createOptions": "{\"HostConfig\":{\"PortBindings\":{\"8080/tcp\":[{\"HostPort\":\"8080\"}]}}}"
},
"env": {}
}
}
}
}
}
In the above json, CustomModuleName is the custom module and it has the field called runtimeStatus: "running". Same field exists in the edgeHub and edgeAgent modules too. So you need to just fetch the edgeAgentTwin through REST API or Azure Device/Service sdk.

Using ADF REST connector to read and transform FHIR data

I am trying to use Azure Data Factory to read data from a FHIR server and transform the results into newline delimited JSON (ndjson) files in Azure Blob storage. Specifically, if you query a FHIR server, you might get something like:
{
"resourceType": "Bundle",
"id": "som-id",
"type": "searchset",
"link": [
{
"relation": "next",
"url": "https://fhirserver/?ct=token"
},
{
"relation": "self",
"url": "https://fhirserver/"
}
],
"entry": [
{
"fullUrl": "https://fhirserver/Organization/1234",
"resource": {
"resourceType": "Organization",
"id": "1234",
// More fields
},
{
"fullUrl": "https://fhirserver/Organization/456",
"resource": {
"resourceType": "Organization",
"id": "456",
// More fields
},
// More resources
]
}
Basically a bundle of resources. I would like to transform that into a newline delimited (aka ndjson) file where each line is just the json for a resource:
{"resourceType": "Organization", "id": "1234", // More fields }
{"resourceType": "Organization", "id": "456", // More fields }
// More lines with resources
I am able to get the REST connector set up and it can query the FHIR server (including pagination), but no matter what I try I cannot seem to generate the ouput I want. I set up an Azure Blob storage dataset:
{
"name": "AzureBlob1",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": "myout.json",
"folderPath": "outfhirfromadf"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
And configure a copy activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"resource": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
But at the end (in spite of configuring the schema mapping), it the end result in the blob is always just the original bundle returned from the server. If I configure the output blob as being a comma delimited text, I can extract fields and create a flattened tabular view, but that is not really what I want.
Any suggestions would be much appreciated.
So I sort of found a solution. If I do the original step of converting where the bundles are simply dumped in the JSON file and then do a nother conversion from the JSON file to what I pretend to be a text file into another blob, I can get the njson file created.
Basically, define another blob dataset:
{
"name": "AzureBlob2",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"structure": [
{
"name": "Prop_0",
"type": "String"
}
],
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "",
"quoteChar": "",
"nullValue": "\\N",
"encodingName": null,
"treatEmptyAsNull": true,
"skipLineCount": 0,
"firstRowAsHeader": false
},
"fileName": "myout.json",
"folderPath": "adfjsonout2"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
Note that this one TextFormat and also note that the quoteChar is blank. If I then add another Copy Activity:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010"
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"['resource']": "resource"
},
"collectionReference": "$.entry"
}
},
"inputs": [
{
"referenceName": "FHIRSource",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
]
},
{
"name": "Copy Data2",
"type": "Copy",
"dependsOn": [
{
"activity": "Copy Data1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"resource": "Prop_0"
}
}
},
"inputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob2",
"type": "DatasetReference"
}
]
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Then it all works out. It is not ideal in that I now have two copies of the data in blobs, but one can easily be deleted, I suppose.
I would still love to hear about it if somebody has a one-step solution.
As briefly discussed in the comment, the Copy Activity does not provide much functionality aside from mapping data. As stated in the documentation, the Copy activity does the following operations:
Reads data from a source data store.
Performs serialization/deserialization, compression/decompression, column mapping, etc. It does these operations based on the
configurations of the input dataset, output dataset, and Copy
Activity.
Writes data to the sink/destination data store.
It does not look like that the Copy Activity does anything else aside from efficiently copying stuff around.
What I found out to be working was to use Databrick.
Here are the steps:
Add a Databricks account to your subscription;
Go to the Databricks page by clicking the authoring button;
Create a notebook;
Write the script (Scala, Python or .Net was recently announced).
The script would the following:
Read the data from the Blob storage;
Filter out & transform the data as needed;
Write the data back to a Blob storage;
You can test your script from there and, once ready, you can go back to your pipeline and create a Notebook activity that will point to your notebook containing the script.
I struggled coding in Scala but it was worth it :)
For anyone finding this post in the future you can just can use the $export api call to accomplish this. Note that you have to have a storage account linked to your Fhir server.
https://build.fhir.org/ig/HL7/bulk-data/export.html#endpoint---system-level-export

Azure Data Factory truncating files in a copy pipeline from an S3 bucket to ADLS

I have a copy pipeline set up to copy a handful of files from a daily folder in an S3 bucket into a data lake in Azure using data factory. I run into this really weird issue.
Suppose there are three files in the S3 bucket. One is 30MB, another is 50MB, and the last is 70MB. If I put the 30M file 'first' (name it test0.tsv), then it claims it successfully copies all three files to ADLS. But, the second and third file are truncated to 30M. The data is correct for each file, but, it is truncated. If I put the 70M file first, then they are all copied properly. So, it is using the first file length as max file size and truncating all subsequent longer files. This is also very worrisome to me since the Azure Data Factory claims it successfully copied them.
Here is my pipeline:
{
"name": "[redacted]Pipeline",
"properties": {
"description": "[redacted]",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataset"
}
],
"outputs": [
{
"name": "OutputDataset"
}
],
"policy": {
"retry": 3
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "[redacted]"
}
],
"start": "2018-07-06T04:00:00Z",
"end": "2018-07-30T04:00:00Z",
"isPaused": false,
"hubName": "[redacted]",
"pipelineMode": "Scheduled"
}
}
Here is my input dataset:
{
"name": "InputDataset",
"properties": {
"published": false,
"type": "AmazonS3",
"linkedServiceName": "[redacted]",
"typeProperties": {
"bucketName": "[redacted",
"key": "$$Text.Format('{0:yyyy}/{0:MM}/{0:dd}/', SliceStart)"
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
Here is my output dataset:
{
"name": "OutputDataset",
"properties": {
"published": false,
"type": "AzureDataLakeStore",
"linkedServiceName": "[redacted]",
"typeProperties": {
"folderPath": "[redacted]/{Year}/{Month}/{Day}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
}
]
},
"availability": {
"frequency": "Day",
"interval": 1
}
}
}
I have removed the format fields in both the input and output dataset because I thought maybe having it be a binary copy would fix it, but that didn't work.

Resources