S3 doesObjectExist API is not recognizing path with equal symbol - apache-spark

I have a Apache Spark application which is writing to S3 folder, since the Spark application is partitioning the data while writing to S3 it has adding EQUAL symbol like below.
s3://biops/XXX/YYY/entryDateYear=2018/entryDateMonth=07
I totally understand that S3 does not allow to create bucket_name with "=" , but spark streaming creates partition with field_name followed by "=" and then value.
Please advise on how to access S3 folder with equal sign.
// Actual paths in S3 is --> biops/XXX/YYY/royalty_raw_json/entryDateYear=2018/
String bucket = "biops";
String without_equal = "XXX/YYY/royalty_raw_json/";
String with_equal = "XXX/YYY/royalty_raw_json/entryDateYear=2018";
String with_equal_encoding = "XXX/YYY/royalty_raw_json/entryDateYear%3D2018";
AmazonS3 amazonS3 = AmazonS3ClientBuilder.standard().
withCredentials(getCredentialsProvider(credentials))
.withRegion("us-east-1")
.build();
amazonS3.doesObjectExist(bucket, without_equal); // Works
amazonS3.doesObjectExist(bucket, with_equal); // Not works
amazonS3.doesObjectExist(bucket, with_equal_encoding); // Not works 2
UPDATE
I managed with a work around to list the objects as below to check if a folder is present are not
ListObjectsRequest listObjectsRequest = new ListObjectsRequest().withBucketName(bucket).withPrefix(with_equal);
ObjectListing bucketListing = amazonS3.listObjects(listObjectsRequest);
if (bucketListing != null && bucketListing.getObjectSummaries() != null && bucketListing.getObjectSummaries().size() > 0)
System.out.println("Folder present with files");
else
System.out.println("Folder present with zero files or Folder not present");

Related

python script to delete all versions including delete marker in S3 for files with certain prefix [duplicate]

I'm using a good example to delete all file versions, delete markers and objects inside an Amazon S3 bucket and folder https://stackoverflow.com/a/53872419/387774
import boto3
bucket = "my-bucket-name"
filename = "my-folder/filename.xml"
client = boto3.client("s3")
paginator = client.get_paginator("list_object_versions")
response_iterator = paginator.paginate(Bucket=bucket)
for response in response_iterator:
versions = response.get("Versions", [])
versions.extend(response.get("DeleteMarkers", []))
for version_id in [
x["VersionId"]
for x in versions
if x["Key"] == filename and x["VersionId"] != "null"
]:
print("Deleting {} version {}".format(filename, version_id))
client.delete_object(Bucket=bucket, Key=filename, VersionId=version_id)
The problem I'm facing is there are thousands of files and their versions. I do not know the names of all those files. I would like to delete all of them using some wildcard like filename = "my-folder/*.xml"
What changes can I make?

Copy files from gen2 ADLS based on acknowledgement file

I am trying to copy data from gen2 ADLS into another ADLS using data factory pipeline.
This pipeline runs daily and copies data only for that particular day. This has been done by providing start and end time in the copy activity.
Somedays the files in the source ADLS will be delayed so that the pipeline will run, but no data will be copied.
In order to track this we have planned to keep an acknowledgment file after data copy into source ADLS, so that before copying we could check for the ack file and proceed data copy only if ack file is present.
So the check should happen every 10 mins If ack file is not present, this check should run after 10 mins and this should continue for 2 hrs.
Within this 2 hrs, if file is present then the data copy should proceed and check task also should be stopped.
If there is no data after 2 hrs then the job should fail.
I was trying with validation task in ADF. But one issue is with the folder name since my folder will be named with data and timestamp of creation (for eg: 2021-03-30-02-19-33).
I have to exclude the timestamp part of folder while providing the folder name.
How is it possible. Is wildcard path accepted for validation activity?
Any leads how to implement this?
Is there any way to implement continuous check after 10 mins for 2 hrs in the get matadata task? Can we implement above scenario with get metadata task?
If we do have a requirement for using wildcard path, we have to write Scala/Python.... script on a notebook file and to execute from ADF.
I have used below scala script which takes input parameters from ADF.
import java.io.File
import java.util.Calendar
dbutils.widgets.text("mainFolderPath", "","")
dbutils.widgets.text("finalFolderStartName", "","")
dbutils.widgets.text("fileName", "","")
dbutils.widgets.text("noOfTry", "1","")
val mainFolderPath = dbutils.widgets.get("mainFolderPath")
val finalFolderStartName = dbutils.widgets.get("finalFolderStartName")
val fileName = dbutils.widgets.get("fileName")
val noOfTry = (dbutils.widgets.get("noOfTry")).toInt
println("Main folder path : " + mainFolderPath)
println("Final folder start name : " + finalFolderStartName)
println("File name to be checked : " + fileName)
println("Number of tries with a gap of 1 mins : " + noOfTry)
if(mainFolderPath == "" || finalFolderStartName == "" || fileName == ""){
dbutils.notebook.exit("Please pass input parameters and rerun!")
}
def getListOfSubDirectories(directoryName: String): Array[String] = {
(new File(directoryName))
.listFiles
.filter(_.isDirectory)
.map(_.getName)
}
var counter = 0
var folderFound = false
var fileFound = false
try{
while (counter < noOfTry && !fileFound) {
val folders = getListOfSubDirectories(mainFolderPath)
if(folders.exists(firstName => firstName.startsWith(finalFolderStartName))){
folders.foreach(fol => {
if(fol.startsWith(finalFolderStartName)){
val finalPath = mainFolderPath + "/" + fol + "/" + fileName
println("Final file path : " + finalPath)
folderFound = true
if(new File(finalPath).exists) {
fileFound = true
}else{
println("found the final folder but no file found!")
println("waiting for 10 mins! " + Calendar.getInstance().getTime())
counter = counter+1
Thread.sleep(1*60*1000)
}
}
})
}else{
println("folder does not exists with name : " + mainFolderPath + "/" + finalFolderStartName + "*")
println("waiting for 10 mins! " + Calendar.getInstance().getTime())
counter = counter+1
Thread.sleep(1*60*1000)
}
}
}catch{
case e : Throwable => throw e;
}
if(folderFound && fileFound){
println("File Exists!")
}else{
throw new Exception("File does not exists!");
}
As far as I know, ADF validation and metadata activities does not support wildcard path for the folder/file path.

How do you process many files from a blob storage with long paths in databricks?

I've enabled logging for an API Management service and the logs are being stored in a storage account. Now I'm trying to process them in an Azure Databricks workspace but I'm struggling with accessing the files.
The issue seems to be that the automatically generated virtual folder structure looks like this:
/insights-logs-gatewaylogs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=*/m=*/d=*/h=*/m=00/PT1H.json
I've mounted the insights-logs-gatewaylogs container under /mnt/diags and a dbutils.fs.ls('/mnt/diags') correctly lists the resourceId= folder but dbutils.fs.ls('/mnt/diags/resourceId=') claims file not found
If I create empty marker blobs along the virtual folder structure I can list each subsequent level but that strategy obviously falls down since the last part of the path is dynamically organized by year/month/day/hour.
For example a
spark.read.format('json').load("dbfs:/mnt/diags/logs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=*/m=*/d=*/h=*/m=00/PT1H.json")
Yields in this error:
java.io.FileNotFoundException: File/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=2019 does not exist.
So clearly the wild-card has found the first year folder but is refusing to go further down.
I setup a copy job in Azure Data Factory that copies all the json blobs within the same blob storage account successfully and removes the resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service> prefix (so the root folder starts with the year component) and that can be accessed successfully all the way down without having to create empty marker blobs.
So the problem seems to be related the to the long virtual folder structure which is mostly empty.
Is there another way on how to process these kind of folder structures in databricks?
Update: I've also tried providing the path as part of the source when mounting but that doesn't help either
I think I may have found the root cause of this. Should have tried this earlier but I provided the exact path to an existing blob like this:
spark.read.format('json').load("dbfs:/mnt/diags/logs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=2019/m=08/d=20/h=06/m=00/PT1H.json")
And I got a more meaningful error back:
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
Turns out the out-of-the box logging creates append blobs (and there doesn't seem to be a way to change this) and support for append blobs is still WIP by the looks of this ticket: https://issues.apache.org/jira/browse/HADOOP-13475
The FileNotFoundException could be a red herring which might caused by the inner exception being swallowed when trying expand the wild-cards and finding an unsupported blob type.
Update
Finally found a reasonable work-around. I installed the azure-storage Python package in my workspace (if you're at home with Scala it's already installed) and did the blob loading myself. Most code below is to add globbing support, you don't need it if you're happy to just match on path prefix:
%python
import re
import json
from azure.storage.blob import AppendBlobService
abs = AppendBlobService(account_name='<account>', account_key="<access_key>")
base_path = 'resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>'
pattern = base_path + '/*/*/*/*/m=00/*.json'
filter = glob2re(pattern)
spark.sparkContext \
.parallelize([blob.name for blob in abs.list_blobs('insights-logs-gatewaylogs', prefix=base_path) if re.match(filter, blob.name)]) \
.map(lambda blob_name: abs.get_blob_to_bytes('insights-logs-gatewaylogs', blob_name).content.decode('utf-8').splitlines()) \
.flatMap(lambda lines: [json.loads(l) for l in lines]) \
.collect()
glob2re is courtesy of https://stackoverflow.com/a/29820981/220986:
def glob2re(pat):
"""Translate a shell PATTERN to a regular expression.
There is no way to quote meta-characters.
"""
i, n = 0, len(pat)
res = ''
while i < n:
c = pat[i]
i = i+1
if c == '*':
#res = res + '.*'
res = res + '[^/]*'
elif c == '?':
#res = res + '.'
res = res + '[^/]'
elif c == '[':
j = i
if j < n and pat[j] == '!':
j = j+1
if j < n and pat[j] == ']':
j = j+1
while j < n and pat[j] != ']':
j = j+1
if j >= n:
res = res + '\\['
else:
stuff = pat[i:j].replace('\\','\\\\')
i = j+1
if stuff[0] == '!':
stuff = '^' + stuff[1:]
elif stuff[0] == '^':
stuff = '\\' + stuff
res = '%s[%s]' % (res, stuff)
else:
res = res + re.escape(c)
return res + '\Z(?ms)'
Not pretty but avoids the copying around of data and can be wrapped up in a little utility class.
Try reading directly from the blob, not through the mount
You need to set up either access key or sas for this but I assume you know that
SAS
spark.conf.set(
"fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
"<complete-query-string-of-sas-for-the-container>")
or Access key
spark.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<storage-account-access-key>")
then
val df = spark.read.json("wasbs://<container>#<account-name>.blob.core.windows.net/<path>")
For now this operation is not supported. It is crap that Microsoft provides technology that are not working each other (BYOML -> Log Analytics and export rule to storage account and then read the data from Databricks for example).
There is a workaround for that. You can create your own custom class and read it. Please take a look at the example of reading am-securityevent data on BYOML git:
https://github.com/Azure/Azure-Sentinel-BYOML

How to get artifact file's URI using Artifactory's checksum API where multiple artifacts have same SHA-1 / SHA-256 values aka file's content

Artifactory Version: 5.8.4
In Artifactory, files are stored in the internal database via file's checksum (SHA1) and for retrieval purposes, SHA-256 is useful (for verifying if file is intact).
Read this first: https://www.jfrog.com/confluence/display/RTF/Checksum-Based+Storage
Let's say there are 2 Jenkins jobs, which creates few artifacts/file (rpm/jar/etc). In my case, I'll take a simple .txt file which stores date in MM/DD/YYYY format and some other jobA/B specific build result files (jars/rpms etc).
If we focus only on the text file (as I mentioned above), then:
Jenkins_jobA > generates jobA.date_mm_dd_yy.txt
Jenkins_jobA > generates jobB.date_mm_dd_yy.txt
Jenkins jobA and jobB run multiple times per day in no given run order. Sometime jobA runs first and sometime jobB.
As the content of the file are mostly same for both jobs (per day), the SHA-1 value on jobA's .txt file and jobB.txt file will be same i.e. in Artifactory, both files will be stored in the first 2 character based directry folder structure (as per the Check-sum based storage mechanism).
Basically running sha1sum and sha256sum on both files in Linux, would return the exact same output.
Over the time, these artifacts (.txt, etc) gets promoted from one repository to another (promotion process i.e. from snapshot -> stage -> release repo) so my current logic written in Groovy is to find the URI of the artifact sitting behind a "VIRTUAL" repository (which contains a set of physical local repositories in some order) is listed below:
// Groovy code
import groovy.json.JsonSlurper
import groovy.json.JsonOutput
jsonSlurper = new JsonSlurper()
// The following function will take artifact.SHA_256 as it's parameter to find URI of the artifact
def checkSumBasedSearch(artifactSha) {
virt_repo = "jar-repo" // this virtual may have many physical repos release/stage/snapshot for jar(maven) or it can be a YUM repo for (rpm) or generic repo for (.txt file)
// Note: Virtual repos don't span different repo types (i.e. a virtual repository in Artifactory for "Maven" artifacts (jar/war/etc) can NOT see YUM/PyPi/Generic physical Repos).
// Run aqlCmd on Linux, requires "...", "..", "..." for every distinctive words / characters in the cmd line.
checkSum_URL = artifactoryURL + "/api/search/checksum?sha256="
aqlCmd = ["curl", "-u", username + ":" + password, "${checkSum_URL}" + artifactSha + "&repos=" + virt_repo]
}
def procedure = aqlCmd.execute()
def standardOut = new StringBuilder(), standardErr = new StringBuilder()
procedure.waitForProcessOutput(standardOut, standardErr)
// Fail early
if (! standardErr ) {
println "\n\n-- checkSumBasedSearch() - standardErr exists ---\n" + standardErr +"\n\n-- Exiting with error 12!!!\n\n"
System.exit(12)
}
def obj = jsonSlurper.parseText(standardOut.toString())
def results = obj.results
def uri = results[0].uri // This would work, if a file's sha-1 /256 is always different or sits in different repo at least.
return uri
// to get the URL, I can use:
//aqlCmd = ["curl", "-u", username + ":" + password, "${uri}"]
//def procedure = aqlCmd.execute()
//def obj = jsonSlurper.parseText(standardOut.toString())
//def url = obj.downloadUri
//return url
//aqlCmd = [ "curl", "-u", username + ":" + password, "${url}", "-o", somedirectory + "/" + variableContainingSomeArtifactFilenameThatIWant ]
//
// def procedure = aqlCmd.execute()
//def standardOut = new StringBuilder(), standardErr = new StringBuilder()
//procedure.waitForProcessOutput(standardOut, standardErr)
// Now, I'll get the artifact downloaded in some Directory as some Filename.
}
My concern is, as both files (even though different name -or file-<versioned-timestamp>.txt) have same content in them and generated multiple times per day, how can I get a specific versioned file downloaded for jobA or jobB?
In Artifactory, the SHA_256 property for all files containing same content will be same!! (Artifactory will use SHA-1 for storing these files efficiently to save space, new uploads will be just minimal database level transactions transparent to the user).
Questions:
Will the above logic return jobA's file or jobB's .txt file or any Job's .txt file which uploaded it's file first or latest/acc. to LastModified -aka- last upload time?
How can I get jobA's .txt file and jobB's .txt file downloaded for a given timestamp?
Do I need to add more properties during my rest Api call?
If I was just concerned for the file content, then it doesn't matter much (sha-1/256 dependent) whether it's coming from JobA .txt or job's .txt file, but in a complex case, one may have file name containing meaningful info that they'd like to know to find which file was download (A / B)!
You can use AQL (Artifactory Query Langueage)
https://www.jfrog.com/confluence/display/RTF/Artifactory+Query+Language
curl -u<username>:<password> -XPOST https://repo.jfrog.io/artifactory/api/search/aql -H "Content-Type: text/plain" -T ./search
The content of the file named search is:
items.find(
{
"artifact.module.build.name":{"$eq":"<build name>"},
"artifact.sha1":"<sha1>"
}
)
The above logic (in the original question) will return one of them arbitrary, since you are taking the first result returned and there is no guarantee on the order.
Since your text file contains the timestamp in the name, then you can add the name to the aql given above, it will also filter by the name.
AQL search API is more flexible than the checksum search, use it and customise your query according to the parameters you need.
So, I ended up doing this instead of just returning [0]th element from array in every case.
// Do NOT return [0] first element as yet as Artifactory uses SHA-1/256 so return [Nth].uri where artifact's full name matches with the sha256
// def uri = results[0].uri
def nThIndex=0
def foundFlag = 'false'
for (r in results) {
println "> " + r.uri + " < " + r.uri.toString() + " artifact: " + artFullName
if ( r.uri.toString().contains(artFullName) ) {
foundFlag = 'true'
println "- OK - Found artifact: " + artFullName + " at results[" + nThIndex + "] index."
break; // i.e. a match for the artifact name with SHA-256 we want - has been found.
} else {
nThIndex++;
}
}
if ( foundFlag == 'true' ) {
def uri = results[nThIndex].uri
return uri
} else {
// Fail early if results were found based on SHA256 but not for the artifact but for some other filename with same SHA256
if (! standardErr ) {
println "\n\n\n\n-- [Cool] -- checkSum_Search() - SHA-256 unwanted situation occurred !!! -- results Array was set with some values BUT it didn't contain the artifact (" + artFullName + ") that we were looking for \n\n\n-- !!! Artifact NOT FOUND in the results array during checkSum_Search()---\n\n\n-- Exiting with error 17!!!\n\n\n\n"
System.exit(17) // Nooka
}
}

Spark: Traverse HDFS subfolders and find all files with name "X"

I have a HDFS path and I want to traverse through all the subfolders and find all the files within that have the name "X".
I have tried to do this:
FileSystem.get( sc.hadoopConfiguration )
.listStatus( new Path("hdfs://..."))
.foreach( x => println(x.getPath))
But this only searches for files within 1 level and I want all levels.
You need to get all the files recursively. Loop through the path and get all the files, if it is a directory call the same function once again.
Below is a simple code you can modify as your configuration and test.
var fileSystem : FileSystem = _
var configuration: Configuration = _
def init() {
configuration = new Configuration
fileSystem = FileSystem.get(configuration)
val fileStatus: Array[FileStatus] = fileSystem.listStatus(new Path(""))
getAllFiles(fileStatus)
}
def getAllFiles(fileStatus: Array[FileStatus]) {
fileStatus.map(fs => {
if (fs.isDirectory)
getAllFiles(fileSystem.listStatus(fs.getPath))
else fs
})
}
Also filter the files that contains 'X' after getting the file list.

Resources