I would like to programmatically create the url to download a file.
To do this I need the workspaceUrl and clusterOwnerUserId.
How can I retrieve those in a Databricks notebook?
# how to get the `workspaceUrl` and `clusterOwnerUserId`?
tmp_file = '/tmp/output_abcd.xlsx'
filestore_file = '/FileStore/output_abcd.xlsx'
# code to create file omitted for brevity ...
dbutils.fs.cp(f'file:{tmp_file}', filestore_file)
downloadUrl = f'https://{workspaceUrl}/files/output_abcd.xlsx?o={clusterOwnerUserId}'
displayHTML(f"<a href='{downloadUrl}'>download</a>")
The variables are available in the spark conf.
E.g.
clusterOwnerUserId = spark.conf.get('spark.databricks.clusterUsageTags.orgId')
workspaceUrl = spark.conf.get('spark.databricks.workspaceUrl')
Use can then use the details as follows:
tmp_file = '/tmp/output_abcd.xlsx'
filestore_file = '/FileStore/output_abcd.xlsx'
# code to create file omitted for brevity ...
dbutils.fs.cp(f'file:{tmp_file}', filestore_file)
downloadUrl = f'https://{workspaceUrl}/files/output_abcd.xlsx?o={clusterOwnerUserId}'
displayHTML(f"<a href='{downloadUrl}'>download</a>")
Databricks Files in the Filestore at
/FileStore/my-stuff/my-file.txt is accessible at:
"https://databricks-instance-name.cloud.databricks.com/files/my-stuff/my-file.txt"
I don't think you need the o=... part. That is the workspace Id btw, not the clusterOwner user id.
Related
I have files in azure file storage, so I listed the files by the date upload and then I want to select the most recent file uploaded.
So to do this, I created a function that should have returned me, the list of the files. However when I see the output, it return only one file and the other are missing.
Here is my code:
file_service = FileService(account_name='', account_key='')
generator = list(file_service.list_directories_and_files(''))
def list_files_in(generator,file_service):
list_files=[]
for file_or_dir in generator:
file_in = file_service.get_file_properties(share_name='', directory_name="", file_name=file_or_dir.name)
file_date= file_in.properties.last_modified.date()
list_tuple = (file_date,file_or_dir.name)
list_files.append(list_tuple)
return list_files
To get the latest files in azure blob you need to write the logic to get that, below are the code which will give you the same :
For Files
result = file_service.list_directories_and_files(share_name, directory_name)
for file_or_dir in result:
if isinstance(file_or_dir, File):
file = file_service.get_file_properties(share_name, directory_name, file_name, timeout=None, snapshot=None)
print(file_or_dir.name, file.properties.last_modified)
For Blob
from azure.storage.blob import ContainerClient
container = ContainerClient.from_connection_string(conn_str={your_connection_string}, container_name = {your_container_name})
for blob in container.list_blobs():
print(f'{blob.name} : {blob.last_modified}')
You can get the key's and account details from the azure portal for the account access. In this blob.last_modified will give you latest blob item.
From the OutputFileDatasetConfig documentation for the destination class member,
If set to None, we will copy the output to the workspaceblobstore datastore, under the path /dataset/{run-id}/{output-name}
Given I have the handle to such OutputFileDatasetConfig with destination set to None, how can I get the generated destination without recomputing the default myself as this can be subject to change.
If you do not want to pass a name and path, then in that scenario the run details should provide the run id and the path can be created using the same. In an ideal scenario you would like to pass these details, if they are not passed the recommended approach is to use them in a intermediate step so the SDK can handle this for you, as seen in this PythonScriptStep()
from azureml.data import OutputFileDatasetConfig
dataprep_output = OutputFileDatasetConfig()
input_dataset = Dataset.get_by_name(workspace, 'raw_data')
dataprep_step = PythonScriptStep(
name="prep_data",
script_name="dataprep.py",
compute_target=cluster,
arguments=[input_dataset.as_named_input('raw_data').as_mount(), dataprep_output]
)
output = OutputFileDatasetConfig()
src = ScriptRunConfig(source_directory=path,
script='script.py',
compute_target=ct,
environment=env,
arguments = ["--output_path", output])
run = exp.submit(src, tags=tags)
###############INSIDE script.py
mount_point = os.path.dirname(args.output_path)
os.makedirs(mount_point, exist_ok=True)
print("mount_point : " + mount_point)
Here's my sample code which works:
import os, io, dropbox
def createFolder(dropboxBaseFolder, newFolder):
# creating a temp dummy destination file path
dummyFileTo = dropboxBaseFolder + newFolder + '/' + 'temp.bin'
# creating a virtual in-memory binary file
f = io.BytesIO(b"\x00")
# uploading the dummy file in order to cause creation of the containing folder
dbx.files_upload(f.read(), dummyFileTo)
# now that the folder is created, delete the dummy file
dbx.files_delete_v2(dummyFileTo)
accessToken = '....'
dbx = dropbox.Dropbox(accessToken)
dropboxBaseDir = '/test_dropbox'
dropboxNewSubDir = '/new_empty_sub_dir'
createFolder(dropboxBaseDir, dropboxNewSubDir)
But is there a more efficient/simpler way to do the task ?
Yes, as Ronald mentioned in the comments, you can use the files_create_folder_v2 method to create a new folder.
That would look like this, modifying your code:
import dropbox
accessToken = '....'
dbx = dropbox.Dropbox(accessToken)
dropboxBaseDir = '/test_dropbox'
dropboxNewSubDir = '/new_empty_sub_dir'
res = dbx.files_create_folder_v2(dropboxBaseDir + dropboxNewSubDir)
# access the information for the newly created folder in `res`
I went through the official docs of google cloud but I don't have an idea how to use these to list resources of specific organization by providing the organization id
organizations = CloudResourceManager.Organizations.Search()
projects = emptyList()
parentsToList = queueOf(organizations)
while (parent = parentsToList.pop()) {
// NOTE: Don't forget to iterate over paginated results.
// TODO: handle PERMISSION_DENIED appropriately.
projects.addAll(CloudResourceManager.Projects.List(
"parent.type:" + parent.type + " parent.id:" + parent.id))
parentsToList.addAll(CloudResourceManager.Folders.List(parent))
}
organizations = CloudResourceManager.Organizations.Search()
projects = emptyList()
parentsToList = queueOf(organizations)
while (parent = parentsToList.pop()) {
// NOTE: Don't forget to iterate over paginated results.
// TODO: handle PERMISSION_DENIED appropriately.
projects.addAll(CloudResourceManager.Projects.List(
"parent.type:" + parent.type + " parent.id:" + parent.id))
parentsToList.addAll(CloudResourceManager.Folders.List(parent))
}
You can use Cloud Asset Inventory for this. I wrote this code for performing a sink in BigQuery.
import os
from google.cloud import asset_v1
from google.cloud.asset_v1.proto import asset_service_pb2
def asset_to_bq(request):
client = asset_v1.AssetServiceClient()
parent = 'organizations/{}'.format(os.getEnv('ORGANIZATION_ID'))
output_config = asset_service_pb2.OutputConfig()
output_config.bigquery_destination.dataset = 'projects/{}}/datasets/{}'.format(os.getEnv('PROJECT_ID'),
os.getEnv('DATASET'))
output_config.bigquery_destination.table = 'asset_export'
output_config.bigquery_destination.force = True
response = client.export_assets(parent, output_config)
# For waiting the finish
# response.result()
# Do stuff after export
return "done", 200
if __name__ == "__main__":
asset_to_bq('')
Be careful is you use it, the sink must be done in an empty/not existing table or set the force to true.
In my case, some minutes after the Cloud Scheduler that trigger my function and extract the data to BigQuery, I have a Scheduled Query into BigQuery that copy the data to another table, for keeping the history.
Note: It's also possible to configure an extract in Cloud Storage if you prefer.
I hope that is a starting point for you and for achieving what do you want to do.
I am able to list the project but I also want to list the folder and resources under folder and folder.name and tags and i also want to specify the organization id to resources information from a specific organization
import os
from google.cloud import resource_manager
def export_resource (organizations):
client = resource_manager.Client()
for project in client.list_projects():
print("%s, %s" % (project.project_id, project.status))
I created a different block in my kinto.ini file and i want to use those setting in my program.
#kinto.ini
[mysetting]
name = json
username = jsonmellow
password = *********
[app:main]
use = egg:kinto
kinto.storage_url = postgre//
if we use 'config.get_setting' function of kinto it gives me the setting of the default block "app:main" only. so how can i get the other setting from "mysetting" block.
you can use prefix for your settings like:
[app:main]
...
mysetting.name = json
mysetting.username = jsonmellow
But if you still need some extra section in ini: How can I access a custom section in a Pyramid .ini file?