I have a single container with around 200k images on my blob storage. I want to write a script in Python that copies out batches of 20k of these images to new containers called something like imageset1, imageset2, ..., imageset20 (the last container will have less than 20k images in it which is fine).
I have the following so far:
from azure.storage.blob import BlockBlobService
from io import BytesIO from shutil
import copyfileobj
with BytesIO() as input_blob:
with BytesIO() as output_blob:
block_blob_service = BlockBlobService(account_name='my_account_name', account_key='my_account_key')
# Download as a stream
block_blob_service.get_blob_to_stream('mycontainer', 'myinputfilename', input_blob)
# Here is where I want to chunk up the container contents into batches of 20k
# Then I want to write the above to a set of new containers using, I think, something like this...
block_blob_service.create_blob_from_stream('mycontainer', 'myoutputfilename', output_blob)
It's the chunking up the contents of a container and writing the results out to new containers which I don't know how to do. Can anyone help?
here is my sample code to realize your needs, and it works on my container.
from azure.storage.blob.baseblobservice import BaseBlobService
account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<the source container name>'
blob_service = BaseBlobService(
account_name=account_name,
account_key=account_key
)
blobs = blob_service.list_blobs(container_name)
# The target container index starts with 1
container_index = 1
# The blob number in new container, such as 3 in my testing
num_per_container = 3
count = 0
# The prefix of new container name
prefix_of_new_container = 'imageset'
flag_of_new_container = False
for blob in blobs:
if flag_of_new_container == False:
flag_of_new_container = blob_service.create_container("%s%d" % (prefix_of_new_container, container_index))
print(blob.name, "%s%d" % (prefix_of_new_container,container_index))
blob_service.copy_blob("%s%d" % (prefix_of_new_container, container_index), blob.name, "https://%s.blob.core.windows.net/%s/%s" % (account_name, container_name, blob.name))
count += 1
if count == num_per_container:
container_index += 1
count = 0
flag_of_new_container = False
Note: I only use BaseBlobService because it's enough for your needs, even for AppendBlob or PageBlob. Also, you can use BlockBlobService instead of it.
Related
I am trying to find the behavior of azure blob storage with AD authentication when uploading took more than 90 min for a single big file, unfortunately my internet is quite fast and my disk can't fit TB scale file, so I am trying to simulate slow upload
I tried the following code
import os
from io import BufferedReader, FileIO
class ProgressFile(BufferedReader):
# For binary opening only
def __init__(self, filename, read_callback):
f = FileIO(file=filename, mode='r')
self._read_callback = read_callback
super().__init__(raw=f)
# I prefer Pathlib but this should still support 2.x
self.length = os.stat(filename).st_size
def read(self, size=None):
calc_sz = size
if not calc_sz:
calc_sz = self.length - self.tell()
self._read_callback(position=self.tell(), read_size=calc_sz, total=self.length)
return super(ProgressFile, self).read(size)
def my_callback(position, read_size, total):
if position > 0 and position <= 4194304:
time.sleep(5520)
print("position: {position}, read_size: {read_size}, total: {total}".format(position=position,
read_size=read_size,
total=total))
myfile = ProgressFile(filename='./testfile', read_callback=my_callback)
from azure.identity import ClientSecretCredential
token_credential = ClientSecretCredential(
)
container_client = ContainerClient(oauth_url, "containername", token_credential)
def upload(filename):
blob_client = container_client.get_blob_client("myfile")
blob_client.upload_blob(myfile, blob_type="BlockBlob")
print("finish uploading")
upload(int(time.time()))
However I don't see token expire error, even after 90 min
In what circumstance does token expiration appears?
As you are using azure.identity.ClientSecretCredential, it renews the token when it is close to expiration.
(I work in Microsoft Azure SDK team)
I am trying to to download a list of csv files from an Azure Blob Storage using a shared SAS token, but I am getting all sorts of errors.
I tried looking this up and tried multiple code samples from contributors on Slackoverflow and Azure documentation. here is the final state of the code sample I constructed from those sources! It tries to download the list of csv files in a pooled manner (blob storage contains 200 csv files):
NB: I left commented code snippets to show different approaches I tried testing. sorry if they are confusing!
from itertools import tee
from multiprocessing import Process
from multiprocessing.pool import ThreadPool
import os
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient
#from azure.storage.blob import BlockBlobService
STORAGEACCOUNTURL = "https://myaccount.blob.core.windows.net"
STORAGEACCOUNTKEY = "sv=2020-08-04&si=blobpolicyXYZ&sr=c&sig=xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CONTAINERNAME = "mycontainer"
##BLOBNAME = "??"
sas_url = 'https://myaccount.blob.core.windows.net/mycontainer/mydir?sv=2020-08-04&si=blobpolicyXYZ&sr=c&sig=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
LOCAL_BLOB_PATH = "./downloads"
class AzureBlobFileDownloader:
def __init__(self):
print("Intializing AzureBlobFileDownloader")
# Initialize the connection to Azure storage account
self.blob_service_client_instance = ContainerClient.from_container_url #BlobClient.from_blob_url(sas_url) #BlobServiceClient(account_url=STORAGEACCOUNTURL, credential=STORAGEACCOUNTKEY)
#self.blob_client_instance = self.blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME)
#self.blob_service_client = BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)
#self.my_container = self.blob_service_client.get_container_client(MY_BLOB_CONTAINER)
#self.blob_service_client = BlockBlobService("storage_account",sas_token="?sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-04-24T10:01:58Z&st=2019-04-23T02:01:58Z&spr=https&sig=xxxxxxxxx")
#self.my_container = self.blob_service_client.get_blob_to_path("container_name","blob_name","local_file_path")
def save_blob(self,file_name,file_content):
# Get full path to the file
download_file_path = os.path.join(LOCAL_BLOB_PATH, file_name)
# for nested blobs, create local path as well!
os.makedirs(os.path.dirname(download_file_path), exist_ok=True)
with open(download_file_path, "wb") as file:
file.write(file_content)
def download_all_blobs_in_container(self):
# get a list of blobs
my_blobs = self.blob_service_client_instance.get_block_list() #list_blobs() #self.blob_client_instance.list_blobs() download_blob() #
print(my_blobs)
#iterate through the iterable object for testing purposes, maybe wrong approach!
result, result_backup = tee(my_blobs)
print("**first iterate**")
for i, r in enumerate(result):
print(r)
#start downloading my_blobs
result = self.run(my_blobs)
print(result)
def run(self,blobs):
# Download 3 files at a time!
with ThreadPool(processes=int(3)) as pool:
return pool.map(self.save_blob_locally, blobs)
def save_blob_locally(self,blob):
file_name = blob.name
print(file_name)
bytes = self.blob_service_client_instance.get_blob_client(CONTAINERNAME,blob).download_blob().readall()
# Get full path to the file
download_file_path = os.path.join(LOCAL_BLOB_PATH, file_name)
# for nested blobs, create local path as well!
os.makedirs(os.path.dirname(download_file_path), exist_ok=True)
with open(download_file_path, "wb") as file:
file.write(bytes)
return file_name
# Initialize class and download files
azure_blob_file_downloader = AzureBlobFileDownloader()
azure_blob_file_downloader.download_all_blobs_in_container()
could someone help me get to achieve this task in python:
get a list of all files in the blob storage, those files names are prefixed with part-
download them to a folder locally
thanks
could someone help me get to achieve this task in python:
get a list of all files in the blob storage, those files names are prefixed with part-
To List all the blobs whose prefix is "part-" you can use blob_service.list_blobs(<Container Name>, prefix="<Your Prefix>"). Below is the code to get the list of blobs for the same.
print("\nList blobs in the container")
generator = blob_service.list_blobs(CONTAINER_NAME, prefix="part-")
for blob in generator:
print("\t Blob name: " + blob.name)
download them to a folder locally
To download the blob you can use blob_client = blob_service.get_blob_to_path(<Container Name>,<Blob Name>,<File Path>). Below is the code to download the blob as per your requirement.
blob_client = blob_service.get_blob_to_path(CONTAINER_NAME,blob.name,fname)
Below is the complete code that worked for us which achieves your requirement.
import os
from azure.storage.blob import BlockBlobService
ACCOUNT_NAME = "<Your_ACCOUNT_NAME>"
ACCOUNT_KEY = "<YOUR_ACCOUNT_KEY>"
CONTAINER_NAME = "<YOUR_CONTAINER_NAME>"
LOCAL_BLOB_PATH = "C:\\<YOUR_PATH>\\downloadedfiles"
blob_service = BlockBlobService(ACCOUNT_NAME, ACCOUNT_KEY)
# Lists All Blobs which has a prefic of part-
print("\nList blobs in the container")
generator = blob_service.list_blobs(CONTAINER_NAME, prefix="part-")
for blob in generator:
print("\t Blob name: " + blob.name)
# Downloading the blob to a folder
for blob in generator:
# Adds blob name to the path
fname = os.path.join(LOCAL_BLOB_PATH, blob.name)
print(f'Downloading {blob.name} to {fname}')
# Downloading blob into file
blob_client = blob_service.get_blob_to_path(CONTAINER_NAME,blob.name,fname)
RESULT :
Files in my Storage Account
Files in my Local Folder
Updated Answer
blob_service = BlockBlobService(account_name=ACCOUNT_NAME,account_key=None,sas_token=SAS_TOKEN)
# Lists All Blobs which has a prefic of part-
print("\nList blobs in the container")
generator = blob_service.list_blobs(CONTAINER_NAME, prefix="directory1"+"/"+"part-")
for blob in generator:
print("\t Blob name: " + blob.name)
# Downloading the blob to a folder
for blob in generator:
# Adds blob name to the path
fname = os.path.join(LOCAL_BLOB_PATH, blob.name)
print(f'Downloading {blob.name} to {fname}')
# Downloading blob into file
blob_client = blob_service.get_blob_to_path(CONTAINER_NAME,blob.name,fname)
I am following this link and getting some error:
How to upload folder on Google Cloud Storage using Python API
I have saved model in container environment and from there I want to copy to GCP bucket.
Here is my code:
storage_client = storage.Client(project='*****')
def upload_local_directory_to_gcs(local_path, bucket, gcs_path):
bucket = storage_client.bucket(bucket)
assert os.path.isdir(local_path)
for local_file in glob.glob(local_path + '/**'):
print(local_file)
print("this is bucket",bucket)
blob = bucket.blob(gcs_path)
print("here")
blob.upload_from_filename(local_file)
print("done")
path="/pythonPackage/trainer/model_mlm_demo" #this is local absolute path where my folder is. Folder name is **model_mlm_demo**
buc="py*****" #this is my GCP bucket address
gcs="model_mlm_demo2/" #this is the new folder that I want to store files in GCP
upload_local_directory_to_gcs(local_path=path, bucket=buc, gcs_path=gcs)
/pythonPackage/trainer/model_mlm_demo has 3 files in it config, model.bin and arguments.bin`
ERROR
The codes doesn't throw any error, but there is no files uploaded in GCP bucket. It just creates empty folder.
What I can see the error is, you don't need to pass the gs:// as the bucket parameter. Actually, here is an example you may need to check out,
https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
def upload_blob(bucket_name, source_file_name, destination_blob_name):
"""Uploads a file to the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The path to your file to upload
# source_file_name = "local/path/to/file"
# The ID of your GCS object
# destination_blob_name = "storage-object-name"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)
I have reproduced your issue and the below code snippet works fine. I have updated the code based on folders and names you have mentioned in the question. Let me know if you have any issues.
import os
import glob
from google.cloud import storage
storage_client = storage.Client(project='')
def upload_local_directory_to_gcs(local_path, bucket, gcs_path):
bucket = storage_client.bucket(bucket)
assert os.path.isdir(local_path)
for local_file in glob.glob(local_path + '/**'):
print(local_file)
print("this is bucket", bucket)
filename=local_file.split('/')[-1]
blob = bucket.blob(gcs_path+filename)
print("here")
blob.upload_from_filename(local_file)
print("done")
# this is local absolute path where my folder is. Folder name is **model_mlm_demo**
path = "/pythonPackage/trainer/model_mlm_demo"
buc = "py*****" # this is my GCP bucket address
gcs = "model_mlm_demo2/" # this is the new folder that I want to store files in GCP
upload_local_directory_to_gcs(local_path=path, bucket=buc, gcs_path=gcs)
I just came across the gcsfs library which seems to be also about better interfaces
You could copy an entire directory into a gcs location like this:
def upload_to_gcs(src_dir: str, gcs_dst: str):
fs = gcsfs.GCSFileSystem()
fs.put(src_dir, gcs_dst, recursive=True)
I figured out a way using subprocess to upload model artefacts in GCP bucket.
import subprocess
subprocess.call('gsutil cp -r source_folder_in_local gs://*****/folder_name', shell=True, stdout=subprocess.PIPE)
If gsutil is not installed. You can install using this link:
https://cloud.google.com/storage/docs/gsutil_install
I have files in azure file storage, so I listed the files by the date upload and then I want to select the most recent file uploaded.
So to do this, I created a function that should have returned me, the list of the files. However when I see the output, it return only one file and the other are missing.
Here is my code:
file_service = FileService(account_name='', account_key='')
generator = list(file_service.list_directories_and_files(''))
def list_files_in(generator,file_service):
list_files=[]
for file_or_dir in generator:
file_in = file_service.get_file_properties(share_name='', directory_name="", file_name=file_or_dir.name)
file_date= file_in.properties.last_modified.date()
list_tuple = (file_date,file_or_dir.name)
list_files.append(list_tuple)
return list_files
To get the latest files in azure blob you need to write the logic to get that, below are the code which will give you the same :
For Files
result = file_service.list_directories_and_files(share_name, directory_name)
for file_or_dir in result:
if isinstance(file_or_dir, File):
file = file_service.get_file_properties(share_name, directory_name, file_name, timeout=None, snapshot=None)
print(file_or_dir.name, file.properties.last_modified)
For Blob
from azure.storage.blob import ContainerClient
container = ContainerClient.from_connection_string(conn_str={your_connection_string}, container_name = {your_container_name})
for blob in container.list_blobs():
print(f'{blob.name} : {blob.last_modified}')
You can get the key's and account details from the azure portal for the account access. In this blob.last_modified will give you latest blob item.
I have developed below code which is helping to export BigQuery table in to Google storage bucket. I want to merge files into single file with out header, so that next processes will use file with out any issue.
def export_bq_table_to_gcs(self, table_name):
client = bigquery.Client(project=project_name)
print("Exporting table {}".format(table_name))
dataset_ref = client.dataset(dataset_name,
project=project_name)
dataset = bigquery.Dataset(dataset_ref)
table_ref = dataset.table(table_name)
size_bytes = client.get_table(table_ref).num_bytes
# For tables bigger than 1GB uses Google auto split, otherwise export is forced in a single file.
if size_bytes > 10 ** 9:
destination_uris = [
'gs://{}/{}{}*.csv'.format(bucket_name,
f'{table_name}_temp', uid)]
else:
destination_uris = [
'gs://{}/{}{}.csv'.format(bucket_name,
f'{table_name}_temp', uid)]
extract_job = client.extract_table(table_ref, destination_uris) # API request
result = extract_job.result() # Waits for job to complete.
if result.state != 'DONE' or result.errors:
raise Exception('Failed extract job {} for table {}'.format(result.job_id, table_name))
else:
print('BQ table(s) export completed successfully')
storage_client = storage.Client(project=gs_project_name)
bucket = storage_client.get_bucket(gs_bucket_name)
blob_list = bucket.list_blobs(prefix=f'{table_name}_temp')
print('Merging shard files into single file')
bucket.blob(f'{table_name}.csv').compose(blob_list)
Can you please help me to find a way to skip header.
Thanks,
Raghunath.
We can avoid header by using jobConfig to set the print_header parameter to False. Sample code
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(table_ref, destination_uris,
job_config=job_config)
Thanks
You can use skipLeadingRows (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#externalDataConfiguration.googleSheetsOptions.skipLeadingRows)