I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N)
What would be the next (best) steps to saving it as a parquet file in Azure blob storage?
From my small research there are a couple of options:
Save locally and use https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json (not great for big data)
I believe adlfs is to read from blob
use dask.dataframe.to_parquet and work out how to point to the blob container
intake project (not sure where to start)
$ pip install adlfs
dd.to_parquet(
df=df,
path='absf://{BLOB}/{FILE_NAME}.parquet',
storage_options={'account_name': 'ACCOUNT_NAME',
'account_key': 'ACCOUNT_KEY'},
)
Related
I am accessing a website that allows me to download CSV file. I would like to store the CSV file directly to the blob container. I know that one way is to download the file locally and then upload the file, but I would like to skip the step of downloading the file locally. Is there a way in which I could achieve this.
i tried the following:
block_blob_service.create_blob_from_path('containername','blobname','https://*****.blob.core.windows.net/containername/FlightStats',content_settings=ContentSettings(content_type='application/CSV'))
but I keep getting errors stating path is not found.
Any help is appreciated. Thanks!
The file_path in create_blob_from_path is the path of your local file, looks like "C:\xxx\xxx". This path('https://*****.blob.core.windows.net/containername/FlightStats') is Blob URL.
You could download your file to byte array or stream, then use create_blob_from_bytes or create_blob_from_stream method.
Other answer uses the so called "Azure SDK for Python legacy".
I recommend that if it's fresh implementation then use Gen2 Storage Account (instead of Gen1 or Blob storage).
For Gen2 storage account, see example here:
from azure.storage.filedatalake import DataLakeFileClient
data = b"abc"
file = DataLakeFileClient.from_connection_string("my_connection_string",
file_system_name="myfilesystem", file_path="myfile")
file.append_data(data, offset=0, length=len(data))
file.flush_data(len(data))
It's painful, if you're appending multiple times then you'll have to keep track of offset on client side.
In Azure SDK v11, we had the option to specify the ParallelOperationThreadCount through the BlobRequestOptions. In Azure SDK v12, I see that the BlobClientOptions does not have this, and the BlockBlobClient (previously CloudBlockBlob in Azure SDK v11), there is only mention of parallelism in the download methods.
We have three files: one 200MB, one 150MB, and one 20MB. For each file, we want the file to be split into blocks and have those uploaded in parallel. Is this automatically done by the BlockBlobClient? If possible, we would like to do these operations for the 3 files in parallel as well.
You also can take use of StorageTransferOptions in v12.
The sample code below:
BlobServiceClient blobServiceClient = new BlobServiceClient(conn_str);
BlobContainerClient containerClient= blobServiceClient.GetBlobContainerClient("xxx");
BlobClient blobClient = containerClient.GetBlobClient("xxx");
//set it here.
StorageTransferOptions transferOptions = new StorageTransferOptions();
//transferOptions.MaximumConcurrency or other settings.
blobClient.Upload("xxx", transferOptions:transferOptions);
By the way, for uploading large files, you can also use Microsoft Azure Storage Data Movement Library for better performance.
Using Fiddler, I verified that BlockBlobClient does indeed upload the files in chunks without needing to do any extra work. For doing each of the major files in parallel, I simply had a task for each one, added it to a list tasks and used await Task.WhenAll(tasks).
I have the files in S3 because Microsoft's storage is too complicated for me to deal with, but am using an azure version of Snowflake. Every time I run the load process it loads every file that I have. The AWS version of snowflake doesn't do that. It keeps track of the file names tht I've already loaded and doesn't load them. What's going on?
here is an example of one that I am loading through a procedure:
var stmt1 = snowflake.execute( { sqlText:
`
copy into dw_order_charges
from #dw_order_charges
file_format = load_format_pipe
`
} );
Thanks, --sw
I am trying to write a new file (not upload an existing file) to a Google Cloud Storage bucket from inside a Python Google Cloud Function.
I tried using google-cloud-storage but it does not have the
"open" attribute for the bucket.
I tried to use the App Engine library GoogleAppEngineCloudStorageClient but the function cannot deploy with this dependencies.
I tried to use gcs-client but I cannot pass the credentials inside the function as it requires a JSON file.
Any ideas would be much appreciated.
Thanks.
from google.cloud import storage
import io
# bucket name
bucket = "my_bucket_name"
# Get the bucket that the file will be uploaded to.
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket)
# Create a new blob and upload the file's content.
my_file = bucket.blob('media/teste_file01.txt')
# create in memory file
output = io.StringIO("This is a test \n")
# upload from string
my_file.upload_from_string(output.read(), content_type="text/plain")
output.close()
# list created files
blobs = storage_client.list_blobs(bucket)
for blob in blobs:
print(blob.name)
# Make the blob publicly viewable.
my_file.make_public()
You can now write files directly to Google Cloud Storage. It is no longer necessary to create a file locally and then upload it.
You can use the blob.open() as follows:
from google.cloud import storage
def write_file():
client = storage.Client()
bucket = client.get_bucket('bucket-name')
blob = bucket.blob('path/to/new-blob.txt')
with blob.open(mode='w') as f:
for line in object:
f.write(line)
You can find more examples and snippets here:
https://github.com/googleapis/python-storage/tree/main/samples/snippets
You have to create your file locally and then to push it to GCS. You can't create a file dynamically in GCS by using open.
For this, you can write in the /tmp directory which is an in memory file system. By the way, you will never be able to create a file bigger than the amount of the memory allowed to your function minus the memory footprint of your code. With a function with 2Gb, you can expect a max file size of about 1.5Gb.
Note: GCS is not a file system, and you don't have to use it like this
EDIT 1
Things have changed since my answer:
It's now possible to write in any directory in the container (not only the /tmp)
You can stream write a file in GCS, as well as you receive it in streaming mode on CLoud Run. Here a sample to stream write to GCS.
Note: stream write deactivate the checksum validation. Therefore, you won't have integrity checks at the end of the file stream write.
I am trying to export a large CSV dataset from BigQuery. The file is over 90000 rows, so BigQuery prompts me to export the table to Google Cloud Storage
...so I did that with the options of:
Export format: CSV,
Compression: GZIP
Google Cloud Storage URI: my_bucket/2015/feb.csv
After a few minutes, the dataset appears in my Google Cloud Storage. Then I go to download it from there. The file is about 200MB, when I finally open it, the excel sheet is crammed with Wingdings, none of the data made it through.
Did I go wrong somewhere? How can I download and open this file properly?
Try
mv feb.csv feb.csv.gz
gunzip feb.csv.gz
According to the question, you asked for a compressed file, so un-compress it first.