I am trying to export a large CSV dataset from BigQuery. The file is over 90000 rows, so BigQuery prompts me to export the table to Google Cloud Storage
...so I did that with the options of:
Export format: CSV,
Compression: GZIP
Google Cloud Storage URI: my_bucket/2015/feb.csv
After a few minutes, the dataset appears in my Google Cloud Storage. Then I go to download it from there. The file is about 200MB, when I finally open it, the excel sheet is crammed with Wingdings, none of the data made it through.
Did I go wrong somewhere? How can I download and open this file properly?
Try
mv feb.csv feb.csv.gz
gunzip feb.csv.gz
According to the question, you asked for a compressed file, so un-compress it first.
Related
I have the request to Upload a PDF from SFTP to Azure Blob which is working for text based PDF only.
If I have a PDF with pictures the PDF ulploaded is faulty (missing pictures)
If I use the "normal" bytearrayoutputsream and return a string I convert it back to PDF and it works.
The issue only occours using Azure methods:
I use this line of code to so:
def body = message.getBody(String.class)
BlobOutputStream blobOutputStream = blob.openOutputStream()
blobOutputStream.write(body.getBytes())
blobOutputStream.close()
A correct PDF with pictures is expected. How to do so ?
As suggested by Priyanka Chakraborti If using open connectors with Cloud Plateform intergation(CPI) is an option, then you can refer to the blogpost. The files will not be corrupted in that case.
I am accessing a website that allows me to download CSV file. I would like to store the CSV file directly to the blob container. I know that one way is to download the file locally and then upload the file, but I would like to skip the step of downloading the file locally. Is there a way in which I could achieve this.
i tried the following:
block_blob_service.create_blob_from_path('containername','blobname','https://*****.blob.core.windows.net/containername/FlightStats',content_settings=ContentSettings(content_type='application/CSV'))
but I keep getting errors stating path is not found.
Any help is appreciated. Thanks!
The file_path in create_blob_from_path is the path of your local file, looks like "C:\xxx\xxx". This path('https://*****.blob.core.windows.net/containername/FlightStats') is Blob URL.
You could download your file to byte array or stream, then use create_blob_from_bytes or create_blob_from_stream method.
Other answer uses the so called "Azure SDK for Python legacy".
I recommend that if it's fresh implementation then use Gen2 Storage Account (instead of Gen1 or Blob storage).
For Gen2 storage account, see example here:
from azure.storage.filedatalake import DataLakeFileClient
data = b"abc"
file = DataLakeFileClient.from_connection_string("my_connection_string",
file_system_name="myfilesystem", file_path="myfile")
file.append_data(data, offset=0, length=len(data))
file.flush_data(len(data))
It's painful, if you're appending multiple times then you'll have to keep track of offset on client side.
I currently have a node.js server running where I can grab a csv file stored in storage bucket and store that to a local file.
However, when I try to do the same thing with a xlsx file, it seems to mess up the file and cannot be read when I download it to a local directory.
Here is my code for getting the file to a stream:
async function getFileFromBucket(fileName) {
var fileTemp = await storage.bucket(bucketName).file(fileName);
return await fileTemp.download()
}
and with the data returned from above code, I store it into local directory by doing the following:
fs.promises.writeFile('local_directory', DataFromAboveCode)
It seems to work fine with .csv file but does not work with .xlsx file where I can open the csv file but xlsx file gets corrupted and cannot be opened.
I tried downloading the xlsx file directly from the storage bucket on google cloud console but it seems to work fine, meaning that somethings gone wrong in the downloading / saving process
Could someone guide me to what I am doing wrong here?
Thank you
I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N)
What would be the next (best) steps to saving it as a parquet file in Azure blob storage?
From my small research there are a couple of options:
Save locally and use https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json (not great for big data)
I believe adlfs is to read from blob
use dask.dataframe.to_parquet and work out how to point to the blob container
intake project (not sure where to start)
$ pip install adlfs
dd.to_parquet(
df=df,
path='absf://{BLOB}/{FILE_NAME}.parquet',
storage_options={'account_name': 'ACCOUNT_NAME',
'account_key': 'ACCOUNT_KEY'},
)
I'm using AWS S3 to host a static webpage, almost all assets are gzipped before being uploaded.
During the upload the "content-encoding" header is correctly set to "gzip" (and this also reflects when actually loading the file from AWS).
The thing is, the files can't be read and are still in gzip format although the correct headers are set...
The files are uploaded using npm s3-deploy, here's a screenshot of what the request looks like:
and the contents of the file in the browser:
If I upload the file manually and set the content-encoding header to "gzip" it works perfectly. Sadly I have a couple hundred files to upload for every deployment and can not do this manually all the time (I hope that's understandable ;) ).
Has anyone an idea of what's going on here? Anyone worked with s3-deploy and can help?
I use my own bash script for S3 deployments, you can try to do it:
webpath='path'
BUCKET='BUCKETNAME'
for file in $webpath/js/*.gz; do
aws s3 cp "$file" s3://"$BUCKET/js/" --content-encoding 'gzip' --region='eu-west-1'
done