parse gz file in aws s3 using python - python-3.x

I am trying to bulk copy tables from SnowFlake to postgreSQL. From SnowFlake, I was able to extract tables in CSV format using COPY. The COPY compresses the extract in gz format in aws s3.
Now the second step is to load these files in postgreSQL. I am planning to use postgreSQL COPY utility to ingest the data. However, I don't want to unzip the files. I would rather like to buffer the data directly from gz files and give the buffer file as input to the psycopg2 copy_from function.
Is there a way to parse gz files in AWS S3 using python? Thanks in advance!

Related

Import XLS file from GCS to BigQuery

I have some .xls datas in my Google Cloud Storage and want to use airflow to store it to GCP. Can I export it directly to BigQuery or can i use additional library (such a pandas and xlrd) to convert the files and store it into BigQuery?
Thanks
Bigquery don't support xls format. The easiest way is to transform the file in CSV and to load it into big query.
However, I don't know your xls format. If it's multisheet you have to work on the file.

How to append files in GCS with the same schema?

Is there any way one can append two files in GCS, suppose file one is a full
load and second file is an incremental load. Then what's the way we can append
the two?
Secondly, using gsutil compose will append the two files including the attributes
names as well. So, in the final file I want the data of the two files.
You can append two separate files using compose in the Google Cloud Shell and rename the output file as the first file, like this:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/obj1
This command is meant for parallel uploads in which you divide a large object file in smaller objects. They get uploaded to Google Cloud Storage and then you can append them to get the original file. You can find more information on Composite Objects and Parallel Uploads.
I've come up with two possible solutions:
Google Cloud Function solution
The option I would go for is using a Cloud Function. Doing something like the following:
Create an empty bucket like append_bucket.
Upload the first file.
Create a Cloud Function to be triggered by new uploaded files on the
bucket.
Upload the second file.
Read the first and the second file (you will have to download them as string first).
Make the append operation.
Upload the result to the bucket.
Google Dataflow solution
You can also do it with Dataflow for BigQuery (keep in mind it’s still in beta).
Create a BigQuery dataset and table.
Create a Dataflow instance, from the template Cloud Storage Text to BigQuery.
Create a Javascript file with the logic to transform the text.
Upload your files in Json format to the bucket.
Dataflow will read the Json file, execute the Javascript code and append the new data to the BigQuery dataset.
At last, export the BigQuery query result to Cloud Storage.

How can we export a cassandra table into a csv format using its snapshots file

I have taken snapshot of a cassandra table . Following are the files generated :-
manifest.json mc-10-big-Filter.db mc-10-big-TOC.txt mc-11-big-Filter.db mc-11-big-TOC.txt mc-9-big-Filter.db mc-9-big-TOC.txt
mc-10-big-CompressionInfo.db mc-10-big-Index.db mc-11-big-CompressionInfo.db mc-11-big-Index.db mc-9-big-CompressionInfo.db mc-9-big-Index.db schema.cql
mc-10-big-Data.db mc-10-big-Statistics.db mc-11-big-Data.db mc-11-big-Statistics.db mc-9-big-Data.db mc-9-big-Statistics.db
mc-10-big-Digest.crc32 mc-10-big-Summary.db mc-11-big-Digest.crc32 mc-11-big-Summary.db mc-9-big-Digest.crc32 mc-9-big-Summary.db
Is there a way to use these files to extract data of the table into a csv file .
Yes, you can do that with the sstable2json tool.
Use the tool against the *Data.db file
This outputs in JSON format. You need to convert to CSV after.

copy and decompress .tar file with Azure Data Factory

I m trying to copy and decompress .tar file from FTP to Azure Data Lake Store.
.tar file contains HTML files. In the copy activity, on a dataset, i select Compression type GZipDeflate, but I wonder what file format do I need to use? Is it supported to do such I thing without custom activity?
Unfortunately, Data factory doesn't support decompression of .tar files. The supported types for ftp are GZip, Deflate, BZip2, and ZipDeflate. (as seen here: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support).
A solution may be to save the files in one of the supported formats, or try a custom activity as was explained here, although I'm not sure if it was for data factory v1 or v2: Import .tar file using Azure Data Factory
Hope this helped!
So its true that there is no way just to decompress .tar files with ADF or ADL Analytics, but there is an option to take a content from every file in .tar file and save as an output in U-SQL.
I have a scenario that I need to take content from html files inside the .tar file, so i just created html extractor that will take stream content of each html file in .tar file and save in a U-SQL output variable.
Maybe this can help someone who has a similar use case.
I used SharpCompress.dll for extracting and looping over .tar files in c#.

Appending to a text file in S3

I know how to write and read from a file in S3 using boto. I'm wondering if there is a way to append to a file without having to download the file and re-upload an edited version?
There is no way to append data to an existing object in S3. You would have to grab the data locally, add the extra data, and then write it back to S3.

Resources