Import XLS file from GCS to BigQuery - excel

I have some .xls datas in my Google Cloud Storage and want to use airflow to store it to GCP. Can I export it directly to BigQuery or can i use additional library (such a pandas and xlrd) to convert the files and store it into BigQuery?
Thanks

Bigquery don't support xls format. The easiest way is to transform the file in CSV and to load it into big query.
However, I don't know your xls format. If it's multisheet you have to work on the file.

Related

parse gz file in aws s3 using python

I am trying to bulk copy tables from SnowFlake to postgreSQL. From SnowFlake, I was able to extract tables in CSV format using COPY. The COPY compresses the extract in gz format in aws s3.
Now the second step is to load these files in postgreSQL. I am planning to use postgreSQL COPY utility to ingest the data. However, I don't want to unzip the files. I would rather like to buffer the data directly from gz files and give the buffer file as input to the psycopg2 copy_from function.
Is there a way to parse gz files in AWS S3 using python? Thanks in advance!

BigQuery howto load data from local file as content

I have a requirement where in I will receive file content which I need to load to BigQuery tables. Standard API shows how to load data from local file but I don't see any variant of the load method which accepts file content as string rather than a file path. Any idea how I can achieve this ?
As we can see in the source code and official documentation load function loads data only from a local file or Storage File. Allowed options are:
AVRO,
CSV,
JSON,
ORC,
PARQUET
The load job is created and it will run your data load asynchronously. If you would like instantaneous access to your data, insert it using Table insert function, where you need to provide the rows to insert into the table:
// Insert a single row
table.insert({
INSTNM: 'Motion Picture Institute of Michigan',
CITY: 'Troy',
STABBR: 'MI'
}, insertHandler);
If you want to load i.e. CSV file, firstly you need to save data to a CSV in Node.js manually. Then, load it as a single column CSV using load() method. That will load the whole string as a single column.
Additionally, what I can recommend you is to use Dataflow templates, i.e. Cloud Storage Text to BigQuery, that read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF), and output the result to BigQuery. But your data to load needs to be stored in Cloud Storage.

How to convert CSV to ORC format using Azure Datafactory

I am coping comma separated partition data files into ADLS using azure datafactory.
The requirement is to copy the comma separated files to ORC format with SNAPPY compression.
Is it possible to achieve this with ADF? if yes, then could you please help me?
Unfortunatelly, data factory can read from ZLIB and SNAPPY, but can only write ZLIB, which is the default for the orc file format.
More info here: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#orc-format
Hope this helped!!

How can we export a cassandra table into a csv format using its snapshots file

I have taken snapshot of a cassandra table . Following are the files generated :-
manifest.json mc-10-big-Filter.db mc-10-big-TOC.txt mc-11-big-Filter.db mc-11-big-TOC.txt mc-9-big-Filter.db mc-9-big-TOC.txt
mc-10-big-CompressionInfo.db mc-10-big-Index.db mc-11-big-CompressionInfo.db mc-11-big-Index.db mc-9-big-CompressionInfo.db mc-9-big-Index.db schema.cql
mc-10-big-Data.db mc-10-big-Statistics.db mc-11-big-Data.db mc-11-big-Statistics.db mc-9-big-Data.db mc-9-big-Statistics.db
mc-10-big-Digest.crc32 mc-10-big-Summary.db mc-11-big-Digest.crc32 mc-11-big-Summary.db mc-9-big-Digest.crc32 mc-9-big-Summary.db
Is there a way to use these files to extract data of the table into a csv file .
Yes, you can do that with the sstable2json tool.
Use the tool against the *Data.db file
This outputs in JSON format. You need to convert to CSV after.

Using Pyspark how to convert Text file to CSV file

I am new learner for Pyspark. I got a requirement in my project to read JSON file with a schema and need to convert it to CSV file.
Can some one help me how to proceed this request using PYspark.
You can load JSON and write CSV with SparkSession.
spark = SparkSession.builder.master("local").appName("ETL").getOrCreate()
spark.read.json(path-to-txt)
spark.write.csv(path-to-csv)

Resources