How to unzip and load tsv file into Bigquery from gcs bucket - python-3.x

Below is the code to get the tsv.gz file from gcs and unzip the file and converting into comma separated csv file to load csv data into Bigquery.
storage_client = storage.Client(project=project_id)
blobs_list = list(storage_client.list_blobs(bucket_name))
for blobs in blobs_list:
if blobs.name.endswith(".tsv.gz"):
source_file = blobs.name
uri = "gs://{}/{}".format(bucket_name, source_file)
gcs_file_system = gcsfs.GCSFileSystem(project=project_id)
with gcs_file_system.open(uri) as f:
gzf = gzip.GzipFile(mode="rb", fileobj=f)
csv_table=pd.read_table(gzf)
csv_table.to_csv('GfG.csv',index=False)
Code seems not effective to load data into BQ as getting many issues. Thought doing wrong with the conversion of file. Please put you thoughts where it went wrong?

If your file is gzip (not zip, I mean gzip), and in Cloud Storage, don't load it, unzip it and stream load it.
You can directly load, as is, in BigQuery, it's magic!! Here a sample
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(
autodetect=True, #Automatic schema
field_delimiter=",", # Use \t if your separator is tab in your TSV file
skip_leading_rows=1, #Skip the header values(but keep it for the column naming)
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://{}/{}".format(bucket_name, source_file)
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.

Related

Data Ingestion Using Big query's python client exceeds cloud functions maximum limit

I am trying to auto-ingest data from gcs into bigquery using a bucket triggered cloud function.the file types are gzipped json files which can have a maximum size of 2gb.the cloud function works fine with small files.however it tends to timeout when i give it large files that range from 1 to 2 gbs.is there a way to further optimize my function here is the code below:
def bigquery_job_trigger(data, context):
# Set up our GCS, and BigQuery clients
storage_client = storage.Client()
client = bigquery.Client()
file_data = data
file_name = file_data["name"]
table_id = 'BqJsonIngest'
bucket_name = file_data["bucket"]
dataset_id = 'dataDelivery'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True
blob = storage_client.bucket(bucket_name).get_blob(file_name)
file = blob.open("rb")
client.load_table_from_file(
file,
table_ref,
location="US", # Must match the destination dataset location.
job_config=job_config,
)
If the file's already in GCS, there's no need to open the blob inside your function (or the need to do so is not apparently from the snippet provided).
See client.load_table_from_uri, or just checkout one of the existing code samples like https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-gcs-csv#bigquery_load_table_gcs_csv-python

How to pass multiple delimiter in Python for BigQuery storage using Cloud Function

I am trying to load multiple csv files into a BigQuery table. For some csv files delimiter is comma and for some it is semicolon. Is there any way to pass multiple delimiter in Job config.
job_config = bigquery.LoadJobConfig(
autodetect=True,
source_format=bigquery.SourceFormat.CSV,
field_delimiter=",",
write_disposition="WRITE_APPEND",
skip_leading_rows=1,
)
Thanks
Ritz
I deployed the following code in Cloud Functions for this purpose. I am using “Cloud Storage” as the trigger and “Finalize/Create” as the event type. The code works successfully for running Bigquery Load jobs on comma and semicolon delimited files.
main.py
def hello_gcs(event, context):
from google.cloud import bigquery
from google.cloud import storage
import subprocess
# Construct a BigQuery client object.
client = bigquery.Client()
client1 = storage.Client()
bucket = client1.get_bucket('Bucket-Name')
blob = bucket.get_blob(event['name'])
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "ProjectID.DatasetName.TableName"
with open("/tmp/z", "wb") as file_obj:
blob.download_to_file(file_obj)
subprocess.call(["sed", "-i", "-e", "s/;/,/", "/tmp/z"])
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
field_delimiter=",",
write_disposition="WRITE_APPEND",
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
with open("/tmp/z", "rb") as source_file:
source_file.seek(0)
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
# Make an API request.
job.result() # Waits for the job to complete.
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud
google-cloud-bigquery
google-cloud-storage
Here, I am substituting the “;” with “,” using the Sed command. One point to note is while writing to a file in Cloud Functions, we need to give the path as /tmp/file_name, as it is the only place in Cloud Functions where writing to a file is allowed. It also assumed that there are no additional commas or semicolons in the files in addition to the delimiter.

Autocreate tables in Bigquery for multiple CSV files

I want to generate tables automatically in Bigquery whenever a file is uploaded in storage bucket using cloud function in python.
For example- if sample1.csv file is uploaded to bucket then a sample1 table will be created in Bigquery.
How to automate it using cloud function using Python i tried with below code but was able to generate 1 table and all data got appended to that table, how to proceed
def hello_gcs(event, context):
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "test_project.test_dataset.test_Table"
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://test_bucket/*.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Processing file: {file['name']}.")
Sounds like you need to do three things:
Extract the name of the CSV file/object from the notification event you're receiving to fire your function.
Update the table_id in your example code to set the table name based on the filename you extracted in the first step.
Update the uri in your example code to only use the single file as the input. As written, your example attempts to load data from all matching CSV objects in GCS to the table.

Calling Rest API in Snow Flake

How to Call rest Api call to notebook in Snowflake , As API call generates output files which are need to store in Snowflake itself
you do not have to call an API directly from Snowflake. You can load files directly from your python notebook with a connection to Snowflake DB:
SQL code:
-- Create destination table to store your API queries results data
create or replace table public.tmp_table
(page_id int,id int,status varchar,provider_status varchar,ts_created timestamp);
-- create a new format for your csv files
create or replace file format my_new_format type = 'csv' field_delimiter = ';' field_optionally_enclosed_by = '"' skip_header = 1;
-- Put your local file to the Snowflake's temporary storage
put file:///Users/Admin/Downloads/your_csv_file_name.csv #~/staged;
-- Copying data from storage into table
copy into public.tmp_table from #~/staged/your_csv_file_name.csv.gz file_format = my_new_format ON_ERROR=CONTINUE;
select * from public.tmp_table;
-- Delete temporary data
remove #~/staged/tmp_table.csv.gz;
You can do the same with python:
https://docs.snowflake.com/en/user-guide/python-connector-example.html#loading-data
target_table = 'public.tmp_table'
filename = 'your_csv_file_name'
filepath = f'/home/Users/Admin/Downloads/{filename}.csv'
conn = snowflake.connector.connect(
user=USER,
password=PASSWORD,
account=ACCOUNT,
warehouse=WAREHOUSE,
database=DATABASE,
schema=SCHEMA
)
conn.cursor().execute(f'put file://{filepath} #~/staged;')
result = conn.cursor().execute(f'''
COPY INTO {target_table}
FROM #~/staged/{filename}.gz
file_format = (format_name = 'my_new_format'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
ESCAPE_UNENCLOSED_FIELD = NONE) ON_ERROR=CONTINUE;
''')
conn.cursor().execute(f'REMOVE #~/staged/{filename}.gz;')

How to read csv file with using pandas and cloud functions in GCP?

I just try to read csv file which was upload to GCS.
I want to read csv file which is upload to GCS with Cloud functions in GCP. And I want to deal with the csv data as "DataFrame".
But I can't read csv file by using pandas.
This is the code to read csv file on the GCS with using cloud functions.
def read_csvfile(data, context):
try:
bucket_name = "my_bucket_name"
file_name = "my_csvfile_name.csv"
project_name = "my_project_name"
# create gcs client
client = gcs.Client(project_name)
bucket = client.get_bucket(bucket_name)
# create blob
blob = gcs.Blob(file_name, bucket)
content = blob.download_as_string()
train = pd.read_csv(BytesIO(content))
print(train.head())
except Exception as e:
print("error:{}".format(e))
When I ran my Python code, I got the following error.
No columns to parse from file
Some websites says that the error means I read un empty csv file. But actually I upload non empty csv file.
So how can I solve this problem?
please give me your help. Thanks.
----add at 2020/08/08-------
Thank you for giving me your help!
But finally I cloud not read csv file by using your code... I still have the error, No columns to parse from file.
So I tried new way to read csv file as Byte type. The new Python code to read csv file is bellow.
MAIN.PY
from google.cloud import storage
import pandas as pd
import io
import csv
from io import BytesIO
def check_columns(data, context):
try:
object_name = data['name']
bucket_name = data['bucket']
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(object_name)
data = blob.download_as_string()
#read the upload csv file as Byte type.
f = io.StringIO(str(data))
df = pd.read_csv(f, encoding = "shift-jis")
print("df:{}".format(df))
print("df.columns:{}".format(df.columns))
print("The number of columns:{}".format(len(df.columns)))
REQUIREMENTS.TXT
Click==7.0
Flask==1.0.2
itsdangerous==1.1.0
Jinja2==2.10
MarkupSafe==1.1.0
Pillow==5.4.1
qrcode==6.1
six==1.12.0
Werkzeug==0.14.1
google-cloud-storage==1.30.0
gcsfs==0.6.2
pandas==1.1.0
The output I got is bellow.
df:Empty DataFrame
Columns: [b'Apple, Lemon, Orange, Grape]
Index: []
df.columns:Index(['b'Apple', 'Lemon', 'Orange', 'Grape'])
The number of columns:4
So I could read only first record in csv file as df.column!? But I could not get the other records in csv file...And the first column is not the column but normal record.
So how can I get some records in csv file as DataFrame with using pandas?
Could you help me again? Thank you.
Pandas, since version 0.24.1, can directly read a Google Cloud Storage URI.
For example:
gs://awesomefakebucket/my.csv
Your service account attached to your function must have access to read the CSV file.
Please, feel free to test and modify this code.
I used Python 3.7
function.py
from google.cloud import storage
import pandas as pd
def hello_world(request):
# it is mandatory initialize the storage client
client = storage.Client()
#please change the file's URI
temp = pd.read_csv('gs://awesomefakebucket/my.csv', encoding='utf-8')
print (temp.head())
return f'check the results in the logs'
requirements.txt
google-cloud-storage==1.30.0
gcsfs==0.6.2
pandas==1.1.0

Resources