Autocreate tables in Bigquery for multiple CSV files - python-3.x

I want to generate tables automatically in Bigquery whenever a file is uploaded in storage bucket using cloud function in python.
For example- if sample1.csv file is uploaded to bucket then a sample1 table will be created in Bigquery.
How to automate it using cloud function using Python i tried with below code but was able to generate 1 table and all data got appended to that table, how to proceed
def hello_gcs(event, context):
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "test_project.test_dataset.test_Table"
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://test_bucket/*.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Processing file: {file['name']}.")

Sounds like you need to do three things:
Extract the name of the CSV file/object from the notification event you're receiving to fire your function.
Update the table_id in your example code to set the table name based on the filename you extracted in the first step.
Update the uri in your example code to only use the single file as the input. As written, your example attempts to load data from all matching CSV objects in GCS to the table.

Related

How to unzip and load tsv file into Bigquery from gcs bucket

Below is the code to get the tsv.gz file from gcs and unzip the file and converting into comma separated csv file to load csv data into Bigquery.
storage_client = storage.Client(project=project_id)
blobs_list = list(storage_client.list_blobs(bucket_name))
for blobs in blobs_list:
if blobs.name.endswith(".tsv.gz"):
source_file = blobs.name
uri = "gs://{}/{}".format(bucket_name, source_file)
gcs_file_system = gcsfs.GCSFileSystem(project=project_id)
with gcs_file_system.open(uri) as f:
gzf = gzip.GzipFile(mode="rb", fileobj=f)
csv_table=pd.read_table(gzf)
csv_table.to_csv('GfG.csv',index=False)
Code seems not effective to load data into BQ as getting many issues. Thought doing wrong with the conversion of file. Please put you thoughts where it went wrong?
If your file is gzip (not zip, I mean gzip), and in Cloud Storage, don't load it, unzip it and stream load it.
You can directly load, as is, in BigQuery, it's magic!! Here a sample
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(
autodetect=True, #Automatic schema
field_delimiter=",", # Use \t if your separator is tab in your TSV file
skip_leading_rows=1, #Skip the header values(but keep it for the column naming)
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://{}/{}".format(bucket_name, source_file)
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.

Data Ingestion Using Big query's python client exceeds cloud functions maximum limit

I am trying to auto-ingest data from gcs into bigquery using a bucket triggered cloud function.the file types are gzipped json files which can have a maximum size of 2gb.the cloud function works fine with small files.however it tends to timeout when i give it large files that range from 1 to 2 gbs.is there a way to further optimize my function here is the code below:
def bigquery_job_trigger(data, context):
# Set up our GCS, and BigQuery clients
storage_client = storage.Client()
client = bigquery.Client()
file_data = data
file_name = file_data["name"]
table_id = 'BqJsonIngest'
bucket_name = file_data["bucket"]
dataset_id = 'dataDelivery'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True
blob = storage_client.bucket(bucket_name).get_blob(file_name)
file = blob.open("rb")
client.load_table_from_file(
file,
table_ref,
location="US", # Must match the destination dataset location.
job_config=job_config,
)
If the file's already in GCS, there's no need to open the blob inside your function (or the need to do so is not apparently from the snippet provided).
See client.load_table_from_uri, or just checkout one of the existing code samples like https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-gcs-csv#bigquery_load_table_gcs_csv-python

How to pass multiple delimiter in Python for BigQuery storage using Cloud Function

I am trying to load multiple csv files into a BigQuery table. For some csv files delimiter is comma and for some it is semicolon. Is there any way to pass multiple delimiter in Job config.
job_config = bigquery.LoadJobConfig(
autodetect=True,
source_format=bigquery.SourceFormat.CSV,
field_delimiter=",",
write_disposition="WRITE_APPEND",
skip_leading_rows=1,
)
Thanks
Ritz
I deployed the following code in Cloud Functions for this purpose. I am using “Cloud Storage” as the trigger and “Finalize/Create” as the event type. The code works successfully for running Bigquery Load jobs on comma and semicolon delimited files.
main.py
def hello_gcs(event, context):
from google.cloud import bigquery
from google.cloud import storage
import subprocess
# Construct a BigQuery client object.
client = bigquery.Client()
client1 = storage.Client()
bucket = client1.get_bucket('Bucket-Name')
blob = bucket.get_blob(event['name'])
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "ProjectID.DatasetName.TableName"
with open("/tmp/z", "wb") as file_obj:
blob.download_to_file(file_obj)
subprocess.call(["sed", "-i", "-e", "s/;/,/", "/tmp/z"])
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
field_delimiter=",",
write_disposition="WRITE_APPEND",
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
with open("/tmp/z", "rb") as source_file:
source_file.seek(0)
job = client.load_table_from_file(source_file, table_id, job_config=job_config)
# Make an API request.
job.result() # Waits for the job to complete.
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud
google-cloud-bigquery
google-cloud-storage
Here, I am substituting the “;” with “,” using the Sed command. One point to note is while writing to a file in Cloud Functions, we need to give the path as /tmp/file_name, as it is the only place in Cloud Functions where writing to a file is allowed. It also assumed that there are no additional commas or semicolons in the files in addition to the delimiter.

Azure Python SDK data tables

I need help to get through this workflow.
I have 2 storage accounts which I name storage1 and storage2
storage1 contrains a list of tables with some data in, and I would like to loop through all those tables, copy their content into storage2. I tried with azCopy but I had no luck as this feature is available only in azCopy v7.3 and I couldn't find this version for MacOs M1. The other option is Data factory but its too complex for what I want to achieve. So I decided to go with azure Python sdk.
As a library I am using azure.data.tables import TableServiceClient
The code I wrote looks like this:
from azure.data.tables import TableServiceClient
my_conn_str_out = 'storage1-Conn-Str'
table_service_client_out = TableServiceClient.from_connection_string(my_conn_str_out)
list_table = []
for table in table_service_client_out.list_tables():
list_table.append(table.table_name)
my_conn_str_in = 'Storage2-Conn-str'
table_service_client_in = TableServiceClient.from_connection_string(my_conn_str_in)
for new_tables in table_service_client_out.list_tables():
table_service_client_in.create_table_if_not_exists(new_tables.table_name)
print(f'tables created successfully {new_tables.table_name}')
this is how I structured my code.
for table in table_service_client_out.list_tables():
list_table.append(table.table_name)
I loop through all my tables in the storage account and append them into a list.
then:
for new_tables in table_service_client_out.list_tables():
table_service_client_in.create_table_if_not_exists(new_tables.table_name)
print(f'tables created successfully {new_tables.table_name}')
I create the same table in the storage2
So far everything works just fine.
What I would like to achieve now, is to query all the data in each table in storage1 and pass it to the respective table in storage2
According to Microsoft documentation I can achieve the query table using this:
query = table_service_client_out.query_tables(filter=table)
so I integrated this in my loop like this:
for table in table_service_client_out.list_tables():
query = table_service_client_out.query_tables(filter=table)
list_table.append(table.table_name)
print(query)
When I run my python code, I get back the memory allocation of the query and not the data in the tables:
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8fbb0>
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8f7f0>
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8fd60>
I was wondering if there is a way how I can query all the data in my tables and pass them to my storage2
Try this :
from azure.cosmosdb.table.tableservice import TableService,ListGenerator
table_service_out = TableService(account_name='', account_key='')
table_service_in = TableService(account_name='', account_key='')
#query 100 items per request, in case of consuming too much menory load all data in one time
query_size = 100
#save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(tb_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_entity(tb_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(tb_name,data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
for tb in tbs_out:
#create table with same name in storage2
table_service_in.create_table(tb.name)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(tb.name,data,table_service_out,table_service_in,query_size)
Of course, this is a simple demo for your requirement.For more efficiency, you can also query table data by partition key and commit them by batch
Let me know if you have any more questions.

How to convert sqlite3 to csv format using API for a chatbot?

When I run my chatbot it creates a db.sqlite3 file in backend for storing all the conversation. I want to convert this db.sqlite3 file into a csv file using API. How should I implement it in python? The image contains the type of file.
There are multiple tables in the db file associated with Chatterbot. They are conversation_association, conversation, response, statement, tag_association and tag. Out of all these tables only response and statement tables have proper data (at least in my case). However I tried to convert all tables into csv. So you may find some empty csv files too.
import sqlite3, csv
db = sqlite3.connect("chatterbot-database") # enter your db name here
cursor = db.cursor()
tables = [table[0] for table in cursor.execute("select name from sqlite_master where type = 'table'")] # fetch table names from db
for table in tables:
with open('%s.csv'%table, 'w') as fd:
csvwriter = csv.writer(fd)
for data in cursor.execute("select * from '%s'"%table): # get data from each table
csvwriter.writerow(data)

Resources