how to avoid duplication in BigQuery by streaming insert - python-3.x

I made a function that inserts .CSV data into BigQuery in every 5~6 seconds. I've been looking for the way to avoid duplicating the data in BigQuery after inserting. I want to remove data that has same luid but I have no idea how to remove it so is it possible to check each data of .CSV has already existed in BigQuery table before inserting .
I put row_ids parameter to avoid duplicate luid but it seems not to work well .
Could you give me any idea ?? Thanks.
def stream_upload():
# BigQuery
client = bigquery.Client()
project_id = 'test'
dataset_name = 'test'
table_name = "test"
full_table_name = dataset_name + '.' + table_name
json_rows = []
with open('./test.csv','r') as f:
for line in csv.DictReader(f):
del line[None]
line_json = dict(line)
json_rows.append(line_json)
errors = client.insert_rows_json(
full_table_name,json_rows,row_ids=[row['luid'] for row in json_rows]
)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
print("end")
schedule.every(0.5).seconds.do(stream_upload)
while True:
schedule.run_pending()
time.sleep(0.1)

BigQuery doesn't have a native way to deal with this. You could either create a view off of this table that performs deduping or create an external cache of luids and lookup if they have already been written to BigQuery before writing and update the cache after writing new data. This could be as simple as a file cache or you could use an additional database.

Related

Using "UPDATE" and "SET" in Python to Update Snowflake Table

I have been using Python to read and write data to Snowflake for some time now to a table I have full update rights to using a Snowflake helper class my colleague found on the internet. Please see below for the class I have been using with my personal Snowflake connection information abstracted and a simply read query that works given you have a 'TEST' table in your schema.
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
import keyring
import pandas as pd
from sqlalchemy import text
# Pull the username and password to be used to connect to snowflake
stored_username = keyring.get_password('my_username', 'username')
stored_password = keyring.get_password('my_password', 'password')
class SNOWDBHelper:
def __init__(self):
self.user = stored_username
self.password = stored_password
self.account = 'account'
self.authenticator = 'authenticator'
self.role = stored_username + '_DEV_ROLE'
self.warehouse = 'warehouse'
self.database = 'database'
self.schema = 'schema'
def __connect__(self):
self.url = URL(
user=stored_username,
password=stored_password,
account='account',
authenticator='authenticator',
role=stored_username + '_DEV_ROLE',
warehouse='warehouse',
database='database',
schema='schema'
)
# =============================================================================
self.url = URL(
user=self.user,
password=self.password,
account=self.account,
authenticator=self.authenticator,
role=self.role,
warehouse=self.warehouse,
database=self.database,
schema=self.schema
)
self.engine = create_engine(self.url)
self.connection = self.engine.connect()
def __disconnect__(self):
self.connection.close()
def read(self, sql):
self.__connect__()
result = pd.read_sql_query(sql, self.engine)
self.__disconnect__()
return result
def write(self, wdf, tablename):
self.__connect__()
wdf.to_sql(tablename.lower(), con=self.engine, if_exists='append', index=False)
self.__disconnect__()
# Initiate the SnowDBHelper()
SNOWDB = SNOWDBHelper()
query = """SELECT * FROM """ + 'TEST'
snow_table = SNOWDB.read(query)
I now have the need to update an existing Snowflake table and my colleague suggested I could use the read function to send the query containing the update SQL to my Snowflake table. So I adapted an update query I use successfully in the Snowflake UI to update tables and used the read function to send it to Snowflake. It actually tells me that the relevant rows in the table have been updated, but they have not. Please see below for update query I use to attempt to change a field "field" in "test" table to "X" and the success message I get back. Not thrilled with this hacky update attempt method overall (where the table update is a side effect of sorts??), but could someone please help with method to update within this framework?
# Query I actually store in file: '0-Query-Update-Effective-Dating.sql'
UPDATE "Database"."Schema"."Test" AS UP
SET UP.FIELD = 'X'
# Read the query in from file and utilize it
update_test = open('0-Query-Update-Effective-Dating.sql')
update_query = text(update_test.read())
SNOWDB.read(update_query)
# Returns message of updated rows, but no rows updated
number of rows updated number of multi-joined rows updated
0 316 0
SQL2Pandas | UPDATE row(s) in pandas

Write Pandas dataframe data to CSV file

I am trying to write a pipeline to bring oracle database table data to aws.
It only takes a few ms to fill the dataframe, but when I try to write the dataframe to a csv-file it takes more than 2 min to write 10000 rows. In addition, one of the column's datatypes is cx_oracle lob type.
I thought this meant that it must take time to write data. So I converted the data to categorical data. But then the operation will take more memory space. Does anyone have any suggestions on how to optimize this process?
query = 'select * from tablename'
cursor.execute(query)
iter_idx = 0
while True:
results = cursor.fetchmany()
if not results:
break
iter_idx += 1
df = pd.DataFrame(results)
df.columns = field['source_field_names']
rec_count = df.shape[0]
t_rec_count += rec_count
file = generate_micro_file()
print('memory usage : \n', df.info(memory_usage='deep'))
# sd = dd.from_pandas(df, npartitions=1)
df.to_csv(file, encoding=str(encoding_type), header=False, index=False, escapechar='\\',chunksize=arraysize)
code output:
From the data access side, there is room for improvement by optimizing the fetching of rows across the network. Either by:
passing a large num_rows value to fetchmany(), see the cx_Oracle doc on [Cursor.fetchmany()[(https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#Cursor.fetchmany).
or increasing the value of Cursor.arraysize.
Your question didn't explain enough about your LOB usage. See the sample return_lobs_as_strings.py for optimizing fetches.
See the cx_Oracle documentation Tuning Fetch Performance.
Is there a particular reason to spend the overhead of converting to a Pandas dataframe? Why not write directly using the csv module?
Maybe something like this:
with connection.cursor() as cursor:
sql = "select * from all_objects where rownum <= 100000"
cursor.arraysize = 10000
with open("testwrite.csv", "w", encoding="utf-8") as outputfile:
writer = csv.writer(outputfile, lineterminator="\n")
results = cursor.execute(sql)
writer.writerows(results)
You should benchmark and choose the best solution.

python code to execute multiple queries and create csv

Hi I am pretty new to python I have a code that reads mysql query thorugh pandas and if there is data converts it into the csv now I need to move a step ahead and add another query and create another csv I am not sure what will be the best way to do it. Any help is appreciated thanks.
My code is something like this
def data_to_df(connection):
query = """
select * from abs
"""
data = pd.read_sql(sql=query,con=connection)
return data
def main():
# DB connection and data retrieval
cnx = database_connection(db_credentials)
print(cnx)
exit()
df = data_to_df(cnx)
Convert Dataframe to textfile
df.to_csv(file_location, sep = '|', na_rep = 'NULL', index=False, quoting=csv.QUOTE_NONE)
print('File created successfully! \n')
how can I add another query that will be executed and will create a different file altogether

BigQuery Storage API: Best Practice for Using Client from Spark Pandas UDF?

I have a spark script that needs to make 60 api calls for every row. Currently I am using BigQuery as a data warehouse. I was wondering if there was a way I can use either the BigQuery API or BigQuery Storage API to query the database from my udf? Maybe a way to perform batch queries? Would pandas-gbq be a better solution? Each query that I need to make per row is a select count(*) from dataset.table where {...} query.
Currently I am using the big query client as shown in the code snippet below, but I am not sure if this is the best way to utilize my resources. Apologies if the code is not done properly for this use case, I am new to spark and BigQuery.
def clients():
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/hadoop/credentials.json'
credentials, your_project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
# Make clients.
bqclient = bigquery.Client(
credentials=credentials,
project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
credentials=credentials
)
return bqclient, bqstorageclient
def query_cache(query):
bqclient, bqstorageclient = clients()
dataframe = (
bqclient.query(query)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
return dataframe['f0_'][0]
#pandas_udf(schema(), PandasUDFType.GROUPED_MAP)
def calc_counts(df):
query = "select count(*) from dataset.table where ...{some column filters}..."
df['count'] = df.apply(query_cache, args=(query), axis=1)
The simpler option is to use the spark-bigquery-connector, which let you query BigQuery directly and get the result as a Spark dataframe. Converting this dataframe into pandas is then simple:
spark_df = spark.read.format('bigquery').option('table', table).load()
pandas_df = spark_df.toPandas()

Need better approach to load oracle blob data into Mongodb collection using Gridfs

Recently, I started working on new project where I need to transfer oracle table data into Mongodb collections.
Oracle table consists one BLOB datatype column.
I wanted to transfer oracle table blob data into Mongodb using GridFS and I even succeed, but I am unable to scale it up.
If I use same script for 10k or 50k records, Its taking very very long time.
Please suggest me, is there anywhere i can improve or is there better way to achieve my goal.
Thank you in advance.
Please find out sample code which I am using to load small amount of data
from pymongo import MongoClient
import cx_Oracle
from gridfs import GridFS
import pickle
import sys
client = MongoClient('localhost:27017/sample')
dbm = client.sample
db = <--oracle connection----->
cursor = db.cursor()
def get_notes_file_sys():
return GridFS(dbm,'notes')
def save_data_in_file(fs,note,file_name):
gridin = None
file_ids = {}
data_blob = pickle.dumps(note['file_content_blob'])
del note['file_content_blob']
gridin = fs.open_upload_stream(file_name, chunk_size_bytes=261120, metadata=note)
gridin.write(data_blob)
gridin.close()
file_ids['note_id'] = gridin._id
return file_ids
# ---------------------------Uploading files start---------------------------------------
fs = get_notes_file_sys()
query = ("""SELECT id, file_name, file_content_blob, author, created_at FROM notes fetch next 10 rows only""")
cursor.execute(query)
rows = cursor.fetchall()
col = [co[0] for co in cursor.description]
final_arr= []
for row in rows:
data = dict(zip(col,row))
file_name = data['file_name']
if data["file_content_blob"] is None:
data["file_content_blob"] = None
else:
# This below line is taking more time
data["file_content_blob"] = data["file_content_blob"].read()
note_id = save_data_in_file(fs,data,file_name)
data['note_id'] = note_id
final_arr.append(data)
dbm['notes'].bulk_insert(final_arr)
Two things comes to mind:
Don't move to Mongo. Just use Oracle's SODA document storage model: https://cx-oracle.readthedocs.io/en/latest/user_guide/soda.html Also take a look at Oracle's JSON DB service: https://blogs.oracle.com/jsondb/autonomous-json-database
Fetch BLOBs as Bytes, which is much faster than the method you are using https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html#fetching-lobs-as-strings-and-bytes There is an example at https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py

Resources