Azure Python SDK data tables - python-3.x

I need help to get through this workflow.
I have 2 storage accounts which I name storage1 and storage2
storage1 contrains a list of tables with some data in, and I would like to loop through all those tables, copy their content into storage2. I tried with azCopy but I had no luck as this feature is available only in azCopy v7.3 and I couldn't find this version for MacOs M1. The other option is Data factory but its too complex for what I want to achieve. So I decided to go with azure Python sdk.
As a library I am using azure.data.tables import TableServiceClient
The code I wrote looks like this:
from azure.data.tables import TableServiceClient
my_conn_str_out = 'storage1-Conn-Str'
table_service_client_out = TableServiceClient.from_connection_string(my_conn_str_out)
list_table = []
for table in table_service_client_out.list_tables():
list_table.append(table.table_name)
my_conn_str_in = 'Storage2-Conn-str'
table_service_client_in = TableServiceClient.from_connection_string(my_conn_str_in)
for new_tables in table_service_client_out.list_tables():
table_service_client_in.create_table_if_not_exists(new_tables.table_name)
print(f'tables created successfully {new_tables.table_name}')
this is how I structured my code.
for table in table_service_client_out.list_tables():
list_table.append(table.table_name)
I loop through all my tables in the storage account and append them into a list.
then:
for new_tables in table_service_client_out.list_tables():
table_service_client_in.create_table_if_not_exists(new_tables.table_name)
print(f'tables created successfully {new_tables.table_name}')
I create the same table in the storage2
So far everything works just fine.
What I would like to achieve now, is to query all the data in each table in storage1 and pass it to the respective table in storage2
According to Microsoft documentation I can achieve the query table using this:
query = table_service_client_out.query_tables(filter=table)
so I integrated this in my loop like this:
for table in table_service_client_out.list_tables():
query = table_service_client_out.query_tables(filter=table)
list_table.append(table.table_name)
print(query)
When I run my python code, I get back the memory allocation of the query and not the data in the tables:
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8fbb0>
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8f7f0>
<iterator object azure.core.paging.ItemPaged at 0x7fcd90c8fd60>
I was wondering if there is a way how I can query all the data in my tables and pass them to my storage2

Try this :
from azure.cosmosdb.table.tableservice import TableService,ListGenerator
table_service_out = TableService(account_name='', account_key='')
table_service_in = TableService(account_name='', account_key='')
#query 100 items per request, in case of consuming too much menory load all data in one time
query_size = 100
#save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(tb_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_entity(tb_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(tb_name,data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
for tb in tbs_out:
#create table with same name in storage2
table_service_in.create_table(tb.name)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(tb.name,data,table_service_out,table_service_in,query_size)
Of course, this is a simple demo for your requirement.For more efficiency, you can also query table data by partition key and commit them by batch
Let me know if you have any more questions.

Related

How do you setup a Synapse Serverless SQL External Table over partitioned data?

I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'

Azure table backup retrieve more than 1000 rows

I hope somebody can help me to debug this issue.
I have the following script
from azure.cosmosdb.table.tableservice import TableService,ListGenerator
from azure.storage.blob import BlobServiceClient
from datetime import date
from datetime import *
def queryAndSaveAllDataBySize(tb_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_or_replace_entity(tb_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(tb_name,data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
for tb in tbs_out:
#create table with same name in storage2
table_service_in.create_table(table_name=tb.name, fail_on_exist=False)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(tb.name,data,table_service_out,table_service_in,query_size)
this code will check the table in storageA copy them and create the same table in StorageB, and thanks to the marker I can have the x_ms_continuation token if I have more than 1000 rows per requests.
Goes without saying that this works just fine as it is.
But yesterday I was trying to make some changes to the code as follow:
If in storageA I have a table name TEST, I storageB I want to create a table named TEST20210930, basically the table name from storageA + today date
This is where the code start breaking down.
table_service_out = TableService(account_name='', account_key='')
table_service_in = TableService(account_name='', account_key='')
query_size = 100
#save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(tb_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_or_replace_entity(tb_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(tb_name,data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
print(tbs_out)
for tb in tbs_out:
table = tb.name + today
print(target_connection_string)
#create table with same name in storage2
table_service_in.create_table(table_name=table, fail_on_exist=False)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(table,data,table_service_out,table_service_in,query_size)
What happens here is that the code runs up to the query_size limit but than fails saying that the table was not found.
I am a bit confused here and maybe somebody can help to spot my error.
Please if you need more info just ask
Thank you so so so much.
HOW TO REPRODUCE:
In azure portal create 2 storage account. StorageA and StorageB.
In storage A create a table and fill it with data, over 100 (based on the query_size.
Set the configuration Endpoints. table_service_out = storageA and table_storage_in = StorageB
I believe the issue is with the following line of code:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
If you notice, tb_name is the name of the table in your target account which is obviously not present in your source account. Because you're querying from a table that does not exist, you're getting this error.
To fix this, you should also pass the name of source table to queryAndSaveAllDataBySize and use that when querying entities in that function.
UPDATE
Please take a look at code below:
table_service_out = TableService(account_name='', account_key='')
table_service_in = TableService(account_name='', account_key='')
query_size = 100
#save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(source_table_name, target_table_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_or_replace_entity(target_table_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=source_table_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(source_table_name, target_table_name, data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
print(tbs_out)
for tb in tbs_out:
table = tb.name + today
print(target_connection_string)
#create table with same name in storage2
table_service_in.create_table(table_name=table, fail_on_exist=False)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(tb.name, table,data,table_service_out,table_service_in,query_size)

Autocreate tables in Bigquery for multiple CSV files

I want to generate tables automatically in Bigquery whenever a file is uploaded in storage bucket using cloud function in python.
For example- if sample1.csv file is uploaded to bucket then a sample1 table will be created in Bigquery.
How to automate it using cloud function using Python i tried with below code but was able to generate 1 table and all data got appended to that table, how to proceed
def hello_gcs(event, context):
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "test_project.test_dataset.test_Table"
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://test_bucket/*.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Processing file: {file['name']}.")
Sounds like you need to do three things:
Extract the name of the CSV file/object from the notification event you're receiving to fire your function.
Update the table_id in your example code to set the table name based on the filename you extracted in the first step.
Update the uri in your example code to only use the single file as the input. As written, your example attempts to load data from all matching CSV objects in GCS to the table.

how to avoid duplication in BigQuery by streaming insert

I made a function that inserts .CSV data into BigQuery in every 5~6 seconds. I've been looking for the way to avoid duplicating the data in BigQuery after inserting. I want to remove data that has same luid but I have no idea how to remove it so is it possible to check each data of .CSV has already existed in BigQuery table before inserting .
I put row_ids parameter to avoid duplicate luid but it seems not to work well .
Could you give me any idea ?? Thanks.
def stream_upload():
# BigQuery
client = bigquery.Client()
project_id = 'test'
dataset_name = 'test'
table_name = "test"
full_table_name = dataset_name + '.' + table_name
json_rows = []
with open('./test.csv','r') as f:
for line in csv.DictReader(f):
del line[None]
line_json = dict(line)
json_rows.append(line_json)
errors = client.insert_rows_json(
full_table_name,json_rows,row_ids=[row['luid'] for row in json_rows]
)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
print("end")
schedule.every(0.5).seconds.do(stream_upload)
while True:
schedule.run_pending()
time.sleep(0.1)
BigQuery doesn't have a native way to deal with this. You could either create a view off of this table that performs deduping or create an external cache of luids and lookup if they have already been written to BigQuery before writing and update the cache after writing new data. This could be as simple as a file cache or you could use an additional database.

Need better approach to load oracle blob data into Mongodb collection using Gridfs

Recently, I started working on new project where I need to transfer oracle table data into Mongodb collections.
Oracle table consists one BLOB datatype column.
I wanted to transfer oracle table blob data into Mongodb using GridFS and I even succeed, but I am unable to scale it up.
If I use same script for 10k or 50k records, Its taking very very long time.
Please suggest me, is there anywhere i can improve or is there better way to achieve my goal.
Thank you in advance.
Please find out sample code which I am using to load small amount of data
from pymongo import MongoClient
import cx_Oracle
from gridfs import GridFS
import pickle
import sys
client = MongoClient('localhost:27017/sample')
dbm = client.sample
db = <--oracle connection----->
cursor = db.cursor()
def get_notes_file_sys():
return GridFS(dbm,'notes')
def save_data_in_file(fs,note,file_name):
gridin = None
file_ids = {}
data_blob = pickle.dumps(note['file_content_blob'])
del note['file_content_blob']
gridin = fs.open_upload_stream(file_name, chunk_size_bytes=261120, metadata=note)
gridin.write(data_blob)
gridin.close()
file_ids['note_id'] = gridin._id
return file_ids
# ---------------------------Uploading files start---------------------------------------
fs = get_notes_file_sys()
query = ("""SELECT id, file_name, file_content_blob, author, created_at FROM notes fetch next 10 rows only""")
cursor.execute(query)
rows = cursor.fetchall()
col = [co[0] for co in cursor.description]
final_arr= []
for row in rows:
data = dict(zip(col,row))
file_name = data['file_name']
if data["file_content_blob"] is None:
data["file_content_blob"] = None
else:
# This below line is taking more time
data["file_content_blob"] = data["file_content_blob"].read()
note_id = save_data_in_file(fs,data,file_name)
data['note_id'] = note_id
final_arr.append(data)
dbm['notes'].bulk_insert(final_arr)
Two things comes to mind:
Don't move to Mongo. Just use Oracle's SODA document storage model: https://cx-oracle.readthedocs.io/en/latest/user_guide/soda.html Also take a look at Oracle's JSON DB service: https://blogs.oracle.com/jsondb/autonomous-json-database
Fetch BLOBs as Bytes, which is much faster than the method you are using https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html#fetching-lobs-as-strings-and-bytes There is an example at https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py

Resources