I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'
I hope somebody can help me to debug this issue.
I have the following script
from azure.cosmosdb.table.tableservice import TableService,ListGenerator
from azure.storage.blob import BlobServiceClient
from datetime import date
from datetime import *
def queryAndSaveAllDataBySize(tb_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_or_replace_entity(tb_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(tb_name,data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
for tb in tbs_out:
#create table with same name in storage2
table_service_in.create_table(table_name=tb.name, fail_on_exist=False)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(tb.name,data,table_service_out,table_service_in,query_size)
this code will check the table in storageA copy them and create the same table in StorageB, and thanks to the marker I can have the x_ms_continuation token if I have more than 1000 rows per requests.
Goes without saying that this works just fine as it is.
But yesterday I was trying to make some changes to the code as follow:
If in storageA I have a table name TEST, I storageB I want to create a table named TEST20210930, basically the table name from storageA + today date
This is where the code start breaking down.
table_service_out = TableService(account_name='', account_key='')
table_service_in = TableService(account_name='', account_key='')
query_size = 100
#save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(tb_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_or_replace_entity(tb_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(tb_name,data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
print(tbs_out)
for tb in tbs_out:
table = tb.name + today
print(target_connection_string)
#create table with same name in storage2
table_service_in.create_table(table_name=table, fail_on_exist=False)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(table,data,table_service_out,table_service_in,query_size)
What happens here is that the code runs up to the query_size limit but than fails saying that the table was not found.
I am a bit confused here and maybe somebody can help to spot my error.
Please if you need more info just ask
Thank you so so so much.
HOW TO REPRODUCE:
In azure portal create 2 storage account. StorageA and StorageB.
In storage A create a table and fill it with data, over 100 (based on the query_size.
Set the configuration Endpoints. table_service_out = storageA and table_storage_in = StorageB
I believe the issue is with the following line of code:
data = table_out.query_entities(table_name=tb_name,num_results=query_size,marker=resp_data.next_marker)
If you notice, tb_name is the name of the table in your target account which is obviously not present in your source account. Because you're querying from a table that does not exist, you're getting this error.
To fix this, you should also pass the name of source table to queryAndSaveAllDataBySize and use that when querying entities in that function.
UPDATE
Please take a look at code below:
table_service_out = TableService(account_name='', account_key='')
table_service_in = TableService(account_name='', account_key='')
query_size = 100
#save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(source_table_name, target_table_name,resp_data:ListGenerator ,table_out:TableService,table_in:TableService,query_size:int):
for item in resp_data:
#remove etag and Timestamp appended by table service
del item.etag
del item.Timestamp
print("instet data:" + str(item) + "into table:"+ tb_name)
table_in.insert_or_replace_entity(target_table_name,item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=source_table_name,num_results=query_size,marker=resp_data.next_marker)
queryAndSaveAllDataBySize(source_table_name, target_table_name, data,table_out,table_in,query_size)
tbs_out = table_service_out.list_tables()
print(tbs_out)
for tb in tbs_out:
table = tb.name + today
print(target_connection_string)
#create table with same name in storage2
table_service_in.create_table(table_name=table, fail_on_exist=False)
#first query
data = table_service_out.query_entities(tb.name,num_results=query_size)
queryAndSaveAllDataBySize(tb.name, table,data,table_service_out,table_service_in,query_size)
I want to generate tables automatically in Bigquery whenever a file is uploaded in storage bucket using cloud function in python.
For example- if sample1.csv file is uploaded to bucket then a sample1 table will be created in Bigquery.
How to automate it using cloud function using Python i tried with below code but was able to generate 1 table and all data got appended to that table, how to proceed
def hello_gcs(event, context):
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "test_project.test_dataset.test_Table"
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://test_bucket/*.csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Processing file: {file['name']}.")
Sounds like you need to do three things:
Extract the name of the CSV file/object from the notification event you're receiving to fire your function.
Update the table_id in your example code to set the table name based on the filename you extracted in the first step.
Update the uri in your example code to only use the single file as the input. As written, your example attempts to load data from all matching CSV objects in GCS to the table.
I made a function that inserts .CSV data into BigQuery in every 5~6 seconds. I've been looking for the way to avoid duplicating the data in BigQuery after inserting. I want to remove data that has same luid but I have no idea how to remove it so is it possible to check each data of .CSV has already existed in BigQuery table before inserting .
I put row_ids parameter to avoid duplicate luid but it seems not to work well .
Could you give me any idea ?? Thanks.
def stream_upload():
# BigQuery
client = bigquery.Client()
project_id = 'test'
dataset_name = 'test'
table_name = "test"
full_table_name = dataset_name + '.' + table_name
json_rows = []
with open('./test.csv','r') as f:
for line in csv.DictReader(f):
del line[None]
line_json = dict(line)
json_rows.append(line_json)
errors = client.insert_rows_json(
full_table_name,json_rows,row_ids=[row['luid'] for row in json_rows]
)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
print("end")
schedule.every(0.5).seconds.do(stream_upload)
while True:
schedule.run_pending()
time.sleep(0.1)
BigQuery doesn't have a native way to deal with this. You could either create a view off of this table that performs deduping or create an external cache of luids and lookup if they have already been written to BigQuery before writing and update the cache after writing new data. This could be as simple as a file cache or you could use an additional database.
Recently, I started working on new project where I need to transfer oracle table data into Mongodb collections.
Oracle table consists one BLOB datatype column.
I wanted to transfer oracle table blob data into Mongodb using GridFS and I even succeed, but I am unable to scale it up.
If I use same script for 10k or 50k records, Its taking very very long time.
Please suggest me, is there anywhere i can improve or is there better way to achieve my goal.
Thank you in advance.
Please find out sample code which I am using to load small amount of data
from pymongo import MongoClient
import cx_Oracle
from gridfs import GridFS
import pickle
import sys
client = MongoClient('localhost:27017/sample')
dbm = client.sample
db = <--oracle connection----->
cursor = db.cursor()
def get_notes_file_sys():
return GridFS(dbm,'notes')
def save_data_in_file(fs,note,file_name):
gridin = None
file_ids = {}
data_blob = pickle.dumps(note['file_content_blob'])
del note['file_content_blob']
gridin = fs.open_upload_stream(file_name, chunk_size_bytes=261120, metadata=note)
gridin.write(data_blob)
gridin.close()
file_ids['note_id'] = gridin._id
return file_ids
# ---------------------------Uploading files start---------------------------------------
fs = get_notes_file_sys()
query = ("""SELECT id, file_name, file_content_blob, author, created_at FROM notes fetch next 10 rows only""")
cursor.execute(query)
rows = cursor.fetchall()
col = [co[0] for co in cursor.description]
final_arr= []
for row in rows:
data = dict(zip(col,row))
file_name = data['file_name']
if data["file_content_blob"] is None:
data["file_content_blob"] = None
else:
# This below line is taking more time
data["file_content_blob"] = data["file_content_blob"].read()
note_id = save_data_in_file(fs,data,file_name)
data['note_id'] = note_id
final_arr.append(data)
dbm['notes'].bulk_insert(final_arr)
Two things comes to mind:
Don't move to Mongo. Just use Oracle's SODA document storage model: https://cx-oracle.readthedocs.io/en/latest/user_guide/soda.html Also take a look at Oracle's JSON DB service: https://blogs.oracle.com/jsondb/autonomous-json-database
Fetch BLOBs as Bytes, which is much faster than the method you are using https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html#fetching-lobs-as-strings-and-bytes There is an example at https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py