Loading data from a file in GCS to GCP firestore - python-3.x

I have written a script which loops through each record in the file and does write to the firestore collection.
Firestore Schema {COLLECTION.DOCUMENT.SUBCOLLECTION.DOCUMENT.SUBCOLLECTION}
'{"KEY":"1234","DATE":"2022-10-10","SUB_COLLECTION":{"KEY":1234,"SUB_DOC":{"KEY1" : :"VAL1"}}'
'{"KEY":"1235","DATE":"2022-10-10","SUB_COLLECTION":{"KEY":1235,"SUB_DOC":{"KEY1" : :"VAL1"}}'
'{"KEY":"1236","DATE":"2022-10-10","SUB_COLLECTION":{"KEY":1236,"SUB_DOC":{"KEY1" : :"VAL1"}}'
...
File is read in the below line
read_file = filename.download_as_string()
converted to a list of strings
fire_client = firestore.Client(project=PROJECT)
dict_str = read_file.decode("UTF-8");
dict_str = dict_str.split('\n');
dict_str = dict_str.split('\n');
for i in range(0,len(dict_str)-1):
i = json.loads(dict_str[i])
doc_ref = fire_client.collection('STATIC_COLLECTION_NAME').document(i['KEY'])
doc_ref.set({"KEY" : int(i['KEY']), "DATE" : i['DATE']})
sub_ref = doc_ref.collection('STATIC_SUB_COLLECTION_NAME').document('STATIC_SUB_DOC_NAME')
sub_ref.set(i['SUB_COLLECTION'])
However, this job is consuming hours to complete a file size of 100 MB. Is there a way I could do this using multiple writes at a time, example batch processing of X number of records from the file and write those to X documents and sub-collections in the firestore.
Finding a way to make this more efficient instead of looping over millions of records, my current script ended up with
503 The datastore operation timed out, or the data was temporarily unavailable.

You'll want to use the bulk_writer to accumulate & send writes to Firestore

Related

How to convert selected files in S3 bucket into snowflake stage in order to load data into snowflake using python and boto3

I need to stage files which are in s3 bucket.first of all i find the latest file which upload to the given bucket and then i need to make those files into stage.not the whole bucket. for example let say i have bucket called topic. inside that i have 2 folders topic1 and topic2. those 2 folders has newly upload 2 files.in this case i need to make those newly upload file into stage in order to load those data into snowflake.i want to do this using python and boto3. i already built a code to find the latest file, but i don't know how make them as stage.when i used the CREATE OR REPLACE STAGE command with for loop for each file it will only create a stage for the last file. Not creating stage for each file. How should i do this?
` def download_s3_files(self):
s3_object = boto3.client('s3', aws_access_key_id=self.s3_acc_key, aws_secret_access_key=self.s3_sec_key)
if self.source_as_stage:
no_of_dir = []
try:
bucket = s3_object.list_objects(Bucket=self.s3_bucket, Prefix=self.file_path, Delimiter='/')
print("object bucket list >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", bucket)
except Exception as e:
self.propagate_log_msg('check [%s] and Source File Location Path' % e)
for directory in bucket['CommonPrefixes']:
no_of_dir.append(str(directory['Prefix']).rsplit('/', 2)[-2])
print(no_of_dir)
no_of_dir.sort(reverse=True)
latest_dir = no_of_dir[0]
self.convert_source_as_stage(latest_dir)
except Exception as e:
print(e)
exit(-1)
def convert_source_as_stage(self, latest_file):
source_file_format = str(self.metadata['source_file_format']).lower()+'_format' if self.metadata['source_file_format'] is not None else 'pipe_format'
url = 's3://{bucket}/{location}/{dir_}'.format(location=self.s3_file_loc.strip("/"),
bucket=self.s3_bucket, dir_=latest_file)
print("formateed url>>>>>>>>>>>>>>>>>>", url)
file_name_dw = str(latest_file.rsplit('/', 1)[-1])
print("File_Name>>>>>>>>>>>>>", file_name_dw)
print("Source file format :", source_file_format)
print("source url: ", url)
self.create_stage = """
CREATE OR REPLACE STAGE {sa}.{table} URL='{url}'
CREDENTIALS=(AWS_KEY_ID='{access_key}' AWS_SECRET_KEY='{secret}')
FILE_FORMAT = {file};
// create or replace stage {sa}.{table}
// file_format = (type = 'csv' field_delimiter = '|' record_delimiter = '\\n');
""".format(sa=self.ss_cd, table=self.table.lower(), access_key=self.s3_acc_key, secret=self.s3_sec_key,
url=url, file=source_file_format, filename=str(self.metadata['source_table']))
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
'''CONNECT TO SNOWFLAKE''''''''''''''''''''''''''''''''''
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
print("Create Stage Statement :", self.create_stage)
con = snowflake.connector.connect(
user=self.USER,
password=self.PASSWORD,
account=self.ACCOUNT,
)
self.propagate_log_msg("Env metadata = [%s]" % self.env_metadata)
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
'''REFRESH DDL''''''''''''''''''''''''''''''''''
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
try:
file_format_full_path = os.path.join(self.root, 'sql', str(source_file_format)+'.sql')
self.create_file_format = open(file_format_full_path, 'r').read()
self.create_schema = "CREATE schema if not exists {db_lz}.{sa}".format(sa=self.ss_cd, db_lz=self.db_lz)
env_sql = 'USE database {db_lz}'.format(db_lz=self.db_lz)
self.propagate_log_msg(env_sql)
con.cursor().execute(env_sql)
con.cursor().execute(self.create_schema)
env_sql = 'USE schema {schema}'.format(schema=self.ss_cd)
self.propagate_log_msg(env_sql)
con.cursor().execute(env_sql)
con.cursor().execute(self.create_file_format)
con.cursor().execute(self.create_stage)
except snowflake.connector.ProgrammingError as e:
self.propagate_log_msg('Invalid sql, fix sql and retry')
self.propagate_log_msg(e)
exit()
except KeyError:
self.propagate_log_msg(traceback.format_exc())
self.propagate_log_msg('deploy_ods is not set in schedule metadata, assuming it is False')
except Exception as e:
self.propagate_log_msg('unhandled exception, debug')
self.propagate_log_msg(traceback.format_exc())
exit()
else:
self.propagate_log_msg(
"Successfully dropped and recreated table/stage for [{sa}.{table}]".format(sa=self.ss_cd,
table=self.table))`
Perhaps you can take a step back and give a bigger picture of what you are trying to achieve. That will help others in order to give good advice.
Best practice is to create one Snowflake STAGE for the whole bucket. The STAGE object then mirrors the bucket object. If your setup needs eg. different permissions for different parts of the bucket, then it can make sense to create multiple stages with different access rights.
It looks like the purpose of setting up stages is to import S3 objects into Snowflake tables. This is done with the COPY INTO <table> command, and that command has two options for selecting objects/filenames to import:
FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] )
PATTERN = '<regex_pattern>'
I suggest you put your effort into the COPY INTO <table> parameters instead of creating excess amounts of STAGE objects in the database.
You should also take a serious look into Snowpipes. Snowpipes does the job importing S3 objects near-realtime into Snowflake tables with COPY INTO <table> commands triggered by S3 eg. create object events. Snowpipes cost less than warehouses as they are not dedicated resources.
Simple and effective.

How to avoid header while exporting BigQuery table in to Google Storage

I have developed below code which is helping to export BigQuery table in to Google storage bucket. I want to merge files into single file with out header, so that next processes will use file with out any issue.
def export_bq_table_to_gcs(self, table_name):
client = bigquery.Client(project=project_name)
print("Exporting table {}".format(table_name))
dataset_ref = client.dataset(dataset_name,
project=project_name)
dataset = bigquery.Dataset(dataset_ref)
table_ref = dataset.table(table_name)
size_bytes = client.get_table(table_ref).num_bytes
# For tables bigger than 1GB uses Google auto split, otherwise export is forced in a single file.
if size_bytes > 10 ** 9:
destination_uris = [
'gs://{}/{}{}*.csv'.format(bucket_name,
f'{table_name}_temp', uid)]
else:
destination_uris = [
'gs://{}/{}{}.csv'.format(bucket_name,
f'{table_name}_temp', uid)]
extract_job = client.extract_table(table_ref, destination_uris) # API request
result = extract_job.result() # Waits for job to complete.
if result.state != 'DONE' or result.errors:
raise Exception('Failed extract job {} for table {}'.format(result.job_id, table_name))
else:
print('BQ table(s) export completed successfully')
storage_client = storage.Client(project=gs_project_name)
bucket = storage_client.get_bucket(gs_bucket_name)
blob_list = bucket.list_blobs(prefix=f'{table_name}_temp')
print('Merging shard files into single file')
bucket.blob(f'{table_name}.csv').compose(blob_list)
Can you please help me to find a way to skip header.
Thanks,
Raghunath.
We can avoid header by using jobConfig to set the print_header parameter to False. Sample code
job_config = bigquery.job.ExtractJobConfig(print_header=False)
extract_job = client.extract_table(table_ref, destination_uris,
job_config=job_config)
Thanks
You can use skipLeadingRows (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#externalDataConfiguration.googleSheetsOptions.skipLeadingRows)

Dump series back into InfluxDB after querying with replaced field value

Scenario
I want to send data to an MQTT Broker (Cloud) by querying measurements from InfluxDB.
I have a field in the schema which is called status. It can either be 1 or 0. status=0 indicated that series has not been sent to the cloud. If I get an acknowlegdment from the MQTT Broker then I wish to rewrite the query back into the database with status=1.
As mentioned in FAQs for InfluxDB regarding Duplicate data If the information has the same timestamp as the previous query but with a different field value => then the update field will be shown.
In order to test this I created the following:
CREATE DATABASE dummy
USE dummy
INSERT meas_1, type=t1, status=0,value=123 1536157064275338300
query:
SELECT * FROM meas_1
provides
time status type value
1536157064275338300 0 t1 234
now if I want to overwrite the series I do the following:
INSERT meas_1, type=t1, status=1,value=123 1536157064275338300
which will overwrite the series
time status type value
1536157064275338300 1 t1 234
(Note: this is not possible via Tags currently in InfluxDB)
Usage
Query some information using the client with "status"=0.
Restructure JSON to be sent to the cloud
Send the information to cloud
If successful then write the output from Step 1. back into the DB but with status=1.
I am using the InfluxDBClient Python3 to create the Application (MQTT + InfluxDB)
Within the write_points API there is a parameter which mentions batch_size which require int as input.
I am not sure how can I use this with the Application that I want. Can someone guide me with this or with the Schema of the DB so that I can upload actual and non-redundant information to the cloud ?
The batch_size is actually the length of the list of the measurements that needs to passed to write_points.
Steps
Create client and query from measurement (here, we query gps information)
client = InfluxDBClient(database='dummy')
op = client.query('SELECT * FROM gps WHERE "status"=0', epoch='ns')
Make the ResultSet into a list:
batch = list(op.get_points('gps'))
create an empty list for update
updated_batch = []
parse through each measurement and change the status flag to 1. Note, default values in InfluxDB are float
for each in batch:
new_mes = {
'measurement': 'gps',
'tags': {
'type': 'gps'
},
'time': each['time'],
'fields': {
'lat': float(each['lat']),
'lon': float(each['lon']),
'alt': float(each['alt']),
'status': float(1)
}
}
updated_batch.append(new_mes)
Finally dump the points back via the client with batch_size as the length of the updated_batch
client.write_points(updated_batch, batch_size=len(updated_batch))
This overwrites the series because it contains the same timestamps with status field set to 1

MafftCommandline and io.StringIO

I've been trying to use the Mafft alignment tool from Bio.Align.Applications. Currently, I've had success writing my sequence information out to temporary text files that are then read by MafftCommandline(). However, I'd like to avoid redundant steps as much as possible, so I've been trying to write to a memory file instead using io.StringIO(). This is where I've been having problems. I can't get MafftCommandline() to read internal files made by io.StringIO(). I've confirmed that the internal files are compatible with functions such as AlignIO.read(). The following is my test code:
from Bio.Align.Applications import MafftCommandline
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import io
from Bio import AlignIO
sequences1 = ["AGGGGC",
"AGGGC",
"AGGGGGC",
"AGGAGC",
"AGGGGG"]
longest_length = max(len(s) for s in sequences1)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences1] #padded sequences used to test compatibilty with AlignIO
ioSeq = ''
for items in padded_sequences:
ioSeq += '>unknown\n'
ioSeq += items + '\n'
newC = io.StringIO(ioSeq)
cLoc = str(newC).strip()
cLocEdit = cLoc[:len(cLoc)] #create string to remove < and >
test1Handle = AlignIO.read(newC, "fasta")
#test1HandleString = AlignIO.read(cLocEdit, "fasta") #fails to interpret cLocEdit string
records = (SeqRecord(Seq(s)) for s in padded_sequences)
SeqIO.write(records, "msa_example.fasta", "fasta")
test1Handle1 = AlignIO.read("msa_example.fasta", "fasta") #alignIO same for both #demonstrates working AlignIO
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
mafft_cline1 = MafftCommandline(mafft_exe, input=cLocEdit) #fails to read string (same as AlignIO)
mafft_cline2 = MafftCommandline(mafft_exe, input=newC)
stdout, stderr = mafft_cline()
print(stdout) #corresponds to MafftCommandline with input file
stdout1, stderr1 = mafft_cline1()
print(stdout1) #corresponds to MafftCommandline with internal file
I get the following error messages:
ApplicationError: Non-zero return code 2 from '/usr/local/bin/mafft <_io.StringIO object at 0x10f439798>', message "/bin/sh: -c: line 0: syntax error near unexpected token `newline'"
I believe this results due to the arrows ('<' and '>') present in the file path.
ApplicationError: Non-zero return code 1 from '/usr/local/bin/mafft "_io.StringIO object at 0x10f439af8"', message '/usr/local/bin/mafft: Cannot open _io.StringIO object at 0x10f439af8.'
Attempting to remove the arrows by converting the file path to a string and indexing resulted in the above error.
Ultimately my goal is to reduce computation time. I hope to accomplish this by calling internal memory instead of writing out to a separate text file. Any advice or feedback regarding my goal is much appreciated. Thanks in advance.
I can't get MafftCommandline() to read internal files made by
io.StringIO().
This is not surprising for a couple of reasons:
As you're aware, Biopython doesn't implement Mafft, it simply
provides a convenient interface to setup a call to mafft in
/usr/local/bin. The mafft executable runs as a separate process
that does not have access to your Python program's internal memory,
including your StringIO file.
The mafft program only works with an input file, it doesn't even
allow stdin as a data source. (Though it does allow stdout as a
data sink.) So ultimately, there must be a file in the file system
for mafft to open. Thus the need for your temporary file.
Perhaps tempfile.NamedTemporaryFile() or tempfile.mkstemp() might be a reasonable compromise.

How to compress warc records with lzma (*.warc.xz) in python3?

I have a list of warc records. Every single item in list is created like this:
header = warc.WARCHeader({
"WARC-Type": "response",
"WARC-Target-URI": "www.somelink.com",
}, defaults=True)
data = "Some string"
record = warc.WARCRecord(header, data.encode('utf-8','replace'))
Now, I am using *.warc.gz to store my records like this:
output_file = warc.open("my_file.warc.gz", 'wb')
And write records like this:
output_file.write_record(record) # type of record is WARCRecord
But how can I compress with lzma as *.warc.xz? I have tried replacing gz with xz when callig warc.open, but warc in python3 do not support this format. I have found this trial, but I was not able to save WARCRecord with this:
output_file = lzma.open("my_file.warc.xz", 'ab', preset=9)
header = warc.WARCHeader({
"WARC-Type": "response",
"WARC-Target-URI": "www.somelink.com",
}, defaults=True)
data = "Some string"
record = warc.WARCRecord(header, data.encode('utf-8','replace'))
output_file.write(record)
The error message is:
TypeError: a bytes-like object is required, not 'WARCRecord'
Thanks for any help.
The WARCRecord class has a write_to method, to write records to a file object.
You could use that to write records to a file created with lzma.open().

Resources