Apache Beam / Dataflow pub/sub side input with python

Apache Beam / Dataflow pub/sub side input with python - python-3.x

I'm new to Apache Beam, so I'm struggling a bit with the following scenario:
Pub/Sub topic using Stream mode
Transform to take out customerId
Parallel PCollection with Transform/ParDo that fetches data from Firestore based on the "customerId" received in the Pub/Sub Topic (using Side Input)
...
The ParDo transform that tries to fetch Firestore data does not run at all. If using "customerId" fixed value everything works as expected ... although not using a proper Fetch from Firestore (simple ParDo), it works. Am I doing something that is not supposed to?
Including my code bellow:
class getFirestoreUsers(beam.DoFn):
def process(self, element, customerId):
print(f'Getting Users from Firestore, ID: {customerId}')
# Call function to initialize Database
db = intializeFirebase()
""" # get customer information from the database
doc = db.document(f'Customers/{customerId}').get()
customer = doc.to_dict() """
usersList = {}
# Get Optin Users
try:
docs = db.collection(
f'Customers/{customerId}/DevicesWiFi_v3').where(u'optIn', u'==', True).stream()
usersList = {user.id: user.to_dict() for user in docs}
except Exception as err:
print(f"Error: couldn't retrieve OPTIN users from DevicesWiFi")
print(err)
return([usersList])
Main code
def run(argv=None):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--topic',
type=str,
help='Pub/Sub topic to read from')
parser.add_argument(
'--output',
help=('Output local filename'))
args, pipeline_args = parser.parse_known_args(argv)
options = PipelineOptions(pipeline_args)
options.view_as(SetupOptions).save_main_session = True
options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=options)
users = (p | 'Create chars' >> beam.Create([
{
"clientMac": "7c:d9:5c:b8:6f:38",
"username": "Louis"
},
{
"clientMac": "48:fd:8e:b0:6f:38",
"username": "Paul"
}
]))
# Get Dictionary from Pub/Sub
data = (p | 'Read from PubSub' >> beam.io.ReadFromPubSub(topic=args.topic)
| 'Parse JSON to Dict' >> beam.Map(lambda e: json.loads(e))
)
# Get customerId from Pub/Sub information
PcustomerId = (data | 'get customerId from Firestore' >>
beam.ParDo(lambda x: [x.get('customerId')]))
PcustomerId | 'print customerId' >> beam.Map(print)
# Get Users from Firestore
custUsers = (users | 'Read from Firestore' >> beam.ParDo(
getFirestoreUsers(), customerId=beam.pvalue.AsSingleton(PcustomerId)))
custUsers | 'print Users from Firestore' >> beam.Map(print)
In order to avoid errors for running the function I had to initialise "users" dictionary, which I completely ignore aftewards.
I suppose I have several errors here, so your help is much appreciated.

It's not clear to me how users PCollection is used (since element is not processed in the process definition) in the example code. I've re-arranged the code a little bit with windowing and used the customer_id as the main input.
class GetFirestoreUsers(beam.DoFn):
def setup(self):
# Call function to initialize Database
self.db = intializeFirebase()
def process(self, element):
print(f'Getting Users from Firestore, ID: {element}')
""" # get customer information from the database
doc = self.db.document(f'Customers/{element}').get()
customer = doc.to_dict() """
usersList = {}
# Get Optin Users
try:
docs = self.db.collection(
f'Customers/{element}/DevicesWiFi_v3').where(u'optIn', u'==', True).stream()
usersList = {user.id: user.to_dict() for user in docs}
except Exception as err:
print(f"Error: couldn't retrieve OPTIN users from DevicesWiFi")
print(err)
return([usersList])
data = (p | 'Read from PubSub' >> beam.io.ReadFromPubSub(topic=args.topic)
| beam.WindowInto(window.FixedWindow(60))
| 'Parse JSON to Dict' >> beam.Map(lambda e: json.loads(e)))
# Get customerId from Pub/Sub information
customer_id = (data | 'get customerId from Firestore' >>
beam.Map(lambda x: x.get('customerId')))
customer_id | 'print customerId' >> beam.Map(print)
# Get Users from Firestore
custUsers = (cutomer_id | 'Read from Firestore' >> beam.ParDo(
GetFirestoreUsers())
custUsers | 'print Users from Firestore' >> beam.Map(print)
From your comment:
the data needed (customerID first and customers data after) is not ready when running the "main" PCollection with original JSON data from Pub/Sub
Did you mean the data in firestore is not ready when reading the Pub/Sub topic?
You can always split the logic into 2 pipelines in your main function and run them one after another.

Related

Using ctrader-fix to download historical data from cTrader

I am using the python package ctrader-fix (https://pypi.org/project/ctrader-fix/) to download historical price data from ctrader's API (https://help.ctrader.com/fix/).
The code does not make clear to me at least where exactly I declare the symbol (e.g. 'NatGas') through its SymbolID code number (in the case of 'NatGas' the SymbolID code number is 10055) for which I request historical data but also it does not make clear where I specify the timeframe I am interested on (e.g. 'H' for hourly data) and the number of records I want to retrieve.
section of ctrader where the FIX SymbolID number of 'NatGas' is provided
The code that is provided is the following (I have filled the values except the username).
config = {
'Host': '',
'Port': 5201,
'SSL': False,
'Username': '****************',
'Password': '3672075',
'BeginString': 'FIX.4.4',
'SenderCompID': 'demo.pepperstoneuk.3672025',
'SenderSubID': 'QUOTE',
'TargetCompID': 'cServer',
'TargetSubID': 'QUOTE',
'HeartBeat': '30'
}
client = Client(config["Host"], config["Port"], ssl = config["SSL"])
def send(request):
diferred = client.send(request)
diferred.addCallback(lambda _: print("\nSent: ", request.getMessage(client.getMessageSequenceNumber()).replace("", "|")))
def onMessageReceived(client, responseMessage): # Callback for receiving all messages
print("\nReceived: ", responseMessage.getMessage().replace("", "|"))
# We get the message type field value
messageType = responseMessage.getFieldValue(35)
# we send a security list request after we received logon message response
if messageType == "A":
securityListRequest = SecurityListRequest(config)
securityListRequest.SecurityReqID = "A"
securityListRequest.SecurityListRequestType = 0
send(securityListRequest)
# After receiving the security list we send a market order request by using the security list first symbol
elif messageType == "y":
# We use getFieldValue to get all symbol IDs, it will return a list in this case
# because the symbol ID field is repetitive
symboldIds = responseMessage.getFieldValue(55)
if config["TargetSubID"] == "TRADE":
newOrderSingle = NewOrderSingle(config)
newOrderSingle.ClOrdID = "B"
newOrderSingle.Symbol = symboldIds[1]
newOrderSingle.Side = 1
newOrderSingle.OrderQty = 1000
newOrderSingle.OrdType = 1
newOrderSingle.Designation = "From Jupyter"
send(newOrderSingle)
else:
marketDataRequest = MarketDataRequest(config)
marketDataRequest.MDReqID = "a"
marketDataRequest.SubscriptionRequestType = 1
marketDataRequest.MarketDepth = 1
marketDataRequest.NoMDEntryTypes = 1
marketDataRequest.MDEntryType = 0
marketDataRequest.NoRelatedSym = 1
marketDataRequest.Symbol = symboldIds[1]
send(marketDataRequest)
# after receiving the new order request response we stop the reactor
# And we will be disconnected from API
elif messageType == "8" or messageType == "j":
print("We are done, stopping the reactor")
reactor.stop()
def disconnected(client, reason): # Callback for client disconnection
print("\nDisconnected, reason: ", reason)
def connected(client): # Callback for client connection
print("Connected")
logonRequest = LogonRequest(config)
send(logonRequest)
# Setting client callbacks
client.setConnectedCallback(connected)
client.setDisconnectedCallback(disconnected)
client.setMessageReceivedCallback(onMessageReceived)
# Starting the client service
client.startService()
# Run Twisted reactor, we imported it earlier
reactor.run()
Can you explain the code to me and provide instructions on how to get for example hourly data for NatGas (1,000 observations)?`

Streaming json into bigquery using python

We have a code to reads some electricity meter data ,which we want to push to bigquery so that it can be visualized in data studio. We tried usign Cloud function, but it seems the code generates streaming data and cloud function timesout. So this may not be a correct use case for cloud function
def test():
def print_recursive(usage_dict, info, depth=0):
for gid, device in usage_dict.items():
for channelnum, channel in device.channels.items():
name = channel.name
if name == 'Main':
name = info[gid].device_name
d = datetime.now()
t = d.strftime("%x")+' '+d.strftime("%X")
print(d.strftime("%x"),d.strftime("%X"))
res={'Gid' : gid,
'ChannelNumber' : channelnum[0],
'Name' : channel.name,
'Usage' : channel.usage,
'unit':'kwh',
'Timestamp':t
}
global resp
resp = res
print(resp)
return resp
devices = vue.get_devices()
deviceGids = []
info ={}
for device in devices:
if not device.device_gid in deviceGids:
deviceGids.append(device.device_gid)
info[device.device_gid] = device
else:
info[device.device_gid].channels += device.channels
device_usage_dict = vue.get_device_list_usage(deviceGids=deviceGids,
instant=datetime.utcnow(), scale=Scale.SECOND.value, unit=Unit.KWH.value)
print_recursive(device_usage_dict, info)
This generates a electricity consumption data in real time
Can anyone suggest which GCP service would be ideal here? based on my research it seems pub/sub => bigquery . But I my question is can we programmatically ingest data into pubsub ? if yes then what are the prerequisites ?

Save all data in database with one query in Django

I'm a new Django programmer. I write a API call with rest_framework in Django. when call this API, my program connect to KUCOIN and get list of all cryptocurrency. I want save symbol and name this cryptocurrencies in database. For save data to database, I use 'for loop' and in every for loop iteration, I query to database and save data. my code :
for currencie in currencies:
name = currencie['name']
symbol = currencie['symbol']
active = (False, True)[symbol.endswith('USDT')]
oms = 'kucoin'
try:
obj = Instrument.objects.get(symbol=symbol, oms=oms)
setattr(obj, 'name', name)
setattr(obj, 'active', active)
obj.save()
except Instrument.DoesNotExist:
obj = Instrument(name=name, symbol=symbol,
active=active, oms=oms)
obj.save()
query to database in every for loop iteration have problem ,How can I solve this problem?
Exist any way in Django to save data in database in with one query.
All my code:
class getKucoinInstrument(APIView):
def post(self, request):
try:
person = Client.objects.filter(data_provider=True).first()
person_data = ClientSerializer(person, many=False).data
api_key = person_data['api_key']
api_secret = person_data['secret_key']
api_passphrase = person_data['api_passphrase']
client = kucoin_client(api_key, api_secret, api_passphrase)
currencies = client.get_symbols()
for currencie in currencies:
name = currencie['name']
symbol = currencie['symbol']
active = (False, True)[symbol.endswith('USDT')]
oms = 'kucoin'
try:
obj = Instrument.objects.get(symbol=symbol, oms=oms)
setattr(obj, 'name', name)
setattr(obj, 'active', active)
obj.save()
except Instrument.DoesNotExist:
obj = Instrument(name=name, symbol=symbol,
active=active, oms=oms)
obj.save()
return Response({'response': 'Instruments get from kucoin'}, status=status.HTTP_200_OK)
except Exception as e:
print(e)
return Response({'response': 'Internal server error'}, status=status.HTTP_500_INTERNAL_SERVER_ERROR)
Thank you for you help.

Yes! Take a look at bulk_create() documentation. https://docs.djangoproject.com/en/4.0/ref/models/querysets/#bulk-create
If you have a db that supports ignore_conflicts parameter (all do, except Oracle), you can do this:
new_currencies = []
for currencie in currencies:
name = currencie['name']
symbol = currencie['symbol']
active = (False, True)[symbol.endswith('USDT')]
oms = 'kucoin'
new_currencies.append(Instrument(name=name, symbol=symbol,
active=active, oms=oms))
Instrument.objects.bulk_create(new_currencies, ignore_conflicts=True)
1-liner:
Instrument.objects.bulk_create(
[
Instrument(
name=currencie['name'], symbol=currencie['symbol'],
active=currencie['symbol'].endswith('USDT'), oms='kucoin'
)
for currencie in currencies
],
ignore_conflicts=True
)

Apache-beam hanging on groupbykey after windowing - not triggering

TLDR;
How to correct trigger count windows with python SDK?
Problem
I'm trying to make a pipeline for transforming and indexing a Wikipedia dump.
The objective is:
Read from a compressed file - just one process and in a streaming fashion as the file doesn't fit in RAM
Process each element in parallel (ParDo)
Group these elements in a count window (GroupBy in just one key to do streaming -> batch ) in just one process to save them in a DB.
Development
For that, I created a simple source class that returns a tuple in the form (index,data, counting):
class CountingSource(beam.io.filebasedsource.FileBasedSource):
def read_records(self, file_name, offset_range_tracker):
# timestamp = datetime.now()
k = 0
with gzip.open(file_name, "rt", encoding="utf-8", errors="strict") as f:
line = f.readline()
while line:
# Structure: index, page, index, page,...
line = f.readline()
yield line, f.readline(), k
k += 1
And I made the pipeline:
_beam_pipeline_args = [
"--runner=DirectRunner",
"--streaming",
# "--direct_num_workers=5",
# "--direct_running_mode=multi_processing",
]
with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
pipeline = (
pipeline
| "Read dump" >> beam.io.Read(CountingSource(dump_path))
| "With timestamps" >> beam.Map(lambda data: beam.window.TimestampedValue(data, data[-1]))
| "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
| "Process element" >> beam.ParDo(ProcessPage())
| "Filter nones" >> beam.Filter(lambda data: data != [])
# * not working, keep stuck at group - not triggering the window
| "window"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING,
)
| "Map to tuple" >> beam.Map(lambda data: (None, data))
# | "Print" >> beam.Map(lambda data: print(data))
| "Group all per window" >> beam.GroupByKey()
| "Discard key" >> beam.Values()
| "Index data" >> beam.Map(index_data)
)
If I remove the window and pass directly from "Filter nones" to "Index data" the pipeline works but indexing individually the elements. Also, If uncomment the print step I can see I still have data after the "Map to tuple" step, but it hangs on "Group all per window" without any logg. I tried with timed triggering too, changing the window to
>> beam.WindowInto(
beam.window.FixedWindows(10))
but this changed nothing (which was supposed to do the same as I create a "count time stamp" on data extraction).
I'm understanding something wrong with the windowing? The objective was to just index the data in batches.
Alternative
I can "hack" this last step using a custom do.Fn like:
class BatchIndexing(beam.DoFn):
def __init__(self, connection_string, batch_size=50000):
self._connection_string = connection_string
self._batch_size = batch_size
self._total = 0
def setup(self):
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from scripts.wikipedia.wikipedia_articles.beam_module.documents import Base
engine = create_engine(self._connection_string, echo=False)
self.session = sessionmaker(bind=engine)(autocommit=False, autoflush=False)
Base.metadata.create_all(engine)
def start_bundle(self):
# buffer for string of lines
self._lines = []
def process(self, element):
# Input element is the processed pair
self._lines.append(element)
if len(self._lines) >= self._batch_size:
self._total += len(self._lines)
self._flush_batch()
def finish_bundle(self):
# takes care of the unflushed buffer before finishing
if self._lines:
self._flush_batch()
def _flush_batch(self):
self.index_data(self._lines)
# Clear the buffer.
self._lines = []
def index_data(self, entries_to_index):
"""
Index batch of data.
"""
print(f"Indexed {self._total} entries")
self.session.add_all(entries_to_index)
self.session.commit()
and change the pipeline to:
with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
pipeline = (
pipeline
| "Read dump" >> beam.io.Read(CountingSource(dump_path))
| "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
| "Process element" >> beam.ParDo(ProcessPage())
| "Filter nones" >> beam.Filter(lambda data: data != [])
| "Unroll" >> beam.FlatMap(lambda data: data)
| "Index data" >> beam.ParDo(BatchIndexing(connection_string, batch_size=10000))
)
Which "works" but do the last step in parallel (thus, overwhelming de database or generating locked database problems with sqlite) and I would like to have just one Sink to communicate with the database.

Triggering in Beam is not a hard requirement. My guess would be that the trigger does not manage to trigger before the input ends. The early trigger of 10 elements means the runner is allowed to trigger after 10 elements, but does not have to (relates to how Beam splits inputs into bundles).
The FixedWindows(10) is fixed on 10 second interval and your data will all have the same timestamp, so that is not going to help either.
If your goal is to group data to batches there is a very handy transform for that: GroupIntoBatches, which should work for the use case and has additional features like limiting the time a record can wait in the batch before being processed.

How can I return a string from a Google BigQuery row iterator object?

My task is to write a Python script that can take results from BigQuery and email them out. I've written a code that can successfully send an email, but I am having trouble including the results of the BigQuery script in the actual email. The query results are correct, but the actual object I am returning from the query (results) always returns as a Nonetype.
For example, the email should look like this:
Hello,
You have the following issues that have been "open" for more than 7 days:
-List issues here from bigquery code
Thanks.
The code reads in contacts from a contacts.txt file, and it reads in the email message template from a message.txt file. I tried to make the bigquery object into a string, but it still results in an error.
from google.cloud import bigquery
import warnings
warnings.filterwarnings("ignore", "Your application has authenticated using end user credentials")
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from string import Template
def query_emailtest():
client = bigquery.Client(project=("analytics-merch-svcs-thd"))
query_job = client.query("""
select dept, project_name, reset, tier, project_status, IssueStatus, division, store_number, top_category,
DATE_DIFF(CURRENT_DATE(), in_review, DAY) as days_in_review
from `analytics-merch-svcs-thd.MPC.RESET_DETAILS`
where in_review IS NOT NULL
AND IssueStatus = "In Review"
AND DATE_DIFF(CURRENT_DATE(), in_review, DAY) > 7
AND ready_for_execution IS NULL
AND project_status = "Active"
AND program_name <> "Capital"
AND program_name <> "SSI - Capital"
LIMIT 50
""")
results = query_job.result() # Waits for job to complete.
return results #THIS IS A NONETYPE
def get_queryresults(results): #created new method to put query results into a for loop and store it in a variable
for i,row in enumerate(results,1):
bq_data = (i , '. ' + str(row.dept) + " " + row.project_name + ", Reset #: " + str(row.reset) + ", Store #: " + str(row.store_number) + ", " + row.IssueStatus + " for " + str(row.days_in_review)+ " days")
print (bq_data)
def get_contacts(filename):
names = []
emails = []
with open(filename, mode='r', encoding='utf-8') as contacts_file:
for a_contact in contacts_file:
names.append(a_contact.split()[0])
emails.append(a_contact.split()[1])
return names, emails
def read_template(filename):
with open(filename, 'r', encoding='utf-8') as template_file:
template_file_content = template_file.read()
return Template(template_file_content)
names, emails = get_contacts('mycontacts.txt') # read contacts
message_template = read_template('message.txt')
results = query_emailtest()
bq_results = get_queryresults(query_emailtest())
import smtplib
# set up the SMTP server
s = smtplib.SMTP(host='smtp-mail.outlook.com', port=587)
s.starttls()
s.login('email', 'password')
# For each contact, send the email:
for name, email in zip(names, emails):
msg = MIMEMultipart() # create a message
# bq_data = get_queryresults(query_emailtest())
# add in the actual person name to the message template
message = message_template.substitute(PERSON_NAME=name.title())
message = message_template.substitute(QUERY_RESULTS=bq_results) #SUBSTITUTE QUERY RESULTS IN MESSAGE TEMPLATE. This is where I am having trouble because the Row Iterator object results in Nonetype.
# setup the parameters of the message
msg['From']='email'
msg['To']='email'
msg['Subject']="This is TEST"
# body = str(get_queryresults(query_emailtest())) #get query results from method to put into message body
# add in the message body
# body = MIMEText(body)
#msg.attach(body)
msg.attach(MIMEText(message, 'plain'))
# query_emailtest()
# get_queryresults(query_emailtest())
# send the message via the server set up earlier.
s.send_message(msg)
del msg
Message template:
Dear ${PERSON_NAME},
Hope you are doing well. Please find the following alert for Issues that have been "In Review" for greater than 7 days.
${QUERY_RESULTS}
If you would like more information, please visit this link that contains a complete dashboard view of the alert.
ISE Services

The BQ result() function returns a generator, so I think you need to change your return to yield from.
I'm far from a python expert, but the following pared-down code worked for me.
from google.cloud import bigquery
import warnings
warnings.filterwarnings("ignore", "Your application has authenticated using end user credentials")
def query_emailtest():
client = bigquery.Client(project=("my_project"))
query_job = client.query("""
select field1, field2 from `my_dataset.my_table` limit 5
""")
results = query_job.result()
yield from results # NOTE THE CHANGE HERE
results = query_emailtest()
for row in results:
print(row.field1, row.field2)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Apache Beam / Dataflow pub/sub side input with python - python-3.x

Related

Using ctrader-fix to download historical data from cTrader

Streaming json into bigquery using python

Save all data in database with one query in Django

Apache-beam hanging on groupbykey after windowing - not triggering

How can I return a string from a Google BigQuery row iterator object?

Categories

Resources