Python Streaming Dataflow "WriteToPubSub" behaviour - python-3.x

I am trying out a Streaming dataflow to read from PubSub and write to another PubSub. I am using python 3.7.3 version. The pipeline looks something like this,
lines = (pipe | "Read from PubSub" >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
| "Transformation" >> beam.ParDo(PubSubToDict())
| "Write to PubSub" >> beam.io.WriteToPubSub(topic=OUTPUT, with_attributes=False)
)
The "Transformation" step is something where I need to so some custom transformation. I am ensuring that the output of this transform is bytes. Something like this,
class PubSubToDict(beam.DoFn):
def process(self, element):
"""pubsub input is a byte string"""
data = element.decode('utf-8')
"""do some custom transform here"""
data = data.encode('utf-8')
return data
Now when I publish a test message, I get an error like this,
ERROR: Data being published to Pub/Sub must be sent as a bytestring. [while running 'Write to PubSub']
I managed to solve this by returning an array instead like this,
return [data]
But I don't know the reason why this worked. So I was looking for an explanation to this.
Regards,
Prasad

It worked because ParDo lets a pipeline step return multiple output elements for a single input element, so it expects an iterable to be returned.
you could also do yield data

Related

Writing unique parquet file per windows with Apache Beam Python

I am trying to stream messages from kafka consumer to google cloud storage with 30 seconds windows using apache beam. Used beam_nuggets.io for reading from a kafka topic. However, I wasn't able to write unique parquet files to GCS per each window.
You can see my code below:`
import apache_beam as beam
from apache_beam.transforms.trigger import AfterAny, AfterCount, AfterProcessingTime, AfterWatermark, Repeatedly
from apache_beam.portability.api.beam_runner_api_pb2 import AccumulationMode
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio
import json
from datetime import datetime
import pandas as pd
import config as conf
import apache_beam.transforms.window as window
consumer_config = {"topic": "Uswrite",
"bootstrap_servers": "*.*.*.*:9092",
"group_id": "notification_consumer_group_33"}
folder_name = datetime.now().strftime('%Y-%m-%d')
def format_result(consume_message):
data = json.loads(consume_message[1])
file_name = datetime.now().strftime("%Y_%m_%d-%I_%M_%S")
df = pd.DataFrame(data).T #, orient='index'
df.to_parquet(f'gs://{conf.gcs}/{folder_name}/{file_name}.parquet',
storage_options={"token": "gcp.json"}, engine='fastparquet')
print(consume_message)
with beam.Pipeline(options=PipelineOptions()) as p:
consumer_message = (p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(consumer_config=consumer_config)
| 'Windowing' >> beam.WindowInto(window.FixedWindows(30),
trigger=AfterProcessingTime(30),
allowed_lateness=900,
accumulation_mode=AccumulationMode.ACCUMULATING)
| 'CombineGlobally' >> beam.Map(format_result))
# window.FixedWindows(30),trigger=beam.transforms.trigger.AfterProcessingTime(30),
# accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING
# allowed_lateness=100,CombineGlobally(format_result).without_defaults() allowed_lateness=30,
Using the code above, a new parquet file is generated for each message. What I would like to do is to group messages by 30 seconds windows and generate one parquet file for each window.
I tried different configurations below with no success:
beam.CombineGlobally(format_result).without_defaults()) instead of beam.Map(format_result))
beam.ParDo(format_result))
In addition, I have few more questions:
Even though I set the offset by "auto.offset.reset": "earliest",
kafka producer starts to read from the last message even if I change
the consumer group and can’t figure out why.
Also, I am puzzled by the usage of trigger, allowed_lateness, accumulation_mode.
I am not sure if I need them for the this task.
As you can see in the code block
above, I also tried using these parameters but it didn’t help.
I searched everywhere but couldn’t find a single example that explains this use case.
`
Here are some changes you should make to your pipeline to get this result:
Remove your trigger if you want a single output per window. Triggers are only needed for getting multiple results per window.
Add a GroupByKey or Combine operation to aggregate the elements. Without such an operation, the windowing has no effect.
I recommend using parquetio from the Beam project itself to ensure you get scalable exactly-once behavior. (See the pydoc from 2.33.0 release)
I took a look at the GroupByKey example in the python documentation
Messages I read from KafkaConsumer (I used kafkaio from
beam_nuggets.io) have a type of tuple, and in order to use
GroupByKey, I tried to create a list in the convert_to_list function
by appending the tuples I got from Kafka Consumer. However,
GroupByKey still produces no output.
import apache_beam as beam
from beam_nuggets.io import kafkaio
new_list = []
def convert_to_list(consume_message):
new_list.append(consume_message)
return new_list
with beam.Pipeline() as pipeline:
dofn_params = (
pipeline
| "Reading messages from Kafka" >> kafkaio.KafkaConsume(consumer_config=consumer_config)
| 'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
| 'consume message added list' >> beam.ParDo(convert_to_list)
| 'GroupBykey' >> beam.GroupByKey()
| 'print' >> beam.Map(print))
I also tried a similar pipeline but this time, I created a list of
tuples with beam.Create() instead of reading from kafka, and it
works successfully. You can view this pipeline below:
import apache_beam as beam
from beam_nuggets.io import kafkaio
with beam.Pipeline() as pipeline:
dofn_params = (
pipeline
| 'Created Pipeline' >> beam.Create([(None, '{"userId": "921","xx":"123"]),(None, '{"userId": "92111","yy":"123"]))
| 'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
| 'GroupBykey' >> beam.GroupByKey()
| 'print' >> beam.Map(print))
I assume the issue in the first approach is related to generating an external list instead of pcollection, but I am not sure. Can you guide me on how to proceed?
Another thing I tried is to use ReadFromKafka function from apache_beam.io.kafka module. But this time I got the following error:
ERROR:apache_beam.utils.subprocess_server:Starting job service with ['java', '-jar', 'user_directory’/.apache_beam/cache/jars\\beam-sdks-java-io-expansion-service-2.33.0.jar', '59627']
Java version 11.0.12 is installed on my computer and the ‘java’ command is available.

Parametrize and loop KQL queries in JupyterLab

My question is how to assign variables within a loop in KQL magic command in Jupyter lab. I refer to Microsoft's document on this subject and will base my question on the code given here:
https://learn.microsoft.com/en-us/azure/data-explorer/kqlmagic
1. First query below
%%kql
StormEvents
| summarize max(DamageProperty) by State
| order by max_DamageProperty desc
| limit 10
2. Second: Convert the resultant query to a dataframe and assign a variable to 'statefilter'
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0].State
statefilter
3. This is where I would like to modify the above query and let statefilter have multiple variables (i.e. consist of different states):
df = _kql_raw_result_.to_dataframe()
statefilter =df.loc[0:3].State
statefilter
4. And finally I would like to run my kql query within a for loop for each of the variables within statefilter. This below syntax may not be correct but it can give an example for what I am looking for:
dfs = [] # an empty list to store dataframes
for state in statefilters:
%%kql
let _state = state;
StormEvents
| where State in (_state)
| do some operations here for that specific state
df = _kql_raw_result_.to_dataframe()
dfs.append(df) # store the df specific to state in the list
The reason why I am not querying all the desired states within the KQL query is to prevent resulting in really large query outcomes being assigned to dataframes. This is not for this sample StormEvents table which has a reasonable size but for my research data which consists of many sites and is really big. Therefore I would like to be able to run a KQL query/analysis for each site within a for loop and assign each site's query results to a dataframe. Please let me know if this is possible or perhaps there may other logical ways to do this within KQL...
There are few ways to do it.
The simplest is to refractor your %%kql cell magic to a %kql line magic.
Line magic can be embedded in python cell.
Other option is to: from Kqlmagic import kql
The Kqlmagic kql method, accept as a string a kql cell or line.
You can call kql from python.
Third way is to call the kql magic via the ipython method:
ip.run_cell_magic('kql', {your kql magic cell text})
You can call it from python.
Example of using the single line magic mentioned by Michael and a return statement that converted the result to JSON. Without the conversion to JSON I wasn't getting anything back.
def testKQL():
%kql DatabaseName | take 10000
return _kql_raw_result_.to_dataframe().to_json(orient='records')

Why output from google video intelligence not in JSON format

I have been trying to use the google video intelligence API from https://cloud.google.com/video-intelligence/docs/libraries and I tried the exact same code. The response output was supposed to be in json format however the output was either a google.cloud.videointelligence_v1.types.AnnotateVideoResponse or something similar to that.
I have tried the code from many resources and recently from https://cloud.google.com/video-intelligence/docs/libraries but still no JSON output was given. What I got when I checked the type of output I got:
type(result)
google.cloud.videointelligence_v1.types.AnnotateVideoResponse
So, how do I get a JSON response from this?
If you specify an outputUri, the results will be stored in your GCS bucket in json format. https://cloud.google.com/video-intelligence/docs/reference/rest/v1/videos/annotate
It seems like you aren't storing the result in GCS. Instead you are getting the result via the GetOperation call, which has the result in AnnotateVideoResponse format.
I have found a solution for this. What I had to do was import this
from google.protobuf.json_format import MessageToJson
import json
and run
job = client.annotate_video(
input_uri='gs://xxxx.mp4',
features=['OBJECT_TRACKING'])
result = job.result()
serialized = MessageToJson(result)
a = json.loads(serialized)
type(a)
what I was doing was turn the results into a dictionary.
Or for more info, try going to this link: google forums thread

How do I convert table row PCollections to key,value PCollections in Python?

There is NO documentation regarding how to convert pCollections into the pCollections necessary for input into .CoGroupByKey()
Context
Essentially I have two large pCollections and I need to be able to find differences between the two, for type II ETL changes (if it doesn't exist in pColl1 then add to a nested field found in pColl2), so that I am able to retain history of these records from BigQuery.
Pipeline Architecture:
Read BQ Tables into 2 pCollections: dwsku and product.
Apply a CoGroupByKey() to the two sets to return --> Results
Parse results to find and nest all changes in dwsku into product.
Any help would be recommended. I found a java link on SO that does the same thing I need to accomplish (but there's nothing on the Python SDK).
Convert from PCollection<TableRow> to PCollection<KV<K,V>>
Is there a documentation / support for Apache Beam, especially Python SDK?
In order to get CoGroupByKey() working, you need to have PCollections of tuples, in which the first element would be the key and second - the data.
In your case, you said that you have BigQuerySource, which in current version of Apache Beam outputs PCollection of dictionaries (code), in which every entry represents a row in the table which was read. You need to map this PCollections to tuples as stated above. This is easy to do using ParDo:
class MapBigQueryRow(beam.DoFn):
def process(self, element, key_column):
key = element.get(key_column)
yield key, element
data1 = (p
| "Read #1 BigQuery table" >> beam.io.Read(beam.io.BigQuerySource(query="your query #1"))
| "Map #1 to KV" >> beam.ParDo(MapBigQueryRow(), key_column="KEY_COLUMN_IN_TABLE_1"))
data2 = (p
| "Read #2 BigQuery table" >> beam.io.Read(beam.io.BigQuerySource(query="your query #2"))
| "Map #2 to KV" >> beam.ParDo(MapBigQueryRow(), key_column="KEY_COLUMN_IN_TABLE_2"))
co_grouped = ({"data1": data1, "data2": data2} | beam.CoGroupByKey())
# do your processing with co_grouped here
BTW, documentation of Python SDK for Apache Beam can be found here.

Using python 3.x how can I pass a Tree object from ete3 to DendroPy without writing to file

I'm using the ete3 package in python to build phylogenetic trees from data I've generated with a stochastic model and it works well. I have previously written these trees to newick format and then used another script, with the package Dendropy, to read these trees and do some analysis of them. Both of these scripts work fine.
I am now trying to do a large amount of this sort of data processing and want to write a single script in which I skip the file writing. Both methods are called Tree, so I got around this by importing the dendropy method like:
from dendropy import Tree as DTree
and the ete3 method like:
from ete3 import Tree
which seems to be ok.
The question I have is how to pass the object from one package to the other. I have a loop in which I first build the tree object using the ete3 methods, and I call it 't'. My plan was then to use the Tree.write method in ete3 to pass the tree obect to Dendropy using the 'get' method and skipping the actual outfile bit, like this:
treePass = t.write(format = 1)
DendroTree = DTree.get(treePass, schema = 'newick')
but this gives the error:
DendroTree = DTree.get(treePass)
TypeError: get() takes 1 positional argument but 2 were given
Any thoughts are welcome.
DTree.get() only takes self as actual argument and rest is given through keywords. This basically means you cannot pass treePass to DTree.get() as an argument.
I haven't used either of those libs, but I have found a way to import data to dendropy tree here.
tree = DTree.get(data="((A,B),(C,D));",schema="newick")
Which means you'd have to get your tree from ete3 in this format. it doesn't seem that unusual for a tree, so after a bit more looking there seems to be supported format in ete3, which you can read here. I believe it's number 9.
So in the end I'd try this:
from dendropy import Tree as DTree
from ete3 import Tree
#do your Tree generation magic here
DendroTree = DTree.get(data=t.write(format = 9),schema = 'newick')
Edit:
As I'm reading more and more, I believe that any format should be read so basically all you have to add to your example is data here: DendroTree = DTree.get(data=treePass, schema = 'newick')

Resources