Merge PCollection with apache_beam - python-3.x

I'm trying to run a pipeline with apache_beam (at the end will get to DataFlow).
The pipeline should look like the following:
I format the data from PubSub, I write raw results to Firestore, I run the ML model, and after I have the results from the ML model I want to update firestore with the ID I got from the first write to FS.
The pipeline code in general looks like this:
with beam.Pipeline(options=options) as p:
# read and format
formated_msgs = (
p
| "Read from PubSub" >> LoadPubSubData(known_args.topic)
)
# write the raw results to firestore
write_results = (
formated_msgs
| "Write to FS" >> beam.ParDo(WriteToFS())
| "Key FS" >> beam.Map(lambda fs: (fs["record_uuid"], fs))
)
# Run the ML model
ml_results = (
formated_msgs
| "ML" >> ML()
| "Key ML" >> beam.Map(lambda row: (row["record_uuid"], row))
)
# Merge by key and update - HERE IS THE PROBLEM
(
(write_results, ml_results) # I want to have the data from both merged by the key at this point
| "group" >> beam.CoGroupByKey()
| "log" >> beam.ParDo(LogFn())
)
I have tried so many ways, but I can't seem to find the correct way to do so. Any ideas?
--- update 1 ---
The problem is that on the log line I don't get anything. Sometimes, I even get a timeout on the operation.
It might be important to note that I'm streaming the data from PubSub at the beginning.

OK, so I finally figured it out. The only thing I was missing is Windowing, I assume since I'm streaming the data.
So I've added the following:
with beam.Pipeline(options=options) as p:
# read and format
formated_msgs = (
p
| "Read from PubSub" >> LoadPubSubData(known_args.topic)
| "Windowing" >> beam.WindowInto(window.FixedWindows(30))
)

Related

Beam - Filter out Records from Bigquery

I am new to Apache Beam, and I trying to do three tasks
Read Top 30 Items from the table
Read Top 30 Stores from the table
select required columns from the bigquery and apply Filter on the columns Items and Stores.
I have this below code, to execute the pipeline
with beam.Pipeline(options=pipeline_args) as p:
#read the dataset from bigquery
query_top_30_items = (
p
| 'GetTopItemNumbers' >> beam.io.ReadFromBigQuery(
query="""SELECT item_number, COUNT(item_number) AS freq_count FROM
[bigquery-public-data.iowa_liquor_sales.sales] GROUP BY item_number
ORDER BY freq_count DESC
LIMIT 30"""
)
| 'ReadItemNumbers' >> beam.Map(lambda elem: elem['item_number'])
| 'ItemNumberAsList' >> beam.combiners.ToList()
)
query_top_30_stores = (
p
|
'GetTopStores' >> beam.io.ReadFromBigQuery(
query = """SELECT store_number, COUNT(store_number) AS store_count
FROM [bigquery-public-data.iowa_liquor_sales.sales] GROUP BY
store_number ORDER BY store_count DESC LIMIT 30"""
)
| 'ReadStoreValues' >> beam.Map(lambda elem:elem['store_number'])
| 'StoreValuesAsList' >> beam.combiners.ToList()
)
query_whole_table = (
(query_top_30_items, query_top_30_stores)
|'ReadTable' >> beam.io.ReadFromBigQuery(
query="""SELECT item_number, store_number, bottles_sold,
state_bottle_retail FROM [bigquery-public-data.iowa_liquor_sales.sales]""")
| 'FilterByItems' >> beam.Filter(lambda row:row['item_number'] in query_top_30_items)
| 'FilterByStore' >> beam.Filter(lambda row:row['store_number'] in query_top_30_stores)
)
I have attached Traceback for reference. How Can I solve this error?
temp_location = pcoll.pipeline.options.view_as( Traceback (most
recent call last): File "run.py", line 113, in
run() File "run.py", line 100, in run
| 'FilterByStore' >> beam.Filter(lambda row:row['store_number'] in query_top_30_stores) File
"/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 1058, in ror
return self.transform.ror(pvalueish, self.label) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/transforms/ptransform.py", line 573, in ror
result = p.apply(self, pvalueish, label) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/pipeline.py",
line 646, in apply
return self.apply(transform, pvalueish) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/pipeline.py",
line 689, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options) File
"/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/runners/runner.py",
line 188, in apply
return m(transform, input, options) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/runners/runner.py",
line 218, in apply_PTransform
return transform.expand(input) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py",
line 1881, in expand
temp_location = pcoll.pipeline.options.view_as( AttributeError: 'tuple' object has no attribute 'pipeline'
Since I am new to Beam, the code is not that optimized. Please let me know If I can optimize this code further.
Thanks for your time and Help!
Apply Filter condition over a function will not work in a pipeline. You have 2 option for the same:-
Apply Filter condition within pipeline.
Apply filter condition over BQ-SQL.
Filter condition over Function will be ambiguous for Function what to return to calling function. Hence modify you code to apply filter conditions to either of 2 place highlighted above.
beam.io.ReadFromBigQuery must be at the root of your pipeline, and takes the pipeline object (not a PCollection or tuple of PCollections) as input. Hence the error.
As the other answer mentions, you could try to write the whole thing as a single BigQuery query. Otherwise, you could do the filtering after the read using side inputs, e.g.
with beam.Pipeline(options=pipeline_args) as p:
#read the dataset from bigquery
query_top_30_items = ...
query_top_30_stores = ...
sales = p |'ReadTable' >> beam.io.ReadFromBigQuery(
query="""SELECT item_number, store_number, bottles_sold,
state_bottle_retail FROM [bigquery-public-data.iowa_liquor_sales.sales]""")
filtered = (
sales
| 'FilterByItems' >> beam.Filter(
lambda row, items_side_input: row['item_number'] in items_side_input,
items_side_input=beam.pvalue.AsList(query_top_30_items))
| 'FilterByStore' >> beam.Filter(
lambda row, stores_side_input: row['store_number'] in stores_side_input,
stores_side_input=beam.pvalue.AsList(query_top_30_stores))
)

What do the "|" and ">>" means in Apache Beam?

I'm trying to understand Apache Beam. I was following the programming guide and in one example, they say talk about The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection..
I was quite surprised, because I didn't saw at any point a ParDo operation, so I started to wondering if the | was actually the ParDo. The code looks like this:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
emails_list = [
('amy', 'amy#example.com'),
('carl', 'carl#example.com'),
('julia', 'julia#example.com'),
('carl', 'carl#email.com'),
]
phones_list = [
('amy', '111-222-3333'),
('james', '222-333-4444'),
('amy', '333-444-5555'),
('carl', '444-555-6666'),
]
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
emails = p | 'CreateEmails' >> beam.Create(emails_list)
phones = p | 'CreatePhones' >> beam.Create(phones_list)
results = ({'emails': emails, 'phones': phones} | beam.CoGroupByKey())
def join_info(name_info):
(name, info) = name_info
return '%s; %s; %s' %\
(name, sorted(info['emails']), sorted(info['phones']))
contact_lines = results | beam.Map(join_info)
I do notice that emails and phones are read at the start of the pipeline, so I guess that both of them are different PCollections, right? But where is the ParDo executed? What do the "|" and ">>" actually means? And how I can see the actual output of this? Does it matter if the join_info function, the emails_list and phones_list are defined outside the DAG?
The | represents a separation between steps, this is (using p as Pbegin): p | ReadFromText(..) | ParDo(..) | GroupByKey().
You can also reference other PCollections before |:
read = p | ReadFromText(..)
kvs = read | ParDo(..)
gbk = kvs | GroupByKey()
That's equivalent to the previous pipeline: p | ReadFromText(..) | ParDo(..) | GroupByKey()
The >> are used between | and the PTransform to name the steps: p | ReadFromText(..) | "to key value" >> ParDo(..) | GroupByKey()

How can I read from BigQuery and save a csv to Google cloud storage using Dataflow beam/python/Jupyter [duplicate]

I am pretty new working on Apache Beam , where in I am trying to write a pipeline to extract the data from Google BigQuery and write the data to GCS in CSV format using Python.
Using beam.io.read(beam.io.BigQuerySource()) I am able to read the data from BigQuery but not sure how to write it to GCS in CSV format.
Is there a custom function to achieve the same , could you please help me?
import logging
import apache_beam as beam
from apache_beam.io.BigQueryDisposition import CREATE_IF_NEEDED
from apache_beam.io.BigQueryDisposition import WRITE_TRUNCATE
PROJECT='project_id'
BUCKET='project_bucket'
def run():
argv = [
'--project={0}'.format(PROJECT),
'--job_name=readwritebq',
'--save_main_session',
'--staging_location=gs://{0}/staging/'.format(BUCKET),
'--temp_location=gs://{0}/staging/'.format(BUCKET),
'--runner=DataflowRunner'
]
with beam.Pipeline(argv=argv) as p:
# Execute the SQL in big query and store the result data set into given Destination big query table.
BQ_SQL_TO_TABLE = p | 'read_bq_view' >> beam.io.Read(
beam.io.BigQuerySource(query = 'Select * from `dataset.table`', use_standard_sql=True))
# Extract data from Bigquery to GCS in CSV format.
# This is where I need your help
BQ_SQL_TO_TABLE | 'Write_bq_table' >> beam.io.WriteToBigQuery(
table='tablename',
dataset='datasetname',
project='project_id',
schema='name:string,gender:string,count:integer',
create_disposition=CREATE_IF_NEEDED,
write_disposition=WRITE_TRUNCATE)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
You can do so using WriteToText to add a .csv suffix and headers. Take into account that you'll need to parse the query results to CSV format. As an example, I used the Shakespeare public dataset and the following query:
SELECT word, word_count, corpus FROM `bigquery-public-data.samples.shakespeare` WHERE CHAR_LENGTH(word) > 3 ORDER BY word_count DESC LIMIT 10
We now read the query results with:
BQ_DATA = p | 'read_bq_view' >> beam.io.Read(
beam.io.BigQuerySource(query=query, use_standard_sql=True))
BQ_DATA now contains key-value pairs:
{u'corpus': u'hamlet', u'word': u'HAMLET', u'word_count': 407}
{u'corpus': u'kingrichardiii', u'word': u'that', u'word_count': 319}
{u'corpus': u'othello', u'word': u'OTHELLO', u'word_count': 313}
We can apply a beam.Map function to yield only values:
BQ_VALUES = BQ_DATA | 'read values' >> beam.Map(lambda x: x.values())
Excerpt of BQ_VALUES:
[u'hamlet', u'HAMLET', 407]
[u'kingrichardiii', u'that', 319]
[u'othello', u'OTHELLO', 313]
And finally map again to have all column values separated by commas instead of a list (take into account that you would need to escape double quotes if they can appear within a field):
BQ_CSV = BQ_VALUES | 'CSV format' >> beam.Map(
lambda row: ', '.join(['"'+ str(column) +'"' for column in row]))
Now we write the results to GCS with the suffix and headers:
BQ_CSV | 'Write_to_GCS' >> beam.io.WriteToText(
'gs://{0}/results/output'.format(BUCKET), file_name_suffix='.csv', header='word, word count, corpus')
Written results:
$ gsutil cat gs://$BUCKET/results/output-00000-of-00001.csv
word, word count, corpus
"hamlet", "HAMLET", "407"
"kingrichardiii", "that", "319"
"othello", "OTHELLO", "313"
"merrywivesofwindsor", "MISTRESS", "310"
"othello", "IAGO", "299"
"antonyandcleopatra", "ANTONY", "284"
"asyoulikeit", "that", "281"
"antonyandcleopatra", "CLEOPATRA", "274"
"measureforemeasure", "your", "274"
"romeoandjuliet", "that", "270"
For anyone looking for an update using Python 3, replace the line of
BQ_VALUES = BQ_DATA | 'read values' >> beam.Map(lambda x: x.values())
with
BQ_VALUES = BQ_DATA | 'read values' >> beam.Map(lambda x: list(x.values()))

How to summarize time window based on a status in Kusto

I have recently started working with Kusto. I am stuck with a use case where i need to confirm the approach i am taking is right.
I have data in the following format
In the above example, if the status is 1 and if the time frame is equal to 15 seconds then i need to assume it as 1 occurrence.
So in this case 2 occurrence of status.
My approach was
if the current and next rows status is equal to 1 then take the time difference and do row_cum_sum and break it if the next(STATUS)!=0.
Even though the approach is giving me correct output, I am assuming the performance can slow down once the size is increased.
I am looking for an alternative approach if any. Also adding the complete scenario to reproduce this with a sample data.
.create-or-alter function with (folder = "Tests", skipvalidation = "true") InsertFakeTrue() {
range LoopTime from ago(365d) to now() step 6s
| project TIME=LoopTime,STATUS=toint(1)
}
.create-or-alter function with (folder = "Tests", skipvalidation = "true") InsertFakeFalse() {
range LoopTime from ago(365d) to now() step 29s
| project TIME=LoopTime,STATUS=toint(0)
}
.set-or-append FAKEDATA <| InsertFakeTrue();
.set-or-append FAKEDATA <| InsertFakeFalse();
FAKEDATA
| order by TIME asc
| serialize
| extend cstatus=STATUS
| extend nstatus=next(STATUS)
| extend WindowRowSum=row_cumsum(iff(nstatus ==1 and cstatus ==1, datetime_diff('second',next(TIME),TIME),0),cstatus !=1)
| extend windowCount=iff(nstatus !=1 or isnull(next(TIME)), iff(WindowRowSum ==15, 1,iff(WindowRowSum >15,(WindowRowSum/15)+((WindowRowSum%15)/15),0)),0 )
| summarize IDLE_COUNT=sum(windowCount)
The approach in the question is the way to achieve such calculations in Kusto and given that the logic requires sorting is also efficient (as long as the sorted data can reside on a single machine).
Regarding union operator - it runs in parallel by default, you can control the concurrency and spread using hints, see: union operator

How many events are stored in my PredictionIO event server?

I imported an unknown number of events into my PIO eventserver and now I want to know that number (in order to measure and compare recommendation engines). I could not find an API for that, so I had a look at the MySQL database my server uses. I found two tables:
mysql> select count(*) from pio_event_1;
+----------+
| count(*) |
+----------+
| 6371759 |
+----------+
1 row in set (8.39 sec)
mysql> select count(*) from pio_event_2;
+----------+
| count(*) |
+----------+
| 2018200 |
+----------+
1 row in set (9.79 sec)
Both tables look very similar, so I am still unsure.
Which table is relevant? What is the difference between pio_event_1 and pio_event_2?
Is there a command or REST API where I can look up the number of stored events?
You could go through the spark shell, described in the troubleshooting docs
Launch the shell with
pio-shell --with-spark
Then find all events for your app and count them
import io.prediction.data.store.PEventStore
PEventStore.find(appName="MyApp1")(sc).count
You could also filter to find different subsets of events by passing more parameters to find. See the api docs for more details. The LEventStore is also an option
Connect to your database
\c db_name
List tables
\dt;
Run query
select count(*) from pio_event_1;
PHP
<?php
$dbconn = pg_connect("host=localhost port=5432 dbname=db_name user=postgres");
$result = pg_query($dbconn, "select count(*) from pio_event_1");
if (!$result) {
echo "An error occurred.\n";
exit;
}
// Not the best way, but output the total number of events.
while ($row = pg_fetch_row($result)) {
echo '<P><center>'.number_format($row[0]) .' Events</center></P>';
} ?>

Resources