I'm trying to understand Apache Beam. I was following the programming guide and in one example, they say talk about The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection..
I was quite surprised, because I didn't saw at any point a ParDo operation, so I started to wondering if the | was actually the ParDo. The code looks like this:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
emails_list = [
('amy', 'amy#example.com'),
('carl', 'carl#example.com'),
('julia', 'julia#example.com'),
('carl', 'carl#email.com'),
]
phones_list = [
('amy', '111-222-3333'),
('james', '222-333-4444'),
('amy', '333-444-5555'),
('carl', '444-555-6666'),
]
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
emails = p | 'CreateEmails' >> beam.Create(emails_list)
phones = p | 'CreatePhones' >> beam.Create(phones_list)
results = ({'emails': emails, 'phones': phones} | beam.CoGroupByKey())
def join_info(name_info):
(name, info) = name_info
return '%s; %s; %s' %\
(name, sorted(info['emails']), sorted(info['phones']))
contact_lines = results | beam.Map(join_info)
I do notice that emails and phones are read at the start of the pipeline, so I guess that both of them are different PCollections, right? But where is the ParDo executed? What do the "|" and ">>" actually means? And how I can see the actual output of this? Does it matter if the join_info function, the emails_list and phones_list are defined outside the DAG?
The | represents a separation between steps, this is (using p as Pbegin): p | ReadFromText(..) | ParDo(..) | GroupByKey().
You can also reference other PCollections before |:
read = p | ReadFromText(..)
kvs = read | ParDo(..)
gbk = kvs | GroupByKey()
That's equivalent to the previous pipeline: p | ReadFromText(..) | ParDo(..) | GroupByKey()
The >> are used between | and the PTransform to name the steps: p | ReadFromText(..) | "to key value" >> ParDo(..) | GroupByKey()
Related
I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:
E.g. wordninja.split('culturetosuccess') outputs ['culture','to','success']
Using pandas_udf, I have:
#pandas_udf(ArrayType(StringType()))
def split_word(x):
splitted = wordninja.split(x)
return splitted
However, it throws an error when I apply it on the column sld:
df1=df.withColumn('test', split_word(col('sld')))
typeerror: expected string or bytes-like object
What I tried:
I noticed that there is a similar problem with the well-known function split(), but the workaround is to use string.str as mentioned here. This doesn't work on wordninja.split.
Any work around this issue?
Edit: I think in a nutshell the issue is:
the pandas_udf input is pd.series while wordninja.split expects string.
My df looks like this:
+-------------+
|sld |
+-------------+
|"hellofriend"|
|"restinpeace"|
|"this" |
|"that" |
+-------------+
I want something like this:
+-------------+---------------------+
| sld | test |
+-------------+---------------------+
|"hellofriend"|["hello","friend"] |
|"restinpeace"|["rest","in","peace"]|
|"this" |["this"] |
|"that" |["that"] |
+-------------+---------------------+
Just use .apply to perform computation on each element of the Pandas series, something like this:
#pandas_udf(ArrayType(StringType()))
def split_word(x: pd.Series) -> pd.Series:
splitted = x.apply(lambda s: wordninja.split(s))
return splitted
One way is using udf.
import wordninja
from pyspark.sql import functions as F
df = spark.createDataFrame([("hellofriend",), ("restinpeace",), ("this",), ("that",)], ['sld'])
#F.udf
def split_word(x):
return wordninja.split(x)
df.withColumn('col2', split_word('sld')).show()
# +-----------+-----------------+
# | sld| col2|
# +-----------+-----------------+
# |hellofriend| [hello, friend]|
# |restinpeace|[rest, in, peace]|
# | this| [this]|
# | that| [that]|
# +-----------+-----------------+
I'm trying to run a pipeline with apache_beam (at the end will get to DataFlow).
The pipeline should look like the following:
I format the data from PubSub, I write raw results to Firestore, I run the ML model, and after I have the results from the ML model I want to update firestore with the ID I got from the first write to FS.
The pipeline code in general looks like this:
with beam.Pipeline(options=options) as p:
# read and format
formated_msgs = (
p
| "Read from PubSub" >> LoadPubSubData(known_args.topic)
)
# write the raw results to firestore
write_results = (
formated_msgs
| "Write to FS" >> beam.ParDo(WriteToFS())
| "Key FS" >> beam.Map(lambda fs: (fs["record_uuid"], fs))
)
# Run the ML model
ml_results = (
formated_msgs
| "ML" >> ML()
| "Key ML" >> beam.Map(lambda row: (row["record_uuid"], row))
)
# Merge by key and update - HERE IS THE PROBLEM
(
(write_results, ml_results) # I want to have the data from both merged by the key at this point
| "group" >> beam.CoGroupByKey()
| "log" >> beam.ParDo(LogFn())
)
I have tried so many ways, but I can't seem to find the correct way to do so. Any ideas?
--- update 1 ---
The problem is that on the log line I don't get anything. Sometimes, I even get a timeout on the operation.
It might be important to note that I'm streaming the data from PubSub at the beginning.
OK, so I finally figured it out. The only thing I was missing is Windowing, I assume since I'm streaming the data.
So I've added the following:
with beam.Pipeline(options=options) as p:
# read and format
formated_msgs = (
p
| "Read from PubSub" >> LoadPubSubData(known_args.topic)
| "Windowing" >> beam.WindowInto(window.FixedWindows(30))
)
This question already has answers here:
pyspark: count distinct over a window
(2 answers)
Closed 1 year ago.
Let's imagine we have the following dataframe :
port | flag | timestamp
---------------------------------------
20 | S | 2009-04-24T17:13:14+00:00
30 | R | 2009-04-24T17:14:14+00:00
32 | S | 2009-04-24T17:15:14+00:00
21 | R | 2009-04-24T17:16:14+00:00
54 | R | 2009-04-24T17:17:14+00:00
24 | R | 2009-04-24T17:18:14+00:00
I would like to calculate the number of distinct port, flag over the 3 hours in Pyspark.
The result will be something like :
port | flag | timestamp | distinct_port_flag_overs_3h
---------------------------------------
20 | S | 2009-04-24T17:13:14+00:00 | 1
30 | R | 2009-04-24T17:14:14+00:00 | 1
32 | S | 2009-04-24T17:15:14+00:00 | 2
21 | R | 2009-04-24T17:16:14+00:00 | 2
54 | R | 2009-04-24T17:17:14+00:00 | 2
24 | R | 2009-04-24T17:18:14+00:00 | 3
The SQL request looks like :
SELECT
COUNT(DISTINCT port) OVER my_window AS distinct_port_flag_overs_3h
FROM my_table
WINDOW my_window AS (
PARTITION BY flag
ORDER BY CAST(timestamp AS timestamp)
RANGE BETWEEN INTERVAL 3 HOUR PRECEDING AND CURRENT
)
I found this topic that solves the problem but only if we want to count distinct elements over one field.
Do someone has any idea of how to achieve that in :
python 3.7
pyspark 2.4.4
Just collect set of structs (port, flag) and get its size. Something like this:
w = Window.partitionBy("flag").orderBy("timestamp").rangeBetween(-10800, Window.currentRow)
df.withColumn("timestamp", to_timestamp("timestamp").cast("long"))\
.withColumn("distinct_port_flag_overs_3h", size(collect_set(struct("port", "flag")).over(w)))\
.orderBy(col("timestamp"))\
.show()
I've just code something like that that works to :
def hive_time(time:str)->int:
"""
Convert string time to number of seconds
time : str : must be in the following format, numberType
For exemple 1hour, 4day, 3month
"""
match = re.match(r"([0-9]+)([a-z]+)", time, re.I)
if match:
items = match.groups()
nb, kind = items[0], items[1]
try :
nb = int(nb)
except ValueError as e:
print(e, traceback.format_exc())
print("The format of {} which is your time aggregaation is not recognize. Please read the doc".format(time))
if kind == "second":
return nb
if kind == "minute":
return 60*nb
if kind == "hour":
return 3600*nb
if kind == "day":
return 24*3600*nb
assert False, "The format of {} which is your time aggregaation is not recognize. \
Please read the doc".format(time)
# Rolling window in spark
def distinct_count_over(data, window_size:str, out_column:str, *input_columns, time_column:str='timestamp'):
"""
data : pyspark dataframe
window_size : Size of the rolling window, check the doc for format information
out_column : name of the column where you want to stock the results
input_columns : the columns where you want to count distinct
time_column : the name of the columns where the timefield is stocked (must be in ISO8601)
return : a new dataframe whith the stocked result
"""
concatenated_columns = F.concat(*input_columns)
w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-hive_time(window_size), 0))
return data \
.withColumn('timestampGMT', data.timestampGMT.cast(time_column)) \
.withColumn(out_column, F.size(F.collect_set(concatenated_columns).over(w)))
Works well, didn't check yet for performance monitoring.
I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)
I need to combine requests and customMetrics tables by parsed url. On output it should have common parsed url, avg duration of requests and avg value of requests from CustomMetrics.
This code doesn't work ^(
let parseUrlOwn = (stringUrl:string) {
let halfparsed = substring(stringUrl,157);
substring(halfparsed,0 , indexof(halfparsed, "?"))
};
customMetrics
| where name == "Api.GetData"
| extend urlURI = tostring(customDimensions.RequestedUri)
| extend urlcustomMeticsParsed = parseUrlOwn(urlURI)
| extend unionColumnUrl = urlcustomMeticsParsed
| summarize summaryCustom = avg(value) by unionColumnUrl
| project summaryCustom, unionColumnUrl
| join (
requests
| where isnotempty(cloud_RoleName)
| extend urlRequestsParsed = parseUrlOwn(url)
| extend unionColumnUrl = urlRequestsParsed
| summarize summaryRequests =sum(itemCount), avg(duration)
| project summaryRequests, unionColumnUrl
) on unionColumnUrl
Instead of inventing your own url parsing, how about using parse_url (https://docs.loganalytics.io/docs/Language-Reference/Scalar-functions/parse_url()) and using that instead?
It also appears that your summarize line in the requests join, isn't summarizing on url, so I'm not sure how that works.
Shouldn't this line:
| summarize summaryRequests =sum(itemCount), avg(duration)
be
| summarize summaryRequests =sum(itemCount), avg(duration) by unionColumnUrl
like it is in the metrics part of the query. Also, why are you calculating the average in that summarize? you're just throwing it away by not projecting it on the next line.