PySpark Structured Streaming: Pass output of Query to API endpoint - apache-spark

I have the following dataframe in Structured Streaming:
TimeStamp|Room|Temperature|
00:01:29 | 1 | 55 |
00:01:34 | 2 | 51 |
00:01:36 | 1 | 56 |
00:02:03 | 2 | 49 |
I am trying to detect when temperatures fall below a certain temperature (50 in this case). I have that part of the query working. Now, I need to pass this information to an API endpoint via a POST call like this: '/api/lowTemperature/' with the timestamp and the temperature in the body of the request. So, in the above case, I need to send along:
POST /api/lowTemperature/2
BODY: { "TimeStamp":"00:02:03",
"Temperature":"49" }
Any idea how I can achieve this using PySpark?
One way I thought of doing this was using Custom streaming sink, but, I can't seem to find any documentation on achieving this using Python.

Good news, as support for Python has recently been added for the ForeachWriter. I created one in Python for REST and Azure Event Grid and it's rather straightforward.
The (basic) documentation, can be found here: https://docs.databricks.com/spark/latest/structured-streaming/foreach.html#using-python

At the time of my original response, ForeachWriter was only supported for Java/Scala, however, it now supports Python as well.
Make sure you read the section on execution semantics and understand how to avoid duplicate API calls if that's an issue.

Related

Kappa architecture - conceptual question about historical data processing

This is a question about building a pipeline for data-analytics in a kappa architecture. The question is conceptional.
Assume you have a system that emits events, for simplicity let's assume you just have two events CREATED and DELETED which tell that an item get's created or deleted at a given point in time. Those events contain an id and a timestamp. An item will get created and deleted again after a certain time. Assume the application ensures correct order of events and prevents duplicate events and no event is emitted with the exact same timestamp.
The metrics that should be available in data analytics are:
Current amount of items
Amount of items as graph over the last week
Amount of items per day as historical data
Now a proposal for an architecture for such a scenario would be like this:
Emit events to Kafka
Use kafka as short term storage
Use superset to display live data directly on kafka with presto
Use spark to consume kafka events to write aggregations to analytics Postgres db
Schematically it would look like this:
Application
|
| (publish events)
↓
Kafka [topics: item_created, item_deleted]
| ↑
| | (query short-time)
| |
| Presto ←-----------┐
| |
| (read event stream) |
↓ |
Spark |
| |
| (update metrics) |
↓ |
Postgres |
↑ |
| (query) | (query)
| |
└-----Superset-----┘
Now this data-analytics setup should be used to visualise historical and live data. Very important to note is that in this case the application can have already a database with historical data. To make this work when starting up the data analytics first the database is parsed and events are emitted to kafka to transfer the historical data. Live data can come at any time and will also be progressed.
An idea to make the metric work is the following. With the help of presto the events can easily be aggregated through the short term memory of kafka itself.
For historical data the idea could be to create a table Items that with the schema:
--------------------------------------------
| Items |
--------------------------------------------
| timestamp | numberOfItems |
--------------------------------------------
| 2021-11-16 09:00:00.000 | 0 |
| 2021-11-17 09:00:00.000 | 20 |
| 2021-11-18 09:00:00.000 | 5 |
| 2021-11-19 09:00:00.000 | 7 |
| 2021-11-20 09:00:00.000 | 14 |
Now the idea would that the spark program (which would need of course to parse the schema of the topic messages) and this will assess the timestamp check in which time-window the event falls (in this case which day) and update the number by +1 in case of a CREATED or -1 in case of a DELTED event.
The question I have is whether this is a reasonable interpretation of the problem in a kappa architecture. In startup it would mean a lot of read and writes to the analytics database. There will be multiple spark workers to update the analytics database in parallel and the queries must be written such that it's all atomic operations and not like read and then write back because the value might have been altered in the meanwhile by another spark node. What could be done to make this process efficient? How would it be possible to prevent kafka being flooded in the startup process?
Is this an intended use case for spark? What would be a good alternative for this problem?
In terms of data-throughput assume like 1000-10000 of this events per day.
Update:
Apparently spark is not intended to be used like this as it can be seen from this issue.
Apparently spark is not intended to be used like this
You don't need Spark, or at least, not completely.
Kafka Streams can be used to move data between various Kafka topics.
Kafka Connect can be used to insert/upsert into Postgres via JDBC Connector.
Also, you can use Apache Pinot for indexed real-time and batch/historical analytics from Kafka data rather than having Presto just consume and parse the data (or needing a separate Postgres database only for analytical purposes)
assume like 1000-10000 of this events per day
Should be fine. I've worked with systems that did millions of events, but were mostly written to Hadoop or S3 rather than directly into a database, which you could also have Presto query.

Spark Structured Streaming - Log the internal progress of a query

Let's assume the following setting: I have a stream of events. I want some specific events to trigger an action. Concrete case could be: stream of customers' orders and if the order meets certain set of conditions I want to send the customer a notification/SMS. At the same time, I want to track how fast I am processing the messages and monitor which order met which condition.
For notifications, I use Spark Structured Streaming code consisting of several operations:
df_orders = spark.readStream.format("eventhubs").options(**conf).load()
(df_orders
.filter(col('sms_consent') == True)
.filter(col('order_price') > 1000)
.dropDuplicates(['order_id', 'customer_id'])
.writeStream
.format('eventhubs')
.options(**conf)
.start()
)
Now I want to build a "monitoring/reporting" solution, which will export the following data for every incoming order:
+----------+-----------------------+-----------------------+-----------------------+--------------------------+----------------------+
| order_id | filtered_sms_consent | filtered_order_price | time_messageReceived | time_processingFinished | time_sentToEventHub |
+----------+-----------------------+-----------------------+-----------------------+--------------------------+----------------------+
| 1 | True | None | 9:40:00 | 9:41:00 | None |
| 2 | False | False | 9:41:00 | 9:42:00 | 9:42:21 |
| 3 | False | True | 9:43:00 | 9:45:00 | None |
+----------+-----------------------+-----------------------+-----------------------+--------------------------+----------------------+
(The shape does not matter - the table can be de-pivoted to more "log-like" structure...)
My experiments:
First, I thought about using the Spark listeners (StreamingQueryListener) as it seems the Listeners are able to logs things such as the query state, average processing time etc.. But I couldn't find any solution to match certain event (order_id) with data from query listener.
Next, I wrote a separate query for monitoring while keeping the query for the actual logic execution. Issue is that since these are two separate queries, each is executed independently. Therefore, the timestamps are off. I managed to bound them together using the foreachBatch() approach. This however does encounter a problem with dropDuplicates (must split the query in two) and it feels very "heavy" (it is slowing down the execution quite a bit).
Dream:
What I would love to have is something like:
(df_orders
.log('order_id {}: Processing started at {time}'.format(col('order_id'), time.now())
.filter(col('sms_consent') == True)
.log('order_id {}: filtered on sms_consent'.format(col('order_id'))
.filter(col('order_price' > 1000)
.log('order_id {}: filtered on sms_price'.format(col('order_id'))
...
)
or to have this information in spark logs by default and that have means to extract it.
How is this achievable?
You can create UDF for sending logs to needed to you storage and call it during streaming and send data from each worker. It can be slow.
You can create UDF for logging to standard spark logs. And for look logs you need collect logs from all nodes. I used logstash for collect local logs from all nodes and kibana as dashboard.
If you need log time series data you can use spark metrics system https://spark.apache.org/docs/latest/monitoring.html#metrics and https://github.com/groupon/spark-metrics for custom metrics. This will allow to you
create UDF and send custom metrics during streaming.

Best way to filter to a specific row in pyspark dataframe

I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.

Retain last row for given key in spark structured streaming

Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.
How can this be archived in spark structured streaming (preferably using PySpark)?
For example suppose I have table
key | time | value
----------------------------
A | 1 | foo
B | 2 | foobar
A | 2 | bar
A | 15 | foobeedoo
Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe
key | time | value
----------------------------
B | 2 | foobar
A | 15 | foobeedoo
that I might like to join against another stream.
Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates() function with reverse order.
Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)
Using flatMapGroupsWithState or mapGroupsWithState, group by key, and sort the value by time in the flatMapGroupsWithState function, store the last line into the GroupState.

Collect rows from spark DataFrame into JSON object, then put the object to another DF

I have a Spark DataFrame which contains some application usage data.
I'm aiming to collect certain metrics from this DataFrame, and then accumulate them together.
For instance, I may want to obtain a total number of users of my product in this DataFrame:
df.select($"user").count.distinct
100500
And then I want to builds stats across different application versions
df.groupBy("version").count.toJSON.show(false)
+-----------------------------------------+
|value |
+-----------------------------------------+
|{"version":"1.2.3.4","count":4051} |
|{"version":"1.2.3.5","count":1} |
|{"version":"1.2.4.6","count":1} |
|{"version":"2.0.0.1","count":30433} |
|{"version":"3.1.2.3","count":112195}|
|{"version":"3.1.0.4","count":11457} |
+-----------------------------------------+
Then I'd like to squash the records in the second DF, so in the end I need to have an object like this:
{ "totalUsers":100500, "versions":[
{"version":"1.2.3.4","count":4051},
{"version":"1.2.3.5","count":1},
{"version":"1.2.4.6","count":1},
{"version":"2.0.0.1","count":30433},
{"version":"3.1.2.3","count":112195},
{"version":"3.1.0.4","count":11457}] }
Then this object shall be written to another spark DF.
What could be the right way to implement this?
Disclaimer: I'm quite new to spark, so I'm sorry if my question is too noobish.
I've read a plenty of similar questions, including seemingly similar ones like this and this. The latter is close, but still doesn't give a clue on how to accumulate multiple rows into one object. Neither was I able to understand it from the Apache Spark docs.
Try to use collect_list function, for example:
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
totalUsers = 100500
agg = df.groupBy().agg(F.collect_list("value").alias('versions')).withColumn("totalUsers", lit(totalUsers)).show()
Where df is data frame with aggregated versions. I get the following result:
+--------------------+----------+
| versions|totalUsers|
+--------------------+----------+
|[{"version":"1.2....| 100500|
+--------------------+----------+
My example is written in Python but I believe the same approach you can use for your language.

Resources