This is a question about building a pipeline for data-analytics in a kappa architecture. The question is conceptional.
Assume you have a system that emits events, for simplicity let's assume you just have two events CREATED and DELETED which tell that an item get's created or deleted at a given point in time. Those events contain an id and a timestamp. An item will get created and deleted again after a certain time. Assume the application ensures correct order of events and prevents duplicate events and no event is emitted with the exact same timestamp.
The metrics that should be available in data analytics are:
Current amount of items
Amount of items as graph over the last week
Amount of items per day as historical data
Now a proposal for an architecture for such a scenario would be like this:
Emit events to Kafka
Use kafka as short term storage
Use superset to display live data directly on kafka with presto
Use spark to consume kafka events to write aggregations to analytics Postgres db
Schematically it would look like this:
Application
|
| (publish events)
↓
Kafka [topics: item_created, item_deleted]
| ↑
| | (query short-time)
| |
| Presto ←-----------┐
| |
| (read event stream) |
↓ |
Spark |
| |
| (update metrics) |
↓ |
Postgres |
↑ |
| (query) | (query)
| |
└-----Superset-----┘
Now this data-analytics setup should be used to visualise historical and live data. Very important to note is that in this case the application can have already a database with historical data. To make this work when starting up the data analytics first the database is parsed and events are emitted to kafka to transfer the historical data. Live data can come at any time and will also be progressed.
An idea to make the metric work is the following. With the help of presto the events can easily be aggregated through the short term memory of kafka itself.
For historical data the idea could be to create a table Items that with the schema:
--------------------------------------------
| Items |
--------------------------------------------
| timestamp | numberOfItems |
--------------------------------------------
| 2021-11-16 09:00:00.000 | 0 |
| 2021-11-17 09:00:00.000 | 20 |
| 2021-11-18 09:00:00.000 | 5 |
| 2021-11-19 09:00:00.000 | 7 |
| 2021-11-20 09:00:00.000 | 14 |
Now the idea would that the spark program (which would need of course to parse the schema of the topic messages) and this will assess the timestamp check in which time-window the event falls (in this case which day) and update the number by +1 in case of a CREATED or -1 in case of a DELTED event.
The question I have is whether this is a reasonable interpretation of the problem in a kappa architecture. In startup it would mean a lot of read and writes to the analytics database. There will be multiple spark workers to update the analytics database in parallel and the queries must be written such that it's all atomic operations and not like read and then write back because the value might have been altered in the meanwhile by another spark node. What could be done to make this process efficient? How would it be possible to prevent kafka being flooded in the startup process?
Is this an intended use case for spark? What would be a good alternative for this problem?
In terms of data-throughput assume like 1000-10000 of this events per day.
Update:
Apparently spark is not intended to be used like this as it can be seen from this issue.
Apparently spark is not intended to be used like this
You don't need Spark, or at least, not completely.
Kafka Streams can be used to move data between various Kafka topics.
Kafka Connect can be used to insert/upsert into Postgres via JDBC Connector.
Also, you can use Apache Pinot for indexed real-time and batch/historical analytics from Kafka data rather than having Presto just consume and parse the data (or needing a separate Postgres database only for analytical purposes)
assume like 1000-10000 of this events per day
Should be fine. I've worked with systems that did millions of events, but were mostly written to Hadoop or S3 rather than directly into a database, which you could also have Presto query.
Related
Let's assume the following setting: I have a stream of events. I want some specific events to trigger an action. Concrete case could be: stream of customers' orders and if the order meets certain set of conditions I want to send the customer a notification/SMS. At the same time, I want to track how fast I am processing the messages and monitor which order met which condition.
For notifications, I use Spark Structured Streaming code consisting of several operations:
df_orders = spark.readStream.format("eventhubs").options(**conf).load()
(df_orders
.filter(col('sms_consent') == True)
.filter(col('order_price') > 1000)
.dropDuplicates(['order_id', 'customer_id'])
.writeStream
.format('eventhubs')
.options(**conf)
.start()
)
Now I want to build a "monitoring/reporting" solution, which will export the following data for every incoming order:
+----------+-----------------------+-----------------------+-----------------------+--------------------------+----------------------+
| order_id | filtered_sms_consent | filtered_order_price | time_messageReceived | time_processingFinished | time_sentToEventHub |
+----------+-----------------------+-----------------------+-----------------------+--------------------------+----------------------+
| 1 | True | None | 9:40:00 | 9:41:00 | None |
| 2 | False | False | 9:41:00 | 9:42:00 | 9:42:21 |
| 3 | False | True | 9:43:00 | 9:45:00 | None |
+----------+-----------------------+-----------------------+-----------------------+--------------------------+----------------------+
(The shape does not matter - the table can be de-pivoted to more "log-like" structure...)
My experiments:
First, I thought about using the Spark listeners (StreamingQueryListener) as it seems the Listeners are able to logs things such as the query state, average processing time etc.. But I couldn't find any solution to match certain event (order_id) with data from query listener.
Next, I wrote a separate query for monitoring while keeping the query for the actual logic execution. Issue is that since these are two separate queries, each is executed independently. Therefore, the timestamps are off. I managed to bound them together using the foreachBatch() approach. This however does encounter a problem with dropDuplicates (must split the query in two) and it feels very "heavy" (it is slowing down the execution quite a bit).
Dream:
What I would love to have is something like:
(df_orders
.log('order_id {}: Processing started at {time}'.format(col('order_id'), time.now())
.filter(col('sms_consent') == True)
.log('order_id {}: filtered on sms_consent'.format(col('order_id'))
.filter(col('order_price' > 1000)
.log('order_id {}: filtered on sms_price'.format(col('order_id'))
...
)
or to have this information in spark logs by default and that have means to extract it.
How is this achievable?
You can create UDF for sending logs to needed to you storage and call it during streaming and send data from each worker. It can be slow.
You can create UDF for logging to standard spark logs. And for look logs you need collect logs from all nodes. I used logstash for collect local logs from all nodes and kibana as dashboard.
If you need log time series data you can use spark metrics system https://spark.apache.org/docs/latest/monitoring.html#metrics and https://github.com/groupon/spark-metrics for custom metrics. This will allow to you
create UDF and send custom metrics during streaming.
I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.
In Production cluster , the Cluster Write latency frequently spikes from 7ms to 4Sec. Due to this clients face a lot of Read and Write Timeouts. This repeats in every few hours.
Observation:
Cluster Write latency (99th percentile) - 4Sec
Local Write latency (99th percentile) - 10ms
Read & Write consistency - local_one
Total nodes - 7
I tried to enable trace using settraceprobability for few mins and observed that mostly of the time is taken in internode communication
session_id | event_id | activity | source | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;' | cassandranode3 | 7 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c | Preparing statement | Cassandranode3 | 47 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c | reading data from /Cassandranode1 | Cassandranode3 | 121 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c | REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 | 40614 | MessagingService-Incoming-/Cassandranode1
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c | Processing response from /Cassandranode1 | Cassandranode3 | 40626 | SharedPool-Worker-5
I tried checking the connectivity between Cassandra nodes but did not see any issues. Cassandra logs are flooded with Read timeout exceptions as this is a pretty busy cluster with 30k reads/sec and 10k writes/sec.
Warning in the system.log:
WARN [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]
During the spike the cluster just stalls and simple commands like "use system_traces" command also fails.
cassandra#cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.
I validated the schema versions on all nodes and its the same but looks like during the issue time Cassandra is not even able to read the metadata.
Has anyone faced similar issues ? any suggestions ?
(from data from your comments above) The long full gc pauses can definitely cause this. Add -XX:+DisableExplicitGC you are getting full GCs because of calls to system.gc which is most likely from a silly DGC rmi thing that gets called at regular intervals regardless of if needed. With the larger heap that is VERY expensive. It is safe to disable.
Check your gc log header, make sure min heap size is not set. I would recommend setting -XX:G1ReservePercent=20
Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.
How can this be archived in spark structured streaming (preferably using PySpark)?
For example suppose I have table
key | time | value
----------------------------
A | 1 | foo
B | 2 | foobar
A | 2 | bar
A | 15 | foobeedoo
Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe
key | time | value
----------------------------
B | 2 | foobar
A | 15 | foobeedoo
that I might like to join against another stream.
Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates() function with reverse order.
Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)
Using flatMapGroupsWithState or mapGroupsWithState, group by key, and sort the value by time in the flatMapGroupsWithState function, store the last line into the GroupState.
I am looking to use the apache cassandra database to store a time series of 1 minute OHLCV financial data for ~1000 symbols. This will need to be updated in real-time as data is streamed in. All entries where time>24hr oldare not needed and should be discarded.
Assuming there are 1000 symbols with entries for each minute from the past 24 hrs, the total number of entries will amount to 1000*(60*24) = 1,440,000.
I am interested in designing this database to efficiency retrieve a slice of all symbols from the past [30m, 1h, 12h, 24h] with fast querying times. Ultimately, I need to retrieve the OHLCV that summarises this slice. The resulting output would be {symbol, FIRST(open), MAX(high), MIN(low), LAST(close), SUM(volume)} of the slice for each symbol. This essentially summarises the 1m OHLCV entries and creates an [30m, 1h, 12h, 24h] OHLCV from the time of the query. E.g. If I want to retrieve the past 1h OHLCV from 1:32pm, the query will give me a 1h OHLCV that represents data from 12:32pm-1:32pm.
What would be a good design to meet these requirements? I am not concerned with the database's memory footprint on the hard drive. The real issue is with fast querying times that is light on cpu and ram.
I have come up with a simple and naive way to store each record with clustering ordered by time:
CREATE TABLE symbols (
time timestamp,
symbol text,
open double,
high double,
low double,
close double,
volume double
PRIMARY KEY (time, symbol)
) WITH CLUSTERING ORDER BY (time DESC);
But I am not sure how to select from this to meet my requirements. I would rather design it specifically for my query, and duplicate data if necessary.
Any suggestions will be much appreciated.
While not based on Cassandra, Axibase Time Series Database can be quite relevant to this particular use case. It supports SQL with time-series syntax extensions to aggregate data into periods of arbitrary length.
An OHLCV query for a 15-minute window might look as follows:
SELECT date_format(datetime, 'yyyy-MM-dd HH:mm:ss', 'US/Eastern') AS time,
FIRST(t_open.value) AS open,
MAX(t_high.value) AS high,
MIN(t_low.value) AS low,
LAST(t_close.value) AS close,
SUM(t_volume.value) AS volume
FROM stock.open AS t_open
JOIN stock.high AS t_high
JOIN stock.low AS t_low
JOIN stock.close AS t_close
JOIN stock.volume AS t_volume
WHERE t_open.entity = 'ibm'
AND t_open.datetime >= '2018-03-29T14:32:00Z' AND t_open.datetime < '2018-03-29T15:32:00Z'
GROUP BY PERIOD(15 MINUTE, END_TIME)
ORDER BY datetime
Note the GROUP BY PERIOD clause above which does all the work behind the scenes.
Query results:
| time | open | high | low | close | volume |
|----------------------|----------|---------|----------|---------|--------|
| 2018-03-29 10:32:00 | 151.8 | 152.14 | 151.65 | 152.14 | 85188 |
| 2018-03-29 10:47:00 | 152.18 | 152.64 | 152 | 152.64 | 88065 |
| 2018-03-29 11:02:00 | 152.641 | 153.04 | 152.641 | 152.69 | 126511 |
| 2018-03-29 11:17:00 | 152.68 | 152.75 | 152.43 | 152.51 | 104068 |
You can use a Type 4 JDBC driver, API clients or just curl to run these queries.
I'm using sample 1-minute data for the above example which you can download from Kibot as described in these compression tests.
Also, ATSD supports scheduled queries to materialize minutely data into OHLCV bars of longer duration, say for long-term retention.
Disclaimer: I work for Axibase.