Related
This is a question about building a pipeline for data-analytics in a kappa architecture. The question is conceptional.
Assume you have a system that emits events, for simplicity let's assume you just have two events CREATED and DELETED which tell that an item get's created or deleted at a given point in time. Those events contain an id and a timestamp. An item will get created and deleted again after a certain time. Assume the application ensures correct order of events and prevents duplicate events and no event is emitted with the exact same timestamp.
The metrics that should be available in data analytics are:
Current amount of items
Amount of items as graph over the last week
Amount of items per day as historical data
Now a proposal for an architecture for such a scenario would be like this:
Emit events to Kafka
Use kafka as short term storage
Use superset to display live data directly on kafka with presto
Use spark to consume kafka events to write aggregations to analytics Postgres db
Schematically it would look like this:
Application
|
| (publish events)
↓
Kafka [topics: item_created, item_deleted]
| ↑
| | (query short-time)
| |
| Presto ←-----------┐
| |
| (read event stream) |
↓ |
Spark |
| |
| (update metrics) |
↓ |
Postgres |
↑ |
| (query) | (query)
| |
└-----Superset-----┘
Now this data-analytics setup should be used to visualise historical and live data. Very important to note is that in this case the application can have already a database with historical data. To make this work when starting up the data analytics first the database is parsed and events are emitted to kafka to transfer the historical data. Live data can come at any time and will also be progressed.
An idea to make the metric work is the following. With the help of presto the events can easily be aggregated through the short term memory of kafka itself.
For historical data the idea could be to create a table Items that with the schema:
--------------------------------------------
| Items |
--------------------------------------------
| timestamp | numberOfItems |
--------------------------------------------
| 2021-11-16 09:00:00.000 | 0 |
| 2021-11-17 09:00:00.000 | 20 |
| 2021-11-18 09:00:00.000 | 5 |
| 2021-11-19 09:00:00.000 | 7 |
| 2021-11-20 09:00:00.000 | 14 |
Now the idea would that the spark program (which would need of course to parse the schema of the topic messages) and this will assess the timestamp check in which time-window the event falls (in this case which day) and update the number by +1 in case of a CREATED or -1 in case of a DELTED event.
The question I have is whether this is a reasonable interpretation of the problem in a kappa architecture. In startup it would mean a lot of read and writes to the analytics database. There will be multiple spark workers to update the analytics database in parallel and the queries must be written such that it's all atomic operations and not like read and then write back because the value might have been altered in the meanwhile by another spark node. What could be done to make this process efficient? How would it be possible to prevent kafka being flooded in the startup process?
Is this an intended use case for spark? What would be a good alternative for this problem?
In terms of data-throughput assume like 1000-10000 of this events per day.
Update:
Apparently spark is not intended to be used like this as it can be seen from this issue.
Apparently spark is not intended to be used like this
You don't need Spark, or at least, not completely.
Kafka Streams can be used to move data between various Kafka topics.
Kafka Connect can be used to insert/upsert into Postgres via JDBC Connector.
Also, you can use Apache Pinot for indexed real-time and batch/historical analytics from Kafka data rather than having Presto just consume and parse the data (or needing a separate Postgres database only for analytical purposes)
assume like 1000-10000 of this events per day
Should be fine. I've worked with systems that did millions of events, but were mostly written to Hadoop or S3 rather than directly into a database, which you could also have Presto query.
I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.
lately I've been playing around with Stream Analytics queries with PowerBI as output sink. I made a simple query which retrieves the total count of http responsecodes of our website requests over time and groups them by date and response code.
The input data is retrieved from a storage account which holds BLOB storage. This is my query:
SELECT
DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0) as datum,
request.ArrayValue.responseCode,
count(request.ArrayValue.responseCode)
INTO
[requests-httpresponsecode]
FROM
[cvweu-internet-pr-sa-requests] R TIMESTAMP BY R.context.data.eventTime
OUTER APPLY GetArrayElements(R.request) as request
GROUP BY DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0), request.ArrayValue.responseCode, System.TimeStamp
Since continuous export became active on 3 september 2018, I chose a job start time of 3 september 2018. Since I am interested in the statistics until today, I did not include a date interval so I am expecting to see data from 3 september 2018 until now (20 december 2018). The job is running fine without errors and I chose PowerBI as an output sink. Immediately I saw the chart being propagated starting from 3 september grouped by day and counting. So far, so good. A few days later I noticed the output dataset didnt start from 3 september anymore but from 2 December until now. Apparently data is being overwritten.
The following link says:
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-power-bi-dashboard
"defaultRetentionPolicy: BasicFIFO: Data is FIFO, with a maximum of 200,000 rows."
But my output table does not have close to 200.000 rows:
datum,count,responsecode
2018-12-02 00:00:00,332348,527387
2018-12-03 00:00:00,3178250,3282791
2018-12-04 00:00:00,3170981,4236046
2018-12-05 00:00:00,2943513,3911390
2018-12-06 00:00:00,2966448,3914963
2018-12-07 00:00:00,2825741,3999027
2018-12-08 00:00:00,1621555,3353481
2018-12-09 00:00:00,2278784,3706966
2018-12-10 00:00:00,3160370,3911582
2018-12-11 00:00:00,3806272,3681742
2018-12-12 00:00:00,4402169,3751960
2018-12-13 00:00:00,2924212,3733805
2018-12-14 00:00:00,2815931,3618851
2018-12-15 00:00:00,1954330,3240276
2018-12-16 00:00:00,2327456,3375378
2018-12-17 00:00:00,3321780,3794147
2018-12-18 00:00:00,3229474,4335080
2018-12-19 00:00:00,3329212,4269236
2018-12-20 00:00:00,651642,1195501
EDIT: I have created the STREAM input source according to
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-quick-create-portal. I can create a REFERENCE input as well, but this invalidates my query since APPLY and GROUP BY are not supported and I also think STREAM input is what I want according to https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-add-inputs.
What am I missing? Is it my query?
It looks like you are streaming to a Streaming dataset. Streaming datasets doesn't store the data in a database, but keeps only the last hour of data. If you want to keep the data pushed to it, then you must enable Historic data analysis option, when you create the dataset:
This will create PushStreaming dataset (a.k.a. Hybrid) with basicFIFO retention policy (i.e. about 200k-210k records kept).
You're correct that Azure Stream Analytics should be creating a "PushStreaming" or "Hybrid" dataset. Can you confirm that your dataset is correctly configured as "Hybrid" (you can check this attribute even after creation as shown here)?
If it is the correct type, can you please clarify the following:
Does the schema of your data change? If, for example, you send the datum {a: 1, b: 2} and then {c: 3, d: 4}, Azure Stream Analytics will attempt to change the schema of your table, which can invalidate older data.
How are you confirming the number of rows in the dataset?
Looks like my query was the problem. I had to use TUMBLINGWINDOW(day,1) instead of System.TimeStamp.
TUMBLINGWINDOW and System.TimeStamp produce exactly the same chart output on the frontend, but seem to be processed in a different way in the backend. This was not reflected to the frontend in any way so this was confusing. I suspect something is happening in the backend due to the way the query is processed when not using TUMBLINGWINDOW and you happen to hit the 200k row per dataset limit sooner than expected. The query below is the one which is producing the expected result.
SELECT
request.ArrayValue.responseCode,
count(request.ArrayValue.responseCode),
DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0) as date
INTO
[requests-httpstatuscode]
FROM
[cvweu-internet-pr-sa-requests] R TIMESTAMP BY R.context.data.eventTime
OUTER APPLY GetArrayElements(R.request) as request
GROUP BY DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0),
TUMBLINGWINDOW(day,1),
request.ArrayValue.responseCode
As we speak my stream analytics job is running smoothly and producing the expected output from 3 september until now without data being overwritten.
I receive regularly two types of sets of data:
Network flows, thousands per second:
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
(... etc ..)
}
And interface tables, every hour:
[
{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 19
'iface_name' : 'Fa0/0'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 20
'iface_name' : 'Fa0/1'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.157.38',
'iface_id' : 20
'iface_name' : 'Gi0/3'
}
]
I want to insert those flows in Cassandra, with interface names instead of IDs, based on the latest matching host/iface_id value. I cannot rely on a memory-only solution, otherwise I may loose up to one hour of flows every time I restart the application.
What I had in mind, is to use two Cassandra tables: One that holds the flows, and one that holds the latest host/iface_id table. Then, when receiving a flow, I would use this data to properly fill interface name.
Ideally, I would like to let Cassandra take care of this. In my mind, it seems more efficient than pulling out interface names from the application side every time.
The thing is that I cannot figure out how to do that - and having never worked with NoSQL before, I am not even sure that this is the right approach... Could someone point me in the right direction?
Inserting data in the interface table and keeping only the latest version is quite trivial, but I cannot wrap my mind around the 'inserting interface name in flow record' part. In a traditional RDBMS I would use a nested query, but those don't seem to exist in Cassandra.
Reading your question, I can hope that the data hourly received in interface table is not too big. So we can keep that data (single row) in memory as well as in cassandra database. For every hour, the in memory data will get updated as well as a new inserted in to database. We can save interface data with below table definition -
create table interface_by_hour(
year int,
month int,
day int,
hour int,
data text, -- enitre json string for one hour interface data.
primary key((year,month,day,hour)));
Few insert statements --
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,23,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,00,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,28,1,'{complete json.........}');
keep every hours interface data in this table and update it in memory as well. Benefit of having in memory data is that you don't have to read it from table thousand of time every second. If application goes down, you can read the current/previous hour data from table using below query, and build the in memory cache.
cqlsh:mykeyspace> select * from interface_by_hour where year=2017 and month=1 and day=27 and hour=0;
year | month | day | hour | data
------+-------+-----+------+--------------------------
2017 | 1 | 27 | 0 | {complete json.........}
Now comes the flow data --
As we have current hour interface table data cached in memory, we can quickly map interface name to host. Use below table to save flow data.
create table flow(
iface_name text,
createdon bingint, -- time stamp in milliseconds.
host text, -----this is optionl, if you want dont use it as column.
flowdata text, -- entire json string
primarykey(iface_name,createdon,host));
Only issue I see in above table is that it will not distribute data evenly across the partitions, if you have too many flow data for one interface name, whole data will inserted in to one partition.
I designed this table just to save the data, if you could have specified how you going to use this data, I would have done some more Thinking.
hope this helps.
Hi as far as I can tell the interface data is not so heavy on the writes that it would need partitioning by time. It changes only once per hour so it's not necessary to save data for every hour just the latest version. Also I will assume that you want to query this in some way I'm not sure how so I'll just propose something general for interface and will threat the flows as time series data:
create table interface(
iface_name text primary key,
iface_id int,
host text,
stamp timestamp
);
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/0', 19, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/1', 20, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Gi0/3', 20, '192.168.157.38', '2017-01-19 03:00:00');
usually this is an antipatern with cassandra:
cqlsh:test> select * from interface;
iface_name | host | iface_id | stamp
------------+----------------+----------+---------------------------------
Fa0/0 | 192.168.2.6 | 19 | 2017-01-19 02:00:00.000000+0000
Gi0/3 | 192.168.157.38 | 20 | 2017-01-19 02:00:00.000000+0000
Fa0/1 | 192.168.2.6 | 20 | 2017-01-19 02:00:00.000000+0000
But as far as I can see you don't have that many interfaces
So basically anything up to thousands will be o.k. here in worst case you might want to
use the token function to get the data out from partitions but the thing is this will save you
a lot on space and you don't need to save this by the hour.
I would simply keep this table in memory also and then enrich the data as it comes in.
If there is updates, update the in memory cache ... but also put writes to cassandra.
If something fails then simply restore from interface table and continue.
basically your flow info would then become
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
'iface_name' : 'key put from in memory cache',
}
This is how you will get the bigest performance now saving flows is just
time series data then, take into account that you are hitting the cluster
with thousands per second and that when you are paritioning by time you
get at least 7000 if not more columns every second in (with the model
I'm proposing here) usually you will want to have up to 100 000 columns
within single partition, which would say that your partition goes over
ideal size withing 20 seconds or even less so I would even suggest using
random buckets (when inserting just use some number in defined range
let's say 10):
create table flow(
time_with_minute text,
artificial_bucket int,
stamp timeuuid,
host text,
ip_src text,
ip_dst text,
iface_in int,
iface_out int,
iface_name text,
primary key((time_with_minute, artificial_bucket), stamp)
);
When wanting to fetch flows over time you would simply use the parts of
a timestamp plus make 10 queries at the same time or one by one to access all the data. There are various techniques here, you simply need to tell more about your use case.
inserting is then something like:
insert into flow(time_with_minute, artificial_bucket, stamp, host, ip_src, ip_dst, iface_in, iface_out, iface_name)
values ('2017-01-19 01:37', 1, now(), '192.168.2.6', '10.29.78.3', '8.8.4.4', 19, 20, 'Fa0/0');
I used now just for an example, use https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/utils/UUIDGen.java to
generate timeuuid with the time when you inserted flow. Also I inserted 1 into artificial bucket, here you would insert random number with range, let's say 0-10 at least. Some people, depending on the load insert multiple random buckets, even 60 or more. It all depends on how heavy writes are. If you just put it to minute every minute a group of nodes within the cluster will be hot and this will switch around. Having hot nodes is usually not a good idea.
With cassandra you are writing the information that you need right away, you are not doing
any joins during write or something similar. Keep the data in memory that you need to
stamp the data with the information that you need and just insert without denormalisation.
Also you can model the solution in a relational way and just tell how you would like to
access the data then we can go into details.
I am working with a model that utilizes the Data Table functionality in Excel to process many scenarios and collect the output. However, I will occasionally find that the data table will repeat results for sequential scenarios, which would be impossible.
The data table is generated with the following VBA code:
Sheets("ProjSheet").Range(Range("DTAnchor").Offset(-1, -1),Range("DTAnchor").Offset(NumScns - 1, 1))
.Table ColumnInput:=Range("CurrScen")
The calculation mode is semiautomatic throughout the code, and the data table is updated using Calculate.
The erroneous output might look something like this:
Scn | Result
------------
1 | 341.5
2 | 0
3 | 861.4
4 | 861.4 <- Wrong!
5 | 861.4 <- Wrong!
6 | 10.5
7 | 64.9
...
The workbook is not small at ~22MB, and the data table typically takes a minute or so to churn through 1,000 scenarios.
I can verify the result of any particular scenario by running it individually, so we know that the results in this case should not be identical.
These models are typically saved with different model settings and run simultaneously in different instances of Excel. They are also generally run on remote computers where multiple users could log in, kick off some models, and then disconnect and leave them running in the background. These computers have 8 cores so they could run up to 8 models simultaneously. Sometimes, the models are opened and kicked off in different Excel instances using a macro that writes a VB Script which is then kicked off by a .bat file.
I've had difficulty replicating the error reliably. One theory I have is that when more Excel models are running than there are cores in the remote computer, Excel freezes up during the data table process and cannot always complete its calculation for some scenarios, and therefore it spits out the same results it currently has stored. I don't know enough about Excel's inner workings to decide if this makes sense, though.
Has anyone else come across data tables that repeat results before? If you know more about the mechanics of data tables, do you know what might cause this error and how to prevent it in the future?
Let me know if you need any more information about the error or things that I've tried to identify the cause. Thanks!