I have a use case for Cassandra where I need to store multiple rows of data, which will belong to different customers. I'm new to Cassandra and I need to provide a permissions model where only one customer is accessible at once from a base permissions role but all could be accessible from a 'supervisor' role. Essentially every time a query is made, one customer cannot see another customer's data, except for when the query is made from a supervisor. We have to enforce a security as a design approach.
The data could look like this:
-----------------------------------------
| id | customer name | data column1... |
-----------------------------------------
| 0 | customer1 | 3 |
-----------------------------------------
| 1 | customer2 | 23 |
-----------------------------------------
| 2 | customer3 | 33 |
-----------------------------------------
| 3 | customer3 | 32 |
-----------------------------------------
Is something like this easily doable with Cassandra?
The way you have modeled this is a perfectly good way to do multi-tenant. This is how UserGrid models multiple tenants and is used in several large scale applications.
Couple of drawbacks to be up-front:
Doesn't help with a "noisy neighbor" problem and unequal tenants
Application code has to manage the tenant security
Related
This is a question about building a pipeline for data-analytics in a kappa architecture. The question is conceptional.
Assume you have a system that emits events, for simplicity let's assume you just have two events CREATED and DELETED which tell that an item get's created or deleted at a given point in time. Those events contain an id and a timestamp. An item will get created and deleted again after a certain time. Assume the application ensures correct order of events and prevents duplicate events and no event is emitted with the exact same timestamp.
The metrics that should be available in data analytics are:
Current amount of items
Amount of items as graph over the last week
Amount of items per day as historical data
Now a proposal for an architecture for such a scenario would be like this:
Emit events to Kafka
Use kafka as short term storage
Use superset to display live data directly on kafka with presto
Use spark to consume kafka events to write aggregations to analytics Postgres db
Schematically it would look like this:
Application
|
| (publish events)
↓
Kafka [topics: item_created, item_deleted]
| ↑
| | (query short-time)
| |
| Presto ←-----------┐
| |
| (read event stream) |
↓ |
Spark |
| |
| (update metrics) |
↓ |
Postgres |
↑ |
| (query) | (query)
| |
└-----Superset-----┘
Now this data-analytics setup should be used to visualise historical and live data. Very important to note is that in this case the application can have already a database with historical data. To make this work when starting up the data analytics first the database is parsed and events are emitted to kafka to transfer the historical data. Live data can come at any time and will also be progressed.
An idea to make the metric work is the following. With the help of presto the events can easily be aggregated through the short term memory of kafka itself.
For historical data the idea could be to create a table Items that with the schema:
--------------------------------------------
| Items |
--------------------------------------------
| timestamp | numberOfItems |
--------------------------------------------
| 2021-11-16 09:00:00.000 | 0 |
| 2021-11-17 09:00:00.000 | 20 |
| 2021-11-18 09:00:00.000 | 5 |
| 2021-11-19 09:00:00.000 | 7 |
| 2021-11-20 09:00:00.000 | 14 |
Now the idea would that the spark program (which would need of course to parse the schema of the topic messages) and this will assess the timestamp check in which time-window the event falls (in this case which day) and update the number by +1 in case of a CREATED or -1 in case of a DELTED event.
The question I have is whether this is a reasonable interpretation of the problem in a kappa architecture. In startup it would mean a lot of read and writes to the analytics database. There will be multiple spark workers to update the analytics database in parallel and the queries must be written such that it's all atomic operations and not like read and then write back because the value might have been altered in the meanwhile by another spark node. What could be done to make this process efficient? How would it be possible to prevent kafka being flooded in the startup process?
Is this an intended use case for spark? What would be a good alternative for this problem?
In terms of data-throughput assume like 1000-10000 of this events per day.
Update:
Apparently spark is not intended to be used like this as it can be seen from this issue.
Apparently spark is not intended to be used like this
You don't need Spark, or at least, not completely.
Kafka Streams can be used to move data between various Kafka topics.
Kafka Connect can be used to insert/upsert into Postgres via JDBC Connector.
Also, you can use Apache Pinot for indexed real-time and batch/historical analytics from Kafka data rather than having Presto just consume and parse the data (or needing a separate Postgres database only for analytical purposes)
assume like 1000-10000 of this events per day
Should be fine. I've worked with systems that did millions of events, but were mostly written to Hadoop or S3 rather than directly into a database, which you could also have Presto query.
I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.
I am looking to use the apache cassandra database to store a time series of 1 minute OHLCV financial data for ~1000 symbols. This will need to be updated in real-time as data is streamed in. All entries where time>24hr oldare not needed and should be discarded.
Assuming there are 1000 symbols with entries for each minute from the past 24 hrs, the total number of entries will amount to 1000*(60*24) = 1,440,000.
I am interested in designing this database to efficiency retrieve a slice of all symbols from the past [30m, 1h, 12h, 24h] with fast querying times. Ultimately, I need to retrieve the OHLCV that summarises this slice. The resulting output would be {symbol, FIRST(open), MAX(high), MIN(low), LAST(close), SUM(volume)} of the slice for each symbol. This essentially summarises the 1m OHLCV entries and creates an [30m, 1h, 12h, 24h] OHLCV from the time of the query. E.g. If I want to retrieve the past 1h OHLCV from 1:32pm, the query will give me a 1h OHLCV that represents data from 12:32pm-1:32pm.
What would be a good design to meet these requirements? I am not concerned with the database's memory footprint on the hard drive. The real issue is with fast querying times that is light on cpu and ram.
I have come up with a simple and naive way to store each record with clustering ordered by time:
CREATE TABLE symbols (
time timestamp,
symbol text,
open double,
high double,
low double,
close double,
volume double
PRIMARY KEY (time, symbol)
) WITH CLUSTERING ORDER BY (time DESC);
But I am not sure how to select from this to meet my requirements. I would rather design it specifically for my query, and duplicate data if necessary.
Any suggestions will be much appreciated.
While not based on Cassandra, Axibase Time Series Database can be quite relevant to this particular use case. It supports SQL with time-series syntax extensions to aggregate data into periods of arbitrary length.
An OHLCV query for a 15-minute window might look as follows:
SELECT date_format(datetime, 'yyyy-MM-dd HH:mm:ss', 'US/Eastern') AS time,
FIRST(t_open.value) AS open,
MAX(t_high.value) AS high,
MIN(t_low.value) AS low,
LAST(t_close.value) AS close,
SUM(t_volume.value) AS volume
FROM stock.open AS t_open
JOIN stock.high AS t_high
JOIN stock.low AS t_low
JOIN stock.close AS t_close
JOIN stock.volume AS t_volume
WHERE t_open.entity = 'ibm'
AND t_open.datetime >= '2018-03-29T14:32:00Z' AND t_open.datetime < '2018-03-29T15:32:00Z'
GROUP BY PERIOD(15 MINUTE, END_TIME)
ORDER BY datetime
Note the GROUP BY PERIOD clause above which does all the work behind the scenes.
Query results:
| time | open | high | low | close | volume |
|----------------------|----------|---------|----------|---------|--------|
| 2018-03-29 10:32:00 | 151.8 | 152.14 | 151.65 | 152.14 | 85188 |
| 2018-03-29 10:47:00 | 152.18 | 152.64 | 152 | 152.64 | 88065 |
| 2018-03-29 11:02:00 | 152.641 | 153.04 | 152.641 | 152.69 | 126511 |
| 2018-03-29 11:17:00 | 152.68 | 152.75 | 152.43 | 152.51 | 104068 |
You can use a Type 4 JDBC driver, API clients or just curl to run these queries.
I'm using sample 1-minute data for the above example which you can download from Kibot as described in these compression tests.
Also, ATSD supports scheduled queries to materialize minutely data into OHLCV bars of longer duration, say for long-term retention.
Disclaimer: I work for Axibase.
I have a dataset of of 500 mobile devices having 10 attributes namely
Date|Company|ModelName|Price|HardDisk|RAM|Colour|Display size|Cam1|Cam2
The sample dataset is given below :
24/10/2015 | walmart | Samsung Galaxy Note 4 N910H 32GB Unlocked GSM OctaCore Cell Phone-N910H 32GB GOLD | 599.99 | 32 | N/A | cell gold | N/A | 10.2 | 16
25/10/2015 | walmart | Samsung Galaxy Note 5 SM-N920i Gold International Model Unlocked GSM Mobile Phone | 717.95 | 32 | N/A | gold | N/A | 5.7 | 16
26/10/2015 | amazon | T-Mobile AllShare Cast Wireless Hub | 65.15 | N/A | N/A | streaming | N/A | N/A | N/A
I have to find the the most similar or unique devices or remove duplicate mobile devices from the dataset by taking into account the various attributes of the mobile devices.
I have explored many similarity algorithms like Jaccard similarity, cosine similarity. Levenshtein Distance but they seem to work upon attributes with same datatype.
Please suggest an algorithm or approach that could work on this type of mixed datatype dataset taking into account almost all attributes.
You can compute the hash code of each row.
Then use the difference of the hash codes as similarity measure.
Obviously, this depends on all the attributes.
It is very good for finding duplicates!
It may not be good for your application - but you did not specify what is good for your application.
I am not much sure this question is suit for SF. if not im sorry. my question is how could i draw the database transaction commit in the sequence diagram.
regards
You should have an object representing the transaction or the database.
Then you could represent it with a message labeled as "commit" from your business object to the transaction/database object.
For example (can't post image because of my reputation):
+-----------------+
| Business Object |
+-----------------+
|
| start transaction +----------------------+
+------------------------> | Database Transaction |
| +----------------------+
| |
| do lots of things |
+----------------------------------->|
| |
| commit |
+----------------------------------->|