I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.
We have reserved various number of RUs per second for our various collections. I'm trying to optimize this to save money. For each response from Cosmos, we're logging the request charge property to Application Insights. I have one analytics query that returns the average number of request units per second and one that returns the maximum.
let start = datetime(2019-01-24 11:00:00);
let end = datetime(2019-01-24 21:00:00);
customMetrics
| where name == 'RequestCharge' and start < timestamp and timestamp < end
| project timestamp, value, Database=tostring(customDimensions['Database']), Collection=tostring(customDimensions['Collection'])
| make-series sum(value) default=0 on timestamp in range(start, end, 1s) by Database, Collection
| mvexpand sum_value to typeof(double), timestamp limit 36000
| summarize avg(sum_value) by Database, Collection
| order by Database asc, Collection asc
let start = datetime(2019-01-24 11:00:00);
let end = datetime(2019-01-24 21:00:00);
customMetrics
| where name == 'RequestCharge' and start <= timestamp and timestamp <= end
| project timestamp, value, Database=tostring(customDimensions['Database']), Collection=tostring(customDimensions['Collection'])
| summarize sum(value) by Database, Collection, bin(timestamp, 1s)
| summarize arg_max(sum_value, *) by Database, Collection
| order by Database asc, Collection asc
The averages are fairly low but the maxima can be unbelievably high in some cases. An extreme example is a collection with a reservation of 1,000, an average used of 15,59 and a maximum used of 63,341 RUs/s.
My question is: How can this be? Are my queries wrong? Is throttling not working? Or does throttling only work on a longer period of time than a single second? I have checked for request throttling on the Azure Cosmos DB overview dashboard (response code 429), and there was none.
I have to answer myself. I found two problems:
Application Insights logs an inaccurate timestamp. I added a timestamp as a custom dimension, and within a certain minute I get different seconds in my custom timestamp but the built-in timestamp is one second past the minute for many of these. That is why I got (false) peaks in request charge.
We did have throttling. When viewing request throttling in the portal, I have to select a specific database. If I try to view request throttling for all databases, it looks like there is none.
I have a Data model like below,
CREATE TABLE appstat.nodedata (
nodeip text,
timestamp timestamp,
flashmode text,
physicalusage int,
readbw int,
readiops int,
totalcapacity int,
writebw int,
writeiops int,
writelatency int,
PRIMARY KEY (nodeip, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
where, nodeip - primary key and timestamp - clustering key (Sorted by descinding oder to get the latest),
Sample data in this table,
SELECT * from nodedata WHERE nodeip = '172.30.56.60' LIMIT 2;
nodeip | timestamp | flashmode | physicalusage | readbw | readiops | totalcapacity | writebw | writeiops | writelatency
--------------+---------------------------------+-----------+---------------+--------+----------+---------------+---------+-----------+--------------
172.30.56.60 | 2017-12-08 06:13:07.161000+0000 | yes | 34 | 57 | 19 | 27 | 8 | 89 | 57
172.30.56.60 | 2017-12-08 06:12:07.161000+0000 | yes | 70 | 6 | 43 | 88 | 79 | 83 | 89
This is properly available and whenever I need to get the statistics I am able to get the data using the partition key like below,
(The above logic seems similar to my previous question : Aggregation in Cassandra across partitions) but expectation is different,
I have value for each column (like readbw, latency etc.,) populated for every one minute in all the 4 nodes.
Now, If I need to get the max value for a column (Example : readbw), It is possible using the following query,
SELECT max(readbw) FROM nodedata WHERE nodeip IN ('172.30.56.60','172.30.56.61','172.30.56.60','172.30.56.63') AND timestamp < 1512652272989 AND timestamp > 1512537899000;
1) First question : Is there a way to perform max aggregation on all nodes of a column (readbw) without using IN query?
2) Second question : Is there a way in Cassandra, whenever I insert the data in Node 1, Node 2, Node 3 and Node 4.
It needs to be aggregated and stored in another table. So that I will collect the aggregated value of each column from the aggregated table.
If any of my point is not clear, please let me know.
Thanks,
Harry
If you are dse Cassandra you can enable spark and write the aggregation queries
Disclaimer. In your question you should define restrictions to query speed. Readers do not know whether you're trying to show this in real time, or is it more for analytical purposes. It's also not clear on how much data you're operating and the answers might depend on that.
Firstly decide whether you want to do aggregation on read or write. This largely depends on your read/write patterns.
1) First question: (aggregation on read)
The short answer is no - it's not possible. If you want to use Cassandra for this, the best approach would be doing aggregation in your application by reading each nodeip with timestamp restriction. That would be slow. But Cassandra aggregations are also potentially slow. This warning exists for a reason:
Warnings :
Aggregation query used without partition key
I found C++ Cassandra driver to be the fastest option, if you're into that.
If your data size allows, I'd look into using other databases. Regular old MySQL or Postgres will do the job just fine, unless you have terabytes of data. There's also influx DB if you want a more exotic one. But I'm getting off-topic here.
2) Second question: (aggregation on write)
That's the approach I've been using for a while. Whenever I need some aggregations, I would do them in memory (redis) and then flush to Cassandra. Remember, Cassandra is super efficient at writing data, don't be afraid to create some extra tables for your aggregations. I can't say exactly how to do this for your data, as it all depends on your requirements. It doesn't seem feasible to provide results for arbitrary timestamp intervals when aggregating on write.
Just don't try to put large sets of data into a single partition. You're better of with traditional SQL databases then.
I receive regularly two types of sets of data:
Network flows, thousands per second:
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
(... etc ..)
}
And interface tables, every hour:
[
{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 19
'iface_name' : 'Fa0/0'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 20
'iface_name' : 'Fa0/1'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.157.38',
'iface_id' : 20
'iface_name' : 'Gi0/3'
}
]
I want to insert those flows in Cassandra, with interface names instead of IDs, based on the latest matching host/iface_id value. I cannot rely on a memory-only solution, otherwise I may loose up to one hour of flows every time I restart the application.
What I had in mind, is to use two Cassandra tables: One that holds the flows, and one that holds the latest host/iface_id table. Then, when receiving a flow, I would use this data to properly fill interface name.
Ideally, I would like to let Cassandra take care of this. In my mind, it seems more efficient than pulling out interface names from the application side every time.
The thing is that I cannot figure out how to do that - and having never worked with NoSQL before, I am not even sure that this is the right approach... Could someone point me in the right direction?
Inserting data in the interface table and keeping only the latest version is quite trivial, but I cannot wrap my mind around the 'inserting interface name in flow record' part. In a traditional RDBMS I would use a nested query, but those don't seem to exist in Cassandra.
Reading your question, I can hope that the data hourly received in interface table is not too big. So we can keep that data (single row) in memory as well as in cassandra database. For every hour, the in memory data will get updated as well as a new inserted in to database. We can save interface data with below table definition -
create table interface_by_hour(
year int,
month int,
day int,
hour int,
data text, -- enitre json string for one hour interface data.
primary key((year,month,day,hour)));
Few insert statements --
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,23,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,00,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,28,1,'{complete json.........}');
keep every hours interface data in this table and update it in memory as well. Benefit of having in memory data is that you don't have to read it from table thousand of time every second. If application goes down, you can read the current/previous hour data from table using below query, and build the in memory cache.
cqlsh:mykeyspace> select * from interface_by_hour where year=2017 and month=1 and day=27 and hour=0;
year | month | day | hour | data
------+-------+-----+------+--------------------------
2017 | 1 | 27 | 0 | {complete json.........}
Now comes the flow data --
As we have current hour interface table data cached in memory, we can quickly map interface name to host. Use below table to save flow data.
create table flow(
iface_name text,
createdon bingint, -- time stamp in milliseconds.
host text, -----this is optionl, if you want dont use it as column.
flowdata text, -- entire json string
primarykey(iface_name,createdon,host));
Only issue I see in above table is that it will not distribute data evenly across the partitions, if you have too many flow data for one interface name, whole data will inserted in to one partition.
I designed this table just to save the data, if you could have specified how you going to use this data, I would have done some more Thinking.
hope this helps.
Hi as far as I can tell the interface data is not so heavy on the writes that it would need partitioning by time. It changes only once per hour so it's not necessary to save data for every hour just the latest version. Also I will assume that you want to query this in some way I'm not sure how so I'll just propose something general for interface and will threat the flows as time series data:
create table interface(
iface_name text primary key,
iface_id int,
host text,
stamp timestamp
);
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/0', 19, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/1', 20, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Gi0/3', 20, '192.168.157.38', '2017-01-19 03:00:00');
usually this is an antipatern with cassandra:
cqlsh:test> select * from interface;
iface_name | host | iface_id | stamp
------------+----------------+----------+---------------------------------
Fa0/0 | 192.168.2.6 | 19 | 2017-01-19 02:00:00.000000+0000
Gi0/3 | 192.168.157.38 | 20 | 2017-01-19 02:00:00.000000+0000
Fa0/1 | 192.168.2.6 | 20 | 2017-01-19 02:00:00.000000+0000
But as far as I can see you don't have that many interfaces
So basically anything up to thousands will be o.k. here in worst case you might want to
use the token function to get the data out from partitions but the thing is this will save you
a lot on space and you don't need to save this by the hour.
I would simply keep this table in memory also and then enrich the data as it comes in.
If there is updates, update the in memory cache ... but also put writes to cassandra.
If something fails then simply restore from interface table and continue.
basically your flow info would then become
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
'iface_name' : 'key put from in memory cache',
}
This is how you will get the bigest performance now saving flows is just
time series data then, take into account that you are hitting the cluster
with thousands per second and that when you are paritioning by time you
get at least 7000 if not more columns every second in (with the model
I'm proposing here) usually you will want to have up to 100 000 columns
within single partition, which would say that your partition goes over
ideal size withing 20 seconds or even less so I would even suggest using
random buckets (when inserting just use some number in defined range
let's say 10):
create table flow(
time_with_minute text,
artificial_bucket int,
stamp timeuuid,
host text,
ip_src text,
ip_dst text,
iface_in int,
iface_out int,
iface_name text,
primary key((time_with_minute, artificial_bucket), stamp)
);
When wanting to fetch flows over time you would simply use the parts of
a timestamp plus make 10 queries at the same time or one by one to access all the data. There are various techniques here, you simply need to tell more about your use case.
inserting is then something like:
insert into flow(time_with_minute, artificial_bucket, stamp, host, ip_src, ip_dst, iface_in, iface_out, iface_name)
values ('2017-01-19 01:37', 1, now(), '192.168.2.6', '10.29.78.3', '8.8.4.4', 19, 20, 'Fa0/0');
I used now just for an example, use https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/utils/UUIDGen.java to
generate timeuuid with the time when you inserted flow. Also I inserted 1 into artificial bucket, here you would insert random number with range, let's say 0-10 at least. Some people, depending on the load insert multiple random buckets, even 60 or more. It all depends on how heavy writes are. If you just put it to minute every minute a group of nodes within the cluster will be hot and this will switch around. Having hot nodes is usually not a good idea.
With cassandra you are writing the information that you need right away, you are not doing
any joins during write or something similar. Keep the data in memory that you need to
stamp the data with the information that you need and just insert without denormalisation.
Also you can model the solution in a relational way and just tell how you would like to
access the data then we can go into details.
i have a coulmn family in cassandra 1.2 as below:
time | class_name | level_log | message | thread_name
-----------------+-----------------------------+-----------+---------------+-------------
121118135945759 | ir.apk.tm.test.LoggerSimple | DEBUG | This is DEBUG | main
121118135947310 | ir.apk.tm.test.LoggerSimple | ERROR | This is ERROR | main
121118135947855 | ir.apk.tm.test.LoggerSimple | WARN | This is WARN | main
121118135946221 | ir.apk.tm.test.LoggerSimple | DEBUG | This is DEBUG | main
121118135951461 | ir.apk.tm.test.LoggerSimple | WARN | This is WARN | main
when i use this query:
SELECT * FROM LogTM WHERE token(time) > token(0);
i get nothing!!! but as you see all of time values are greater than zero!
this is CF schema:
CREATE TABLE logtm(
time bigint PRIMARY KEY ,
level_log text ,
thread_name text ,
class_name text ,
msg text
);
can any body help?
thanks :)
If you're not using an ordered partitioner (if you don't know what that means you don't) that query doesn't do what you think. Just because two timestamps sort one way doesn't mean that their tokens do. The token is the (Murmur3) hash of the cell value (unless you've changed the partitioner).
If you need to do range queries you can't do it on the partition key, only on clustering keys. One way you can do it is to use a schema like this:
CREATE TABLE LogTM (
shard INT,
time INT,
class_name ASCII,
level_log ASCII,
thread_name ASCII,
message TEXT,
PRIMARY KEY (shard, time, class_name, level_log, thread_name)
)
If you set shard to zero the schema will be roughly equivalent to what you're doing now, but the query SELECT * FROM LogTM WHERE timestamp > 0 will give you the results you expect.
However, the performance will be awful. With a single value of shard only a single partition/row will be created, and you will only use a single node of your cluster (and that node will be very busy trying to compact that single row).
So you need to figure out a way to spread the load across more nodes. One way is to pick a random shard between something like 0 and 359 (or 0 and 255 if you like multiples of two, the exact range isn't important, it just needs to be an order of magnitude or so larger than the number of nodes) for each insert, and read from all shards when you read back: SELECT * FROM LogTM WHERE shard IN (0,1,2,...) (you need to include all shards in the list, in place of ...).
You can also pick the shard by hashing the message, that way you don't have to worry about duplicates.
You need to tell us more about what exactly it is that you're trying to do, especially how you intend to query the data. Don't go do the thing I described above, it is probably completely wrong for your use case, I just wanted to give you an example so that I could explain what is going on inside Cassandra.