Cassandra sstables accumulating - cassandra

I've been testing out Cassandra to store observations.
All "things" belong to one or more reporting groups:
CREATE TABLE observations (
group_id int,
actual_time timestamp, /* 1 second granularity */
is_something int, /* 0/1 bool */
thing_id int,
data1 text, /* JSON encoded dict/hash */
data2 text, /* JSON encoded dict/hash */
PRIMARY KEY (group_id, actual_time, thing_id)
)
WITH compaction={'class': 'DateTieredCompactionStrategy',
'tombstone_threshold': '.01'}
AND gc_grace_seconds = 3600;
CREATE INDEX something_index ON observations (is_something);
All inserts are done with a TTL, and should expire 36 hours after
"actual_time". Something that is beyond our control is that duplicate
observations are sent to us. Some observations are sent in near real
time, others delayed by hours.
The "something_index" is an experiment to see if we can slice queries
on a boolean property without having to create separate tables, and
seems to work.
"data2" is not currently being written-- it is meant to be written by
a different process than writes "data1", but will be given the same
TTL (based on "actual_time").
Particulars:
Three nodes (EC2 m3.xlarge)
Datastax ami-ada2b6c4 (us-east-1) installed 8/26/2015
Cassandra 2.2.0
Inserts from Python program using "cql" module
(had to enable "thrift" RPC)
Running "nodetool repair -pr" on each node every three hours (staggered).
Inserting between 1 and 4 million rows per hour.
I'm seeing large numbers of data files:
$ ls *Data* | wc -l
42150
$ ls | wc -l
337201
Queries don't return expired entries,
but files older than 36 hours are not going away!

The large number SSTables is probably caused by the frequent repairs you are running. Repair would normally only be run once a day or once a week, so I'm not sure why you are running repair every three hours. If you are worried about short term downtime missing writes, then you could set the hint window to three hours instead of running repair so frequently.
You might have a look at CASSANDRA-9644. This sounds like it is describing your situation. Also CASSANDRA-10253 might be of interest.
I'm not sure why your TTL isn't working to drop old SSTables. Are you setting the TTL on a whole row insert, or individual column updates? If you run sstable2json on a data file, I think you can see the TTL values.

Full disclosure: I have a love/hate relationship with DTCS. I manage a cluster with hundreds of terabytes of data in DTCS, and one of the things it does absolutely horribly is streaming of any kind. For that reason, I've recommended replacing it ( https://issues.apache.org/jira/browse/CASSANDRA-9666 ).
That said, it should mostly just work. However, there are parameters that come into play, such as timestamp_resolution, that can throw things off if set improperly.
Have you checked the sstable timestamps to ensure they match timestamp_resolution (default: microseconds)?

Related

Cassandra approach of RDBMS nested insertions

I receive regularly two types of sets of data:
Network flows, thousands per second:
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
(... etc ..)
}
And interface tables, every hour:
[
{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 19
'iface_name' : 'Fa0/0'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.2.6',
'iface_id' : 20
'iface_name' : 'Fa0/1'
},{
'stamp' : '2017-01-19 03:00:00'
'host' : '192.168.157.38',
'iface_id' : 20
'iface_name' : 'Gi0/3'
}
]
I want to insert those flows in Cassandra, with interface names instead of IDs, based on the latest matching host/iface_id value. I cannot rely on a memory-only solution, otherwise I may loose up to one hour of flows every time I restart the application.
What I had in mind, is to use two Cassandra tables: One that holds the flows, and one that holds the latest host/iface_id table. Then, when receiving a flow, I would use this data to properly fill interface name.
Ideally, I would like to let Cassandra take care of this. In my mind, it seems more efficient than pulling out interface names from the application side every time.
The thing is that I cannot figure out how to do that - and having never worked with NoSQL before, I am not even sure that this is the right approach... Could someone point me in the right direction?
Inserting data in the interface table and keeping only the latest version is quite trivial, but I cannot wrap my mind around the 'inserting interface name in flow record' part. In a traditional RDBMS I would use a nested query, but those don't seem to exist in Cassandra.
Reading your question, I can hope that the data hourly received in interface table is not too big. So we can keep that data (single row) in memory as well as in cassandra database. For every hour, the in memory data will get updated as well as a new inserted in to database. We can save interface data with below table definition -
create table interface_by_hour(
year int,
month int,
day int,
hour int,
data text, -- enitre json string for one hour interface data.
primary key((year,month,day,hour)));
Few insert statements --
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,23,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,27,00,'{complete json.........}');
insert into interface_by_hour (year,month,day,hour,data) values (2017,1,28,1,'{complete json.........}');
keep every hours interface data in this table and update it in memory as well. Benefit of having in memory data is that you don't have to read it from table thousand of time every second. If application goes down, you can read the current/previous hour data from table using below query, and build the in memory cache.
cqlsh:mykeyspace> select * from interface_by_hour where year=2017 and month=1 and day=27 and hour=0;
year | month | day | hour | data
------+-------+-----+------+--------------------------
2017 | 1 | 27 | 0 | {complete json.........}
Now comes the flow data --
As we have current hour interface table data cached in memory, we can quickly map interface name to host. Use below table to save flow data.
create table flow(
iface_name text,
createdon bingint, -- time stamp in milliseconds.
host text, -----this is optionl, if you want dont use it as column.
flowdata text, -- entire json string
primarykey(iface_name,createdon,host));
Only issue I see in above table is that it will not distribute data evenly across the partitions, if you have too many flow data for one interface name, whole data will inserted in to one partition.
I designed this table just to save the data, if you could have specified how you going to use this data, I would have done some more Thinking.
hope this helps.
Hi as far as I can tell the interface data is not so heavy on the writes that it would need partitioning by time. It changes only once per hour so it's not necessary to save data for every hour just the latest version. Also I will assume that you want to query this in some way I'm not sure how so I'll just propose something general for interface and will threat the flows as time series data:
create table interface(
iface_name text primary key,
iface_id int,
host text,
stamp timestamp
);
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/0', 19, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Fa0/1', 20, '192.168.2.6', '2017-01-19 03:00:00');
insert into interface(iface_name, iface_id, host, stamp) values ('Gi0/3', 20, '192.168.157.38', '2017-01-19 03:00:00');
usually this is an antipatern with cassandra:
cqlsh:test> select * from interface;
iface_name | host | iface_id | stamp
------------+----------------+----------+---------------------------------
Fa0/0 | 192.168.2.6 | 19 | 2017-01-19 02:00:00.000000+0000
Gi0/3 | 192.168.157.38 | 20 | 2017-01-19 02:00:00.000000+0000
Fa0/1 | 192.168.2.6 | 20 | 2017-01-19 02:00:00.000000+0000
But as far as I can see you don't have that many interfaces
So basically anything up to thousands will be o.k. here in worst case you might want to
use the token function to get the data out from partitions but the thing is this will save you
a lot on space and you don't need to save this by the hour.
I would simply keep this table in memory also and then enrich the data as it comes in.
If there is updates, update the in memory cache ... but also put writes to cassandra.
If something fails then simply restore from interface table and continue.
basically your flow info would then become
{
'stamp' : '2017-01-19 01:37:22'
'host' : '192.168.2.6',
'ip_src' : '10.29.78.3',
'ip_dst' : '8.8.4.4',
'iface_in' : 19,
'iface_out' : 20,
'iface_name' : 'key put from in memory cache',
}
This is how you will get the bigest performance now saving flows is just
time series data then, take into account that you are hitting the cluster
with thousands per second and that when you are paritioning by time you
get at least 7000 if not more columns every second in (with the model
I'm proposing here) usually you will want to have up to 100 000 columns
within single partition, which would say that your partition goes over
ideal size withing 20 seconds or even less so I would even suggest using
random buckets (when inserting just use some number in defined range
let's say 10):
create table flow(
time_with_minute text,
artificial_bucket int,
stamp timeuuid,
host text,
ip_src text,
ip_dst text,
iface_in int,
iface_out int,
iface_name text,
primary key((time_with_minute, artificial_bucket), stamp)
);
When wanting to fetch flows over time you would simply use the parts of
a timestamp plus make 10 queries at the same time or one by one to access all the data. There are various techniques here, you simply need to tell more about your use case.
inserting is then something like:
insert into flow(time_with_minute, artificial_bucket, stamp, host, ip_src, ip_dst, iface_in, iface_out, iface_name)
values ('2017-01-19 01:37', 1, now(), '192.168.2.6', '10.29.78.3', '8.8.4.4', 19, 20, 'Fa0/0');
I used now just for an example, use https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/utils/UUIDGen.java to
generate timeuuid with the time when you inserted flow. Also I inserted 1 into artificial bucket, here you would insert random number with range, let's say 0-10 at least. Some people, depending on the load insert multiple random buckets, even 60 or more. It all depends on how heavy writes are. If you just put it to minute every minute a group of nodes within the cluster will be hot and this will switch around. Having hot nodes is usually not a good idea.
With cassandra you are writing the information that you need right away, you are not doing
any joins during write or something similar. Keep the data in memory that you need to
stamp the data with the information that you need and just insert without denormalisation.
Also you can model the solution in a relational way and just tell how you would like to
access the data then we can go into details.

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Azure Sql Database replication - 9 1/2 hour delay + cant remove the replica

Note
This is getting quite long so I will try and re-edit parts through the day.
These databases are no long active, which means I can play with them to work out what is going wrong.
The only thing left to answer: Given two databases running on Azure Databases at S3 (100 DTU). Should any secondary ever get significantly behind the primary database? Even while the DTU is hammered to 100% for over half the day. The reason for the DTU being hammered being IO writes mostly.
The Start: a few problems.
DTU limits were hit on Monday, Tuesday and to some extent on Wednesday for a significant amount of time. 3PM UTC - 6AM UTC.
Problem 1 (lag in data on the secondary): This had appeared to have caused a lag of data in the secondary of about 9 1/2 hours. The servers were effectively being spammed with updates causing a lot of IO updates. 6-8 million records on one table for the 24 hour period for example. This problem drove the reason for the post:
Shouldn't these be much more in sync?
The data became out of sync on Monday morning and continued out of sync until Friday. On Thursday some new databases were started up to replace these standard SQL databases and so they were left to rot. Well, for me to experiment with at least.
The application causing the redundant queries couldn't be fixed for a few days (I'm doubting they will ever fix it now) so that leaves changing the instance type. That action was attempted on the current instance but, the instance must disconnect with all standard replicas to increase to the performance tier. This led to the second problem (see below). The replica taking its time to be removed. Began on Wednesday morning and did not complete until Friday.
Problem 2 (can't remove the replica):
(Solved itself after two days)
Disconnecting the secondary database process began ~ Wed 8UTC (when the primary was at about 80GBs). The secondary being about 11GB behind in size at this point.
Setup
The databases (primary and secondary) are S3 that is geo-replicated (North + West Europe).
It has an audit log table(which I read from the secondary - normally with an SQL query), but this is currently 9 1/2 hours behind the last entry for the primary database. Running the query again on the secondary a few seconds later it is slowly catching up, but appears to be relative to the refresh rather than playing catch-up.
Both primary and secondary (read-only) databases are S3 (about to be bumped to P2).
the azure documentation states:
Active Geo-Replication (readable secondaries) is now available for all databases in all service tiers. In April 2017, the non-readable secondary type will be retired and existing non-readable databases will automatically be upgraded to readable secondaries.
How has the secondary has got so far behind? seconds to minutes would be acceptable. Hours not so much. The link above describes it as slightly:
While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to always be transactionally consistent with changes committed to the primary database.
Given the secondary is about to be destroyed and replaced by a higher level (need to remove replicas when upgrading from standard -> premium). I'm curious to know if it will happen again as well as what the definition of slight might be in this instance?
Notes: The primary did reach maximum DTU for a few hours but didn't harm the performance significantly, which is where the 9-hour difference was noticed.
Stats
Update for TheGameiswar:
I can't query it right now as it started removing itself as a replica (to be able to move the primary up to the P2 level, but that began hours ago at ~8.30UTC and 5 hours later it is still going). I think it's quite broken now.
Query - nothing special:
SELECT TOP 1000 [ID]
,[DateCreated]
,[SrcID]
,[HardwareID]
,[IP_Port]
,[Action]
,[ResponseTime]
,[ErrorCode]
,[ExtraInfo]
FROM [dbo].[Audit]
order by datecreated desc
I can't compare the tables anymore as it's quite stuck and refusing connections.
The 586 hours (10-14GB) are inserts into the primary database audit table. It was similar yesterday when noticing the 9 1/2 hour difference in data.
When the attempt to remove the replica (another person stated the process) it had about 10GB difference in size.
Cant compare data but can show DB-size at equivalent time
Primary DB Size (last 24 hours):
Secondary DB Size (last 24 hours):
Primary database size - week view
Secondary database size - week view
As mentioned ... it is being removed as a replica... but is still playing catch up with the DB size if you observe the charts above.
Stop replication errored for serverName: ---------, databaseName: Cloud4
ErrorCode: 400
ErrorMessage: A terminate operation is already in progress for database 'Cloud4'.
Update 2 - Replication - dm_continuous_copy_status
Replication is still removing ... moving on...
select * from sys.dm_continuous_copy_status
sys.dm_exec_requests
Querying from Thursday
Appears to be quite empty. The only record being
Replica removed itself at last.
The replica has removed itself after 2 days. At the 80GB mark that I predicted. It waited to replay the data in the transactions (till the point it was removed as a replica) before it would remove the replica.
A Week after on the P2 databases
DTU is holding between 20-40% at busy periods and currently performing ~12 million data entries every 24 hours (a similar amount for reads, but writing is much worse on the indexes and the table). 70-100% inserts extra in a week. This time, the replica is not struggling to keep up, which is good but that is likely due to it not reaching 100% DTU.
Conclusion
The replicas are useful but not in this case. This one caused degraded performance for several days that could have been averted. A simple increase to the performance tier until the cause of the problem could be fixed. IF the replica looks like it is dragging behind and you are on the border of Basic -> Standard or Standard -> Performance it would be safe to remove the replica as soon as possible and increase to a different tier.
Now we are on P2. The database is increasing at 20GB a day... and they say they have fixed the problem that sends 15 thousand redundant updates per minute. Thanks to the query performance insight for highlight that as querying the table is extremely painful on the DTU (even querying the last minute of data in that table is bad on the DTU. ~15 thousand new records every minute).
62617: insert ...
62618: select ...
62619: select ...
A positive from the above is that it's moved from 586 hours combined time for the insert statements (7.5 million entry rows per day) on S3 to 3 hours on P2 (12.4 million row rows per day). An extremely significant decrease in processing time. It did start with an empty table on Thursday but that has surpassed the previous size in a week whereas the previous one took a few months to get there.
It's doing well on the new tier. It should be ~5% if the applications were using the database responsibly and the secondary is up to date.
Spoke too soon. Now on P2
Someone thought it was a good idea to run an SQL query that repeats itself that deletes a thousand rows at a time. 12 million new rows a day.
10AM - 12AM it's managed to remove about 5.2 million rows. Now the database is showing signs of being in the same state as last week. Im curious if that is what happened now.

Cassandra TTL not working

I am using datastax with cassandra. I want a row to be automatically deleted after 15 minutes of its insertion. But the row still remains.
My code is below:
Insert insertStatement = QueryBuilder.insertInto(keySpace, "device_activity");
insertStatement.using(QueryBuilder.ttl(15* 60));
insertStatement.value("device", UUID.fromString(persistData.getSourceId()));
insertStatement.value("lastupdatedtime", persistData.getLastUpdatedTime());
insertStatement.value("devicename", persistData.getDeviceName());
insertStatement.value("datasourcename", persistData.getDatasourceName());
The table consist of 4 columns : device (uuid), datasourcename(text), devicename(text), lastupdatedtime (timestamp).
If I query the TTL of some field it shows me 4126 seconds which is wrong.
//Select TTL(devicename) from device_activity; // Gives me 4126 seconds
In the below link, the explanation of TTL is provided.
https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_expire_c.html
"TTL data has a precision of one second, as calculated on the server. Therefore, a very small TTL probably does not make much sense. Moreover, the clocks on the servers should be synchronized; otherwise reduced precision could be observed because the expiration time is computed on the primary host that receives the initial insertion but is then interpreted by other hosts on the cluster."
After reading this i could resolve by setting proper time on the corresponding node(machine.)

select count(*) runs into timeout issues in Cassandra

Maybe it is a stupid question, but I'm not able to determine the size of a table in Cassandra.
This is what I tried:
select count(*) from articles;
It works fine if the table is small but once it fills up, I always run into timeout issues:
cqlsh:
OperationTimedOut: errors={}, last_host=127.0.0.1
DBeaver:
Run 1: 225,000 (7477 ms)
Run 2: 233,637 (8265 ms)
Run 3: 216,595 (7269 ms)
I assume that it hits some timeout and just aborts. The actual number of entries in the table is probably much higher.
I'm testing against a local Cassandra instance which is completely idle. I would not mind if it has to do a full table scan and is unresponsive during that time.
Is there a way to reliably count the number of entries in a Cassandra table?
I'm using Cassandra 2.1.13.
Here is my current workaround:
COPY articles TO '/dev/null';
...
3568068 rows exported to 1 files in 2 minutes and 16.606 seconds.
Background: Cassandra supports to
export a table to a text file, for instance:
COPY articles TO '/tmp/data.csv';
Output: 3568068 rows exported to 1 files in 2 minutes and 25.559 seconds
That also matches the number of lines in the generated file:
$ wc -l /tmp/data.csv
3568068
As far as I see you problem connected to timeout of cqlsh: OperationTimedOut: errors={}, last_host=127.0.0.1
you can simple increase it with options:
--connect-timeout=CONNECT_TIMEOUT
Specify the connection timeout in seconds (default: 5
seconds).
--request-timeout=REQUEST_TIMEOUT
Specify the default request timeout in seconds
(default: 10 seconds).
Is there a way to reliably count the number of entries in a Cassandra table?
Plain answer is no. It is not a Cassandra limitation but a hard challenge for distributed systems to count unique items reliably.
That's the challenge that approximation algorithms like HyperLogLog address.
One possible solution is to use counter in Cassandra to count the number of distinct rows but even counters can miscount in some corner cases so you'll get a few % error.
This is a good utility for counting rows that avoids the timeout issues that happen when running a large COUNT(*) in Cassandra:
https://github.com/brianmhess/cassandra-count
The reason is simple:
When you're using:
SELECT count(*) FROM articles;
it has the same effect on the database as:
SELECT * FROM articles;
You have to query over all your nodes. Cassandra simply runs into a timeout.
You can change the timeout, but it isn't a good solution. (For one time it's fine but don't use it in your regular queries.)
There's a better solution: make your client count your rows. You can create a java app where you count your rows, when you inserting them, and insert the result using a counter column in a Cassandra table.
You can use copy to avoid cassandra timeout usually happens on count(*)
use this bash
cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'

Resources