Manipulating Cassandra writetime value - cassandra

Let's say I have an application, which receives periodically some measurement data.
I know the exact time the data was measured and i want every piece of data to be deleted in 30 days after it was measured.
I'm not inserting the data immediately to the database, but i want to use the time-to-live functionality of Cassandra.
Is there a way to manipulate the system intern time-stamp of a row in Cassandra so, that I can set time-to-live to 60 days, but it actually measures the lifespan of each row with my time-stamp?
E.g. I measure something at the 27.08.2014 - 19:00. I insert this data at 27.08.2014 - 20:00 into the database and set the time-to-live value to 1 day. I now want the row to be deleted at 28.08.2014 - 19:00 and not at 28.08.2014 - 20:00 like it normally would.
Is something like this possible?

I suggest you the folowing approach based on your example:
before insertion calculate Δx = insertTime - measureTime
set TTL = 1day - Δx for inserting row
Addition on the basis of a comment:
You can use Astyanax-client with Batch mutation "to simultaneously enter multiple values at once". There is possibility to set TTL on each column and on whole row at once.

Related

Excel - How to Manualy calculate a worksheet every second?

My sheet contains links from a third application and it's set to Automatic calculation.
New data retrieved every few seconds so I should get the data every second or close to it (One of the fields the sheet retrieved is "time" so I can know how updated is that data).
Yet, I found that the data is updated every 3 minutes (and more..) rather than every second or few seconds.
I changed the Automatic calculation to Manual calculation, Results: Every click on the Calculate Now results with a "new updated" data which is indeed updated from a few seconds ago rather than 3 minutes, like it should be in the first place.
I suppose that the Automatic calculation has an algorithm to decide when to calculate and that algorithm doesn't go into my favor,
So how can I set the sheet calculation to be immediate or at least to be every second or so?

How can I generate activity map for customer care executives?

I have data of customer care executives which tells about how many calls they have attend from one time another time continuously. I need to find out whether particular executive is either busy or free in a particular period. The timings for the office is 10:00 to 17:00, so I have sliced the time with one hour in each slice from 10:00 to 17:00.
The data that I have would look like as:
Note:
The data given here is some part of original data and we have 20 executives and they have 5 to 10 rows of data. For simplification we have used 3 executives with less than 5 rows for each one.
The start timings do not follow any ascending or descending order
Please suggest the formulas with out any sorting and filtering on the given data
Required: The result table should give whether particular executive is busy or free in every hour. If he is on call for one minute it should give busy for that entire one hour period
The result should be like:
The same file is attached here:
Thanks In Advance!!!
You need to put in an extra logical test in your OR function that tests for start times less than the time interval start and end times greater than the time interval end. So in cell G31 your formula should read:
=IF(OR(COUNTIFS($A$3:$A$14,A31,$C$3:$C$14,">0",$D$3:$D$14,">=14:00",$D$3:$D$14,"<15:00"),COUNTIFS($A$3:$A$14,A31,C$3:$C$14,">0",$E$3:$E$14,">=14:00",$E$3:$E$14,"<15:00"),COUNTIFS($A$3:$A$14,A31,C$3:$C$14,">0",$D$3:$D$14,"<14:00",$E$3:$E$14,">=15:00")),"Busy","Free")

Using QDigest over a date range

I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest.
https://prestodb.io/docs/current/functions/qdigest.html
Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.
Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.

REDIS - creating meaningful keys to reduce queries

I have an app (nodejs / express) that needs to find a routing rule to apply based on time of day and day of week.
So for example, I have the following business rules:
on Mondays and Tuesdays between 09:00 GMT and 12:00 GMT, I need route object ABC to "location x".
on Tuesdays between 13:00 and 13:30 I need to route ABC to "location y".
( For the purposes of this discussion, it really doesn't matter what object ABC is. )
I am debating between two options, as far as how I should design my keys in my REDIS database
Option 1
Make the day information a part of the object data, like this:
HMSET routing_rules:objectABC_09:00_12:00 days 'mon tues' location X
HMSET routing_rules:objectABC_13:00_13:30 days 'tues' location Y
Advantage of this method - When it's time to update the days list I can simply do this:
HMSET routing_rules:objectABC_09:00_12:00 days 'mon tues thu'
The downside here is that in order to find the right rule, I have to make two queries.... First do a SCAN command to find the right time range... and then if there's a match... do another query to find the days value.
Option 2
Include the day of week information as a part of the key
HMSET routing_rules:objectABC_09:00_12:00_mt location X
HMSET routing_rules:objectABC_13:00_13:30_t location Y
I would use a naming convention like
m = monday
t = tuesday
w = wed
r = thursday
etc.
The advantage to option 2 is that in order to find the right routing rule based on current time and day, I only need to run one SCAN command (we can assume that my SCAN command will return all results in one shot)
But the downside to option 2 is that when I need to add a new day to the key, I think i need to delete the key and value... and then recreate it. Is this correct?
And as of right now, the only way I know how to delete is by doing an HDEL for each value in the object, and then the key is deleted.
So for example, I've been doing something like this:
127.0.0.1:6379> HDEL routing_rules:objectABC_00:00_00:00 days 'mon tues' location x
where I have to list all the values in the object to delete the entire key / value pair.
In this example, it's not so bad because I only have two values for this key - the location and the days fields. But if there was a lot more data, it'd be a little cumbersome. And I'm not sure if there are other considerations to factor in besides the number of fields associated with this key.
If you have any suggestions on how to design this key for optimal performance and maintenance, I'm all ears. The way I see it, there is no way to avoid running the scan at least once. But this is my first redis database attempt so I apologize in advance for remedial questions / noob mistakes.
EDIT 1
Assuming that I have enough memory and assuming that I only have to save one field / value per key, let's say I create my keys like this:
SET routing_rules:objectABC_09:00_12:00_m X
SET routing_rules:objectABC_09:00_12:00_t X
SET routing_rules:objectABC_13:00_13:30_t Y
And now a request comes in for object ABC and it's Monday at UTC 11. Because my keys represent start times and end times (aka a range ), I don't see how I can find the right key / value pair without doing a scan.
Am I missing something?
I would go with a sorted set solution, a set for each object, the value should be the location*, and the score should be the minute in the week this rule expires.
e.g. (week starts at Monday 00:00)
on Mondays and Tuesdays between 09:00 GMT and 12:00 GMT, I need route
object ABC to "location x".
Monday 12:00 => 720
Tuesday 12:00 => 2160
ZADD ABC_rules 720 x 2160 x
There are two issues here, 1st your example shows times that there are no rules, so this must be taken into account. 2nd and more major, set objects must be unique, and x cannot be stored twice. Both together are the * reason above, the way to solve it is to apped/prepend the value with the minute in the week the rule starts:
Monday 9:00 => 540
Tuesday 9:00 => 1980
ZADD ABC_rules 720 x:540 2160 x:1980
To query, all you need is to use ZRANGEBYSCORE with the minute in the week, and make sure the time that you get appended to the location is before the time you sent.
Query for Monday 10:00 (600):
ZRANGEBYSCORE ABC_rules 600 +inf LIMIT 1
The result will be x:540 and since 540 is lower than 600 you know x is a valid answer.
Query for Monday 13:00 (780):
ZRANGEBYSCORE ABC_rules 780 +inf LIMIT 1
The result will be x:1980, and since 1980 is larger than your query (780) this result is invalid and you should take your default routing (or whatever your solution is to the unmapped times in your schedule).
To deleting rules you must remove the location appended with the start time:
ZREM ABC_rules x:540
You can also use the ZRANGEBYSCORE to get all rules that apply in a specific day, and you can write a LUA script that clears them.
I would not use any SCAN command in this case (and in most cases). You probably would have to invoke it multiple times to scan the whole key space, while there are other alternatives providing a direct access to the data you are looking for -which is what makes a K/V store performant.
By example, with your first solution, put all values in a hash, and get all routes in one request with HGETALL. Then, you will have to iterate on the values in your application to select the right one.
Another solution, which do not require any iteration on application side, is to create a route per day and per hour range:
SET routing_rules:objectABC_09:00_12:00_m location X
SET routing_rules:objectABC_13:00_13:30_t location Y
...
Then in one GET request you have the value you are looking for.
Adding a day does just require a SET.
The drawback compared to your solution is memory usage: it multiples entries. You didn't give any clue about number of entries, but if it's very high, it could be a problem. To reduce the memory required, you can start by using shorter key names, like r:objectABC_09001200m instead of routing_rules:objectABC_09:00_12:00_m).
Update
Given the fact the time ranges does not seem the be constant, and assuming there is no algorithm to deduce the time range from the current time, the first solution, based on using hash, seems to be better than he second, based on GET/SET. But I would name the fields according to the time ranges:
HSET routing_rules:objectABC 09:00_12:00 X
HSET routing_rules:objectABC 12:00_12:30 Y
Then I would get all Fields for a given object using HGETALL routing_rules:objectABC and iterate over the member keys to find the right one.

On the SQL tab what is happening during the Offset time?

On the SQL tab in Glimpse, there is the Duration column next to the Records column which I suppose is the execution time of the command, and then the next column is a time period labeled Offset. What is that actually measuring? Then there is the Duration at the far right of the column that I was guessing is the total time, but the two detailed columns to add up to that total.
Thanks!
The first duration column is the duration, in milliseconds, for the command. (Your query).
The offset column is the length of time, in milliseconds, since the beginning of the request.
The second duration column is the duration, in milliseconds, of the open connection time to the database. Often one command will run on one connection, but sometimes you'll see multiple commands happening within the same connection.

Resources