Timestamp data persistence in cassandra - cassandra

In our crawling system we have 5 node cassandra cluster. I have a scenario in which I want to delete cassandra data as soon as it becomes older than x days.
Ex:
id | name | created_date
1 | Dan | "2017-08-01"
2 | Monk | "2017-08-02"
3 | Shibuya | "2017-08-03"
4 | Rewa | "2017-08-04"
5 | Himan | "2017-08-05"
if x = 3 Then following should be the scenario:
id | name | created_date
1 | Dan | "2017-08-01" --------------> DELETE
2 | Monk | "2017-08-02" --------------> DELETE
3 | Shibuya | "2017-08-03" -------------->(REMAIN)Latest 3 days data
4 | Rewa | "2017-08-04" -------------->(REMAIN)Latest 3 days data
5 | Himan | "2017-08-05" -------------->(REMAIN)Latest 3 days data
If new data is added then id=3 should be deleted.
Is there any Cassandra configuration or any approach to do this?

Cassandra has a TTL feature, that allows you to specify how long each CQL cell is valid. Details are available on the INSERT docs, but it also applies to UPDATE.

You can use TTL.
But be careful with tombstouns and compaction procedure

Related

Formula to Repeat and increment Numbers and reset when ID changes

I am looking for a start in the right direction and I hope someone on this forum has run into this issue. I have an excel table with 24K jobs on it and a technician assigned to every job. These technicians have 40 weeks to complete all the jobs assigned. I have a helper table with each technician’s id and how many jobs per week then need to complete all the work. I have sorted the jobs by geographic area for efficiency. I need a formula that will look at the Technician id and if they are receiving 3 jobs per week that it will number the first 3 with a 1, and the next 3 with a 2 and so on. And when it switches Technician it would reset the counter.
In the example below Tech 1 is assigned 3 jobs per week, and Tech 2 has 2 jobs per week.
| JobID | Tech | Grouping |
|-------|--------|----------|
| BK025 | Tech 1 | 1 |
| CD044 | Tech 1 | 1 |
| DE024 | Tech 1 | 1 |
| DE031 | Tech 1 | 2 |
| DE035 | Tech 1 | 2 |
| FT083 | Tech 1 | 2 |
| IR004 | Tech 2 | 1 |
| IR006 | Tech 2 | 1 |
| IR052 | Tech 2 | 2 |
| IR061 | Tech 2 | 2 |
| IR062 | Tech 2 | 3 |
| IR072 | Tech 2 | 3 |
I have been searching SO and Google looking for an answer but may not be using the correct key words.I have found this formula =ROUNDUP((ROW()-offset)/repeat,0) -- Found on exceljet -- which will work, but in order to get it to work properly I would have to filter to each tech individually.
Assuming your helper table is something like in the screenshot below, you could use an approach like the following:
=ROUNDUP(COUNTIF(B$2:B2,B2)/VLOOKUP(B2,$E$1:$F$3,2,0),0)

Node.js : Check entries regularly based on their timestamps

I have a Node.js backend and MSSQL as database. I do some data processing and store logs in a database table which shows which entity is a what stage in the progress, e.g. (there are three stages each entity has to pass)
+----+-----------+-------+------------------------+
| id | entity_id | stage | timestamp |
+----+-----------+-------+------------------------+
| 1 | 1 | 1 | 2019-01-01 12:12:01 |
| 2 | 1 | 2 | 2019-01-01 12:12:10 |
| 3 | 1 | 3 | 2019-01-01 12:12:15 |
| 4 | 2 | 1 | 2019-01-01 12:14:01 |
| 5 | 2 | 2 | 2019-01-01 12:14:10 <--|
| 6 | 3 | 1 | 2019-01-01 12:24:01 |
+----+-----------+-------+------------------------+
As you can see in line with the arrow, the entity no. 2 did not go to stage 3. After a certain amount of time (maybe 120 seconds), these entity should be considered as faulty and this will be reported somehow.
What would be a good approach in Node.js to check the table for those outtimed entities? Some kind of cron job which checks all lines every x seconds? That sounds rather clumsy to me.
I am looking forward to our ideas!

PySpark - Select users seen for 3 days a week for 3 weeks a month

I know this is a very specific problem and it is not usual to post this kind of question on stackoverflow, but I am in the strange situation of having an idea of a naive algorithm that would solve my issue, but not being able to implement it. Hence my question.
I have a data frame
|user_id| action | day | week |
------------------------------
| d25as | AB | 2 | 1 |
| d25as | AB | 3 | 2 |
| d25as | AB | 5 | 1 |
| m3562 | AB | 1 | 3 |
| m3562 | AB | 7 | 1 |
| m3562 | AB | 9 | 1 |
| ha42a | AB | 3 | 2 |
| ha42a | AB | 4 | 3 |
| ha42a | AB | 5 | 1 |
I want to create a dataframe with users that are seem at least 3 days a week for at least 3 weeks a month. The "day" column goes from 1 to 31 and the "week" column goes from 1 to 4.
The way I thought about doing it is :
split dataframe into 4 dataframes for each week
for every week_dataframe count days seen per user.
count for every user how many weeks with >= 3 days they were seen.
only add to the new df the users seen for >= 3 such weeks.
Now I need to do this in Spark and in a way that scales and I have no idea how to implement it. Also ,if you have a better idea of an algorithm than my naive approach, that would really be helpful.
I suggest using groupBy function with selecting users with where selector:
df.groupBy('user_id', 'week')\
.agg(countDistinct('day').alias('days_per_week'))\
.where('days_per_week >= 3')\
.groupBy('user_id')\
.agg(count('week').alias('weeks_per_user'))\
.where('weeks_per_user >= 3' )
#eakotelnikov is correct.
But if anyone is facing the error
NameError: name 'countDistinct' is not defined
then please use below statement prior to execute eakotelnikov solution
from pyspark.sql.functions import *
Adding another solution for this problem
tdf.registerTempTable("tbl")
outdf = spark.sql("""
select user_id , count(*) as weeks_per_user from
( select user_id , week , count(*) as days_per_week
from tbl
group by user_id , week
having count(*) >= 3
) x
group by user_id
having count(*) >= 3
""")
outdf.show()

add uniqueness to a table column just for some cases in mysql using knex

I'm using mysql. I want a column to have unique values just in some cases.
Example, the table can have the next vales:
+----+-----------+----------+------------+
| id | user_id | col1 | col2 |
+----+-----------+----------+------------+
| 1 | 2 | no | no |
| 2 | 2 | no | no |
| 3 | 3 | no | yes |
| 4 | 2 | yes | no |
| 5 | 2 | no | yes |
+----+-----------+----------+------------+
I want the no|no to be able to repeat for the same user but no the yes|no combination. Is this possible in mysql? And with knex?
My migration fot that table looks like this
return knex.schema.createTable('myTable', table => {
table.increments('id').unsigned().primary();
table.integer('uset_id').unsigned().notNullable().references('id').inTable('table_user').onDelete('CASCADE').index();
table.string('col1').defaultTo('yes');
table.string('col2').defaultTo('no');
});
That doesn't seem to be easy task to do. You would need partial unique index over multiple columns.
I couldn't spot that mysql would support partial indexes https://dev.mysql.com/doc/refman/8.0/en/create-index.html
So it could Something like what is described here, but using triggers for that seems a bit overkill https://dba.stackexchange.com/questions/41030/creating-a-partial-unique-constraint-for-mysql

retrieving data from cassandra database

I'm working on smart parking data stored in Cassandra database and i'm trying to get the last status of each device.
I'm working on self-made dataset.
here's the description of the table.
table description
select * from parking.meters
need help please !
trying to get the last status of each device
In Cassandra, you need to design your tables according to your query patterns. Building a table, filling it with data, and then trying to fulfill a query requirement is a very backward approach. The point, is that if you really need to satisfy that query, then your table should have been designed to serve that query from the beginning.
That being said, there may still be a way to make this work. You haven't mentioned which version of Cassandra you are using, but if you are on 3.6+, you can use the PER PARTITION LIMIT clause on your SELECT.
If I build your table structure and INSERT some of your rows:
aploetz#cqlsh:stackoverflow> SELECT * FROM meters ;
parking_id | device_id | date | status
------------+-----------+----------------------+--------
1 | 20 | 2017-01-12T12:14:58Z | False
1 | 20 | 2017-01-10T09:11:51Z | True
1 | 20 | 2017-01-01T13:51:50Z | False
1 | 7 | 2017-01-13T01:20:02Z | False
1 | 7 | 2016-12-02T16:50:04Z | True
1 | 7 | 2016-11-24T23:38:31Z | False
1 | 19 | 2016-12-14T11:36:26Z | True
1 | 19 | 2016-11-22T15:15:23Z | False
(8 rows)
And I consider your PRIMARY KEY and CLUSTERING ORDER definitions:
PRIMARY KEY ((parking_id, device_id), date, status)
) WITH CLUSTERING ORDER BY (date DESC, status ASC);
You are at least clustering by date (which should be an actual date type, not a text), so that will order your rows in a way that helps you here:
aploetz#cqlsh:stackoverflow> SELECT * FROM meters PER PARTITION LIMIT 1;
parking_id | device_id | date | status
------------+-----------+----------------------+--------
1 | 20 | 2017-01-12T12:14:58Z | False
1 | 7 | 2017-01-13T01:20:02Z | False
1 | 19 | 2016-12-14T11:36:26Z | True
(3 rows)

Resources