Database design for kids chore schema - node.js

I'm wondering about best practise to keep a database as tidy as possible. The database is postgresql accessed by express.js/node. It is for a kids chores app that I'm working on and it has the following schema:
CHILDREN
id
name
points
rate
created_at
updated_at
user_id
TASKS
id
description
value
days (boolean array - eg. [0,0,0,0,0,0,0])
periods (boolean array - eg. [0,0])
created_at
updated_at
user_id
FINISHED TASKS
id
task_id
child_id
completed (boolean)
created_at
updated_at
period (boolean)
day (int (0-6))
For every individual finished task a row is created in the database. With only 400 children doing chores in there, there are already around 800 rows being added each day to the FINISHED TASKS table.
I have two questions:
Is there a more efficient way of storing FINISHED TASKS either for a full day per child or similar?
With scale I'm going to end up with potentially tens of thousands of rows per day - is this acceptable for an app like this?

Having a child table related to a task table through an intermediate bridge table is the common way of doing this. My experience with large hospital applications is that once tables start to have millions of rows and performance is degrading, the applications typically archive the "finished tasks" into a separate archive table. You would maybe end up with two tables, one called 'active tasks' that contains tasks where 'completed' is false and once the task is finished, the row is moved into the archived 'finished tasks' table.
Depending on how much effort you want to put into future proofing the application, this could be done now to prevent having to revisit this.

Related

How to inquire the database after limited time?

There is a SQL table contains Bids.
When first bids is inserted to table the downcounter starts. After some time, as instance 5 minutes I must aggregate all data and find the max price across bids.
I wonder how to trigger this event and send message to the Node service that should handle this?
Another directions when service asks each second DB and compares startDate, endDate and makes aggregate by sum.
Which is approach to choose?
What about create a cron UNIX task when bid is inserted to DB?
So, the bid continues according by configurated time in script. In my case it is 5 minutes. After the no one can not send own bid.
After I need to select all participant who made bids and aggregate max price across them.

Excel Query looking up multiple values for the same name and presenting averages

Apologies if this has been asked before. I would be surprised if it hasn't but I am just not hitting the correct syntax to search and get the answer.
I have a table of raw data for my staff, it contains data on the name of the employee who completed a job and the start and finish times, among other things. I have no unique ID's other than name, and I cant change that as I'm part of a large organisation and I have to make do with the data I'm given.
what I would like to do it present a table (Table 2) that shows the name of the employee and then takes the start/finish times for all of their jobs on table 1 and presents the average time taken across all of their jobs.
I have used Vlookup in the past but I'm not sure it will cut it here. the raw data table contains approx 6000 jobs each month.
On table 1 i work out the time taken for each job with this formula;
=IF(V6>R6,V6-R6,24-R6+V6) (R= started Time) (V= Completed Time) in 24hr clock.
I have gone this route as some jobs are started before midnight and completed afterwards. Although my raw data also contains dates (started/completed) in separate columns so I am open to an experts feedback on this and if there is a better way to work out the total time form start to completion.
I believe the easiest way to tackle this would be with a Pivot Table. Calculate the time taken for each Name and Job combination in Table 1; create a pivot table with the Name in the Row Labels and the Time in the Values -- change the Time Values to be an average instead of a sum:
Alternatively, you could create a unique list of names, perhaps with Data > Remove Duplicates and then use an =AVERAGEIF formula:
Thanks this give me the thread to pull on, I have unique names as its the persons full name, but ill try pivot tables to hopefully make it a little more future proof for other things to be reports on later.

Cassandra data modeling

So I'm designing this data model for product price tracking.
A product can be followed by many users and an user can follow many products, so it's a many to many relation.
The products are under constant tracking, but a new price is inserted only if it has varied from the previous one.
The users have set an upper price limit for their followed products, so every time a price varies, the preferences are checked and the users will be notified if the price has dropped below their treshold.
So initially I thought of the following product model:
However "subscriberEmails" is a list collection that will handle up to 65536 elements. But being a big data solution, it's a boundary that we don't want to have. So we end up writing a separate table for that:
So now "usersByProduct" can have up to 2 billion columns, fair enough. And the user preferences are stored in a "Map" which is again limited but we think it's a good maximum number of products to follow by user.
Now the problem we're facing is the following:
Every time we want to update a product's price we would have to make a query like this:
INSERT INTO products("Id", date, price) VALUES (7dacedd2-c09b-46c5-8686-00c2a03c71dd, dateof(now()), 24.87); // Example only
But INSERT operations don't admit other conditional clauses than (IF NOT EXISTS) and that isn't what we want. We need to update the price only if it's different from the previous one, so this forces us to make two queries (one for reading current value and another to update it if necessary).
PD. UPDATE operations do have IF conditions but it's not our case because we need an INSERT.
UPDATE products SET date = dateof(now()) WHERE "Id" = 7dacedd2-c09b-46c5-8686-00c2a03c71dd IF price != 20.3; // example only
Don't try to apply a normal model on a cassandra database. It may work but you'll end up with terrible performance and scalability.
The recommended approach to Cassandra data modeling is to first figure out your read queries against the database and structure your data so that these reads are cheap. You'll probably need to duplicate writes somewhat but it's OK because writes are pretty cheap in Cassandra.
For your specific use case, the key query seems to be able to get all users interested in a price change in a product, so you create a table for this, for example:
create table productSubscriptions (
productId uuid,
priceLimit float,
createdAt timestamp,
email text,
primary key (productId,priceLimit,createdAt)
);
but since you also need to know all product subscriptions for a user, you all need a user-keyed table of the same data:
create table userProductSubscriptions (
email text,
productId uuid,
priceLimit float,
primary key (email, productId)
)
With these 2 tables, I guess you can see that all your main queries can be done with a single-row select and your insert/delete are straightforward but will require you to modify both tables in sync.
Obviously, you'll need to flesh out a bit more the schema for your complete need but this should give you an example on how to think about your cassandra schema.
Conditional update issue
For your conditional insert issue, the easiest answer is: do it with an UPDATE if you really need it (update and insert are nearly identical in CQL) but it's a very expensive operation so avoid it if you can.
For your use case, I would split your product table in three :
create table products (
category uuid,
productId uuid,
url text,
price float,
primary key (category, productId)
)
create table productPricingAudit (
productId uuid,
date timestamp,
price float,
primary key (productId, date)
)
create table priceScheduler (
day text,
checktime timestamp,
productId uuid,
url text,
primary key (day, checktime)
)
products table can hold for full catalog, optionally split in categories (so that listing all products in a single category is a single-row select)
productPricingAudit would have an insert with the latest price retrieved whatever it is since this will let you debug any pricing issue you may have
priceScheduler holds all the check to be made for a given day, ordered by check time. Your scheduler simply has to make a column range query on single row whenever it runs.
With such a schema, you don't care about the conditional updates, you simply issue 3 inserts whenever you update a product price even it doesn't change.
Okay, I will try to answer my own question: conditional inserts other than "IF NOT EXISTS" are not supported in Cassandra by the date, period.
The closest thing is a conditional update, but that doesn't work in our scenario. So there's one simple option left: application side logic. This means that you have to read the previous entry and perform the decision on your application. The obvious downside is that 2 queries are performed (one SELECT and one INSERT) which obviously adds latency.
However this suits our application because every time we perform a query to enqueue all items that should be checked, we can select the items urls and their current prices too. So the workers that check the latest price can then make the decision of inserting or not because they have the current price to compare with.
So... A query similar to this would be performed every X minutes:
SELECT id, url, price FROM products WHERE "nextCheckTime" < now();
// example only, wouldn't even work if nextCheckTime is not part of the PK or index
This is a very costly operation to perform on a Cassandra cluster because it has to go through all rows that are stored randomly in different nodes by default. Another downside is that we need some advanced and specific statistics regarding products and users.
So we've decided that a relational database will serve us better than Cassandra in this particular case.
We sadly leave all of Cassandra's advantages (fast inserts, easy scaling, built in sharding...) and look towards a MySQL Cluster or master-slave implementation.

Azure table storage store multiple types

What do you recommend in the following scenario:
I have an azure table called Users where as columns are:
PrimaryKey
RowKey
Timestamp
FirstName
LastName
Email
Phone
Then there are different types of tasks for each user let's call them TaskType1 and TaskType2.
Both task types have common columns but then have also type specific columns like this:
PrimaryKey (this is the same as the Users PrimaryKey to find all tasks belonging to one user)
RowKey
Timestamp
Name
DueDate
Description
Priority
then TaskType1 has additional columns:
EstimationCompletionDate
IsFeasible
and TaskType2 has it's own specific column:
EstimatedCosts
I know I can store both types in the same table and my question is:
If I use different tables for TaskType1 and TaskType2 what will be the impact in transactions costs? I will guess that if I have 2 tables for each task type and then I will issue a query like: get me all tasks where the task Primarykey is equal to a specific user from Users table PrimaryKey then I will have to run 2 queries for each types (because users can have both tasks type) that means more transactions ... instead if both tasks are in the same table then it will be like 1 query (in the limit of 1000 after pagination transactions) because I will get all the rows where the PartitionKey is the user PartitionKey so the partition is not split that means 1 transaction right?
So did I understood it right that I will have more transactions if I store the tasks in different tables .. ?
Your understanding is completely correct. Having the tasks split into 2 separate tables would mean 2 separate queries thus 2 transactions (let's keep more than 1000 entities out of equation for now). Though transaction cost is one reason to keep them in the same table, there are other reasons too:
By keeping them in the same table, you would be making full use of schema-less nature of Azure Table Storage.
2 tables means 2 network calls. Though the service is highly available but you would need to take into consideration a scenario when call to 1st table is successful however call to 2nd table fails. How would your application behave in that scenario? Do you discard the result from the 1st table also? By keeping them in just one table saves you from this scenario.
Assuming that you have a scenario in your application where a user could subscribe to both Task 1 and 2 simultaneously. If you keep them in the same table, you can make use of Entity Group Transaction as both entities (one for Task 1 and other for Task 2) will have the same PartitionKey (i.e. User Id). if you keep them in separate tables, you would not be able to take advantage of entity group transactions.
One suggestion I would give is to have a "TaskType" attribute in your Tasks table. That way you would have an easier way of filtering by tasks as well.

Model and ordered list in Cassandra

I need to model a list of items which is sorted by the time of last update of the item.
Consider for instance a user task list. Each user has a list of tasks and each tasks has a due date. Tasks can be added to that list, but also the due date of a task can change after it has been added to the list.
That is, a task which is in the 3rd position in the task list of User A may have to be moved to the 1st, as a result of the due date of the task being updated.
What I have right now is the following CF:
Create Table UserTasks (
user_id uuid,
task_id timeuuid,
new_due_date timestamp
PRIMARY KEY (user_id, task_id));
I understand that I cannot sort on 'new_due_date' unless it is made part of the key.
But if its part of the key then it cannot be updated unless but rather deleted and recreated.
My concerns in doing so is that if a task exists in the task list of 100.000 users, then I need to make 100.000 select/delete/insert sequence.
While if I could sort on new_due_date it's be 100.000 updates
Any suggestions would be greatly appreciated.
Well, one option is if use PlayOrm with cassandra, you can partition by user_id and query for UserTasks of a user. If you query where time > 0 and time < MAX, it returns a cursor(reading in batchSize rows at a time) and you can traverse the cursor in reverse order or just plain order. This solution scales infinitely with number of users, but only scales to millions of tasks per user which may be ok but I don't know your domain well enough.
Dean

Resources