SQL to Cassandra Data Model Structure - cassandra

Forgive me for asking something that is probably explained elsewhere, but I didnt found a simple and plain conversion/explanation of SQL model to Cassandra Model.
Lets say I've a use-case of designing a DB structure for employee details and records in a organization. In SQL(having years of experience), I could have modelled it using normalization techniques, but coming into the world of NoSQL, it would take me sometime to have hold over designing DB for NoSQL, hence I'm here (for better understanding).
Can someone transform this SQL model into a NoSQL(Cassandra) model, thereby giving a lot of newbies(like me) a simple and plain transformation of SQL to NoSQL migration.
Since SO works on a concept Try First and then ask, so Ive thought of a structure as well. Let me know if that works well.
Since data can be denormalized in Cassandra, I thought of this structure.
Emploee(ColumnFamily) = {
"01234"(EmployeeId) : {
"EmpName" : "Jack",
"mail" : "Jack#xyz.com",
"phone" : ["9999900000","8888888888"],
"DOB": 4/1/91,
"Contact":{ "Street" : XYZ2 , "City":ABC, "Pincode":PQR},
"UnitID":{ "UnitName" : XYZ , "UnitHead":ABC},
"RoleID":{ "Designation" : Manage , "Band":Something},
},
"01235"(EmploeeId) : {
"EmpName" : "Jackyyy",
"mail" : "Jackyyy#xyz.com",
"phone" : ["99565600000","88888846468"],
"DOB": 4/1/91,
"Contact":{ "Strreet" : XYZ2 , "City":ABC1, "Pincode":PQR},
"UnitID":{ "UnitName" : XYZ1 , "UnitHead":ABC1},
"RoleID":{ "Designation" : Faculty , "Band":Something},
},
and so on...
}
Projects(ColumnFamily) = {
"1213"(ProjectId) : {
"EmpID" : [01234,01235],
"StartDate" : 4/1/2001,
"EndDateDate": 4/1/2012,
"ClientName": Apple
"Description": "Something",
},
and so on...
}
Attentance Detail(ColumnFamily) = {
"1213"(DetailId) : {
"EmpID" : 01234,
"SwipeInTime" : Something,
"SwipeOutTime": Something,
"Status": Apple
},
and so on...
}
Firstly please let me know, if this structure is correct. If yes, how would I design queries for the following?
1) Select employee whose phone number = something;
2) Select employees who lives in 'XYZ' location;
3) Select employees whose age is > 40 years ;
4) Select employee whose Designation is a 'Manager' of Unit Name 'XYZ' ;
5) Select employees who work for over 1o hours a day;
6) Get names(not IDs) of all employees wh were working for client 'Apple';
Let me know If I can provide more clarity on the question!!!

Your structure is not correct because you won't be able to express any of your 6 queries :-(
The main rule of Cassandra modeling is: start from your queries and denormalize. In your case, you would have 6 tables employee_by_phone, employee_by_location, employee_by_age and so on.
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
However if you have a lot of multi-criteria queries like these, Cassandra (Datastax Enterprise edition) has SolR extension which will let you express richer queries. In this case your model may be right.

Related

NodeJS with MS-SQL

In NodeJs with MS-SQL, I want to retrieve two three table data in the form of array of objects
Hello there my name is Shaziya, please help me out (โ—ยดโŒ“`โ—)
Actually i have done NodeJs from YouTube,
I want learn NodeJS with MS-SQL, do you or any friends have such course for Advance understanding,
Like how to connect 4 5 tables and show data in array of objects format
How to make nested query or subquires like...
I mean if wanna do table match with two table product and order
Product ID with Order table by matching product ID
data {
productId : 1:
productName : "abc",
[{
orderId : 1
orderName : "xyz"
},
{
orderId : 2
orderName : "pqr"
}
]
}
At least i got some course or solution for that where i stuck

Cosmos db null value

I have two kind of record mention below in my table staudentdetail of cosmosDb.In below example previousSchooldetail is nullable filed and it can be present for student or not.
sample record below :-
{
"empid": "1234",
"empname": "ram",
"schoolname": "high school ,bankur",
"class": "10",
"previousSchooldetail": {
"prevSchoolName": "1763440",
"YearLeft": "2001"
} --(Nullable)
}
{
"empid": "12345",
"empname": "shyam",
"schoolname": "high school",
"class": "10"
}
I am trying to access the above record from azure databricks using pyspark or scala code .But when we are building the dataframe reading it from cosmos db it does not bring previousSchooldetail detail in the data frame.But when we change the query including id for which the previousSchooldetail show in the data frame .
Case 1:-
val Query = "SELECT * FROM c "
Result when query fired directly
empid
empname
schoolname
class
Case2:-
val Query = "SELECT * FROM c where c.empid=1234"
Result when query fired with where clause.
empid
empname
school name
class
previousSchooldetail
prevSchoolName
YearLeft
Could you please tell me why i am not able to get previousSchooldetail in case 1 and how should i proceed.
As #Jayendran, mentioned in the comments, the first query will give you the previouschooldetail document wherever they are available. Else, the column would not be present.
You can have this column present for all the scenarios by using the IS_DEFINED function. Try tweaking your query as below:
SELECT c.empid,
c.empname,
IS_DEFINED(c.previousSchooldetail) ? c.previousSchooldetail : null
as previousSchooldetail,
c.schoolname,
c.class
FROM c
If you are looking to get the result as a flat structure, it can be tricky and would need to use two separate queries such as:
Query 1
SELECT c.empid,
c.empname,
c.schoolname,
c.class,
p.prevSchoolName,
p.YearLeft
FROM c JOIN c.previousSchooldetail p
Query 2
SELECT c.empid,
c.empname,
c.schoolname,
c.class,
null as prevSchoolName,
null as YearLeft
FROM c
WHERE not IS_DEFINED (c.previousSchooldetail) or
c.previousSchooldetail = null
Unfortunately, Cosmos DB does not support LEFT JOIN or UNION. Hence, I'm not sure if you can achieve this in a single query.
Alternatively, you can create a stored procedure to return the desired result.

Possible to add dynamic WHERE clause with a QueryFile?

I have a complex query stored in an SQL file and I would like to reuse it for various routes but change up the WHERE clause depending on the route. This would be instead of having a large complex query in multiple files with the only difference being the WHERE statement.
Is it possible to dynamically add a WHERE when using QueryFile? Simplified example below:
SELECT "id", "userId", "locationId", "title", "body",
(
SELECT row_to_json(sqUser)
FROM (
SELECT "id", "firstname", "lastname"
FROM "users"
WHERE "users"."id" = "todos"."userId"
) sqUser
) as "user"
FROM "todos"
const queryIndex = new pgp.QueryFile('sql/todos/index.pgsql', queryOptions);
// 1. Use as is to get a list of all todos
// 2. OR Append WHERE "locationId" = $1 to get list filtered by location
// 3. OR Append WHERE "id" = $1 to get a specific item
// without having three separate SQL files?
It seems like (maybe?) you could get away with adding the below in the query file but that still feels limiting (would still need two files for = and LIKE and it still limits to only one WHERE condition). It also also feels weird to do something like WHERE 1 = 1 to get all records to return.
WHERE $1 = $2
I would be interested in hearing peoples' thoughts on this or if there is a better approach.
You can inject dynamic condition into a query-file as Raw Text:
SELECT "id", "userId", "locationId", "title", "body",
(
SELECT row_to_json(sqUser)
FROM (
SELECT "id", "firstname", "lastname"
FROM "users"
${condition:raw}
) sqUser
) as "user"
FROM "todos"
Pre-formatted parameters, based on the condition:
// Generate a condition, based on the business logic:
const condition = pgp.as.format('WHERE col_1 = $1 AND col_2 = $2', [111, 222]);
Executing your query-file:
await db.any(myQueryFile, {condition});
Advanced
Above is for the scenario when your have a simple dynamic condition that you want to generate in the code. But sometimes you may have complex static conditions that you want to alternate. In this case, you can have your master query file refer to the condition from a slave query file (nested query files are supported right out of the box). And in this case you do not even need to use :raw filter, because query files are injected as raw text by default:
Master query:
SELECT * FROM table ${condition}
Load your slave query files with complex conditions (when the app starts):
const conditionQueryFile1 = new QueryFile(...);
const conditionQueryFile2 = new QueryFile(...);
Selecting the right slave query, based on the business logic:
const condition = conditionQueryFile1; // some business logic here;
Executing master query with a slave as parameter:
await db.any(myQueryFile, {condition});

How should I attack a large GroupBy recordset in a JavaScript heavy stack?

I'm currently using Node.js and Firebase on a project, and I love both. My challenge is that I need to store millions of sales order rows that would look something like this:
{ companyKey: 'xxx',
orderKey : 'xxx',
rowKey : 'xxx',
itemKey : 'xxx',
orderQty: '5',
orderDate: '12/02/2015'
}
I'd like to query these records like the pseudocode below:
Select sum(orderQty) from mydb where companyKey = 'xxx' and itemKey = 'xxx' groupby orderDate
According to various reasons such as Firebase count group by, groupby in general can be a tough nut to crack. I've done it before using Oracle Materialized Views but would like to use some kind of service that just does all of that backend work for me so I can CRUD those sales orders without worrying about the aggregation maintenance. I read in another stackoverflow post that Keen.io might be a good approach to this problem.
How would the internet experts attack this problem if they were using a JavaScript heavy stack and they wanted an outside service to do aggregation by day for them?
A couple of points I'm considering. I'll update as they come up:
1) It seems I might have to take Keen.io off the list. It's $125 for 1M rows. I don't need all the power Keen.io provides, only aggregation by day.
2) Going the Sequelize + PostGreSQL seems to be a decent compromise. I can still use JavaScript, an ORM to alleviate the pain, and PostGreSQL hosting is usually cheap.
It sounds like you want to show a trend in sales of an item over time. That's a very good fit for an event data platform because showing trends over time is really native to the query language. In Keen IO, the idea of "grouping by time" is instead expressed as the concept of "timeframe" (e.g. previous_7_days) and "interval" (e.g. daily).
Here's how you would run that with a simple sum query in Keen:
var sum = new Keen.Query("sum", {
event_collection: "sales",
target_property: "orderQty",
timeframe: "previous_12_weeks",
interval: "weekly",
filters: [
{
property_name: "companyKey",
operator: "eq",
property_value: "xxx"
},
{
property_name: "itemKey",
operator: "eq",
property_value: "yyy"
}
]
});
In fact you could calculate the sum for ALL of your companies and products in a single query by using group_by.
var sum = new Keen.Query("sum", {
event_collection: "sales",
target_property: "orderQty",
timeframe: "previous_12_weeks",
interval: "weekly",
group_by: ["companyKey", "itemKey"]
});
Keen recently updated their pricing. Depending on the frequency of querying, something like this would be pretty light, in the $10s of dollars per month if you have millions of new transactions monthly.

MongoDB Data Structure

I'm a bit of a noob with MongoDB, so would appreciate some help with figuring out the best solution/format/structure in storing some data.
Basically, the data that will be stored will be updated every second with a name, value and timestamp for a certain meter reading.
For example, one possibility is water level and temperature in a tank. The tank will have a name and then the level and temperature will be read and stored every second. Overall, there will be 100's of items (i.e. tanks), each with millions of timestamped values.
From what I've learnt so far (and please correct me if I'm wrong), there are a few options as how to structure the data:
A slightly RDMS approach:
This would consist of two collections, Items and Values
Items : {
_id : "id",
name : "name"
}
Values : {
_id : "id",
item_id : "item_id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The more document db denormalized method:
This method involves one collection of items each with an array of timestamped values
Items : {
_id : "id",
name : "name"
values : [{
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}]
}
A collection for each item
Save all the values in a collection named after that item.
ItemName : {
_id : "id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The majority of read queries will be to retrieve the timestamped values for a specified time period of an item (i.e. tank) and display in a graph. And for this, the first option makes more sense to me as I don't want to retrieve the millions of values when querying for a specific item.
Is it even possible to query for values between specific timestamps for option 2?
I will also need to query for a list of items, so maybe a combination of the first and third option with a collection for all the items and then a number of collections to store the values for each of those items?
Any feedback on this is greatly appreciated.
Don't use timestamp if you are not modifying the ObjectId.
As ObjectId itself has time stamp in it.
So you will be saving a lot of memory by it.
MongoDB Id Documentation
In case if you dont require the previous data then you can use update query in MongoDB to update the fields every second instead of storing.
If you want to store the updated data each time then instead of updating store it in flat structure.
{ "_id" : ObjectId("XXXXXX"),
"name" : "ItemName",
"value" : "ValueOfItem"
"created_at" : "timestamp"
}
Edit 1: Added timestamp as per the comments

Resources