Data reference and updation in cassandra tables - reference

I have a table Called 'usertab' to store user details such as:
userid uuid,
firstname text,
lastname text
email text
gender int
image text
Most of the other tables contains userid as a field for referencing 'usertab',
but when I retrieve data from other table, I need to execute another select query to get user details.
So if 10,000 or more data retrieved, same number of select query executed for getting user details. This makes our system slow.
So we add usertab fields such as firstname,lastname, gender, image in other tables in addition to userid field.
So on data retrieval, the system become fast, but we faced another problem. If any changes in usertab table such as change in firstname, lastname, gender or image, we need to update other tables that contains user details. If we consider huge amount of data in other tables, how can I handle this?
We are using lucene index and C#.

Cassandra writes significantly faster and more efficient than reads.
That's why cassandra prefer Denationalization over normalization
Denormalization is the concept that a data model should be designed so that a given query can be served from the results from one row and query. Instead of doing multiple reads from multiple tables and rows to gather all the required data for a response, instead modify your application logic to insert the required data multiple times into every row that might need it in the future This way, all required data can be available in just one read which prevents multiple lookups.
When executing multiple update you can use executeAsync.
Session allows asynchronous execution of statements (for any type of statement: simple, bound or batch) by exposing the ExecuteAsync method.
//Execute a statement asynchronously using await
var rs = await session.ExecuteAsync(statement);
Source : https://www.hakkalabs.co/articles/cassandra-data-modeling-guide

Related

In Cassandra does creating a tables with multiple columns take more space compared to multiple tables?

I have 6 tables in my database each consisting of approximate 12-15 columns and they have relationship with its id to main_table. I have to migrate my database to cassandra so I have a question should I create one main_table with consisting multiple columns or different table as in my mysql database.
Will creatimg multiple column take more space or multiple table will take more space
Your line of questioning is flawed. It is a common mistake for DBAs who only have a background in traditional relational databases to view data as normalised tables.
When you switch to NoSQL, you are doing it because you are trying to solve a problem that traditional RDBMS can't. A paradigm shift is required since you can't just migrate relational tables the way they are, otherwise you're back to where you started.
The principal philosophy of data modelling in Cassandra is that you need to design a CQL table for each application query. It is a one-to-one mapping between app queries and CQL tables. The crucial point is you need to start with the app queries, not the tables.
Let us say that you have an application that stores information about users which include usernames, email addresses, first/last name, phone numbers, etc. If you have an app query like "get the email address for username X", it means that you need a table of email addresses and the schema would look something like:
CREATE TABLE emails_by_username (
username text,
email text,
firstname text,
lastname text,
...
PRIMARY KEY(username)
)
You would then query this table with:
SELECT email FROM emails_by_username WHERE username = ?
Another example is where you have an app query like "get the first and last names for a user where email address is Y". You need a table of users partitioned by email instead:
CREATE TABLE users_by_email (
email text,
firstname text,
lastname text,
...
PRIMARY KEY(email)
)
You would query the table with:
SELECT firstname, lastname FROM users_by_email WHERE email = ?
Hopefully with these examples you can see that the disk space consumption is completely irrelevant. What is important is that you design your tables so they are optimised for the application queries. Cheers!

Creating database tables with same columns or one master table

I am building a website with large database, there's 6 types of data, so 6 forms to pass data to database.
Each form has unique parameters, and 4 of 6 forms have the same fields and the fields can contain multiple data: email, address and phone can be multiple on 4 forms.
For the first i wanted to created 4 different tables like: store_contacts, warehouse_contacts, delivery_contacts, etc. to keep different types separated.
so i would have 4 similar tables containing the same fields:
id, phone, email, address, store_id/delivery_id/etc
I have read that better practice to create one table containing them, table Contacts:
id, type, type_id, phone, email, address
from similar questions:
Two tables with same columns or one table with additional column?
https://softwareengineering.stackexchange.com/questions/302573/one-wide-table-or-multiple-themed-tables
https://dba.stackexchange.com/questions/46852/multiple-similar-tables-vs-one-master-table
But i'm not sure if tables will change later and new fields will be added for store only or only for delivery. and apart from contacts i have similar situation for other fields.
Would it be comfortable to make queries with type every time i need to pull data for certain type or when i need to delete them? Won't it get messy when a lot of rows will be inserted? And if a new field will be created for 'store', it is okay that others will contain NULL on that field?
Probably you should read a bit about Relational Entities or Object Orientation - inheritance, depending on the paradigm you are working.
For example, you can get aware about it in articles like this
Usually you should store contacts in a separate and exclusive entity, for a plenty of reasons. Sector-specific fields can be stored in each table, only if you are sure that there would be no use for them in another entities. For example: warehouse_contacts would have an imaginary employee id field to represent an employee in warehouse repsonsible for attending a given contact. Even though, proably the best practice would be to build a third table managing this information.
Nevertheless, if performance is an issue, I mean, if you have millions of records and dozens and dozens of simultaneous access in your website, maybe your Data Base would run faster in fewer tables, not so normalized. But this situation is quite improbable for most enterprises and users. Rather, this situation is kind a common practice in large-scale and legacy systems.
Good luck.

sequelize: retrieve model of a manually created table

If i create tables with sequelize API (sequelize.define), it returns a model object (User in the following example) that i can use to do queries (User.find) and other kind of operations:
var User = sequelize.define('User', {/* ... */})
If i need to create a table in the db without that api, but i need to do it with a pure sql query, is there a way to retrieve the same model object for my manual table and use it like the others?
In sequelize.models i see all my tables but not the custom one.
You need to define a model using define with the tableName property set to your manually defined table.
The DB columns that you want to retrieve as a model's attributes must meet a number of criteria. Each should have a datatype that matches the column's type. They should have the same constraints (cascade, null etc) as the DB columns. The column name should either match what Sequelize automatically generates or be specified manually using field.
Since the table for this model has already been manually created, make sure that it is not sync'ed to the DB. To test that the model you have specified will work the manually specified table you should sync the model to a test database and compare the automatically generated SQL with the manually generated SQL You are using.
This answer is based on the documentation here.

Cassandra - join two tables and save result to new table

I am working on a self-bi application where users can upload their own datasets which are stored in Cassandra tables that are created dynamically. The data is extracted from files that the user can upload. So, each dataset is written into its own Cassandra table modeled based on column headers in the uploaded file while indexing the dimensions.
Once the data is uploaded, the users are allowed to build reports, analyze, etc., from within the application. I need a way to allow users to merge/join data from two or more datasets/tables based on matching keys and write the result into a new Cassandra table. Once a dataset/table is created, it will stay immutable and data is only read from it.
user table 1
username
email
employee id
user table 2
employee id
manager
I need to merge data in user table 1 and user table 2 on matching employee id and write to new table that is created dynamically.
new table
username
email
employee id
manager
What would be the best way to do this?
The only option that you have is to do the join in your application code. There are just few details to suggest a proper solution.
Please add details about table keys, usage patterns... in general, in cassandra you model from usage point of view, i.e. starting with queries that you'll execute on data.
In order to merge 2 tables on this pattern, you have to do it into application, creating the third table (target table ) and fill it with data from both tables. You have to make sure that you read the data in pages to not OOM, it really depends on size of the data.
Another alternative is to build the joins into Spark, but maybe is too over-engineering in your case.
You can have merge table with primary key of user so that merged data goes in one row and that should be unique since it is one time action.
Than when user clicks you can go through one table in batches with fetch size (for java you can check query options but that is a way to have a fixed window which will be loaded and when reached move to next fetch size of elements). Lets say you have fetch size of 1000 items, iterate over them from one table and find matches in second table, and after 1000 is reached place batch of 1000 inserts to new table.
If that is time consuming you can as suggested use some other tool like Apache Spark or Spring Batch and do that in background informing user that it will take place.

How would you store contacts in Azure Tables?

Each user of my system can have contacts. Each contact has details like Name, Address, Email, Phone, etc.
Do you think is a good idea to store this contacts in Azure Tables? I am worried about the following:
How do I search for a specific field (like Email or Phone)?
How do I get only the contacts belonging to a specific user?
How do I sort the contacts by a field?
I think that contacts could be a good candidate for storing in Table Storage - but only if you can partition on the owning person and never really need to search or aggregate across multiple owning users.
One possible design for this is:
store the contacts once with the owning user as partition key and some unique field for row key, but with the fields as columns within each row.
How do I search for a specific field (like Email or Phone)?
You can then ask table storage to search within a partition - it will then do a scan within that partition - which shouldn't be particularly large or slow for any single partition.
How do I get only the contacts belonging to a specific user?
This is just a simple query by partition key only
How do I sort the contacts by a field?
All results from table storage are sorted by (partitionkey, rowkey) so to sort the contacts for a user, you'll need to query for all of them, and then sort them within your web or worker role.
Other designs are, of course, possible -
e.g. you could store each contact in multiple rows in multiple tables - this would then allow you to have pre-formed sort orders within the table storage.
e.g. you could use separate tables instead of separate partitionkeys for each user - this has the advantage that when you delete a user, you can delete the entire table belonging to that user.
Note... while it's possible to use table storage for this... actually I almost always seem to end up back in SQL Azure at the moment - it's just so much more powerful and predictable (IMO). When the team deliver secondary indexing, then I might be tempted to use it for more of my data.

Resources