Lets say i have a table pagelabels which keep track of page labelled by user with a label -
The table i have is as below -
CREATE TABLE pagelabels (
pageid text PRIMARY KEY,
label list<string> // A list of label on this pageid
)
i want to show user 10 label initially on page sorted by insertion time on opening the page. When user asks for more labels by clicking on load more labels, i should return next 10 labels on this page. How can i support such pagination scheme? Should I consider changing my Data model in this case?
You cannot get your desired results by using current data model. With the current data model you have if you query with pageId, all labels will be returned at once as they are in single list. Also there is no provision to sort by insertion time.
A better data model could be
CREATE TABLE pagelabels (
pageid text,
insertion_time timestamp,
label string,
label_attribute1 string,
PRIMARY KEY (pageid, insertion_time, label));
)
With this data model and using paging from the drivers (for example see here for Datastax Java Driver.
Related
I have 6 tables in my database each consisting of approximate 12-15 columns and they have relationship with its id to main_table. I have to migrate my database to cassandra so I have a question should I create one main_table with consisting multiple columns or different table as in my mysql database.
Will creatimg multiple column take more space or multiple table will take more space
Your line of questioning is flawed. It is a common mistake for DBAs who only have a background in traditional relational databases to view data as normalised tables.
When you switch to NoSQL, you are doing it because you are trying to solve a problem that traditional RDBMS can't. A paradigm shift is required since you can't just migrate relational tables the way they are, otherwise you're back to where you started.
The principal philosophy of data modelling in Cassandra is that you need to design a CQL table for each application query. It is a one-to-one mapping between app queries and CQL tables. The crucial point is you need to start with the app queries, not the tables.
Let us say that you have an application that stores information about users which include usernames, email addresses, first/last name, phone numbers, etc. If you have an app query like "get the email address for username X", it means that you need a table of email addresses and the schema would look something like:
CREATE TABLE emails_by_username (
username text,
email text,
firstname text,
lastname text,
...
PRIMARY KEY(username)
)
You would then query this table with:
SELECT email FROM emails_by_username WHERE username = ?
Another example is where you have an app query like "get the first and last names for a user where email address is Y". You need a table of users partitioned by email instead:
CREATE TABLE users_by_email (
email text,
firstname text,
lastname text,
...
PRIMARY KEY(email)
)
You would query the table with:
SELECT firstname, lastname FROM users_by_email WHERE email = ?
Hopefully with these examples you can see that the disk space consumption is completely irrelevant. What is important is that you design your tables so they are optimised for the application queries. Cheers!
I have a PDF file which will contains some data like below structure.
I want to use Azure Form Recognizer to get the data.
How can I set the label with Table.
While tagging with Table, it need to specify the Column and Row.
You must select a table from the Form recognizer tag insertion field. The important thing here is to choose the table type (fixed sized or row dynamic).
Then you must point to the fields in the table by manually creating the columns. If there are no columns, I recommend you to label the fields one by one, or you can create imaginary columns and then delete these imaginary columns by writing a script.
You can find more detailed information at the link below.
Use table tags to train your custom template model
Form Recognizer should extract this table automatically as part of the layout information extract from a document. Did Layout succeed to extract the table ? Do you also want to structure it in a certain structure ? If yes, you can also train a custom model (template) and label the table. Labeling the table can be done in any format. That table you label does not need to be in the same structure. For example You can create a 4 column table form this data with Age, Name, Weight, Hight as the columns or if this is a single row table you can also label it as key value pairs.
Try out the Form Recognizer Generic Document pre-built it might extract these as key value pairs out of the box - https://formrecognizer.appliedai.azure.com/studio/document
In this minimal case situation I have table with filed referencing another table entries.
When I'm adding main_entries entry, I have dropdown with entries from referenced table. When there isn't entry in referenced table I need, how can I create this new entry from this view (i.e. not leaving main_entries form)?
have this situation:
Model:
db.define_table('main_entries',
Field('type', 'reference entry_type' )
)
db.define_table('entry_type',
Field('label')
)
Controller:
def entries_edit():
form = SQLFORM.grid(db.main_entries)
return dict(form=form)
View:
{{extend 'layout.html'}}
{{=form}}
You can manage this by using the left option of SQLFORM.grid
left is an optional left join expressions used to build ...select(left=...).
It makes sense to combine this with the field option to specify the fields of both tables which should be displayed.
fields is a list of fields to be fetched from the database. It is also used to determine which fields to be shown in the grid view. However, it doesn't control what is displayed in the separate form used to edit rows. For that, use the readable and writable attribute of the database fields.
And don't forget to reference the leading table by the field_id option
field_id must be the field of the table to be used as ID, for example db.mytable.id. This is useful when the grid query is a join of several tables. Any action button on the grid(add record, view, edit, delete) will work over db.mytable.
cf. SQLFORM.grid signature
You must specify relationship in query.
You can try something like this:
def entries_edit():
query = db(db.main_entries.type == db.entry_type.id)
form = SQLFORM.grid(query)
return dict(form=form)
I have a table Called 'usertab' to store user details such as:
userid uuid,
firstname text,
lastname text
email text
gender int
image text
Most of the other tables contains userid as a field for referencing 'usertab',
but when I retrieve data from other table, I need to execute another select query to get user details.
So if 10,000 or more data retrieved, same number of select query executed for getting user details. This makes our system slow.
So we add usertab fields such as firstname,lastname, gender, image in other tables in addition to userid field.
So on data retrieval, the system become fast, but we faced another problem. If any changes in usertab table such as change in firstname, lastname, gender or image, we need to update other tables that contains user details. If we consider huge amount of data in other tables, how can I handle this?
We are using lucene index and C#.
Cassandra writes significantly faster and more efficient than reads.
That's why cassandra prefer Denationalization over normalization
Denormalization is the concept that a data model should be designed so that a given query can be served from the results from one row and query. Instead of doing multiple reads from multiple tables and rows to gather all the required data for a response, instead modify your application logic to insert the required data multiple times into every row that might need it in the future This way, all required data can be available in just one read which prevents multiple lookups.
When executing multiple update you can use executeAsync.
Session allows asynchronous execution of statements (for any type of statement: simple, bound or batch) by exposing the ExecuteAsync method.
//Execute a statement asynchronously using await
var rs = await session.ExecuteAsync(statement);
Source : https://www.hakkalabs.co/articles/cassandra-data-modeling-guide
I am working on a self-bi application where users can upload their own datasets which are stored in Cassandra tables that are created dynamically. The data is extracted from files that the user can upload. So, each dataset is written into its own Cassandra table modeled based on column headers in the uploaded file while indexing the dimensions.
Once the data is uploaded, the users are allowed to build reports, analyze, etc., from within the application. I need a way to allow users to merge/join data from two or more datasets/tables based on matching keys and write the result into a new Cassandra table. Once a dataset/table is created, it will stay immutable and data is only read from it.
user table 1
username
email
employee id
user table 2
employee id
manager
I need to merge data in user table 1 and user table 2 on matching employee id and write to new table that is created dynamically.
new table
username
email
employee id
manager
What would be the best way to do this?
The only option that you have is to do the join in your application code. There are just few details to suggest a proper solution.
Please add details about table keys, usage patterns... in general, in cassandra you model from usage point of view, i.e. starting with queries that you'll execute on data.
In order to merge 2 tables on this pattern, you have to do it into application, creating the third table (target table ) and fill it with data from both tables. You have to make sure that you read the data in pages to not OOM, it really depends on size of the data.
Another alternative is to build the joins into Spark, but maybe is too over-engineering in your case.
You can have merge table with primary key of user so that merged data goes in one row and that should be unique since it is one time action.
Than when user clicks you can go through one table in batches with fetch size (for java you can check query options but that is a way to have a fixed window which will be loaded and when reached move to next fetch size of elements). Lets say you have fetch size of 1000 items, iterate over them from one table and find matches in second table, and after 1000 is reached place batch of 1000 inserts to new table.
If that is time consuming you can as suggested use some other tool like Apache Spark or Spring Batch and do that in background informing user that it will take place.