I am new to Cassandra and trying to implement Reddit mock with limited functionalities. I am not considering subreddits and comments as of now. There is a single home page that displays 'Top' posts and 'New' posts. By clicking any post I can navigate into the post.
1)Is this a correct schema design?
2)If I want to show all-time top posts how can that be achieved?
Table for Post Details
CREATE TABLE main.post (
user_id text,
post_id text,
timeuuid timeuuid,
downvoted_user_id list<text>,
img_ids list<text>,
islocked boolean,
isnsfw boolean,
post_date date,
score int,
upvoted_user_id list<text>,
PRIMARY KEY ((user_id, post_id), timeuuid)
) WITH CLUSTERING ORDER BY (timeuuid DESC)
Table for Top & New Posts
CREATE TABLE main.posts_by_year (
post_year text,
timeuuid timeuuid,
score int,
img_ids list<text>,
islocked boolean,
isnsfw boolean,
post_date date,
post_id text,
user_id text,
PRIMARY KEY (post_year, timeuuid, score)
) WITH CLUSTERING ORDER BY (timeuuid DESC, score DESC)
Imaging we have 2 tables in RDBMS, INVOICE and INVOICE_LINE_ITEMS and there is a One-To-Many relationship between INVOICE and INVOICE_LINE_ITEMS.
INVOICE (1) --------> (*) INVOICE_LINE_ITEMS
Above said entity needs to be stored in Cassandra now, to do this we can follow 2 approaches,
Denormalized table with PRIMARY KEY (invoice_id, invoice_line_item_id), for one invoice, there will be multiple line_item_ids.
A Row for INVOICE with a SET<FROZEN<INVOICE_LINE_ITEMS_UDT>>
Have 2 tables and take care of updating 2 tables and joining query result in DAO code
Use Cases are,
User can create an invoice and keep adding, updating and deleting lines
User can search with invoice or invoice_line_udt attributes and get invoice details (Using DSE Search solr_query)
INVOICE (Header) may contain 20 attributes and each Item (invoice_line) may contain around 30+ attributes a big UDT and each collection may have ~1000 lines.
Question:
Using a frozen collection affects read and write performance due to serialization and deserialization. Considering UDT contains 30+ fields and a max of 1000 items in collection, is this a good approach or data model?
Because there is serialization and deserialization, collection of UDT gets replaced every time record or partition is updated. Will column updates create tombstones? Considering we have lot of updates in the items (collection of UDTs) will it create a problem?
Here is the CQL for approach 1: (Invoice header row having collection of UDTs)
CREATE TYPE IF NOT EXISTS comment_udt (
created_on timestamp,
user text,
comment_type text,
comment text
);
CREATE TYPE IF NOT EXISTS invoice_line_udt ( ---TO REPRESENT EACH ITEM ---
invoice_line_id text,
invoice_line_number int,
parent_id text,
item_id text,
item_name text,
item_type text,
uplift_start_end_indicator text,
uplift_start_date timestamp,
uplift_end_date timestamp,
bol_number text,
ap_only text,
uom_code text,
gross_net_indicator text,
gross_quantity decimal,
net_quantity decimal,
unit_cost decimal,
extended_cost decimal,
available_quantity decimal,
total_cost_adjustment decimal,
total_quantity_adjustment decimal,
total_variance decimal,
alt_quantity decimal,
alt_quantity_uom_code text,
adj_density decimal,
location_id text,
location_name text,
origin_location_id text,
origin_location_name text,
intermediate_location_id text,
intermediate_location_name text,
dest_location_id text,
dest_location_name text,
aircraft_tail_number text,
flight_number text,
aircraft_type text,
carrier_id text,
carrier_name text,
created_on timestamp,
created_by text,
updated_on timestamp,
updated_by text,
status text,
matched_tier_name text,
matched_on text,
workflow_action text,
adj_reason text,
credit_reason text,
hold_reason text,
delete_reason text,
ap_only_reason text
);
CREATE TABLE IF NOT EXISTS invoice_by_id ( -- MAIN TABLE --
invoice_id text,
parent_id text,
segment text,
invoice_number text,
invoice_type text,
source text,
ap_only text,
invoice_date timestamp,
received_date timestamp,
due_date timestamp,
vendor_id text,
vendor_name text,
vendor_site_id text,
vendor_site_name text,
currency_code text,
local_currency_code text,
exchange_rate decimal,
exchange_rate_date timestamp,
extended_cost decimal,
early_pay_discount decimal,
payment_method text,
invoice_amount decimal,
total_tolerance decimal,
total_variance decimal,
location_id text,
location_name text,
dest_location_override text,
company_id text,
company_name text,
org_id text,
sold_to_number text,
ship_to_number text,
ref_po_number text,
sanction_indicator text,
created_on timestamp,
created_by text,
updated_on timestamp,
updated_by text,
manually_assigned text,
assigned_user text,
assigned_group text,
workflow_process_id text,
version int,
comments set<frozen<comment_udt>>,
status text,
lines set<frozen<invoice_line_udt>>,-- COLLECTION OF UDTs --
PRIMARY KEY (invoice_id, invoice_type));
Here is the script for approach 2: (denormalized invoice and lines in one partition but multiple rows)
CREATE TABLE wfs_eam_ap_matching.invoice_and_lines_copy1 (
invoice_id uuid,
invoice_line_id uuid,
record_type text,
active boolean,
adj_density decimal,
adj_reason text,
aircraft_tail_number text,
aircraft_type text,
alt_quantity decimal,
alt_quantity_uom_code text,
ap_only boolean,
ap_only_reason text,
assignment_group text,
available_quantity decimal,
bol_number text,
cancel_reason text,
carrier_id uuid,
carrier_name text,
comments LIST<FROZEN<comment_udt>>,
company_id uuid,
company_name text,
created_by text,
created_on timestamp,
credit_reason text,
dest_location_id uuid,
dest_location_name text,
dest_location_override boolean,
dom_intl_indicator text,
due_date timestamp,
early_pay_discount decimal,
exchange_rate decimal,
exchange_rate_date timestamp,
extended_cost decimal,
flight_number text,
fob_point text,
gross_net_indicator text,
gross_quantity decimal,
hold_reason text,
intermediate_location_id uuid,
intermediate_location_name text,
invoice_currency_code text,
invoice_date timestamp,
invoice_line_number int,
invoice_number text,
invoice_type text,
item_id uuid,
item_name text,
item_type text,
local_currency_code text,
location_id uuid,
location_name text,
manually_assigned boolean,
matched_on timestamp,
matched_pos text,
matched_tier_name text,
net_quantity decimal,
org_id int,
origin_location_id uuid,
origin_location_name text,
parent_id uuid,
payment_method text,
received_date timestamp,
ref_po_number text,
sanction_indicator text,
segment text,
ship_to_number text,
sold_to_number text,
solr_query text,
source text,
status text,
total_tolerance decimal,
total_variance decimal,
unique_identifier FROZEN<TUPLE<text, text>>,
unit_cost decimal,
uom_code text,
updated_by text,
updated_on timestamp,
uplift_end_date timestamp,
uplift_start_date timestamp,
uplift_start_end_indicator text,
user_assignee text,
vendor_id uuid,
vendor_name text,
vendor_site_id uuid,
vendor_site_name text,
version int,
workflow_process_id text,
PRIMARY KEY (invoice_id, invoice_line_id, record_type)
);
Note: we use datastax cassandra + DSE Search. It doesn't support static columns, hence we are not using it. Also, in order to give a real picture I have listed tables and UDT with lots of columns and ended up creating a long question.
I've 2 models for Cassandra with the same partition key:
CREATE TABLE users(
parent_id int,
user_id text,
PRIMARY KEY ((parent_id), user_id )
);
CREATE TABLE user_actions(
parent_id int,
user_id text,
type text,
created_at int,
data map<text, text>,
PRIMARY KEY((parent_id), user_id, created_at)
);
I want to find all the users how made an action and belong to the same parent_id.
Right now I'm getting all the users, even if they did not made an action, I'm using it like this:
http://ADDRESS/solr/name.users/select?q=parent_id:1&fq={!join+fromIndex=name.user_actions}type:click
Thanks!
There are not 'from' and 'to' parameters to tell solr on which fields it should make the join, so your filter query should be something like:
fq={!join from=user_id fromIndex=name.user_actions to=user_id force=true}type:click
I have these table
CREATE TABLE user_info (
userId uuid PRIMARY KEY,
userName varchar,
fullName varchar,
sex varchar,
bizzCateg varchar,
userType varchar,
about text,
joined bigint,
contact text,
job set<text>,
blocked boolean,
emails set<text>,
websites set<text>,
professionTag set<text>,
location frozen<location>
);
create table publishMsg
(
rowKey uuid,
msgId timeuuid,
postedById uuid,
title text,
time bigint,
details text,
tags set<text>,
location frozen<location>,
blocked boolean,
anonymous boolean,
hasPhotos boolean,
esIndx boolean,
PRIMARY KEY(rowKey, msgId)
) with clustering order by (msgId desc);
create table publishMsg_by_user
(
rowKey uuid,
msgId timeuuid,
title text,
time bigint,
details text,
tags set<text>,
location frozen<location>,
blocked boolean,
anonymous boolean,
hasPhotos boolean,
PRIMARY KEY(rowKey, msgId)
) with clustering order by (msgId desc);
CREATE TABLE followers
(
rowKey UUID,
followedBy uuid,
time bigint,
PRIMARY KEY(rowKey, orderKey)
);
I doing 3 INSERT statement in BATCH to put data in publishMsg publishMsg_by_user followers table.
To show a single message I have to query three SELECT query on different table:
publishMsg - to get a publish message details where rowkey & msgId given.
userInfo - to get fullName based on postedById
followers - to know whether a postedById is following a given topic or not
Is this a fit way of using cassandra ? will that be efficient because the given scanerio data can't fit in single table.
Sorry to ask this in an answer but I don't have the rep to comment.
Ignoring the tables for now, what information does your application need to ask for? Ideally in Cassandra, you will only have to execute one query on one table to get the data you need to return to the client. You shouldn't need to have to execute 3 queries to get what you want.
Also, your followers table appears to be missing the orderkey field.
I created a one table posts. When I make request SELECT:
return $this->db->query('SELECT * FROM "posts" WHERE "id" IN(:id) LIMIT '.$this->limit_per_page, ['id' => $id]);
I get error:
PRIMARY KEY column "id" cannot be restricted (preceding column
"post_at" is either not restricted or by a non-EQ relation)
My table dump is:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id bigint,
name text,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY (user_id,post_at,id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
I read some article about PRIMARY AND CLUSTER KEYS, and understood, when there are some primary keys - I need use operator = with IN. In my case, i can not use a one PRIMARY KEY. What you advise me to change in table structure, that error will disappear?
My dummy table structure
CREATE TABLE posts (
id timeuuid,
post_at timestamp,
user_id bigint,
PRIMARY KEY (id,post_at,user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
And after inserting some dummy data
I ran query select * from posts where id in (timeuuid1,timeuuid2,timeuuid3);
I was using cassandra 2.0 with cql 3.0