Using collection of UDTs vs Denormalized rows in Cassandra - cassandra

Imaging we have 2 tables in RDBMS, INVOICE and INVOICE_LINE_ITEMS and there is a One-To-Many relationship between INVOICE and INVOICE_LINE_ITEMS.
INVOICE (1) --------> (*) INVOICE_LINE_ITEMS
Above said entity needs to be stored in Cassandra now, to do this we can follow 2 approaches,
Denormalized table with PRIMARY KEY (invoice_id, invoice_line_item_id), for one invoice, there will be multiple line_item_ids.
A Row for INVOICE with a SET<FROZEN<INVOICE_LINE_ITEMS_UDT>>
Have 2 tables and take care of updating 2 tables and joining query result in DAO code
Use Cases are,
User can create an invoice and keep adding, updating and deleting lines
User can search with invoice or invoice_line_udt attributes and get invoice details (Using DSE Search solr_query)
INVOICE (Header) may contain 20 attributes and each Item (invoice_line) may contain around 30+ attributes a big UDT and each collection may have ~1000 lines.
Question:
Using a frozen collection affects read and write performance due to serialization and deserialization. Considering UDT contains 30+ fields and a max of 1000 items in collection, is this a good approach or data model?
Because there is serialization and deserialization, collection of UDT gets replaced every time record or partition is updated. Will column updates create tombstones? Considering we have lot of updates in the items (collection of UDTs) will it create a problem?
Here is the CQL for approach 1: (Invoice header row having collection of UDTs)
CREATE TYPE IF NOT EXISTS comment_udt (
created_on timestamp,
user text,
comment_type text,
comment text
);
CREATE TYPE IF NOT EXISTS invoice_line_udt ( ---TO REPRESENT EACH ITEM ---
invoice_line_id text,
invoice_line_number int,
parent_id text,
item_id text,
item_name text,
item_type text,
uplift_start_end_indicator text,
uplift_start_date timestamp,
uplift_end_date timestamp,
bol_number text,
ap_only text,
uom_code text,
gross_net_indicator text,
gross_quantity decimal,
net_quantity decimal,
unit_cost decimal,
extended_cost decimal,
available_quantity decimal,
total_cost_adjustment decimal,
total_quantity_adjustment decimal,
total_variance decimal,
alt_quantity decimal,
alt_quantity_uom_code text,
adj_density decimal,
location_id text,
location_name text,
origin_location_id text,
origin_location_name text,
intermediate_location_id text,
intermediate_location_name text,
dest_location_id text,
dest_location_name text,
aircraft_tail_number text,
flight_number text,
aircraft_type text,
carrier_id text,
carrier_name text,
created_on timestamp,
created_by text,
updated_on timestamp,
updated_by text,
status text,
matched_tier_name text,
matched_on text,
workflow_action text,
adj_reason text,
credit_reason text,
hold_reason text,
delete_reason text,
ap_only_reason text
);
CREATE TABLE IF NOT EXISTS invoice_by_id ( -- MAIN TABLE --
invoice_id text,
parent_id text,
segment text,
invoice_number text,
invoice_type text,
source text,
ap_only text,
invoice_date timestamp,
received_date timestamp,
due_date timestamp,
vendor_id text,
vendor_name text,
vendor_site_id text,
vendor_site_name text,
currency_code text,
local_currency_code text,
exchange_rate decimal,
exchange_rate_date timestamp,
extended_cost decimal,
early_pay_discount decimal,
payment_method text,
invoice_amount decimal,
total_tolerance decimal,
total_variance decimal,
location_id text,
location_name text,
dest_location_override text,
company_id text,
company_name text,
org_id text,
sold_to_number text,
ship_to_number text,
ref_po_number text,
sanction_indicator text,
created_on timestamp,
created_by text,
updated_on timestamp,
updated_by text,
manually_assigned text,
assigned_user text,
assigned_group text,
workflow_process_id text,
version int,
comments set<frozen<comment_udt>>,
status text,
lines set<frozen<invoice_line_udt>>,-- COLLECTION OF UDTs --
PRIMARY KEY (invoice_id, invoice_type));
Here is the script for approach 2: (denormalized invoice and lines in one partition but multiple rows)
CREATE TABLE wfs_eam_ap_matching.invoice_and_lines_copy1 (
invoice_id uuid,
invoice_line_id uuid,
record_type text,
active boolean,
adj_density decimal,
adj_reason text,
aircraft_tail_number text,
aircraft_type text,
alt_quantity decimal,
alt_quantity_uom_code text,
ap_only boolean,
ap_only_reason text,
assignment_group text,
available_quantity decimal,
bol_number text,
cancel_reason text,
carrier_id uuid,
carrier_name text,
comments LIST<FROZEN<comment_udt>>,
company_id uuid,
company_name text,
created_by text,
created_on timestamp,
credit_reason text,
dest_location_id uuid,
dest_location_name text,
dest_location_override boolean,
dom_intl_indicator text,
due_date timestamp,
early_pay_discount decimal,
exchange_rate decimal,
exchange_rate_date timestamp,
extended_cost decimal,
flight_number text,
fob_point text,
gross_net_indicator text,
gross_quantity decimal,
hold_reason text,
intermediate_location_id uuid,
intermediate_location_name text,
invoice_currency_code text,
invoice_date timestamp,
invoice_line_number int,
invoice_number text,
invoice_type text,
item_id uuid,
item_name text,
item_type text,
local_currency_code text,
location_id uuid,
location_name text,
manually_assigned boolean,
matched_on timestamp,
matched_pos text,
matched_tier_name text,
net_quantity decimal,
org_id int,
origin_location_id uuid,
origin_location_name text,
parent_id uuid,
payment_method text,
received_date timestamp,
ref_po_number text,
sanction_indicator text,
segment text,
ship_to_number text,
sold_to_number text,
solr_query text,
source text,
status text,
total_tolerance decimal,
total_variance decimal,
unique_identifier FROZEN<TUPLE<text, text>>,
unit_cost decimal,
uom_code text,
updated_by text,
updated_on timestamp,
uplift_end_date timestamp,
uplift_start_date timestamp,
uplift_start_end_indicator text,
user_assignee text,
vendor_id uuid,
vendor_name text,
vendor_site_id uuid,
vendor_site_name text,
version int,
workflow_process_id text,
PRIMARY KEY (invoice_id, invoice_line_id, record_type)
);
Note: we use datastax cassandra + DSE Search. It doesn't support static columns, hence we are not using it. Also, in order to give a real picture I have listed tables and UDT with lots of columns and ended up creating a long question.

Related

How to index list of user defined data type as a frozen on table "list<frozen<UDT>>"?

I have table contains column as foos list<frozen <foo>>,
Let foo is defined as foo
CREATE TYPE api.foo (
arrival_date_time text,
carrier_iata text,
carrier_id text,
carrier_name text,
class_code text,
departure_date_time text,
flight_duration int,
"from" text,
"to" text,
via text
);
How to index on table contains foos?

Running into unrecognized token error when writing csv file into sqlite3 database

I'm trying to write the contents of a csv file into an sqlite3 database but I'm running into an unrecognized token error while creating the database and defining the schema
# Connect to database
conn = sqlite3.connect('test.db')
# Create cursor
c = conn.cursor()
# Open CSV file
with open('500000 Records.csv', mode='r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
if line_count == 0:
# Create table
query = '''CREATE TABLE IF NOT EXISTS Employee({} INT, {} TEXT,
{} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT,
{} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT,
{} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT, {} TEXT,
{} TEXT, {} TEXT, {} TEXT)'''.format(*row)
print(query)
c.execute(query)
This is the query that is printed when the print(query) line is executed:
CREATE TABLE IF NOT EXISTS Employee(Emp ID INT, Name Prefix TEXT,
First Name TEXT, Middle Initial TEXT, Last Name TEXT, Gender TEXT, E Mail TEXT, Father's Name TEXT, Mother's Name TEXT, Mother's Maiden Name TEXT,
Date of Birth TEXT, Time of Birth TEXT, Age in Yrs. TEXT, Weight in Kgs. TEXT, Date of Joining TEXT, Quarter of Joining TEXT, Half of Joining TEXT, Year of Joining TEXT,
Month of Joining TEXT, Month Name of Joining TEXT, Short Month TEXT, Day of Joining TEXT, DOW of Joining TEXT, Short DOW TEXT, Age in Company (Years) TEXT, Salary TEXT,
Last % Hike TEXT, SSN TEXT, Phone No. TEXT)
This is the error that results from the c.execute(query) line:
Traceback (most recent call last):
File "C:\Users\User\Google Drive\CSC443\A1\create_database.py", line 21, in <module>
c.execute(query)
sqlite3.OperationalError: unrecognized token: "'s Maiden Name TEXT,
Date of Birth TEXT, Time of Birth TEXT, Age in Yrs. TEXT, Weight in Kgs. TEXT, Date of Joining TEXT, Quarter of Joining TEXT, Half of Joining TEXT, Year of Joining TEXT,
Month of Joining TEXT, Month Name of Joining TEXT, Short Month TEXT, Day of Joining TEXT, DOW of Joining TEXT, Short DOW TEXT, Age in Company (Years) TEXT, Salary TEXT,
Last % Hike TEXT, SSN TEXT, Phone No. TEXT)"
sqlite3 is taking issue with the "Mother's Maiden Name" column for some reason and I can't figure it out. It's not the first apostrophe symbol to occur; that would be in the "Father's Name" column.
Basically you have numerous issues.
First consider removing all the column definitions from the syntax error on e.g. using :-
DROP TABLE IF EXISTS Employee;
CREATE TABLE IF NOT EXISTS Employee(Emp ID INT, Name Prefix TEXT,
First Name TEXT, Middle Initial TEXT, Last Name TEXT, Gender TEXT, E Mail TEXT, Father's Name TEXT, Mother'
/*s Name TEXT, Mother's Maiden Name TEXT,
Date of Birth TEXT, Time of Birth TEXT, Age in Yrs. TEXT, Weight in Kgs. TEXT, Date of Joining TEXT, Quarter of Joining TEXT, Half of Joining TEXT, Year of Joining TEXT,
Month of Joining TEXT, Month Name of Joining TEXT, Short Month TEXT, Day of Joining TEXT, DOW of Joining TEXT, Short DOW TEXT, Age in Company (Years) TEXT, Salary TEXT,
Last % Hike TEXT, SSN TEXT, Phone No. TEXT
*/
)
;
SELECT * FROM Employee;
The resultant table created has the following columns :-
I believe that you would have expected columns such as Emp ID, Name Prefix, First name etc.
What is happening is that the text up to the first space is used as the column name the subsequent text is then used for the column definition which due to the flexibility of SQLite can be rather forgiving with the column type.
see How flexible/restricive are SQLite column types?
Column names (and names in general) cannot have an embedded space unless the name is suitable enclosed.
If you now consider all names enclosed e.g as per :-
DROP TABLE IF EXISTS Employee;
CREATE TABLE IF NOT EXISTS Employee(`Emp ID` INT, `Name Prefix` TEXT,
`First Name` TEXT, `Middle Initial` TEXT, `Last Name` TEXT, `Gender` TEXT, `E Mail` TEXT, `Father's Name` TEXT, `Mother's Name` TEXT,
`Mother's Maiden Name` TEXT,
`Date of Birth` TEXT, `Time of Birth` TEXT, `Age in Yrs.` TEXT, `Weight in Kgs.` TEXT, `Date of Joining` TEXT, `Quarter of Joining` TEXT, `Half of Joining TEXT, Year of Joining TEXT,
Month of Joining` TEXT, `Month Name of Joining` TEXT, `Short Month` TEXT, `Day of Joining` TEXT, `DOW of Joining` TEXT, `Short DOW` TEXT, `Age in Company (Years)` TEXT, `Salary` TEXT,
`Last % Hike TEXT`, `SSN` TEXT, `Phone No.` TEXT
)
;
SELECT * FROM Employee;
The the result is :-
note only a subset of the columns shown
In short, due to the column names including spaces, you need to enclose the names (identifiers) according to :-
SQL As Understood By SQLite - SQLite Keywords
Of course using such names/identifiers will probably only result in ongoing issues and it is doubtful that many would recommended the use of such conventions.

How to find last and first entry in cassandra (date is part of partition key)

Is it possible to find first and last entry in Cassandra database if my partition key contains text date as a part of partition key to avoid large partitions?
CREATE TABLE trades (
stockexchange text,
symbol text,
ts timestamp,
date text,
tid text,
price decimal,
side text,
size decimal,
PRIMARY KEY ((stockexchange, symbol, date), ts, tid)
) WITH CLUSTERING ORDER BY (ts ASC, tid ASC)
The one solution is - to create the second table and store separately.
only: stockexchange, symbol, timestamp
This gives you ability to find the first and last timestamp by your key (stockexchange:symbol)
Please pay attention, that you have to store the data in the same moment and Cassandra is not ACID database type.
CREATE TABLE trades (
stockexchange text,
symbol text,
ts timestamp,
date text,
tid text,
price decimal,
side text,
size decimal,
PRIMARY KEY ((stockexchange, symbol, date), ts, tid)
) WITH CLUSTERING ORDER BY (ts ASC, tid ASC)
CREATE TABLE trades_timestampts (
stockexchange text,
symbol text,
tid text,
ts timestamp,
PRIMARY KEY ((stockexchange, symbol), ts, tid)) WITH CLUSTERING ORDER BY (ts asc, tid asc);

Cassandra upsert not working

My cqlsh version is 3.3.1, and I am using datastax.cassandra java driver for connecting. Following is my Cassandra schema.
url text PRIMARY KEY,
ae_rank int,
age int,
brands list<text>,
cities list<text>,
content text,
countries list<text>,
created timestamp,
customer_id text,
datasource list<text>,
diseases list<frozen<disease>>,
drugs list<text>,
gender int,
host text,
lang text,
meddracoding list<text>,
molecules list<text>,
owners list<text>,
page_rank int,
pj_terms list<frozen<pj_term>>,
projects list<text>,
quintilesims_id int,
reviewed int,
sentiment int,
social_tags list<text>,
switchovers list<frozen<switchover>>,
taxonomies list<frozen<taxonomy>>,
therapy_areas list<text>,
title text,
total_count bigint,
ts timestamp,
type text,
updated timestamp,
viralities map<text, bigint>
When there's an update (or insert) statement issue with same primary key value, the relevant row values does not get updated. Here we use an URL as primary key value (ex. http://example.com/example/1234test)
Is this an issue with the versions of Cassandra or the java driver?
Please help get this resolved.

Am I using cassandra efficiently?

I have these table
CREATE TABLE user_info (
userId uuid PRIMARY KEY,
userName varchar,
fullName varchar,
sex varchar,
bizzCateg varchar,
userType varchar,
about text,
joined bigint,
contact text,
job set<text>,
blocked boolean,
emails set<text>,
websites set<text>,
professionTag set<text>,
location frozen<location>
);
create table publishMsg
(
rowKey uuid,
msgId timeuuid,
postedById uuid,
title text,
time bigint,
details text,
tags set<text>,
location frozen<location>,
blocked boolean,
anonymous boolean,
hasPhotos boolean,
esIndx boolean,
PRIMARY KEY(rowKey, msgId)
) with clustering order by (msgId desc);
create table publishMsg_by_user
(
rowKey uuid,
msgId timeuuid,
title text,
time bigint,
details text,
tags set<text>,
location frozen<location>,
blocked boolean,
anonymous boolean,
hasPhotos boolean,
PRIMARY KEY(rowKey, msgId)
) with clustering order by (msgId desc);
CREATE TABLE followers
(
rowKey UUID,
followedBy uuid,
time bigint,
PRIMARY KEY(rowKey, orderKey)
);
I doing 3 INSERT statement in BATCH to put data in publishMsg publishMsg_by_user followers table.
To show a single message I have to query three SELECT query on different table:
publishMsg - to get a publish message details where rowkey & msgId given.
userInfo - to get fullName based on postedById
followers - to know whether a postedById is following a given topic or not
Is this a fit way of using cassandra ? will that be efficient because the given scanerio data can't fit in single table.
Sorry to ask this in an answer but I don't have the rep to comment.
Ignoring the tables for now, what information does your application need to ask for? Ideally in Cassandra, you will only have to execute one query on one table to get the data you need to return to the client. You shouldn't need to have to execute 3 queries to get what you want.
Also, your followers table appears to be missing the orderkey field.

Resources