cassandra table modeling for EDI data - cassandra

I am trying to create tables to insert EDI data into Cassandra database. As all of you know there are more than 300+ fields in EDI table, What is the best way to create tables. Now I know we will not need all the 300+ fields , but definitely we will be using about 100-120 fields and the rest of the fields must be saved incase we need them in future. Now out of these 120 + fields we might use 12-15 for search purpose. Let us assume these are some of the fields and we have more of them say close to 300+
Shipment ID Number Date Reference Number Equipment Number
Shipper Name Shipper ID Shipper Addr1 Shipper Addr2 Shipper City Shipper State Shipper Zip
Consignee Name Consignee ID Consignee Addr Consignee Addr2 Consignee City Consignee State Consignee Zip Contact Name Phone
Item No. Authorization No. Lading Desc. Lading Qty Packaging Code Weight (Pounds) Commodity Code
SSCC Application ID
Purchase Order Reference ID Unit Quantity,
A B C D ….. AA AB AC ….. ZZ
Now out of these say we need
ID Number, Reference Number, Equipment Number, A, AD, GG, ….upto 15 as search fields,
Now what is the best way to model tables in this scenario.
Case 1) Keep all the fields in one table.
Well this sounds good but the Table is going to be huge. Search needs too many secondary indexes. Please let me know if this is a good approach or If my thinking is wrong.
Case 2) Split the table into 6 or 7 tables with necessary fields and just for information sake fields.
Say Table one: Shipment ID Number Date Reference Number Equipment Number Consignee Name Consignee ID Consignee Addr Consignee Addr2 Consignee City Consignee State Consignee Zip Contact Name Phone A D FA … etc say upto 40 fields
Assuming Table one has mostly used fields.
Say Table two:
Item No. Authorization No. Lading Desc. Lading Qty Packaging Code Weight (Pounds) Commodity Code … and say upto 200 fields or so …
Assume there are unto six or seven tables as well
Questions regarding to Case2)
1) How do I achieve joins in case I need so.
2) Most of the time I read suggestions as de-normalize the data and repeat the fields in both the tables.
2a) if so how do I insert the data into both the fields do I right my code in such a way or Do I use some other tools like Spark etc..
2b) What is the best way to achieve Joins in these scenarios.
2c) Also what is the best way to still achieve search scenarios with minimal Secondary indexes.
I know the question is vague and needs lot of assumptions but still I would appreciate if you can answer this. I am looking more towards the way my thinking should go . I have gone though many notes but did not find any solutions. I am only reading suggestions , mostly telling me to use de-normalized data . But how do we achieve it.
Please make as many assumptions as possible , I will try to understand, and keep open in using Spark and Cassandra where ever required. Please try to explain it as clearly as possible with example if possible so that it will be clear.
I appreciate the effort you will be taking in answering these questions.
Thanks
Tom

Related

Cassandra: filtering based one specific value in a set

I have a data table in Cassandra and one of the columns is:
customer_favourites, with each value being of type set and it has the details of each customer's favourite foods. For example one customer could have {'Mexican', 'Italian', 'Indian'} and another customer could have {'Mexican', 'French'} and another could have {'Mexican'}.
I have the following code:
SELECT customer_id, customer_fname, customer_lname FROM customers WHERE customer_favourites CONTAINS ‘Mexican’ ALLOW FILTERING;
I want it to filter on those customers whose favourite food is ONLY Mexican, but right now it's returning the details of every customer who has Mexican as one of their favourite foods. How do I filter my query to return customer who like ONLY Mexican food?
Naive approach: You need to use customer_favourites = {'Mexican'}...
Better approach - create secondary index on the corresponding field, using the FULL keyword, and then use customer_favourites = {'Mexican'}.
Best approach - create a separate table with customer_favourites as partition key, and search users in it (column should be frozen). One of the problems with this approach would be the data skew, as number of favorite foods is relatively small, and quite imbalanced.
Alternative approach - reconsider the use of the Cassandra, if you need to search by non-partition key very often.

Static column in Cassandra

Can someone explain in simple terms what is the static column in Cassandra, and its use?
I came across this link Static column, but wasn't able to understand it much.
Static column is a way to associate data with the whole partition, so it will be shared between all rows inside that partition. There are legitimate cases, when all rows need to have the same data, and when data is updated, we won't need to update every row.
One example that comes in mind is e-commerce. For example, you're selling something, and you're selling in different countries with different currency & different prices. But some things are common between them, like, description, sizes, etc. In this case we can model it as following:
create table articles (
sku text,
description text static,
country text,
currency text,
price decimal,
primary key (sku, country)
);
in this case, when you do select * from articles where sku = ... and country = ... then you get description anyway, and you can update description only with update articles set description = '...' where sku = ..., and next select will pull updated description.
Also, static columns may exist in partition without any rows. One of the use cases that I've seen is collection of the aggregated information, where detailed data were stored as individual rows with some TTL, and there was a job that aggregated data into static column, so when rows are expired, then this partition still stays only with aggregated data.

Concatenate all values for a column as one string for each row in select

I have been looking into this issue for some time and need some help. I have looked into sub-queries and CTE with no Luck.
I want to create a select of all of the data in Table A. I also want to add some columns from Table B. So an Inner join will work initially.
However in Table B I want to create a string that is concatenated from all the different categories for each person and list this as a column in my select.
For example, one person might be in IT and also Construction categories so I would like a string that says "Construction, IT" listed for that person and the same basis for all others.
I believe I have looked into COALESCE
which meets me part way. Any advice most appreciated.
Table A
Name
Surname
Person Ref
Table B
Person Ref
Person Category
Person Gender
These are example tables I can't post the ones I'm looking at for security reasons but hopefully, you get my point.
Martin

How to optimize Cassandra model while still supporting querying by contents of lists

I just switched from Oracle to using Cassandra 2.0 with Datastax driver and I'm having difficulty structuring my model for this big data approach. I have a Persons table with UUID and serialized Persons. These Persons have lists of addresses, names, identifications, and DOBs. For each of these lists I have an additional table with a compound key on each value in the respective list and the additional person_UUID column. This model feels too relational to me, but I don't know how else to structure it so that I can have index(am able to search by) on address, name, identification, and DOB. If Cassandra supported indexes on lists I would have just the one Persons table containing indexed lists for each of these.
In my application we receive transactions, which can contain within them 0 or more of each of those address, name, identification, and DOB. The persons are scored based on which person matched which criteria. A single person with the highest score is matched to a transaction. Any additional address, name, identification, and DOB data from the transaction that was matched is then added to that person.
The problem I'm having is that this matching is taking too long and the processing is falling far behind. This is caused by having to loop through result sets performing additional queries since I can't make complex queries in Cassandra, and I don't have sufficient memory to just do a huge select all and filter in java. For instance, I would like to select all Persons having at least two names in common with the transaction (names can have their order scrambled, so there is no first, middle, last; that would just be three names) but this would require a 'group by' which Cassandra does not support, and if I just selected all having any of the names in common in order to filter in java the result set is too large and i run out of memory.
I'm currently searching by only Identifications and Addresses, which yield a smaller result set (although it could still be hundreds) and for each one in this result set I query to see if it also matches on names and/or DOB. Besides still being slow this does not meet the project's requirements as a match on Name and DOB alone would be sufficient to link a transaction to a person if no higher score is found.
I know in Cassandra you should model your tables by the queries you do, not by the relationships of the entities, but I don't know how to apply this while maintaining the ability to query individually by address, name, identification, and DOB.
Any help or advice would be greatly appreciated. I'm very impressed by Cassandra but I haven't quite figured out how to make it work for me.
Tables:
Persons
[UUID | serialized_Person]
addresses
[address | person_UUID]
names
[name | person_UUID]
identifications
[identification | person_UUID]
DOBs
[DOB | person_UUID]
I did a lot more reading, and I'm now thinking I should change these tables around to the following:
Persons
[UUID | serialized_Person]
addresses
[address | Set of person_UUID]
names
[name | Set of person_UUID]
identifications
[identification | Set of person_UUID]
DOBs
[DOB | Set of person_UUID]
But I'm afraid of going beyond the max storage for a set(65,536 UUIDs) for some names and DOBs. Instead I think I'll have to do a dynamic column family with the column names as the Person_UUIDs, or is a row with over 65k columns very problematic as well? Thoughts?
It looks like you can't have these dynamic column families in the new version of Cassandra, you have to alter the table to insert the new column with a specific name. I don't know how to store more than 64k values for a row then. With a perfect distribution I will run out of space for DOBs with 23 million persons, I'm expecting to have over 200 million persons. Maybe I have to just have multiple set columns?
DOBs
[DOB | Set of person_UUID_A | Set of person_UUID_B | Set of person_UUID_C]
and I just check size and alter table if size = 64k? Anything better I can do?
I guess it's just CQL3 that enforces this and that if I really wanted I can still do dynamic columns with the Cassandra 2.0?
Ugh, this page from Datastax doc seems to say I had it right the first way...:
When to use a collection
This answer is not very specific, but I'll come back and add to it when I get a chance.
First thing - don't serialize your Persons into a single column. This complicates searching and updating any person info. OTOH, there are people that know what they're saying that disagree with this view. ;)
Next, don't normalize your data. Disk space is cheap. So, don't be afraid to write the same data to two places. You code will need to make sure that the right thing is done.
Those items feed into this: If you want queries to be fast, consider what you need to make that query fast. That is, create a table just for that query. That may mean writing data to multiple tables for multiple queries. Pick a query, and build a table that holds exactly what you need for that query, indexed on whatever you have available for the lookup, such as an id.
So, if you need to query by address, build a table (really, a column family) indexed on address. If you need to support another query based on identification, index on that. Each table may contain duplicate data. This means when you add a new user, you may be writing the same data to more than one table. While this seems unnatural if relational databases are the only kind you've ever used, but you get benefits in return - namely, horizontal scalability thanks to the CAP Theorem.
Edit:
The two column families in that last example could just hold identifiers into another table. So, voilà you have made an index. OTOH, that means each query takes two reads. But, still will be a performance improvement in many cases.
Edit:
Attempting to explain the previous edit:
Say you have a users table/column family:
CREATE TABLE users (
id uuid PRIMARY KEY,
display_name text,
avatar text
);
And you want to find a user's avatar given a display name (a contrived example). Searching users will be slow. So, you could create a table/CF that serves as an index, let's call it users_by_name:
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
user_id uuid
}
The search on display_name is now done against users_by_name, and that gives you the user_id, which you use to issue a second query against users. In this case, user_id in users_by_name has the value of the primary key id in users. Both queries are fast.
Or, you could put avatar in users_by_name, and accomplish the same thing with one query by using more disk space.
CREATE TABLE users_by_name (
display_name text PRIMARY KEY,
avatar text
}

Azure Tables - Partition Key and Row Key - Correct Choice

I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.
The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.
Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.
Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.
I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH
The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?

Resources