database, table and column naming conventions [closed] - naming

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Do you know how to use the naming conventions in mysql database? I've downloaded a mysql sample database.
Here it is:
CREATE DATABASE IF NOT EXISTS classicmodels DEFAULT CHARACTER SET latin1;
USE classicmodels ;
DROP TABLE IF EXISTS customers ;
CREATE TABLE customers (
customerNumber int(11) NOT NULL,
customerName varchar(50) NOT NULL,
contactLastName varchar(50) NOT NULL,
contactFirstName varchar(50) NOT NULL,
phone varchar(50) NOT NULL,
addressLine1 varchar(50) NOT NULL,
addressLine2 varchar(50) default NULL,
city varchar(50) NOT NULL,
state varchar(50) default NULL,
postalCode varchar(15) default NULL,
country varchar(50) NOT NULL,
salesRepEmployeeNumber int(11) default NULL,
creditLimit double default NULL,
PRIMARY KEY ( customerNumber )
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Edit:
What i prefer:
CREATE DATABASE IF NOT EXISTS classic_models;
USE classic_models ;
DROP TABLE IF EXISTS customers ;
CREATE TABLE customers (
customer_number int(11) NOT NULL,
customer_name varchar(50) NOT NULL,
-- or i define the column name this way:
name varchar(50) NOT NULL, -- NOT customerName and NOT customer_name
PRIMARY KEY ( customer_number )
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Am i right?
I recommend an article: sql convention from Faruk Ateş
Do you have any advice for naming conventions here?

You're never right (or wrong) about naming conventions. As you go from employer to employer, you'll encounter different conventions at every workplace, and you'll always have to adapt. You've already stated what you prefer, so when working on your own projects, simply use that, except your system is built on top of something else, that consistently uses another convention. Then I'd say you'd be better off using that convention in that project. Consistency > Preference.

Please keep calm if you don't like this idea,
but did you consider column names independent from table names ?
Semantics
The semantic would be :
"if two fields are called startDate and endDate, then they designates the dates that determine the period considered for the current table."
That semantic usually crosses tables, so having a consistent name is good.
Implementation concerns
In the database
Maybe some people say that they still understand if that common column name is prefixed by the table name. But in several use cases, we got bitten by this and prefer:
To you use meta-requests (ex. create a request to read all columns named startDate, or find all tables that use some reference data) efficiently, fixed names are much easier.
Stored procedure or triggers also can be reused easier if the names are fixed.
ORM
ORM are really good at letting you define a field once, then create subclasses that will have that field automatically (composition is also used). The database has no subclassing, but
If the various tables mapped on the classes use the same name for columns, everything is natural.
Otherwise, you have to hand-code (or declare) the fact that the startDate in the code is implemented in the database as XXXStartDate or XXX_start_date...
Requiring Aliases
Some requests are self-joins, joining the same table twice, thus requiring to use aliases for table names in every case.
Most other hand-coded requests join several tables. This naming policy would increase the likelihood that two columns have the same name, so that would require to use aliases. Is that a problem? I think it isn't because:
Using aliases is often recommended anyway, and considered good-practice.
Using aliases in both cases allow for a more consistent coding environment.
Using aliases allow a few advantages over table names:
a. Can allow long and clear table names, including a prefix to group the tables by "modules", as a shorter alias can be used in requests.
b. While a table name is fixed for all modules of all applications that access the database, applications or modules can use varying aliases in their requests, allowing to provide more semantic in the requests (just like the choice of naming a variable in the code, with the same rules).

both naming conventions you've shown are completely acceptable. writingLikeThis is easier to type, but writing_like_this is easier to read. The most important thing is consistency. Pick one naming convention and stick with it.

Related

How do you UPDATE a Cassandra column without directly knowing the primary key?

Given a scenario where you have a User table, with id as PRIMARY KEY.
You have a column called email, and a column called name.
You want to UPDATE User.name based on User.email
I realized that the UPDATE command requires you to pass in a PRIMARY KEY. Does this mean I can't use a pure CQL migration, and would need to first query for the User.id primary key before I can UPDATE?
In this case, I DO know the PRIMARY KEY because the UUIDs are the same for dev and prod, but it feels dirty.
Yes, you're correct - you need to know primary key of the record to perform an update on the data, or deletion of specific record. There are several options here, depending of your data model:
Perform full scan of the table using effective token range scan (Look to this answer for more details);
If this is required very often, you can create a materialized view, with User.email as partition key, and fetch all message IDs that you can update (but you'll need to do this from your application, there is no nested query support in CQL). But also be aware that materialized views are "experimental" feature in Cassandra, and may not work all the time (it's more stable in DataStax Enterprise). Also, if you have some users with hundreds of thousands of emails, this may create big partitions.
Do like 2nd item with your code, by using an additional table
I think Alex's answer covers your question -- "how can I find a value in a PK column working backwards from a non-PK column's value?".
However, I think it's worth noting that asking this question indicates you should reconsider your data model. A rule of thumb in C* data model design is that you begin by considering the queries you need, and you've missed the UPDATE query use case. You can probably make things work without changing your model for now, but if you find you need to make other queries you're unprepared for, you'll run into operational issues with lots of indexes and/or MVs.
More generally, search around for articles and other resources about Cassandra data modeling. It sounds like you're basically using C* for a relational use case so you'll want to look into that.

Search for more than one element in a list in Cassandra

I'm learning how the data model works in Cassandra, what things you can do and what not, etc.
I've seen you can have collections and I'm wondering if you can search for the elements inside the collection. I've seen that you can look for one element with contains, but if you want to look for more than one you need to add more filters, is there any way to do this better? is it a bad practice?.
This my table definition:
CREATE TABLE data (
group_id int,
user timeuuid,
friends LIST<VARCHAR>,
PRIMARY KEY (group_id, user)
);
And this what I know i can use to look for more than one item in the list:
SELECT * FROM groups where friends contains 'bob' and friends contains 'Pete' ALLOW FILTERING;
Thank you
Secondary indexes are generally not recommended for performance reasons.
Generally, in Cassandra, Query based modelling should be followed.
So,
That would mean another table:
CREATE TABLE friend_group_relation (
friend VARCHAR,
group_id int,
<user if needed>
PRIMARY KEY ((friend), group_id)
);
Now you can use either IN query (not recommended) or async queries (strongly recommended, very fast response) on this table.
You can follow 2 different approaches
Pure cassandra: use a secondary index on your collection type as defined here documentation
You may also be able to use Solr and create a query against solr to retrieve your entries. Although this may look like a more complicated solution because it will require to use an extra tool it will avoid using secondary indexes on Cassandra. Secondary indexes on Cassandra are really expensive and based on on your schema definition may impact your performances.

Node.js - ORM with hierarchical data support

Like the title says i am looking for an ORM that supports hierarchical data.
For example i will need to represent a relation like this (category with subcategories and so on...):
CREATE TABLE "category"
(
"id" SERIAL PRIMARY KEY,
"parent" INTEGER NULL DEFAULT NULL REFERENCES "category" ("id")
"name" VARCHAR(50) NOT NULL UNIQUE,
"description" VARCHAR(100) NOT NULL,
"sort_order" INTEGER NULL DEFAULT NULL,
);
Is there any one that can do that?
You should check out sails.js. Their Waterline ORM has support for dozens of databases and has excellent relational support, and it has a huge community surrounding it.
From the docs:
You can do all the same things you might be used to (one-to-many,
many-to-many), but you can also assign multiple named associations
per-model (for instance, a cake might have two collections of people:
"havers" and "eaters"). Better yet, you can assign different models to
different databases, and your associations/joins will still work--
even across NoSQL and relational boundries. Sails has no problem
implicitly/automatically joining a MySQL table with a Mongo collection
and vice versa.

Efficient modeling of versioned hierarchies in Cassandra

Disclaimer:
This is quite a long post. I first explain the data I am dealing with, and what I want to do with it.
Then I detail three possible solutions I have considered, because I've tried to do my homework (I swear :]). I end up with a "best guess" which is a variation of the first solution.
My ultimate question is: what's the most sensible way to solve my problem using Cassandra? Is it one of my attempts, or is it something else?
I am looking for advice/feedback from experienced Cassandra users...
My data:
I have many SuperDocuments that own Documents in a tree structure (headings, subheadings, sections, …).
Each SuperDocument structure can change (renaming of headings mostly) over time, thus giving me multiple versions of the structure as shown below.
What I'm looking for:
For each SuperDocument I need to timestamp those structures by date as above and I'd like, for a given date, to find the closest earlier version of the SuperDocument structure. (ie. the most recent version for which version_date < given_date)
These considerations might help solving the problem more easily:
Versions are immutable: changes are rare enough, I can create a new representation of the whole structure each time it changes.
I do not need to access a subtree of the structure.
I'd say it is OK to say that I do not need to find all the ancestors of a given leaf, nor do I need to access a specific node/leaf inside the tree. I can work all of this out in my client code once I have the whole tree.
OK let's do it
Please keep in mind I am really just starting using Cassandra. I've read/watched a lot of resources about data modeling, but haven't got much (any!) experience in the field!
Which also means everything will be written in CQL3... sorry Thrift lovers!
My first attempt at solving this was to create the following table:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY ((doc_id, version_date), pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (pre_pos ASC);
That would give me the following structure:
I'm using a Nested Sets model for my trees here; I figured it would work well to keep the structure ordered, but I am open to other suggestions.
I like this solution: each version has its own row, in which each column represents a level of the hierarchy.
The problem though is that I (candidly) intended to query my data as follows:
SELECT * FROM superdoc_structures
WHERE doc_id="3399c35...14e1" AND version_date < '2014-03-11' LIMIT 1
Cassandra quickly reminded me I was not allowed to do that! (because the partitioner does not preserve row order on the cluster nodes, so it is not possible to scan through partition keys)
What then...?
Well, because Cassandra won't let me use inequalities on partition keys, so be it!
I'll make version_date a clustering key and all my problems will be gone. Yeah, not really...
First try:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
pre_pos int,
post_pos int,
title text,
PRIMARY KEY (doc_id, version_date, pre_pos, post_pos)
) WITH CLUSTERING ORDER BY (version_date DESC, pre_pos ASC);
I find this one less elegant: all versions and structure levels are made into columns of a now very wide row (compared to my previous solution):
Problem: with the same request, using LIMIT 1 will only return the first heading. And using no LIMIT would return all versions structure levels, which I would have to filter to only keep the most recent ones.
Second try:
there's no second try yet... I have an idea though, but I feel it's not using Cassandra wisely.
The idea would be to cluster by version_date only, and somehow store whole hierarchies in each column values. Sounds bad doesn't it?
I would do something like this:
CREATE TABLE IF NOT EXISTS superdoc_structures (
doc_id varchar,
version_date timestamp,
nested_sets map<int, int>,
titles list<text>,
PRIMARY KEY (doc_id, version_date)
) WITH CLUSTERING ORDER BY (version_date DESC);
The resulting row structure would then be:
It looks kind of all right to me in fact, but I will probably have more data than the level title to de-normalize into my columns. If it's only two attributes, I could go with another map (associating titles with ids for instance), but more data would lead to more lists, and I have the feeling it would quickly become an anti-pattern.
Plus, I'd have to merge all lists together in my client app when the data comes in!
ALTERNATIVE & BEST GUESS
After giving it some more thought, there's an "hybrid" solution that might work and may be efficient and elegant:
I could use another table that would list only the version dates of a SuperDocument & cache these dates into a Memcache instance (or Redis or whatever) for real quick access.
That would allow me to quickly find the version I need to fetch, and then request it using the composite key of my first solution.
That's two queries, plus a memory cache store to manage. But I may end up with one anyway, so maybe that'd be the best compromise?
Maybe I don't even need a cache store?
All in all, I really feel the first solution is the most elegant one to model my data. What about you?!
First, you don't need to use memcache or redis. Cassandra will give you very fast access to that information. You could certainly have a table that was something like:
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
/* stuff */
primary key (doc_id, version_date)
} with clustering order by (version_date desc);
which would give you a quick way to access a given version (this query may look familiar ;-):
select * from superdoc_structures
where doc_id="3399c35...14e1" and
version_date < '2014-03-11'
order by version_date desc
limit 1;
Since nothing about the document tree structure seems to be relevant from the schema's point of view, and you are happy as a clam to create the document in its entirety every time there is a new version, I don't see why you'd even bother breaking out the tree in to separate rows. Why not just have the entire document in the table as a text or blob field?
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
contents text;
primary key (doc_id, version_date)
} with clustering order by (version_date desc);
So to get the contents of the document as existed at the new year, you'd do:
select contents from superdoc_structures
where doc_id="...." and
version_date < '2014-01-1'
order by version_date > 1
Now, if you did want to maintain some kind of hierarchy of the document components, I'd recommend doing something like a closure table table to represent it. Alternatively, since you are willing to copy the entire document on each write anyway, why not copy the entire section info on each write, why not do so and have a schema like:
create table superdoc_structures {
doc_id varchar;
version_date timestamp;
section_path varchar;
contents text;
primary key (doc_id, version_date, section_path)
) with clustering order by (version_date desc, section_path asc);
Then have section path have a syntax like, "first_level next_level sub_level leaf_name". As a side benefit, when you have the version_date of the document (or if you create a secondary index on section_path), because a space is lexically "lower" than any other valid character, you can actually grab a subsection very cleanly:
select section_path, contents from superdoc_structures
where doc_id = '....' and
version_date = '2013-12-22' and
section_path >= 'chapter4 subsection2' and
section_path < 'chapter4 subsection2!';
Alternatively, you can store the sections using Cassandra's support for collections, but again... I'm not sure why you'd even bother breaking them out as doing them as one big chunk works just great.

Using CF's as an extra key level? Is there a limit on tables/column famillies in Cassandra?

As usual, I don't know if this is a good idea, so that's why I'm asking StackOverflow!
I'm toying with the idea of using CF's as an extra layer of partitioning data. For example, (and using the sensor example which seems to be pretty common) a traditional schema would be something like:
CREATE TABLE data (
area_id int,
sensor varchar,
date ascii,
event_time timeuuid,
some_property1 varchar,
some_property2 varchar,
some_property3 varchar
PRIMARY KEY ((area_id, sensor, date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
This is a bit problematic if some_property1,2,3 etc are not known at design time and can be changed over the life the platform. One possibility is to just declare more properties as needed, but then I think it makes more sense to bring the sensors into their own CF as each will have different schemas. You could do this just by naming the CF something composite (managed outside Cassandra), e.g. {area_id}_{sensor_name}, and then alter the schema as needed when new properties are requested for insert.
My question is 2 fold. a) Is this a reasonable idea? and b) Are there any limitations of Cassandra (such as a cap of number of CFs) that this might fall foul of?
For reference this is a possible design to a previous question, but I think the question is valid to stand-alone.
Andy,
Adding an excessively large number of column families will create maintainability issues for you down the road. I'd advise against it.
Consider using CQL3 collections to address the unknown property issue - these will allow your objects in this column family to have a variable number of properties that may not be known at design-time. You can use the Map type to give each of your dynamic properties a strong name and a correlated value (we do this.)
However, you if you need wildly different data types for each property and if you need more than 10-15 properties per sensor, then CQL3 collections might not be the right tool for the job. You can technically store up to 65,000 objects in a CQL3 collection, but the truth is that they should never approach that size. CQL3 collections aren't indexed and working with really large CQL3 collections will incur performance penalties.

Resources