Table 1
I have a column which consist of Primary and Secondary groups like the example above, what I would like to do is to group all Secondary groups into Primary group, so that the final column will only consist of all Primary groups. However, I need to divide some of the Secondary group's value by percentage among the Primary groups, for example: 50% of Secondary D goes to Primary A and B respectively, 1/3 of Secondary E goes to Primary A, B and C respectively and all of Secondary F goes into Primary C. The end results will look something like the table below.
Table 2
Could someone give me some pointers on how I can achieve this?
Thanks.
DL
Related
i'm currently trying to optimize some kind of query of 2 rather large tables, which are characterized like this:
Table 1: id column - alphanumerical, about 300mil unique ids, more than 1bil rows overall
Table 2: id column - identical semantics, about 200mil unique ids, more than 1bil rows overall
Lets say on a given day, 17.03. i want to join those two tables on id.
Table 1 is left, table 2 is right, i get like 90% of matches, meaning table 2 has like 90% of those ids present in table 1.
One week later, said table 1 did not change (could but to make explanation easier, consider it didn't), table 2 was updated and now contains more records. I do the join again and now, from the former missing ids some came up, so i got like 95% matches now.
In general, table1.id has some matches with table2.id at a given time which might change on a day-per-day base.
I now want to optimize this join and came up on the bucketing feature. Is this possible?
Example:
1st join: id "ABC123" is present in table1, not in table2. ABC123 gets sorted into a certain bucket, e.g. "1".
2nd join (week later): id "ABC123" now came up in table2; how can it be ensured it comes into the bucket on table 2 which then is co-located with table 1?
Or am i having a general problem of understanding how it works?
When search by only one of primary key in composite primary key table Spanner behaviors is differ. For Example if a table has ColA and ColB as primary key( mentioned in the same order when defining primary key). If you search by first key select * from table where ColA = 'dfdf' then it scans few rows and brings result much faster ~10ms. But if you search by second key select * from table where ColB = 'dfdf' then it does full table scan. Why this inconsistency, if we are not searching by full key then it should do full table or particular rows scan. primary keys are indexed so it should never go to full table scan.
A composite key is not 2 separate keys, but one single key concatenated from 2 parts...
Imagine a list of words, alphabetically sorted... Finding words whose first letter is H is easy...
But how would you find all words whose second letter is 'H'...
The only way is to do a complete scan of all the words -- unless there is a second index of words by their second letter, which is what secondary indexes are for ..
For example, if my primary key is a and clustering columns are b and c.
Can I only use the following in where condition?
select * from table where a = 1 and b = 2 and c = 3
Or are there any other queries that I can use?
I want to use
select * from table where a=1
and
select * from table where a = 1 and b = 2 and c = 3 and d = 4
Is that possible?
If not, then how can I model my data to make this possible?
Cassandra has lots of advantages, but it does not fit for every need.
Cassandra is a good choice, when you need to handle large amount of writes. People like it, because Cassandra is easily scalable, can handle huge datasets and highly fault tolerant.
You need to keep in mind that with Cassandra (if you really want to utilize it) the basic rule is to model your data to fit your queries. Don't model around relations. Don't model around objects. Model around your queries. This way you can minimize partition reads.
And of course you can query not just the primary keys and partition columns. You can:
add secondary index to some columns or
use the ALLOW FILTERING keyword
but of course, these are not that effective as having a well-modeled table.
For example, if my primary key is a and clustering columns are b and c.
So this translates into a definition of: PRIMARY KEY ((a),b,c). Based on that...
are there any other queries that I can use?
Yes. Some important points to understand; is that the query's WHERE clause with PRIMARY KEYs:
Must be specified in order.
Cannot be skipped.
Can be omitted, as long as the keys prior to it are specified.
select * from table where a=1
Yes, this query will work. That's because you're still querying by your partition key (a).
select * from table where a = 1 and b = 2 and c = 3 and d = 4
However, this will not work. That is because d is not (based on my understanding of your first statement) a part of your PRIMARY KEY definition.
If not, then how can I model my data to make this possible?
As Andrea mentioned, you should build your table according to the queries it needs to support. So if you need to query by a, b, c, and d, you'll need to make d a part of your PRIMARY KEY.
In Cassandra Wiki, it is said that there is a limit of 2 billion cells (rows x columns) per partition. But it is unclear to me what is a partition?
Do we have one partition per node per column family, which would mean that the max size of a column family would be 2 billion cells * number of nodes in the cluster.
Or will Cassandra create as much partitions as required to store all the data of a column family?
I am starting a new project so I will use Cassandra 2.0.
With the advent of CQL3 the terminology has changed slightly from the old thrift terms.
Basically
Create Table foo (a int , b int, c int, d int, PRIMARY KEY ((a,b),c))
Will make a CQL3 table. The information in a and b is used to make the partition key, this describes which node the information will reside on. This is the 'partiton' talked about in the 2 billion cell limit.
Within that partition the information will be organized by c, known as the clustering key. Together a,b and c, define a unique value of d. In this case the number of cells in a partition would be c * d. So in this example for any given pair of a and b there can only be 2 billion combinations of c and d
So as you model your data you want to ensure that the primary key will vary so that your data will be randomly distributed across Cassandra. Then use clustering keys to ensure that your data is available in the way you want it.
Watch this video for more info on Datmodeling in cassandra
The Datamodel is Dead, Long live the datamodel
Edit: One more example from the comments
Create Table foo (a int , b int, c int, d int, e int, f int, PRIMARY KEY ((a,b),c,d))
Partitions will be uniquely identified by a combination of a and b.
Within a partition c and d will be used to order cells within the partition so the layout will
look a little like:
(a1,b1) --> [c1,d1 : e1], [c1,d1 :f1], [c1,d2 : e2] ....
So in this example you can have 2 Billion cells with each cell containing:
A value of c
A value of d
A value of either e or f
So the 2 billion limit refers to the sum of unique tuples of (c,d,e) and (c,d,f).
From : http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/create_table_r.html
Using a composite partition key¶
A composite partition key is a partition key consisting of multiple columns. You use an extra set of parentheses to enclose columns that make up the composite partition key. The columns within the primary key definition but outside the nested parentheses are clustering columns. These columns form logical sets inside a partition to facilitate retrieval.
CREATE TABLE Cats (
block_id uuid,
breed text,
color text,
short_hair boolean,
PRIMARY KEY ((block_id, breed), color, short_hair)
);
For example, the composite partition key consists of block_id and breed. The clustering columns, color and short_hair, determine the clustering order of the data. Generally, Cassandra will store columns having the same block_id but a different breed on different nodes, and columns having the same block_id and breed on the same node.
Implication
==> Partition is the smallest unit of replication (which on its own makes sh** no sense. :) )
==> Every combination of block_id and breed is a Partition.
==> On any given machine in cluster, either all or none of the rows with same partition-key will exist.
I'm undecided whether it's better, performance-wise, to use a very commonly shared column value (like Country) as partition key for a compound primary key or a rather unique column value (like Last_Name).
Looking at Cassandra 1.2's documentation about indexes I get this:
"When to use an index:
Cassandra's built-in indexes are best on a table
having many rows that contain the indexed value. The more unique
values that exist in a particular column, the more overhead you will
have, on average, to query and maintain the index. For example,
suppose you had a user table with a billion users and wanted to look
up users by the state they lived in. Many users will share the same
column value for state (such as CA, NY, TX, etc.). This would be a
good candidate for an index."
"When not to use an index:
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a column
that has many distinct values, a query between the fields will incur
many seeks for very few results. In the table with a billion users,
looking up users by their email address (a value that is typically
unique for each user) instead of by their state, is likely to be very
inefficient. It would probably be more efficient to manually maintain
the table as a form of an index instead of using the Cassandra
built-in index. For columns containing unique data, it is sometimes
fine performance-wise to use an index for convenience, as long as the
query volume to the table having an indexed column is moderate and not
under constant load."
Looking at the examples from CQL's SELECT for
"Querying compound primary keys and sorting results", I see something like a UUID being used as partition key... which would indicate that it's preferable to use something rather unique?
Indexing in the documentation you wrote up refers to secondary indexes. In cassandra there is a difference between the primary and secondary indexes. For a secondary index it would indeed be bad to have very unique values, however for the components in a primary key this depends on what component we are focusing on. In the primary key we have these components:
PRIMARY KEY(partitioning key, clustering key_1 ... clustering key_n)
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible. That is why the example you have uses UUIDs.
The clustering key is used for ordering so that querying columns with a particular clustering key can be more efficient. That is where you want your values to not be unique and where there would be a performance hit if unique rows were frequent.
The cql docs have a good explanation of what is going on.
if you use cql3, given a column family:
CREATE TABLE table1 (
a1 text,
a2 text,
b1 text,
b2 text,
c1 text,
c2 text,
PRIMARY KEY ( (a1, a2), b1, b2) )
);
by defining a
primary key ( (a1, a2, ...), b1, b2, ... )
This implies that:
a1, a2, ... are fields used to craft a row key in order to:
determine how the data is partitioned
determine what is phisically stored in a single row
referred as row key or partition key
b1, b2, ... are column family fields used to cluster a row key in order to:
create logical sets inside a single row
allow more flexible search schemes such as range range
referred as column key or cluster key
All the remaining fields are effectively multiplexed / duplicated for every possible combination of column keys. Here below an example about composite keys with partition keys and clustering keys work.
If you want to use range queries, you can use secondary indexes or (starting from cql3) you can declare those fields as clustering keys. In terms of speed having them as clustering key will create a single wide row. This has impact on speed since you will fetch multiple clustering key values such as:
select * from accounts where Country>'Italy' and Country<'Spain'
I am sure you would have got the answer but still this can help you for better understanding.
CREATE TABLE table1 (
a1 text,
a2 text,
b1 text,
b2 text,
c1 text,
c2 text,
PRIMARY KEY ( (a1, a2), b1, b2) )
);
here the partition keys are (a1, a2) and row keys are b1,b2.
combination of both partition keys and row keys must be unique for each new record entry.
the above primary key can be define like this.
Node< key, value>
Node<(a1a2), Map< b1b2, otherColumnValues>>
as we know Partition Key is responsible for data distribution accross your nodes.
So if you are inserting 100 records in table1 with same partition keys and different row keys. it will store data in same node but in different columns.
logically we can represent like this.
Node<(a1a2), Map< string1, otherColumnValues>, Map< string2, otherColumnValues> .... Map< string100, otherColumnValues>>
So the record will store sequentially in memory.