Relational Stores & Many-to-one joins - activepivot

David, could I ask for some clarification on what you say about joins in this answer
When you say "You cannot, using the join of the relational stores, join one entry to multiple ones", does that mean in any direction?
E.g. Store 1:
| Key1 | Measure1 |
Store 2:
| Key 1 | SomeId1 | Measure2 | Measure3 |
| Key 1 | SomeId2 | Measure4 | Measure4 |
So is it not possible to join these two stores by putting the join from Store 2 to Store 1?
And if not, are you saying then that the only way to manage this is to duplicate the entries in Store 1? E.g.:
Store 1
| Key 1 | SomeId1 | Measure1 | Measure2 | Measure3 |
| Key 1 | SomeId2 | Measure1 | Measure4 | Measure4 |

The direction matters for the one-to-many : it depends which store is the "parent" one.
The relational stores includes the concept of an "ActivePivot Store" which is your main store (on which your schema is based). This store can then be joined to one or more stores, given a set of key fields, that we'll call "child" stores for simplicity. Each of these child stores can eventually be joined with other stores, and so on (you can represent it with a directed graph).
The main rule to respect is that you should never have a "parent" store entry resolving to multiple "child" store entries (neither should you have any cyclic relationship I believe).
The simplified idea behind the relational stores (as of RS 1.5.x / AP 4.4.x) is that when one entry is submitted into the "ActivePivot Store" then, starting from the ActivePivot Store, it'll recursively resolve the joins in order to retrieve maximum one entry in each of the joined stores. Depending of your schema definition, these entries will then be used to populate the fact before inserting it in the cube.
If resolving a join result in more than one entry then AP will not be able to choose which one to use in order to populate the fact and will throw an exception.
Coming back to your example you can do the join between Store 1 and Store 2 only in the case where Store 2 is your ActivePivot Store or a "parent" of Store 1 (APStore->...->Store2->Store1), which seems to be your case.
If not (Store1->Store2) you will then have to duplicate the entries of Store 1 in order to ensure that it will always find only one entry at maximum when resolving the join. Store 1 will then looks like:
| Key 1 | SomeId1 | Measure1
| Key 1 | SomeId2 | Measure1
Your join with Store 2 will then be done on the fields "Key, SomeId" instead of just "Key" and that will ensure you to find only one entry when resolving Store1->Store2

Related

Find by value on leveldb

I've been playing with leveldb and it's really good at what it's designed to do--storing and getting key/value pairs based on keys.
But now I want to do something more advanced and find myself immediately stuck. Is there no way to find a record by value? The only way I can think of is to iterate through the entire database until I find an entry with the value I'm looking for. This becomes worse if I'm looking for multiple entries with the value (basically a "where" query) since I have to iterate through the entire database every time I try to do this type of query.
Am I trying to do what Leveldb isn't designed to do and should I be using another database instead? Or is there a nice way to do this?
You are right. Basically what you need to know about is key composition.
Second, you don't query by value itself in SQL WHERE clause, but using a boolean query like age = 42.
To answer your particlular question imagine you have a first key-value namespace in leveldb, where you store your objects where the value is serialized in json for instance:
key | value
-------------------------------------------------
namespace | uid | value
================================================
users | 1 | {name:"amz", age=32}
------------------------------------------------
users | 2 | {name:"abki", age=42}
In another namespace, you index users uid by age:
key | value
----------------------------------
namespace | age | uid | value
==================================
users-by-uid | 32 | 1 | empty
----------------------------------
users-by-uid | 42 | 2 | empty
Here the value is empty because, the key must be unique. What we could think as the value of the given rows would be uid column it's composed
into the key to make each row's key unique.
In that second namespace, every key that starts with the (user-by-uid, 32) match records that answer the query age = 32.

Converting multi-list row into multiple rows with single elements Cassandra

I have a table in Cassandra with the following schema (simplified):
id | lat | lng
--------+--------------+-------------
uuid | list<double> | list<double>
I would like to query one row of this table (by id) and obtain something like this:
id | lat | lng
--------+-----------+-----------
id0 | lat[0] | lng[0]
id0 | lat[1] | lng[1]
id0 | lat[2] | lng[2]
.
.
.
id0 | lat[n] | lng[n]
Is this possible? What would be a good approach?
Note: both lists always have the same length.
I think there is no easy way select this from cassandra. Just note, if you know that your values would be use in different ways, you should store it in normalized form (in that context you should store second table).
Good approach would be understand what is more critical in your task: map JSON, map for new query, how much requests would you have and find compromiss.
Generally, for Cassandra it is common to save data in different tables for different selects, when talking about different keys/indexes.

Many-To-Many Cassandra Database

Let say i have users. Those users can have access to multiple projects. So a project can also allow multiple users.
So I model four tables. users (by_id), projects (by id), projects_by_user_id and users_by_project_id.
----------- ------------ -------------------- --------------------
| users | | projects | | projects_by_user | | users_by_project |
|---------| |--------- | |------------------| |------------------|
| id K | | id K | | user_id K | | project_id K |
| name | | name | | project_id C | | user_id C |
----------- ------------ | project_name S | | user_name S |
-------------------- --------------------
So storing the user_name in the users_by_project and the projet_name in the projects_by_user table for querying.
The problem I have is when an user updates the project_name, this will of course update the projects table. But for data consistency I also need to update each partition in the projects_by_user table.
As far as I can see, this is only possible by querying all the users from the users_by_project table and doing an update for each user.
Is there any better way without first reading lots of data?
I don't see why you need four tables. Your users and projects tables could contain all of the data.
If you define the tables like this:
CREATE TABLE users (
user_id int PRIMARY KEY,
name text,
project_ids list<int> );
CREATE TABLE projects (
project_id int PRIMARY KEY,
name text,
user_ids list<int> );
Then each user would have a list of project ids they have access to, and each project would have a list of users that have access to it.
To add access to project 123 to user 1 you would run:
BEGIN BATCH
UPDATE users SET project_ids = project_ids + [123] WHERE user_id=1;
UPDATE projects SET user_ids = user_ids + [1] WHERE project_id=123;
APPLY BATCH;
To change a project name, you would just do:
UPDATE projects SET name = 'New project name' WHERE project_id=123;
For simplicity I showed the id fields as int's, but normally you would use uuid's for that.
I don't think there is better way. Cassandra has a lot of limitation on the queries you can make. In your case, you have to create a compound key (user_id, project_id), and in order to update it you have to provide both parts in where clause, which means you have to read all users for specific project and update each of these. If you have a large database and this scenario will happen often, this would be significant overhead, so I guess it would be better to remove projectname field from the table and perform join of the projects and projects_by_users at the application level.
BTW: Scenario you described here is more convenient for relational database model, so if the rest of your database model is similar to this, I would think of using some relational database instead.

postgres join list with $ delimiter

From these tables:
select group, ids
from some.groups_and_ids;
Result:
group | group_ids
---+----
winners | 1$4
losers | 4
others | 2$3$4
and:
select id,name from some.ids_and_names;
id | name
---+----
1 | bob
2 | robert
3 | dingus
4 | norbert
How would you go about returning something like:
winners | bob, norbert
losers | norbert
others | robert, dingus, norbert
with normalized (group_name, id) as (
select group_name, unnest(string_to_array(group_ids,'$')::int[])
from groups_and_ids
)
select n.group_name, string_agg(p.name,',' order by p.name)
from normalized n
join ids_and_names p on p.id = n.id
group by n.group_name;
The first part (the common table expression) normalizes your broken table design by creating a proper view on the groups_and_ids table. The actual query then joins the ids_and_names table to the normalized version of your groups and the aggregates the names again.
Note I renamed group to group_name because group is a reserved keyword.
SQLFiddle: http://sqlfiddle.com/#!15/2205b/2
Is it possible to redesign your database? Putting all the group_ids into one column makes life hard. If your table was e.g.
group | group_id
winners | 1
winners | 4
losers | 4
etc. this would be trivially easy. As it is, the below query would do it, although I hesitated to post it, since it encourages bad database design (IMHO)!
p.s. I took the liberty of renaming some columns, because they are reserved words. You can escape them, but why make life difficult for yourself?
select group_name,array_to_string(array_agg(username),', ') -- array aggregation and make it into a string
from
(
select group_name,theids,username
from ids_and_names
inner join
(select group_name,unnest(string_to_array(group_ids,'$')) as theids -- unnest a string_to_array to get rows
from groups_and_ids) i
on i.theids = cast(id as text)) a
group by group_name

How Cassandra stores multicolumn primary key (CQL)

I have a little misunderstanding about composite row keys with CQL in Cassandra.
Let's say I have the following
cqlsh:testcql> CREATE TABLE Note (
... key int,
... user text,
... name text
... , PRIMARY KEY (key, user)
... );
cqlsh:testcql> INSERT INTO Note (key, user, name) VALUES (1, 'user1', 'name1');
cqlsh:testcql> INSERT INTO Note (key, user, name) VALUES (1, 'user2', 'name1');
cqlsh:testcql>
cqlsh:testcql> SELECT * FROM Note;
key | user | name
-----+-------+-------
1 | user1 | name1
1 | user2 | name1
How this data is stored? Are there 2 rows or one.
If two then how it is possible to have more than one row with the same key?
If one then having records with key=1 and user from "user1" to "user1000" does it mean it will have one row with key=1 and 1000 columns containing names for each user?
Can someone explain what's going on on the background? Thanks.
So, after diging a bit more and reading an article suggested by Lyuben Todorov (thank you) I found the answer to my question.
Cassandra stores data in data structures called rows which is totally different than relational databases. Rows have a unique key.
Now, what's happening in my example... In table Note I have a composite key defined as PRIMARY KEY (key, user). Only the first element of this key acts as a row key and it's called partition key. Internally the rest of this key is used to build a composite columns.
In my example
key | user | name
-----+-------+-------
1 | user1 | name1
1 | user2 | name1
This will be represented in Cassandra in one row as
-------------------------------------
| | user1:name | user2:name |
| 1 |--------------------------------
| | name1 | name1 |
-------------------------------------
Having know that it's clear that it's not a good idea to add any column with huge amount of unique values (and growing) to the composite key because it will be stored in one row. Even worse if you have multiple columns like this in a composite primary key.
Update: Later I found this blog post by Aaron Morton than explains the same in more details.

Resources