How to identify different user_ids of the same person in hive? - apache-spark

I have the following attributes in one table in hive:
user_id_a,user_id_b
Each row of data identifies a different user_id of the same person.
Like :
user_id_a,user_id_b
3242,5897
5897,6752
3242,9876
7654,1287
1287,3421
There are two people in this table.
"3242_5897_6752_9876" represent the same one person, and
"1287_3421_7654" represent another same person .
how use hql extract those data in the table ?

Related

How to primary keys and clustering columns operate in Cassandra?

I'm confused as to how primary keys in Cassandra allow for quick data access. Say for example I create a table of Students with the following schema columns:
I choose the primary key to be Student Id. My understanding is that all the students will be placed around the cluster based on some hash of this value. Say I also choose the Country as a Clustering Column. So Within each partition of the students (who have been split based on their Id) they will be ordered by Country (presumably alphabetically).
So if I then want to retrieve all students for a specific country will I have to visit multiple nodes in the cluster? While the students have been ordered by Country within each node there is nothing to say that all the students for a specific country have been stored on the same node? Is this type of query even supported?
If I had only added 5 students to a 5 nodes cluster would it be possible that all the students would be stored on separate nodes if the Student Id was a UUID?
So if I then want to retrieve all students for a specific country will I have to visit multiple nodes in the cluster?
Yes.
While the students have been ordered by Country within each node there is nothing to say that all the students for a specific country have been stored on the same node?
Correct.
Is this type of query even supported?
It is but that's considered an anti-pattern in Cassandra. What happens is that the coordinator (the node that receives the request from the client) will have to query ALL other nodes since it will have to scan all rows for that column family.
If I had only added 5 students to a 5 nodes cluster would it be possible that all the students would be stored on separate nodes if the Student Id was a UUID?
Yes.
The way your problem can be solved is by having a column family for each query (one for selecting by Student ID and the other for selecting by Country, each one having a different primary query) while duplicating the rows (when you create a student you have to insert it in both column families).

Understanding Cassandra Data Model

I have recently started learning No-SQL and Cassandra through this article. The author explains the data model through this diagram:
The author also gives the below column family example:
Book {
key: 9352130677{ name: “Hadoop The Definitive Guide”, author:” Tom White”, publisher:”Oreilly”, priceInr;650, category: “hadoop”, edition:4},
key: 8177228137{ name”” Hadoop in Action”, author: “Chuck Lam”, publisher:”manning”, priceInr;590, category: “hadoop”},
key: 8177228137{ name:” Cassandra: The Definitive Guide”, author: “Eben Hewitt”, publisher:” Oreilly”, priceInr:600, category: “cassandra”},
}
But in that tutorial and every other tutorial I have gone through, then end up creating regular tables in cassandra. I am unable to connect the Cassandar model with what I am creating.
For example, I created a column family called Employee as below:
create columnfamily Employee(empid int primary key,empName text,age int);
Now I inserted some data and my column family looks as this:
For me this looks like a regular relational table and not like the data model the author has explained. How do I create a Employee column family where each row represents an employee with different attributes? Something like:
Employee{
101:{name:Emp1,age:20}
102:{name:Emp2,salary:1000}
102:{manager_name:Emp3,age:45}
}
}
You need to understand that in the representation using cql, is may look like regular relational table, but the internal structure of the rows in Cassandra is completely different. It is saving different set of attributes for each employee, and the nulls you can see while querying with cql is just a representation of empty/nonexistent cells.
What you trying to achieve, is unstructured data model. Cassandra started with this model, and all was working as described in the tutorial you've read, but there is an opinion that unstructured data design is unhealthy to development and makes more problems than it solves. So, after sometime, Cassandra moved to the "structured" data structure (and from thrift to cql). It doesn't mean that you have to store all attributes for all keys/rows, it doesn't mean that all the rows are have same number of attributes, it just means that you have to declare attributes before you use them.
You can achieve some kind of unstructured data modeling using Map, List, Set, etc. data types, UDT (User defined types) or just saving your data as json string and parsing it on the application side.
What you have understood is correct. Just believe it. Internally cassandra stores columns exactly like the image in your question.
Now, what you are expecting is to insert a column which is not defined while creating the Employee table. For dynamic columns, you can always use Map data types .
For example
create table Employee(
empid int primary key,
empName text,
age int,
attributes Map<text,text>);
To add new attributes you can use below queries.
UPDATE Employee SET attributes = { manager_name : Emp3, age:45 } WHERE empid = 102;
Update -
another way to to create a dynamic column model is as below
create table Employee(
empid int primary key,
empName text,
attribute text,
attributevalue text,
primary key (empid,empName,attribute)
);
Lets take few inserts -
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','age','25') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','manager','emp2') ;
insert into Employee (empid,empName,attribute,attributevalue) values (102,'Emp1','department','hr') ;
this data structure will create a wide row, and behaves like dynamic column. you can see primary key empid and name is common for all three rows, only attribute and value will change.
Hope this will help
Cassandra uses a special primary key called compositie key. This is the representation of the partitions. This is also one reason why cassandra scales well. The composite key is used to determine the nodes on which the rows are stored.
The result in your console may be a result set of rows, but the intern organization of cassandra is differnt from that. Have you ever tried to query a table without an primary key? You will quickly see that you can't query that flexible (because of the partitioning).
After that you will understand why we have to use a query-first design aproach for cassandra. This is completely different from RDBBS.

what's the difference among row key, primary key and index in cassandra?

I'm so confused.
When to use them and how to determine which one to use?
If a column is index/primary key/row key, could it be duplicated?
I want to create a column family to store some many-to-many info, for example, one column is the given name and the other is surname. One given name can related to many surnames, and one surname could have different given names.
I need to query surnames by a given name, and the given names by a specified surname too.
How to create the table?
Thanks!
Cassandra is a NoSQL database, and as such has no such concept of many-to-many relationships. Ideally a table should not have anything other than a primary key. In your case the right way to model it in Cassandra is to create two tables, one with name as the primary key and the other with surname as the primary key
When you need to query by either key, you need to query the table that has that key as the primary key
EDIT:
From the Cassandra docs:
Cassandra's built-in indexes are best on a table having many rows that
contain the indexed value. The more unique values that exist in a
particular column, the more overhead you will have, on average, to
query and maintain the index. For example, suppose you had a races
table with a billion entries for cyclists in hundreds of races and
wanted to look up rank by the cyclist. Many cyclists' ranks will share
the same column value for race year. The race_year column is a good
candidate for an index.
Do not use an index in these situations:
On high-cardinality columns for a query of a huge volume of records for a small number of results.
In tables that use a counter column On a frequently updated or deleted column.
To look for a row in a large partition unless narrowly queried.

Cassandra Hierachy Data Model

I'm newbie design cassandra data model and I need some help to think out the box.
Basically I need a hierarchical table, something pretty standard when talking about Employee.
You have a employee, say Big Boss, that have a list of employee under him.
Something like:
create table employee(id timeuuid, name text, employees list<employee>, primary key(id));
So, is there a way to model a hierarchical model in Cassandra adding the table type itself, or even another approach?
When trying this line above it give me
Bad Request: line 1:61 no viable alternative at input 'employee'
EDITED
I was thinking about 2 possibilities:
Add an uuid instead and in my java application find each uuid Employee when bringing up the "boss".
Working with Map, where the uuid is the id itself and my text would be the entire Row, then in my java application get the maps, convert each "text" employee into a Employee entity and finally return the whole object;
It really depends on your queries...one particular model would only be good for a set of queries, but not others.
You can store ids, and look them up again at the client side. This means n extra queries for each "query". This may or may not be a problem, as queries that hit a partition are fast. Using a map from id to name is also an option. This means you do extra work and denormalise the names into the map values. That's also valid. A third option is to use a UDT (user defined type). You could then have a list or set or even map. In cassandra 2.1, you could index the map keys/ values as well, allowing for some quite flexible querying.
https://www.datastax.com/documentation/cql/3.1/cql/cql_using/cqlUseUDT.html
One more approach could be to store a person's details as id, static columns for their attributes, and have "children" as columns in wide row format.
This could look like
create table person(
id int primary key,
name text static,
age int static,
employees map<int, employeeudt>
);
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/refStaticCol.html
Querying this will give you rows with the static properties repeated, but on disk, it's still held once. You can resolve the rest client side.

Shoudl I create seperate column family if I want to query on many columns ? or use composite PK?

I have a column family like
object
(
obect_id,
company-id,
group_id,
family_id,
description,
..
);
I want to query that based on object id, company id ,group id and any combination of these.
My question is
should i make composite primary key
(object id, company id ,group id)
or create seperate column familis ?
only object id is unique in CF, company id can repeat in multiple rows, but group iddoes not repeat in many rows
You may well want to duplicate your data in multiple CFs depending on your query patterns. This is quite common practice.
If a common query is "Get all objects by company_id" then you might want to store all objects with in a CF with partitioned just by company_id as a row key. If you need to do individual object lookups as well, then you store that data duplicated in another CF - each object partitioned by object_id. If groups are always a subset of a specific company, perhaps you want to row key by company, but then cluster by group.
You should be designing your Cassandra schema based on the queries you need to run, rather than the data that needs to go in it.

Resources