Cassandra rangequery, knowing the beginning of start key

Cassandra rangequery, knowing the beginning of start key - cassandra

Let me clarify it
I have a Column Family, with UTF8 keys like that :apple#NewYork, banana#LosAngeles, banana#NewYork, cherry#NewYork, ... and so on
I need that because they are sorted, and then I would like to get all 'banana' starting keys?
Is it possible or is there a workaround?

How about composite types?
Map your current rows to columns like
row_key => {
banana:a => "value you wish",
banana:b => "value you wish",
...
}
Advantages of composite types are
They preserve type property while sorting
Incase your composite key is a:b:c you can query for all columns in the range of a, a:b and also point queries like a:b:c
Now you can perform Column Slice for example using phpcassa
Ex:
row_key => { 1:a => 1, 1:b => 1, 10:bb => 1, 1:c => 2}
ColumnSlice(array(1), array(1)) => All Columns with first component equal to 1
ColumnSlice(array(1), array(10)) => All Columns with first component between 1 and 10
ColumnSlice("", array(1, 'c')) => All Columns from the beginning of row whose first component less than 1 and second component less than 'c'
You can do the above said things in reverse and can also play with inclusive & exclusive limits.
Also one point to remember, You can't directly ask for all columns in the range of second component < x skipping the first one
Even row keys support composite types but in case you are using random partitioner then it makes no sense.

Related

is a row in Cassandra same as key->value where value is a super column

I am trying to create a mental model of data model of Cassandra. What I have got so far is that the basic unit of data is a column (name, value, timestamp). A super-column can contain several columns (it has name and its value is a map). An example of ColumnFamily (which I suppose contains several entries of data or rows) is
UserProfile = { // this is a ColumnFamily
phatduckk: { // this is the key to this Row inside the CF
username: "phatduckk", //column
email: "phatduckk#example.com", //column
phone: "(900) 976-6666"//column
}, // end row
ieure: { // another row in same CF. this is the key to another row in the CF
username: "ieure",
email: "ieure#example.com",
phone: "(888) 555-1212"
age: "66", // a differnet column than previous one.
gender: "undecided" // a differnet column than previous one.
},
}
Question 1- To me it seems that a row in CF is nothing but a key-value pair where value is a super-column Am I correct?
Question 2- Could the value (of row key) be a map of several super columns?What I am thinking is say I want to create a row with User's name and address then the row could be key (user id) and value maps to two super columns, C1 (firstname, last name) and C2 (street, country)

I think your trying to wrap head around the (very) old nomenclature that was renamed to make it less confusing.
Table
{
partition key: { // partition
clustering: { // row
key: value // column
key2: value // column
key3: value // column
}
clustering2: { // row
key: value // column
...
}
...
}
...
}
partitions are ordered by the murmur3 hash of the key and used to determine which hosts are replicas. The clustering keys are sorted within them, and theres a fixed schema for the fields within a row which each has a column.
Using super column family, column family, supercolumns, columns and row nomenclature is just going to get yourself confused when you read anything thats come out in last 6 years. Thrift has even been deprecated as well for what its worth so don't plan your application around that.
For your questions
Question 1- To me it seems that a row in CF is nothing but a key-value
pair where value is a super-column Am I correct?
yes, but the super columns are sorted by their keys. ie phatduckk would be after ieure if they are text types using descending order. That way you can read a slice of names between ph and pk for instance and pull them off disk (more useful when clustering on a timestamp and looking for ranges of data).
Question 2- Could the value (of row key) be a map of several super
columns?What I am thinking is say I want to create a row with User's
name and address then the row could be key (user id) and value maps to
two super columns, C1 (firstname, last name) and C2 (street, country)
You should really look at some newer documentation. I think you have right idea but hard to relate exactly with how C* works now. Try starting with
https://academy.datastax.com/resources/ds101-introduction-cassandra
https://academy.datastax.com/resources/ds220-data-modeling
as some free courses that do a good job explaining.

Spark: filter out all rows based on key/value

I have an RDD, x, in which I have two fields: id, value. If a row has a particular value, I want to take the id and filter out all rows with that id.
For example if I have:
id1,value1
id1,value2
and I want to filter out all ids if any rows with that id have value value1, then I would expect all rows to be filtered out. But currently only the first row is filtered out because it has a value of value1.
I've tried something like
val filter = x.filter(row => (set contains row.value))
This filters out all rows with a particular value, but leaves the other rows with the same id still in the RDD.

You have to apply a filter function for each rdd row and the function after the => should include the row as Array whether or not it includes that token idx or whatever. You might have to adjust the number of the token , but it should look something like this ( whether you should use contains or not contains depends on whether you want to filter in or filter out:
val filteredRDD = rawRDD
.filter(rowItem => !(rowItem.map(_.toString).toSeq
.contains(rowItem.(0).toString)))
or even something like:
val filteredRDD = rdd1.rawRDD(rowItem => !(rowItem._2 contains rowItem._1))

How to make the query to work?

I have Cassandra version 2.0, and in it I am totally new in it, so the question...
I have table T1, with columns with names: 1,2,3...14 (for simplicity);
Partitioning key is column 1 , 2;
Clustering key is column 3, 1 , 5;
I need to perform following query:
SELECT 1,2,7 FROM T1 where 2='A';
Column 2 is a flag, so values are repeating.
I get the following error:
Unable to execute CQL query: Partitioning column 2 cannot be restricted because the preceding column 1 is either not restricted or is restricted by a non-EQ relation
So what is the right way to do it? I really need to get the data that already filtered. Thanks.

So, to make sure I understand your schema, you have defined a table T1:
CREATE TABLE T1 (
1 INT,
2 INT,
3 INT,
...
14 INT,
PRIMARY ((1, 2), 3, 1, 5)
);
Correct?
If this is the case, then Cassandra cannot find the data to answer your CQL query:
SELECT 1,2,7 FROM T1 where 2 = 'A';
because your query has not provided a value for column "1", without which Cassandra cannot compute the partition key (which requires, per your composite PRIMARY KEY definition, both columns "1" and "2"), and without that, it cannot determine where to look on which nodes in the ring. By including "2" in your partition key, you are telling Cassandra that that data is required for determine where to store (and thus, where to read) that data.
For example, given your schema, this query should work:
SELECT 7 FROM T1 WHERE 1 = 'X' AND 2 = 'A';
since you are providing both values of your partition key.
#Caleb Rockcliffe has good advice, though, regarding the need for other, secondary/supplemental lookup mechanisms if the above table definition is a big part of your workload. You may need to find some way to first lookup the values for "1" and "2", then issue your query. E.g.:
CREATE TABLE T1_index (
1 INT,
2 INT,
PRIMARY KEY (1, 2);
);
Given a value for "1", the above will provide all of the possible "2" values, through which you can then iterate:
SELECT 2 FROM T1_index WHERE 1 = 'X';
And then, for each "1" and "2" combination, you can then issue your query against table T1:
SELECT 7 FROM T1 WHERE 1 = 'X' AND 2 = 'A';
Hope this helps!

Your WHERE clause needs to include the first element of the partition key.

intersect cassandra rows

We have cassandra column family.
each row have multiple columns. columns have name, but value is empty.
if we have 5-10 row keys, how we can find column names that appear in all of these keys.
e.g.
row1: php, programming, accounting
row2: php, bookkeeping, accounting
row3: php, accounting
must return:
result: php, accounting
note we can not easily load whole row into the memory, because it may contain 1M+ columns
solution not need to be fast.

In order to do intersection of several rows, we will need to intersect two of them first, then to intersect the result with third and so on.
Looks like in cassandra we can query the data by column names and this is relatively fast operation.
So we first get Column Slice of 10k rows. Making list of column names (in PHP Cassa - put them in array). Then select those from second row.
Code may be looking like this:
$x = $cf->get($first_key, <some column slice>);
$column_names = array();
foreach(array_keys($x) as $k)
$column_names[] = $k;
$result = $cf->get($second_key, $column_slice = null, $column_names);
// write result somewhere, and proceed with next slice

You columns names are sorted and you can create an iterator for each row (this iterator load portion of date at once, for example 10k of columns). Now put each iterator into a priority queue (by the next column name). If you take for queue the k times the iterator with the same column names, this is common names between all rows, in the other case we move to the next element and return iterators to queue.

You could use a Hadoop map/reduce job as follows:
Map output key = column name
Map output value = row key
Reducer counts row keys for each column and outputs column name & count to a CF with the following schema:
key : [column name] {
Count : [count]
}
You can then query counts from this CF in reverse order. The first record will be the max, so you can keep iterating until a value is < max. This will be your intersection.

In Cassandra How In operation will work like SQL?

My requirement is I need to fetch the records from Cassandra one column that column value is not a particular string other than the list of string which i have passed... For example for in the data there is a column name service... That column may contain the values of 1,2,3,4,5.... I dont want to display 5... I want to display the records which has 1 or 2 or 3 or 4... How to achieve this... Could you please anyone help me on this?

If you are storing 1..5 in your column as values there is no other go, You need to do it from client side only. You can't filter from cassandra end
If 1..5 itself is in your column name like
row_1 => { 1: value, 2:value... }
row_2 => { 1: value, 2:value... }
..
row_3 => { 1: value, 2:value... }
then you can use
SELECT 1..4 from YOUR_COLUMN_FAMILY where key='yourKey'
As an other option if you can have 1..5 as separate rows like
1 => { c1: value, c2:value... }
2 => { c1: value, c2:value... }
..
5 => { c1: value, c2:value... }
you can do
SELECT * from YOUR_COLUMN_FAMILY where key in (1,2,3,4)
You can also have a look #secondary Indexes on columns here
Because of the secondary index created on the columns, their
values can be queried directly

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string