In Cassandra How In operation will work like SQL?

In Cassandra How In operation will work like SQL? - cassandra

My requirement is I need to fetch the records from Cassandra one column that column value is not a particular string other than the list of string which i have passed... For example for in the data there is a column name service... That column may contain the values of 1,2,3,4,5.... I dont want to display 5... I want to display the records which has 1 or 2 or 3 or 4... How to achieve this... Could you please anyone help me on this?

If you are storing 1..5 in your column as values there is no other go, You need to do it from client side only. You can't filter from cassandra end
If 1..5 itself is in your column name like
row_1 => { 1: value, 2:value... }
row_2 => { 1: value, 2:value... }
..
row_3 => { 1: value, 2:value... }
then you can use
SELECT 1..4 from YOUR_COLUMN_FAMILY where key='yourKey'
As an other option if you can have 1..5 as separate rows like
1 => { c1: value, c2:value... }
2 => { c1: value, c2:value... }
..
5 => { c1: value, c2:value... }
you can do
SELECT * from YOUR_COLUMN_FAMILY where key in (1,2,3,4)
You can also have a look #secondary Indexes on columns here
Because of the secondary index created on the columns, their
values can be queried directly

Related

Find the column in Subquery coalesce function

I am using the Coalesce function to return a value from my preferred ranking of Columns but I also want to include the name of the column that the value was derived from.
`i.e
Table:
Apples Pears Mangos
4 5
**SQL **
; with CTE as
(
Select
Coalesce(Apples,Pears,Mangos) as QTY_Fruit
from Table
) select *, column name from QTY_Fruit
from CTE
Result:
QTY_Fruit Col Name
4 Pears`
I am trying to avoid a case statement if possible because there are about 12 fields that I will need to use in my Coalesce. Would love for an easy way to pull the column name based on value in QTY_Fruit. I'm all ears if the answer lies outside the use of subqueries but I figured this would be a start.

Spark: filter out all rows based on key/value

I have an RDD, x, in which I have two fields: id, value. If a row has a particular value, I want to take the id and filter out all rows with that id.
For example if I have:
id1,value1
id1,value2
and I want to filter out all ids if any rows with that id have value value1, then I would expect all rows to be filtered out. But currently only the first row is filtered out because it has a value of value1.
I've tried something like
val filter = x.filter(row => (set contains row.value))
This filters out all rows with a particular value, but leaves the other rows with the same id still in the RDD.

You have to apply a filter function for each rdd row and the function after the => should include the row as Array whether or not it includes that token idx or whatever. You might have to adjust the number of the token , but it should look something like this ( whether you should use contains or not contains depends on whether you want to filter in or filter out:
val filteredRDD = rawRDD
.filter(rowItem => !(rowItem.map(_.toString).toSeq
.contains(rowItem.(0).toString)))
or even something like:
val filteredRDD = rdd1.rawRDD(rowItem => !(rowItem._2 contains rowItem._1))

Eliminate duplicates and Insert Unique records having max no. of column values present through Talend

I have an excel file which gets updated on a daily basis i.e the data is always different every time.
I am pulling the data from the Excel sheet into the table using Talend. I have a primary key Company_ID defined in the table.
The error I am facing is that the Excel sheet has few duplicate Company_ID values. It will also pick up more duplicate values in the future as the Excel file will be updated daily.
I want to choose the first record where the Company ID field is 1 and the record doesn't have null in the rest of the columns. Also, for a Company_ID of 3 there is a null value for one column which is ok since it is a unique record for that company_id.
How do I choose a unique row which has maximum no. of column values present ie for eg in the case of Company ID of 1 in Talend ?

tUniqRow is usually the easiest way to handle duplicates.
If you are worried that the first row coming to tUniqRow may not be the first row that you want there, you can sort your rows, so they enter tUniqRow in your preferred order:
(used components: tFileInputExcel, tJavaRow, tSortRow, tUniqRow, tFilterColumns)
In your particular case, the tJava could look like this:
// Code generated according to input schema and output schema
output_row.company_id = input_row.company_id;
output_row.name = input_row.name;
output_row.et_cetera = input_row.et_cetera;
// End of pre-generated code
int i = 0;
if (input_row.company_id == null) { i++; }
if (input_row.name == null) { i++; }
if (input_row.et_cetera == null) { i++; }
output_row.priority = i;

Reading the most recent updated row in cassandra

I have a use case and want suggestion on the below.
Structure :
Rowkey_1:
Column1 = value1;
Column2 = value2;
Rowkey_2:
Column1 = value1;
Column2 = value2;
" Suppose i am writing 1000 rows into cassandra with each row having couple of columns. After sometime i update only 100 rows and make changes for column values ".
-> when i read data from cassandra i only want to get these 100 updated rows and not the entire row key information.
Is there a way to say to cassandra like give me all row keys from start - > end where time in between "Time_start" to "Time_end"
in SQL Lingo -- > select from "" to "" where time between "time_start" and "time_end".
P.S. i read Basic Time Series with Cassandra where it says you can annotate rowkey like the below
Inserting data — {:key => ‘server1-load-20110306′, :column_name => TimeUUID(now), :column_value => 0.75}
Here the column family has TimeUUID columns.
My question is can you annotate you rowkey with date and time like this : { :key ==> 2012-11-18 16:00:15 }
OR any other way to get only the recent updated rows.
Any suggestion/ guidance really appreciated.

You can't do range queries on keys unless you use ByteOrderedPartitioner, which you shouldn't use. The way to do this is by writing known sentinel values as keys, such as a timestamp representing the beginning of the day. Then you can do the column slice by time.

Cassandra rangequery, knowing the beginning of start key

Let me clarify it
I have a Column Family, with UTF8 keys like that :apple#NewYork, banana#LosAngeles, banana#NewYork, cherry#NewYork, ... and so on
I need that because they are sorted, and then I would like to get all 'banana' starting keys?
Is it possible or is there a workaround?

How about composite types?
Map your current rows to columns like
row_key => {
banana:a => "value you wish",
banana:b => "value you wish",
...
}
Advantages of composite types are
They preserve type property while sorting
Incase your composite key is a:b:c you can query for all columns in the range of a, a:b and also point queries like a:b:c
Now you can perform Column Slice for example using phpcassa
Ex:
row_key => { 1:a => 1, 1:b => 1, 10:bb => 1, 1:c => 2}
ColumnSlice(array(1), array(1)) => All Columns with first component equal to 1
ColumnSlice(array(1), array(10)) => All Columns with first component between 1 and 10
ColumnSlice("", array(1, 'c')) => All Columns from the beginning of row whose first component less than 1 and second component less than 'c'
You can do the above said things in reverse and can also play with inclusive & exclusive limits.
Also one point to remember, You can't directly ask for all columns in the range of second component < x skipping the first one
Even row keys support composite types but in case you are using random partitioner then it makes no sense.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

In Cassandra How In operation will work like SQL? - cassandra

Related

Find the column in Subquery coalesce function

Spark: filter out all rows based on key/value

Eliminate duplicates and Insert Unique records having max no. of column values present through Talend

Reading the most recent updated row in cassandra

Cassandra rangequery, knowing the beginning of start key

Categories

Resources