Cassandra: How to check if a column value is already present - cassandra

I am using Cassandra. There is a column name-text which stores all say usernames.
name
------
bob
Bob
bobby
mike
michael
micky
BOB
I have 2 questions
I have to select all user names that starts with 'bo'. I know there is no like equivalent in Cassandra. But is there anyway to achieve that? (Additional column is an option but is there something else?)
There are 3 entries. bob,Bob and BOB. Is there anyway to use fetch all 3 rows if I pass where name='bob'. I need to fetch the names case-insensitive.
Thanks in advance.

Let's start with the second question first. If you want to support case-insensitive queries, you should store a second, upper-case copy of the text data you want to search for in another column. Then by querying by that column you'll be able to do case-insensitive requests.
Going back to searches for bo*. The best way to do that is to use a schema that allows you to leverage clustering columns (columns 2 and higher of the primary key) for range searches. Here is an example:
CREATE TABLE t1 (region INT, name TEXT, PRIMARY KEY (region, name));
In particular, if you make name the second column of the key, you will be able to perform searches such as
SELECT * FROM t1 WHERE name >= 'bo' and name < 'bp' ALLOW FILTERING;
which will return results you're looking for. This only works for trailing wildcards: the leading characters have to be constant for range queries to work. Again, if you want to do case-insensitive searches, have the case-insensitive column be the second part of the primary key and query by it:
SELECT * FROM t1 WHERE name_upper >= 'BO' and name_upper < 'BP' ALLOW FILTERING;

Related

When and why are Google Cloud Spanner table and column names case-sensitive?

Spanner documentation says:
Table and column names:
Can be between 1-128 characters long. Must start with an uppercase or lowercase letter.
Can contain uppercase and lowercase letters, numbers, and underscores, but not hyphens.
Are case-insensitive. For example, you cannot create tables named mytable and MyTable in the same database or columns names mycolumn and
MyColumn in the same table.
https://cloud.google.com/spanner/docs/data-definition-language#table_statements
Given that, I have no idea what this means:
Table names are usually case insensitive, but may be case sensitive
when querying a database that uses case sensitive table names.
https://cloud.google.com/spanner/docs/lexical#case-sensitivity
In fact it seems that table names are case-sensitive, for example:
Queries fail if we don't match the case shown in the UI.
This seems to be an error in the documentation. Table names are case insensitive in cloud spanner. I'll follow up with the docs team.
Edit: Updated docs https://cloud.google.com/spanner/docs/data-definition-language#naming_conventions
I add a couple of examples, so we can see the diference.
Table names are case sensitive, In this example, It does not matter, there is only one table:
Example 1:
SELECT *
FROM Roster
WHERE LastName = #myparam
returns all rows where LastName is equal to the value of query parameter myparam.
But for Example 2, if we comparing two tables, or make other kind of queries, using tables.
SELECT id, name
FROM Table1 except select id, name
FROM Table2
It will give you everything in Table1 but not in Table2.

How to split name data in the same column in PostgreSQL?

I am new to PostgreSQL and am using PGADMIN 4 on a Mac. I have one column of imported data that has some usernames, sometimes a last name and mostly a first and last name in the same column.
I care more to be able to query and count the most occurrences of a name in the column. I will be able to determine by results if it is a first or last for my need. Listing the first 50 should do it. Please assist with the specific code including addressing the table and column.
Have played with this, but need more:
select surname, count(*) from atreedata
group by surname
order by count(*) desc limit 40;
Works great with only one name! I need the most common names listed by name and count.
Common Column Example:
John Smith
jsmith3
Stacey123
Bob Smith
Jones
So, if I understand it correctly, you just need to find the most numerous words in surname column.
There's a built-in function regexp_split_to_table that can split strings to words and creates rows from those words. So:
select surname_word, count(*) as surname_word_count
from (
select regexp_split_to_table(surname, E'\\s+') as surname_word
from atreedata
) as surname_words
group by surname_word
order by surname_word_count desc
limit 40;

number listing for conditions in sqlite

I have a table in which I am trying to create a column that will increment based on conditions being valid.
In my example, I need to update the department count where the department = Marketing. Anyone have a good way to do this using SQLite?
Current Table
Name Department Department_Count
James Accounting NULL
Jennifer Marketing NULL
Micheal Warehouse NULL
Natalie Marketing NULL
Rebecca Marketing NULL
Update Table
Name Department Department_Count
James Accounting NULL
Jennifer Marketing 1
Micheal Warehouse NULL
Natalie Marketing 2
Rebecca Marketing 3
Edit:
Currently, I insert rows where the department is 'Marketing' into a new table and then I used primary key or rowid to create an auto increment so I can number these types of items.
This requires me to create a new table which is not the best since it takes up so much space and the table is redundant since I have the underlying data in this original table already.
I'm using python to interact with my database, if that helps with solving this problem.
UPDATE
Actually, thinking about it a little further, you may not need a trigger:
INSERT INTO Table (Department, Department_Count)
VALUES (?, (SELECT MAX(IFNULL(Department_Count, 0)+1 FROM Table WHERE Department = ?))
may give you what you want.
Original Answer
You cannot do this declaratively, but you can probably accomplish what you want proceduraly using a trigger.
Two possible strategies:
Use an AFTER INSERT trigger to execute an UPDATE statement against the most recently inserted row (RowID will be available AFTER INSERT) to set the Departement_Count column to a SELECT expression based on the current data in the table.
Use an INSTEAD OF trigger to perform an alternate INSERT combining the values from the NEW cursor with a similar SELECT statement to get the maximum value (plus 1) from the Department_Count column.

Why cassandra/cql restrict to use where clause on a column that not indexed?

I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.

Cassandra super column structure

I'm new to Cassandra, and I'm not familiar with super columns.
Consider this scenario: Suppose we have a some fields of a customer entity like
Name
Contact_no
address
and we can store all these values in a normal column. I want to arrange that when a person moves from one location to another location (the representative field could store the longitude and latitude) that values will be stored consecutively with respect to customer location. I think we can do this with super columns but I'm confused how to design the schema to accomplish this.
Please help me to create this schema and come to understand the concepts behind super columns.
supercolumns are really not recommended anymore...still used but more and more have switched to composite columns. For example playOrm uses this concept for indexing. If I am indexing an integer, and indexing row may look like this
rowkey = 10.pk56 10.pk39 11.pk50
Where the column name type is a composite integer and string in this case. These rows can be up to about 10 million columns though I have only run expirements up to 1 million my self. For example, playOrm's queries use these types of indexes to do a query that took 60 ms on 1,000,000 rows.
With playOrm, you can do scalable relational models in noSQL....you just need to figure out how to partition your data correctly as you can have as many partitions as you want in each table, but a partition should really not be over 10 million rows.
Back to the example though, if you have a table with columns numShares, price, username, age, you may wnat to index numShares and the above row would be that index so you could grab the index by key OR better yet, grab all column names with numShares > 20 and numShares < 50
Once you have those columns, you can then get the second half of the column name which is the primary key. The reason primary key is NOT a value is because as in the example above there is two rows pk56 and pk39 with the same 10 and you can't have two columns named 10, but you can have a 10.pk56 and 10.pk39.
later,
Dean

Resources