Cassandra (Pycassa/CQL) Return Partial Match

Cassandra (Pycassa/CQL) Return Partial Match - cassandra

I'm trying to do a partial search through a column family in Cassandra similar to an SQL query like: SELECT * FROM columnfamily WHERE col = 'val*' where val* means any value matching at least the first three characters 'val'.
I've read datastax's documentation on the SELECT function, but can't seem to find any support for the partial WHERE criteria. Any ideas?

There is no wildcard support like this in Cassandra, but you can model your data in such a way that you could get the same end result.
You would take the column that you want to perform this query on and denormalize it into a second column family. This CF would have a single wide row with the column name as the value of the col you want to do the wild card query on. The column value for this CF could either be the row key for the original CF or some other representation of the original row.
Then you would use slicing to get out the values you care about. For example if this was the wide row to slice on:
+---------+----------+--------+----------+---------+--------+----------+
| RowKey | aardvark | abacus | abacuses | abandon | accent | accident |
| +----------+--------+----------+---------+--------+----------+
| | | | | | | |
| | | | | | | |
+---------+----------+-----------------------------+--------+----------+
Using CQL you could select out everything starting with 'aba*' using this query*:
SELECT 'aba'..'abb' from some_cf where RowKey = some_row_key;
This would give you the columns for 'abacus', 'abacuses', and 'abandon'.
There are some things to be aware of with this strategy:
In the above example, if you have things with the same column_name you need to have some way to differentiate between them (otherwise inserting into the wide column family will clobber other valid values). One way that you could do this is by using a composite column of word:some_unique_value.
The above model only allows wild cards at the end of the string. Wild cards at the beginning of the string could also easily be handled with a few modifications. Wild cards in the middle of a string would be much more challenging.
Remember that Cassandra doesn't give you an easy way to do ad-hoc queries. Instead you need to figure out how you will be using the data and model your CFs accordingly. Take a look at this blog post from Ed Anuff on indexing data in Cassandra for more info on modeling data like this.
*Note that the CQL syntax for slicing columns is changing in an upcoming release of Cassandra.

Related

MariaDB 10.2.19 on Linux - Performance very slow

Main issue - I have a query joining two tables that are indexed.. When I run the query, even after five hours, it's still running.. I've never gotten results back. it's like a never ending query.
This is a new database that we just installed and really the first query we have tried. Performance is very slow.. At this point, I'm trying to figure
out if the issue is the SERVER/CPU/Network/IO/RAM or the database itself (need to tune configuration file) or if it's the query. When I do explain plan,
I do see indexes being used. Is my volume of data too big for MariaDB to handle based on the numbers presented below? Where do I begin to troubleshoot?
Any help is greatly appreciated.
Specs -
• We have MariaDB 10.2.19 on linux rhel7.5
• Our database is mainly used to store monthly data and then we report on it and do analytic work.
• We have around 70 plus tables.
• 3-4 users
• I am running a query that joins two tables. The size of the tables are listed below. The query is not complicated – a simple Inner join on columns and for both tables I am looking for data from the year/month of 201911. The two tables are joined by two columns and both columns are indexed on both tables. I have modified the query to many variation and none of the queries return any rows… even after running for 5 hours, it’s still never finished and I had to kill the query.
Size of (Innodb) Database 1919.69G
Size of table1 - daily_2019
Total Rows 1,034,987,628
Data I’m analyzing (using where clause) 73,895,929
Total table size 230.62GB
Size of table2 - ZIPCODES
Total Rows 68,429,146
Data I’m analyzing (using where clause) 3,998,975
Total table size 28.70GB
Explain plan - Tables daily_2019 and zipcodes each have 3 indexes on three different fields. Col1, col2 and col3 each have indexes both for daily_2019 and zipcodes. Two keys are used in the explain plan. One from each table.
We have set the innodb_buffer_pool_size to 128G
Explain Plan
+------+-------------+-------+------+-------------------------------------+-------------+---------+-----------------------+---------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+------+-------------+-------+------+-------------------------------------+-------------+---------+-----------------------+---------+------------------------------------+
| 1 | SIMPLE | T2 | ref | ZIP_I2,ZIP_I1,ZIP_I3 | ZIP_I1 | 5 | const | 7994358 | Using where
| 1 | SIMPLE | T1 | ref | ADHOC_I2,ADHOC_I1,ADHOC_I3 | ADHOC_I3 | 13 | T2.ID | 12 | Using index condition; Using where
+------+-------------+-------+------+-------------------------------------+-------------+---------+-----------------------+---------+------------------------------------+

Select From Cassandra with Order By and Include position in order by

I have a table in cassandra which has the following structure
CREATE TABLE example (session text, seq_number int, Primary key((session), seq_num))
In the data I would expect all sequence numbers to be in the table starting at 0.
So a table might look like
session text| seq_number|
session1 | 0 |
session1 | 1 |
session1 | 2 |
session1 | 4 | // bad row, means missing data
In this example I would like to read only the first 3 rows. I don't want the fourth row because it is not 3 and it is at index 3. Is this possible?

Possible yes, you can use group by on session and a user defined aggregation function. It may be more work than its worth though, if you just set fetch size low (say 100) on queries then iterate through resultset on client side it might save you a lotta work and potentially even be more efficient overall. I would recommend implementing the client side solution first and benchmarking it to see if its even necessary or beneficial.

Are there any major disadvantages to having multiple clustering columns in cassandra?

I'm designing a cassandra table where I need to be able able to retrieve rows by their geohash. I have something that works, but I'd like to avoid range queries more so than I'm currently able to.
The current table schema is this, with geo_key containing the first five characters of the geohash string. I query using the geo_key, then range filter on the full geohash, allowing me to prefix search based on a 5 or greater length geohash:
CREATE TABLE georecords (geo_key text,geohash text, data text) PRIMARY KEY (geo_key, geohash))
My idea is that I could instead store the characters of the geohash as seperate columns, allowing me to specify as many caracters as I wanted, to do a prefix match on the geohash. My concern is what impact using multiple clustering columns might have:
CREATE TABLE georecords (g1 text,g2 text,g3 text,g4 text,g5 text,g6 text,g7 text,g8 text,geohash text, data text) PRIMARY KEY (g1,g2,g3,g4,g5,g6,g7,g8,geohash,pid))
(I'm not really concerned about the cardinality of the partition key - g1 would have minimum 30 values, and I have other workarounds for it as well)
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?

Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
This seemed like an interesting problem to help out with, so I built a few CQL tables of differing PRIMARY KEY structure and options. I then used http://geohash.org/ to come up with a few endpoints, and inserted them.
aploetz#cqlsh:stackoverflow> SELECT g1, g2, g3, g4, g5, g6, g7, g8, geohash, pid, data FROm georecords3;
g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | geohash | pid | data
----+----+----+----+----+----+----+----+--------------+------+---------------
d | p | 8 | 9 | v | c | n | e | dp89vcnem4n | 1001 | Beloit, WI
d | p | 8 | c | p | w | g | v | dp8cpwgv3 | 1003 | Harvard, IL
d | p | c | 8 | g | e | k | t | dpc8gektg8w7 | 1002 | Sheboygan, WI
9 | x | j | 6 | 5 | j | 5 | 1 | 9xj65j518 | 1004 | Denver, CO
(4 rows)
As you know, Cassandra is designed to return data with a specific, precise key. Using multiple clustering columns helps in that approach, in that you are helping Cassandra quickly identify the data you wish to retrieve.
The only thing I would think about changing, is to see if you can do without either geohash or pid in the PRIMARY KEY. My gut says to get rid of pid, as it really isn't anything that you would query by. The only value it provides is that of uniqueness, which you will need if you plan on storing the same geohashes multiple times.
Including pid in your PRIMARY KEY leaves you with one non-key column, and that allows you to use the WITH COMPACT STORAGE directive. Really the only true edge that gets you, is in saving disk space as the clustering column names are not stored with the value. This becomes apparent when looking at the table from within the cassandra-cli tool:
Without compact storage:
[default#stackoverflow] list georecords3;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:, value=, timestamp=1428766191314431)
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:data, value=42656c6f69742c205749, timestamp=1428766191314431)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:, value=, timestamp=1428766191382903)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:data, value=486172766172642c20494c, timestamp=1428766191382903)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:, value=, timestamp=1428766191276179)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:data, value=536865626f7967616e2c205749, timestamp=1428766191276179)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:, value=, timestamp=1428766191424701)
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:data, value=44656e7665722c20434f, timestamp=1428766191424701)
2 Rows Returned.
Elapsed time: 217 msec(s).
With compact storage:
[default#stackoverflow] list georecords2;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001, value=Beloit, WI, timestamp=1428765102994932)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003, value=Harvard, IL, timestamp=1428765717512832)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002, value=Sheboygan, WI, timestamp=1428765102919171)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004, value=Denver, CO, timestamp=1428766022126266)
2 Rows Returned.
Elapsed time: 39 msec(s).
But, I would recommend against using WITH COMPACT STORAGE for the following reasons:
You cannot add or remove columns after table creation.
It prevents you from having multiple non-key columns in the table.
It was really intended to be used in the old (deprecated) thrift-based approach to column family (table) modeling, and really shouldn't be used/needed anymore.
Yes, it saves you disk space, but disk space is cheap so I'd consider this a very small benefit.
I know you said "other than cardinality of the partition key", but I am going to mention it here anyway. You'll notice in my sample data set, that almost all of my rows are stored with the d partition key value. If I were to create an application like this for myself, tracking geohashes in the Wisconsin/Illinois stateline area, I would definitely have the problem of most of my data being stored in the same partition (creating a hotspot in my cluster). So knowing my use case and potential data, I would probably combine the first three or so columns into a single partition key.
The other issue with storing everything in the same partition key, is that each partition can store a max of about 2 billion columns. So it would also make sense to put some though behind whether or not your data could ever eclipse that mark. And obviously, the higher the cardinality of your partition key, the less likely you are to run into this issue.
By looking at your question, it appears to me that you have looked at your data and you understand this...definite "plus." And 30 unique values in a partition key should provide sufficient distribution. I just wanted to spend some time illustrating how big of a deal that could be.
Anyway, I also wanted to add a "nicely done," as it sounds like you are on the right track.
Edit
The still unresolved question for me is which approach will scale better, in which situations.
Scalability is more tied to how many R replicas you have across N nodes. As Cassandra scales linearly; the more nodes you add, the more transactions your application can handle. Purely from a data distribution scenario, your first model will have a higher cardinality partition key, so it will distribute much more evenly than the second. However, the first model presents a much more restrictive model in terms of query flexibility.
Additionally, if you are doing range queries within a partition (which I believe you said you are) then the second model will allow for that in a very performant manner. All data within a partition is stored on the same node. So querying multiple results for g1='d' AND g2='p'...etc...will perform extremely well.
I may just have to play with the data more and run test cases.
That is a great idea. I think you will find that the second model is the way to go (in terms of query flexibility and querying for multiple rows). If there is a performance difference between the two when it comes to single row queries, my suspicion is that it should be negligible.

Here's the best Cassandra modeling guide I've found: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
I've used composite columns (6 of them) successfully for very high write/read loads. There is no significant performance penalty when using compact storage (http://docs.datastax.com/en/cql/3.0/cql/cql_reference/create_table_r.html).
Compact storage means the data is stored internally in a single row, with the limitation that you can only have one data column. That seems to suit your application well, regardless of which data model you choose, and would make maximal use of your geo_key filtering.
Another aspect to consider is that the columns are sorted in Cassandra. Having more clustering columns will improve the sorting speed and potentially the lookup.
However, in your case, I'd start with having the geohash as a row key and turn on row cache for fast lookup (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1). If the performance is lacking there, I'd run performance tests on different data representations.

CQL select showing encoded values

I am new to cassandra and trying my hands on basic commands. The following is how I am inserting using cassandra-cli
set contactManagementSystem['rowkey3']['firstName'] ='xyz';set contactManagementSystem['rowkey3']['lastName'] ='abc';
but when i try to view those values on CQLSH this is what it shows:
cqlsh:test> select * from "contactManagementSystem";
key | column1 | value
-----+-
0x726f776b657933 | 0x66697273744e616d65 | 0x41616b61
0x726f776b657933 | 0x6c6173744e616d65 | 0x4d
0x726f776b657933 | 0x70686f6e65 | 0x3631372d3132332d373839
I just wanted to understand why is it happening like this and what am I doing wrong? (apologies for the weird looking code. I do not have enough reputations to post images)

While creating table/columnfamily using cassandra-cli the default datatype of columns will be BytesType. So when you describe/select the data it will show you in bytes format. But you can declare the datatype for columns while creating column family.
You can find it here
Also you can create column family using CQLSH as well, and you can declare datatype for each column. Here you can find how to create table/column family using cqlsh.
And doing CRUD using CQLSH is very simple.
Hope it will help you.

Find rows without a specific column family in Accumulo?

Is there a straightforward way to find a row in Accumulo that does not have a specific column family?
For example, here is some simple sample data (omitting timestamp and visibility):
r|cf|cq|v
1|A | |
2|A | |
2|B | |
3|A | |
3|B | |
4|C | |
I'd like to create a scanner that looks for rows without a "B" column family. In this case, it would return row 1 and 4.

There isn't a specific API call in Accumulo that you can use, but this is a great example as to why Accumulo's (SortedKeyValue)Iterator concept is cool. We can write a small amount of code and perform this filtering on the server instead of on the client.
Rather than leave you hanging, here's some code: https://github.com/joshelser/RowsWithoutColumns
Specifically, you can find the iterator: https://github.com/joshelser/RowsWithoutColumns/blob/master/src/main/java/accumulo/RowsWithoutColumnIterator.java
And some code that invokes it: https://github.com/joshelser/RowsWithoutColumns/blob/master/src/test/java/test/RowsWithoutColumnIteratorTest.java
A few things to note, the RowsWithoutColumnIterator needs to buffer an entire row in memory to accomplish what you're asking. This approach will run you out of memory if you have rows with many many columns. If you have 1,000 columns per row (each key-value being 1KB), the server will have to keep 1MB in memory. If you don't have wide columns, this isn't an issue. This example also depends on 1.5.0 but this code can run against any version of Accumulo (if you change some API calls in the test case).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string