Find rows without a specific column family in Accumulo? - accumulo

Is there a straightforward way to find a row in Accumulo that does not have a specific column family?
For example, here is some simple sample data (omitting timestamp and visibility):
r|cf|cq|v
1|A | |
2|A | |
2|B | |
3|A | |
3|B | |
4|C | |
I'd like to create a scanner that looks for rows without a "B" column family. In this case, it would return row 1 and 4.

There isn't a specific API call in Accumulo that you can use, but this is a great example as to why Accumulo's (SortedKeyValue)Iterator concept is cool. We can write a small amount of code and perform this filtering on the server instead of on the client.
Rather than leave you hanging, here's some code: https://github.com/joshelser/RowsWithoutColumns
Specifically, you can find the iterator: https://github.com/joshelser/RowsWithoutColumns/blob/master/src/main/java/accumulo/RowsWithoutColumnIterator.java
And some code that invokes it: https://github.com/joshelser/RowsWithoutColumns/blob/master/src/test/java/test/RowsWithoutColumnIteratorTest.java
A few things to note, the RowsWithoutColumnIterator needs to buffer an entire row in memory to accomplish what you're asking. This approach will run you out of memory if you have rows with many many columns. If you have 1,000 columns per row (each key-value being 1KB), the server will have to keep 1MB in memory. If you don't have wide columns, this isn't an issue. This example also depends on 1.5.0 but this code can run against any version of Accumulo (if you change some API calls in the test case).

Related

Excel -VBA+API - Most performant solution to add a column and have historical data from daily API

This is an overall question, meaning I'd like to discuss the best way to solve the problem, instead of going technical on the code.
Here's the description of what I would like to build.
Actual Situation
Using an issue tracker, I call an API and get all the data from a tracker query.
IE, in redmine, or JIRA, I create a query, save it, and then via API I am able to import the data in a worksheet.
This is ok now, and I created infographics and dashboard which updates every 30 minutes.
Smooth.
Evolution
The data from API are saved in a worksheet called "DATA".
There's a field, call it "% Done" which should be changing at least every end of the day.
When the API updates the query, the field is updated in the "Data" Worksheet correctly.
What I need is a worksheet in which, from this query or another (that makes no difference),
I have this kind of mockup :
+---------+------------+
| issue # | 20/01/2020 |
+---------+------------+
| 23415 | 10% |
+---------+------------+
When the API updates the data, IF THE date is the new day, here's what happens :
+---------+------------+------------+
| issue # | 20/01/2020 | 21/01/2020 |
+---------+------------+------------+
| 23415 | 10% | 20% |
+---------+------------+------------+
And obviously if the %Done is not changing, the third day I will have the table with :
+---------+------------+------------+------------+
| issue # | 20/01/2020 | 21/01/2020 | 22/01/2020 |
+---------+------------+------------+------------+
| 23415 | 10% | 20% | 20% |
+---------+------------+------------+------------+
The date is TODAY(), as the API is called once per day, and is written in the header cell.
The %Done is reloaded every day, and what I would like to talk about is the most performant way to
have maybe 20k records all updated, with 20k row per call.
Any suggestions on how to achieve the best results, more on a "Architectural" side ?
Thank you all, if you need more informations about, just ask.
CG.
There is a significant overhead each time VBA calls Excel to get data and an even larger overhead every time VBA calls Excel to put data onto a sheet. Therefore the most performant way is to minimise the number of calls by maximising the amount of data transferred in each call.
In practice this means using arrays to read as large a block of data as possible and caching stuff retrieved from your API into an array and then writing the array back as infrequently as possible.

Select From Cassandra with Order By and Include position in order by

I have a table in cassandra which has the following structure
CREATE TABLE example (session text, seq_number int, Primary key((session), seq_num))
In the data I would expect all sequence numbers to be in the table starting at 0.
So a table might look like
session text| seq_number|
session1 | 0 |
session1 | 1 |
session1 | 2 |
session1 | 4 | // bad row, means missing data
In this example I would like to read only the first 3 rows. I don't want the fourth row because it is not 3 and it is at index 3. Is this possible?
Possible yes, you can use group by on session and a user defined aggregation function. It may be more work than its worth though, if you just set fetch size low (say 100) on queries then iterate through resultset on client side it might save you a lotta work and potentially even be more efficient overall. I would recommend implementing the client side solution first and benchmarking it to see if its even necessary or beneficial.

Cassandra - do group by and join in the right way

I know - Cassandra does not supports group by. But how to achieve similar result on a big collection of data.
Let's say I have table with 1 mln rows of clicks, 1 mln with shares and table user_profile. clicks and shares store one operation per row with created_at column. On a dashboard I would like to show results grouped by day, for example:
2016-06-01 - 2016-07-01
+-------------+--------+------+
|user_profile | like |share |
+-------------+--------+------+
| John | 34 | 12 |
| Adam | 12 | 4 |
| Bruce | 4 | 2 |
+-------------+--------+------+
The question is, how can I do this in the right way:
Create table user_likes_shares with counter by date
Create UDF to group by each column and join them in the code by merging arrays by key
Select data from 3 tables group and join them in the code by merging arrays by key
Another option
If you use code to join the results, do you use Apache Spark SQL, Is the Spark the right way in this case?
Assuming that your dashboard page will show all historical results, grouped by day:
1. 'Group by' in a table: The denormalised approach is the accepted way of doing things in Cassandra as writes and disk space are cheap. If you can structure your data model (and application writes) to support this, then this is the best approach.
2. 'Group by' in a UDA: In this blog post, the author notes that all rows are pulled back to the coordinator, reconciled and aggregated there (for CL>1). So even if your clicks and shares tables are partitioned by date, Cassandra will still have to pull all rows for that date back to the coordinator, store them in the JVM heap and then process them. So this approach has reduced scalability.
3. Merging in code: This will be a much slower approach as you will have to transfer a lot more data from the coordinator to your application server.
4. Spark: This is a good approach if you have to make ad-hoc queries (e.g. analyzing data, rather than populating a web page) and can be simplified by running your Spark jobs through a notebook application (a.g. Apache Zeppelin). However, in your use case, you have the complexity of having to wait for that job to finish, write the output somewhere and then display it on a web page.

Are there any major disadvantages to having multiple clustering columns in cassandra?

I'm designing a cassandra table where I need to be able able to retrieve rows by their geohash. I have something that works, but I'd like to avoid range queries more so than I'm currently able to.
The current table schema is this, with geo_key containing the first five characters of the geohash string. I query using the geo_key, then range filter on the full geohash, allowing me to prefix search based on a 5 or greater length geohash:
CREATE TABLE georecords (geo_key text,geohash text, data text) PRIMARY KEY (geo_key, geohash))
My idea is that I could instead store the characters of the geohash as seperate columns, allowing me to specify as many caracters as I wanted, to do a prefix match on the geohash. My concern is what impact using multiple clustering columns might have:
CREATE TABLE georecords (g1 text,g2 text,g3 text,g4 text,g5 text,g6 text,g7 text,g8 text,geohash text, data text) PRIMARY KEY (g1,g2,g3,g4,g5,g6,g7,g8,geohash,pid))
(I'm not really concerned about the cardinality of the partition key - g1 would have minimum 30 values, and I have other workarounds for it as well)
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?
This seemed like an interesting problem to help out with, so I built a few CQL tables of differing PRIMARY KEY structure and options. I then used http://geohash.org/ to come up with a few endpoints, and inserted them.
aploetz#cqlsh:stackoverflow> SELECT g1, g2, g3, g4, g5, g6, g7, g8, geohash, pid, data FROm georecords3;
g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | geohash | pid | data
----+----+----+----+----+----+----+----+--------------+------+---------------
d | p | 8 | 9 | v | c | n | e | dp89vcnem4n | 1001 | Beloit, WI
d | p | 8 | c | p | w | g | v | dp8cpwgv3 | 1003 | Harvard, IL
d | p | c | 8 | g | e | k | t | dpc8gektg8w7 | 1002 | Sheboygan, WI
9 | x | j | 6 | 5 | j | 5 | 1 | 9xj65j518 | 1004 | Denver, CO
(4 rows)
As you know, Cassandra is designed to return data with a specific, precise key. Using multiple clustering columns helps in that approach, in that you are helping Cassandra quickly identify the data you wish to retrieve.
The only thing I would think about changing, is to see if you can do without either geohash or pid in the PRIMARY KEY. My gut says to get rid of pid, as it really isn't anything that you would query by. The only value it provides is that of uniqueness, which you will need if you plan on storing the same geohashes multiple times.
Including pid in your PRIMARY KEY leaves you with one non-key column, and that allows you to use the WITH COMPACT STORAGE directive. Really the only true edge that gets you, is in saving disk space as the clustering column names are not stored with the value. This becomes apparent when looking at the table from within the cassandra-cli tool:
Without compact storage:
[default#stackoverflow] list georecords3;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:, value=, timestamp=1428766191314431)
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:data, value=42656c6f69742c205749, timestamp=1428766191314431)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:, value=, timestamp=1428766191382903)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:data, value=486172766172642c20494c, timestamp=1428766191382903)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:, value=, timestamp=1428766191276179)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:data, value=536865626f7967616e2c205749, timestamp=1428766191276179)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:, value=, timestamp=1428766191424701)
=> (name=x:j:6:5:j:5:1:9xj65j518:1004:data, value=44656e7665722c20434f, timestamp=1428766191424701)
2 Rows Returned.
Elapsed time: 217 msec(s).
With compact storage:
[default#stackoverflow] list georecords2;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: d
=> (name=p:8:9:v:c:n:e:dp89vcnem4n:1001, value=Beloit, WI, timestamp=1428765102994932)
=> (name=p:8:c:p:w:g:v:dp8cpwgv3:1003, value=Harvard, IL, timestamp=1428765717512832)
=> (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002, value=Sheboygan, WI, timestamp=1428765102919171)
-------------------
RowKey: 9
=> (name=x:j:6:5:j:5:1:9xj65j518:1004, value=Denver, CO, timestamp=1428766022126266)
2 Rows Returned.
Elapsed time: 39 msec(s).
But, I would recommend against using WITH COMPACT STORAGE for the following reasons:
You cannot add or remove columns after table creation.
It prevents you from having multiple non-key columns in the table.
It was really intended to be used in the old (deprecated) thrift-based approach to column family (table) modeling, and really shouldn't be used/needed anymore.
Yes, it saves you disk space, but disk space is cheap so I'd consider this a very small benefit.
I know you said "other than cardinality of the partition key", but I am going to mention it here anyway. You'll notice in my sample data set, that almost all of my rows are stored with the d partition key value. If I were to create an application like this for myself, tracking geohashes in the Wisconsin/Illinois stateline area, I would definitely have the problem of most of my data being stored in the same partition (creating a hotspot in my cluster). So knowing my use case and potential data, I would probably combine the first three or so columns into a single partition key.
The other issue with storing everything in the same partition key, is that each partition can store a max of about 2 billion columns. So it would also make sense to put some though behind whether or not your data could ever eclipse that mark. And obviously, the higher the cardinality of your partition key, the less likely you are to run into this issue.
By looking at your question, it appears to me that you have looked at your data and you understand this...definite "plus." And 30 unique values in a partition key should provide sufficient distribution. I just wanted to spend some time illustrating how big of a deal that could be.
Anyway, I also wanted to add a "nicely done," as it sounds like you are on the right track.
Edit
The still unresolved question for me is which approach will scale better, in which situations.
Scalability is more tied to how many R replicas you have across N nodes. As Cassandra scales linearly; the more nodes you add, the more transactions your application can handle. Purely from a data distribution scenario, your first model will have a higher cardinality partition key, so it will distribute much more evenly than the second. However, the first model presents a much more restrictive model in terms of query flexibility.
Additionally, if you are doing range queries within a partition (which I believe you said you are) then the second model will allow for that in a very performant manner. All data within a partition is stored on the same node. So querying multiple results for g1='d' AND g2='p'...etc...will perform extremely well.
I may just have to play with the data more and run test cases.
That is a great idea. I think you will find that the second model is the way to go (in terms of query flexibility and querying for multiple rows). If there is a performance difference between the two when it comes to single row queries, my suspicion is that it should be negligible.
Here's the best Cassandra modeling guide I've found: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
I've used composite columns (6 of them) successfully for very high write/read loads. There is no significant performance penalty when using compact storage (http://docs.datastax.com/en/cql/3.0/cql/cql_reference/create_table_r.html).
Compact storage means the data is stored internally in a single row, with the limitation that you can only have one data column. That seems to suit your application well, regardless of which data model you choose, and would make maximal use of your geo_key filtering.
Another aspect to consider is that the columns are sorted in Cassandra. Having more clustering columns will improve the sorting speed and potentially the lookup.
However, in your case, I'd start with having the geohash as a row key and turn on row cache for fast lookup (http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1). If the performance is lacking there, I'd run performance tests on different data representations.

Cassandra (Pycassa/CQL) Return Partial Match

I'm trying to do a partial search through a column family in Cassandra similar to an SQL query like: SELECT * FROM columnfamily WHERE col = 'val*' where val* means any value matching at least the first three characters 'val'.
I've read datastax's documentation on the SELECT function, but can't seem to find any support for the partial WHERE criteria. Any ideas?
There is no wildcard support like this in Cassandra, but you can model your data in such a way that you could get the same end result.
You would take the column that you want to perform this query on and denormalize it into a second column family. This CF would have a single wide row with the column name as the value of the col you want to do the wild card query on. The column value for this CF could either be the row key for the original CF or some other representation of the original row.
Then you would use slicing to get out the values you care about. For example if this was the wide row to slice on:
+---------+----------+--------+----------+---------+--------+----------+
| RowKey | aardvark | abacus | abacuses | abandon | accent | accident |
| +----------+--------+----------+---------+--------+----------+
| | | | | | | |
| | | | | | | |
+---------+----------+-----------------------------+--------+----------+
Using CQL you could select out everything starting with 'aba*' using this query*:
SELECT 'aba'..'abb' from some_cf where RowKey = some_row_key;
This would give you the columns for 'abacus', 'abacuses', and 'abandon'.
There are some things to be aware of with this strategy:
In the above example, if you have things with the same column_name you need to have some way to differentiate between them (otherwise inserting into the wide column family will clobber other valid values). One way that you could do this is by using a composite column of word:some_unique_value.
The above model only allows wild cards at the end of the string. Wild cards at the beginning of the string could also easily be handled with a few modifications. Wild cards in the middle of a string would be much more challenging.
Remember that Cassandra doesn't give you an easy way to do ad-hoc queries. Instead you need to figure out how you will be using the data and model your CFs accordingly. Take a look at this blog post from Ed Anuff on indexing data in Cassandra for more info on modeling data like this.
*Note that the CQL syntax for slicing columns is changing in an upcoming release of Cassandra.

Resources