How to get a range of rows(e.g. 1000th~2000th rows) with Apache Kudu? - apache-kudu

I'm using Apache Kudu for study, but how can I get a specific range of rows? For example, I want to get the 1000th to the 2000th rows.
I have found some client APIs about search bound with key:
Status AddLowerBound(const KuduPartialRow& key);
Status AddExclusiveUpperBound(const KuduPartialRow& key);
However, they are using key column bounds to filter rows, they do not using row range to select rows specificlly. Although I can do some scanning to add to the 1000th and 2000th rows, the next time I want the 1500th to the 3000th rows, I must scan again, it seems not a good solution.
So how can I solve it, Thanks:)

Related

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

Display whole rows in great_expectations dashboard

When an expectation fails, I cannot view on the dashboard (the data docs) the entire row (and not just the column value) which caused the failure. For example, if I have a failure because the maximum value of a numerical column is over a threshold, I can see that value, but I cannot see the related ID column.
I am aware that there exists the option "unexpected_index_list" to get the ids of the dataset which causes the fails, but I would like to be able to view the entire row directly on the dashboard. I also tried the option "include_unexpected_rows", but id does not work as expected.
Any help?

How to retrieve a very big cassandra table and delete some unuse data from it?

I hava created a cassandra table with 20 million records. Now I want to delete the expired data decided by one none primary key column. But it doesn't support the operation on the column. So I try to retrieve the table and get the data line by line to delete the data.Unfortunately,it is too huge to retrieve. Otherwise,I couldn't delete the whole table, how could I achieve my goal?
Your question is actually, how to get the data from the table in bulks (also called pagination).
You can do that by selecting different slices from your primary key: For example, if your primary key is some sort of ID, select a range of IDs each time, process the results and do whatever you want to do with them, then get the next range, and so on.
Another way, which depends on the driver you're working with, will be to use fetch_size. You can see a Python example here and a Java example here.

Cassandra: "contains key" operation?

there's some Cassandra operation to verify if a Column Family contains a key? We don't need any row data, only key existence or not.
Best Regards
If you're using Java then create a SliceQuery for the rowKey and set begin/end values equal to the specific column key you're looking for. If there is a column with the specific key then the following expression will be true:
sliceQuery.execute().get().getColumns().size() > 0
One quick way of doing it is to ask for the column count for the row, if it's positive the row exists. Because of tombstones there's a gray area around "does not exist". You can remove all columns for a row, but asking for data for the row may result in an empty set of columns instead of null (this depends a lot on which driver you're using). You should consider rows that don't have columns as non-existent, and therefore asking for the column count is probably the best way to determine if a row exists or not.
There's some more information about this in the Cassandra FAQ under "range ghosts".

MutliGet or multiple Get operations when paging

I have a wide column family used as a 'timeline' index, where column names are timestamps. In order to prevent hotspots, I shard the CF by month so that each month has its own row in the CF.
I query the CF for a slice range between two dates and limit the number of columns returned based on the page's records per page, say to 10.
The problem is that if my date range spans several months, I get 10 columns returned from each row, even if there is 10 matching columns in the first row - thus satisfying my paging requirement.
I can see the logic in this, but it strikes me as a real inefficiency if I have to retrieve redundant records from potentially multiple nodes when I only need the first 10 matching columns regardless of how many rows they span.
So my question is, am I better off to do a single Get operation on the first row and then do another Get operation on the second row if my first call doesnt return 10 records and continue until I have the required no. of records (or hit the row limit), or just accept the redundancy and dump the unneeded records?
I would sample your queries and record how many rows you needed to fetch for each one in order to get your 10 results and build a histogram of those numbers. Then, based on the histogram, figure out how many rows you would need to fetch at once in order to complete, say, 90% of your lookups with only a single query to Cassandra. That's a good start, at least.
If you almost always need to fetch more than one row, consider splitting your timeline by larger chunks than a month. Or, if you want to take a more flexible approach, use different bucket sizes based on the traffic for each individual timeline: http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra (see the "Variable Time Bucket Sizes" section).

Resources