How group by a sub select result that uses a TOP to reduce read rows (Sybase) - sap-ase

I'm trying to do a group by on a subselect result to reduce the amount of data processed.
My_table has more than 20 million rows.
example:
SELECT TOP 100 A.Column FROM (
SELECT TOP 500 Column FROM My_table) A
GROUP BY A.Column
I want the query to work with only 500 rows from my table, but when I use group by, it takes a lot of time, like if it was grouping up the whole 20 million rows when I'm only grouping 500.
Is there a way to make the sql motor only work with the 500 rows?

If it's irrelevant to you which 500 rows are used, have you considered using set rowcount?
set {rowcount number, textsize number} – causes the SAP ASE server to stop processing the query (select, insert, update, or delete) after the specified number of rows are affected. The number can be a numeric literal with no decimal point or a local variable of type integer.
Infocenter source

Related

Removing irrelevalent data with power query

I have a situation where I have data in such format.
There are thousands of rows with such status. What I would like to have is a new table where rows 2 and 3rd are removed and only the bottom row is left for reporting.
Currently, I have a VBA macro code, in which it first concatenates [sales document and product], checks and tags for repeating value. For the tagged lines, concatenated value times billed price is matched with next (-1 * Concatenate next value * billed price) and both lines are deleted in a loop.
This operation takes a long time sometimes as the size of the file can be big. I would like to move to power query because I have other related files, transformation happening there.
Would be glad if anyone can help me.
BR,
Manoj
I would recommend doing a Group By on the first four columns and using Sum as your aggregation for the billing column. Then simply filter out the 0 rows.

Total row count in Cassandra

I totally understand the count(*) from table where partitionId = 'test' will return the count of the rows. I could see that it takes the same time as select * from table where partitionId = 'test.
Is there any other alternative in Cassandra to retrieve the count of the rows in an efficient way?
You can compare results of select * & select count(*) if you run cqlsh, and enable tracing there with tracing on command - it will print time that is required for execution of corresponding command. The difference between both queries is only in what amount of data should be returned back.
But anyway, to find number of rows Cassandra needs to hit SSTable(s), and scan entries - performance could be different if you have partition spread between multiple SSTables - this may depend on your compaction strategy for tables, that is selected based on your reading/writing patterns.
As Alex Ott mentioned, the COUNT(*) needs to go through the entire partition to know that total.
The fact is that Cassandra wants to avoid locks and as a result they do not maintain a number of row in their sstables and each time you do an INSERT, UPDATE, or DELETE, you may actually overwrite another entry which is just marked as a tombstone (i.e. it's not an in place overwrite, instead it saves the new data at the end of the sstable and marks the old data as dead).
The COUNT(*) will go through the sstables and count all the entries not marked as a tombstone. That's very costly. We're used to SQL having the total number of rows in a table or an index so COUNT(*) on those is instantaneous... not here.
One solution I've used is to have Elasticsearch installed on your Cassandra cluster. One of the parameters Elasticsearch saves in their stats is the number of rows in a table. I don't remember the exact query, but more or less you can just a count request and you get a result in like 100ms, always, whatever the number is. Even in the 10s of millions of rows. Just like with a SELECT COUNT(*) ... the result will always be an approximation if you have many writes happening in parallel. It will stabilize if the writes stop for long enough (possibly about 1 or 2 seconds).

Why does aggregating paginated query takes less time than fetching the entire table

I have a table in my database and I have it indexed over three columns: PropertyId, ConceptId and Sequence. This particular table has about 90,000 rows in it and it is indexed over these three properties.
Now, when I run this query, the total time required is greater than 2 minutes:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
However, if I paginate the query like so:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the aggregate time (x goes from 0 to 8) required is only around 20 seconds.
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries and we're adding on the additional latency required for sequential network calls because I haven't parallelized this query at all. And, I know it's not a caching issue because running these queries one after the other does not affect the latencies very much.
So, my question is this: why is one so much faster than the other?
This seems counterintuitive to me because the pagination requires additional operations over and beyond simpler queries
Pagination queries some times works very fast,if you have the right index...
For example,with below query
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
the maximum rows you might read is 20000 only..below is an example which proves the same
RunTimeCountersPerThread Thread="0" ActualRows="60" ActualRowsRead="60"
but with select * query.. you are reading all the rows
After a prolonged search into what's going on here, I discovered that the reason behind this difference in performance (> 2 minutes) was due to hosting the database on Azure. Since Azure partitions any tables you host on it across multiple partitions (i.e. multiple machines), running a query like:
SELECT *
FROM MSC_NPV
ORDER BY PropertyId, ConceptId, Sequence
would run more slowly because the query pulls data from all the partitions in before ordering them, which could result in multiple queries across multiple partitions on the same table. By paginating the query over indexed properties I was looking at a particular partition and querying over the table stored there, which is why it performed significantly better than the un-paginated query.
To prove this, I ran another query:
SELECT *
FROM MSC_NPV
ORDER BY Narrative
OFFSET x * 10000 ROWS
FETCH NEXT 10000 ROWS ONLY
This query ran anemically when compared to the first paginated query because Narrative is not a primary key and therefore is not used by Azure to build a partition key. So, ordering on Narrative required the same operation as the first query and additional operations on top of that because the entire table had to be gotten beforehand.

SELECT COUNT(*) return 0 but I have 800 rows

I use Cassandra 2.0 and cqlsh:
cqlsh:node1> SELECT count(*) FROM users;
count
-------
0
(1 rows)
but when I do:
cqlsh:node1> select id from users LIMIT 10;
id
--------------------
8acecf2
f638215
8b33e24
470a2cb
0f9a5c2
4c49298
2e28a56
b42ce98
19b68c5
2a207f2
(10 rows)
My users table have 5 "text" columns with more than 100Kb of base64 data.
When I do a SELECT * FROM users; cqlsh take 3 seconds before showing the data.
Any one has a solution?
Is it possible to make a COUNT(column)?
ps: what do you need? logs? where?
what needs to be done when counting is specify a limit :
if you are sure that the number of "rows" is less that 5,000,000 (5 millions) then you can do in cql3.0 :
select count(*) from mycolumnfamilyname limit 5000000;
You need to re-think, why does your application need counts. If you have millions/billions of rows, counting will be time/resource-consuming.
If your application is OK with "approximate" counts of users, then you may use "nodetool cfstats". It will get you number of keys (users) estimate and generally accurate.
If you need "exact", then there are different techniques to do that.
you can maintain a special row and keep adding columns to it, when there is a new row insert. Now you can count the number of columns to get number of rows.
In order to count a specific column, you have to have the column in the WHERE clause.
For example, assuming the 'id' column is your primary key, you could do this:
SELECT COUNT(id) FROM users WHERE id > '';
If the column is not the primary key, then you have to allow filtering as in:
SELECT COUNT(name) FROM users WHERE name > '' ALLOW FILTERING;
As mentioned by others, this is slow and the LIMIT keyword is required if you expect a large number of users. The slowness comes from the fact that Cassandra reads all the rows one by one and from what I understand, it reads the entire rows (i.e. your really big columns will be loaded each time,) because they do not have a way to just read one column when filtering. But Cassandra 3.x may have a ameliorated that now.
If you really need that number often, you could use a lock and increment a field representing the number of users. You could also have a process that adjusts the number once in a while if it gets out of sync, somehow.

MutliGet or multiple Get operations when paging

I have a wide column family used as a 'timeline' index, where column names are timestamps. In order to prevent hotspots, I shard the CF by month so that each month has its own row in the CF.
I query the CF for a slice range between two dates and limit the number of columns returned based on the page's records per page, say to 10.
The problem is that if my date range spans several months, I get 10 columns returned from each row, even if there is 10 matching columns in the first row - thus satisfying my paging requirement.
I can see the logic in this, but it strikes me as a real inefficiency if I have to retrieve redundant records from potentially multiple nodes when I only need the first 10 matching columns regardless of how many rows they span.
So my question is, am I better off to do a single Get operation on the first row and then do another Get operation on the second row if my first call doesnt return 10 records and continue until I have the required no. of records (or hit the row limit), or just accept the redundancy and dump the unneeded records?
I would sample your queries and record how many rows you needed to fetch for each one in order to get your 10 results and build a histogram of those numbers. Then, based on the histogram, figure out how many rows you would need to fetch at once in order to complete, say, 90% of your lookups with only a single query to Cassandra. That's a good start, at least.
If you almost always need to fetch more than one row, consider splitting your timeline by larger chunks than a month. Or, if you want to take a more flexible approach, use different bucket sizes based on the traffic for each individual timeline: http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra (see the "Variable Time Bucket Sizes" section).

Resources