YugabyteDB not using expression index on json column - yugabytedb

[Question posted by a user of YugabyteDB]
I'm having difficulty making the query planner use the index in the query below:
postgres=# create table books(k int primary key, doc jsonb not null);
postgres=# CREATE INDEX books_year
ON books (((doc->>'year')::int) ASC)
WHERE doc->>'year' is not null;
postgres=# EXPLAIN select
(doc->>'ISBN')::bigint as isbn,
doc->>'title' as title,
(doc->>'year')::int as year
from books
where (doc->>'year')::int > 1850
order by 3;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=177.33..179.83 rows=1000 width=44)
Sort Key: (((doc ->> 'year'::text))::integer)
-> Seq Scan on books (cost=0.00..127.50 rows=1000 width=44)
Filter: (((doc ->> 'year'::text))::integer > 1850)
(4 rows)
While querying by string value, looks like it's using it:
postgres=# EXPLAIN select
(doc->>'ISBN')::bigint as isbn,
doc->>'title' as title,
(doc->>'year')::int as year
from books
where (doc->>'year') = '1988'
order by 3;
QUERY PLAN
------------------------------------------------------------------------------
Index Scan using books_year on books (cost=0.00..125.50 rows=1000 width=44)
Filter: ((doc ->> 'year'::text) = '1988'::text)
(2 rows)

The predicate on the index and query must match like below:
postgres=# CREATE INDEX books_year ON books (((doc->>'year')::int) asc) where doc->>'year' is not null;
postgres=# EXPLAIN select
(doc->>'ISBN')::bigint as isbn,
doc->>'title' as title,
(doc->>'year')::int as year
from books
where (doc->>'year')::int > 1850 and doc->>'year' is not null;
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using books_year on books (cost=0.00..5.24 rows=10 width=44)
Index Cond: (((doc ->> 'year'::text))::integer > 1850)
(2 rows)

Related

Index Not use in basic query

Having the table block:
CREATE TABLE IF NOT EXISTS "block" (
"hash" char(66) CONSTRAINT block_pk PRIMARY KEY,
"size" text,
"miner" text ,
"nonce" text,
"number" text,
"number_int" integer not null,
"gasused" text ,
"mixhash" text ,
"gaslimit" text ,
"extradata" text ,
"logsbloom" text,
"stateroot" char(66) ,
"timestamp" text ,
"difficulty" text ,
"parenthash" char(66) ,
"sha3uncles" char(66) ,
"receiptsroot" char(66),
"totaldifficulty" text ,
"transactionsroot" char(66)
);
CREATE INDEX number_int_index ON block (number_int);
The table has about 3M of rows , when a query a simple query the results are:
EXPLAIN ANALYZE select number_int from block where number_int > 1999999 and number_int < 2999999 order by number_int desc limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=110.00..110.00 rows=1 width=4) (actual time=16154.891..16154.894 rows=1 loops=1)
-> Sort (cost=110.00..112.50 rows=1000 width=4) (actual time=16154.890..16154.890 rows=1 loops=1)
Sort Key: number_int DESC
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on block (cost=0.00..105.00 rows=1000 width=4) (actual time=172.766..16126.135 rows=190186 loops=1)
Remote Filter: ((number_int > 1999999) AND (number_int < 2999999))
Planning Time: 19.961 ms
Execution Time: 16155.382 ms
Peak Memory Usage: 1113 kB
(9 rows)
any advice?
Regards
I tried something I've found here in stackoverflow with the same result
select number_int from block where number_int > 1999999 and number_int < 2999999 order by number_int+0 desc limit 1;
Hi the problem was related to yugabyte, there was not a issue with a index or with other stuff related with postgres, I ended up migrated to a self-managed database, but at least yugabyte is fully compatible with postgres because I migrated with pg_dump without any problem. It worth it when you are starting if you don't want to manage the database server.

Data model in Cassandra and proper deletion Strategy

I have following table in cassandra:
CREATE TABLE article (
id text,
price int,
validFrom timestamp,
PRIMARY KEY (id, validFrom)
) WITH CLUSTERING ORDER BY (validFrom DESC);
With articles and historical price information (validFrom is a timestamp of new price). Article price changes often. I want to query for
Historic price for a particular article.
Last price for an article.
From my understanding, I can solve both problems with following query:
select id, price from article where id = X validFrom < Y limit 1;
This query uses article id as restriction, query uses the partition key. Since the clustering order is based on the validFrom timestamp in reversed order, cassandra can efficient perform this query.
Am I getting this right?
What is the best approach to delete old data (house-keeping). Let's assume, I want delete all articles with validFrom > 20150101 and validFrom < 20151231. Since I don't have a primary key, this would be inefficient, even if I use an index on validFrom, right? How can I achieve this?
You can use external tools for that:
Spark with Spark Cassandra Connector (even in the local mode). Code could look as following (note that I'm using validfrom as name, not validFrom, as it's not escaped in your schema):
import com.datastax.spark.connector._
val data = sc.cassandraTable("test", "article")
.where("validfrom >= '2020-07-28T11:50:00Z' AND validfrom < '2020-07-28T12:50:00Z'")
.select("id", "validfrom")
data.deleteFromCassandra("test", "article", keyColumns=SomeColumns("id", "validfrom"))
use DSBulk to do find the matching entries and output them into the file (output.csv in my case), and then perform their deletion:
bin/dsbulk unload -url output.csv \
-query "SELECT id, validfrom FROM test.article WHERE token(id) > :start AND token(id) <= :end AND validFrom >= '2020-07-28T11:50:00Z' AND validFrom < '2020-07-28T12:50:00Z' ALLOW FILTERING"
bin/dsbulk load -query "DELETE from test.article WHERE id = :id and validfrom = :validfrom" \
-url output.csv
To add to Alex Ott's answer, this comment of yours is incorrect:
This query uses article id as restriction, query uses the partition key. Since the clustering order is based on price, cassandra can efficient perform this query.
The rows are not ordered by price. They are ordered by validFrom in reverse-chronological order. Cheers!

Cassandra QueryBuilder not returning any result, whereas same query works fine in CQL shell

SELECT count(*) FROM device_stats
WHERE orgid = 'XYZ'
AND regionid = 'NY'
AND campusid = 'C1'
AND buildingid = 'C1'
AND floorid = '2'
AND year = 2017;
The above CQL query returns correct result - 32032, in CQL Shell
But when I run the same query using QueryBuilder Java API , I see the count as 0
BuiltStatement summaryQuery = QueryBuilder.select()
.countAll()
.from("device_stats")
.where(eq("orgid", "XYZ"))
.and(eq("regionid", "NY"))
.and(eq("campusid", "C1"))
.and(eq("buildingid", "C1"))
.and(eq("floorid", "2"))
.and(eq("year", "2017"));
try {
ResultSetFuture tagSummaryResults = session.executeAsync(tagSummaryQuery);
tagSummaryResults.getUninterruptibly().all().stream().forEach(result -> {
System.out.println(" totalCount > "+result.getLong(0));
});
I have only 20 partitions and 32032 rows per partition.
What could be the reason QueryBuilder not executing the query correctly ?
Schema :
CREATE TABLE device_stats (
orgid text,
regionid text,
campusid text,
buildingid text,
floorid text,
year int,
endofwindow timestamp,
categoryid timeuuid,
devicestats map<text,bigint>,
PRIMARY KEY ((orgid, regionid, campusid, buildingid, floorid,year),endofwindow,categoryid)
) WITH CLUSTERING ORDER BY (endofwindow DESC,categoryid ASC);
// Using the keys function to index the map keys
CREATE INDEX ON device_stats (keys(devicestats));
I am using cassandra 3.10 and com.datastax.cassandra:cassandra-driver-core:3.1.4
Moving my comment to an answer since that seems to solve the original problem:
Changing .and(eq("year", "2017")) to .and(eq("year", 2017)) solves the issue since year is an int and not a text.

Cassandra + Fetch the last records using in query

I am new in this cassandra database using with nodejs.
I have user_activity table. In this table data will insert based on user activity.
Also I have some user list. I need to fetch the data in that particular users and last record.
I don't interest to put the query in for loop. Have any other idea to achieve this?
Example Code:
var userlist = ["12", "34", "56"];
var query = 'SELECT * FROM user_activity WHERE userid IN ?';
server.user.execute(query, [userlist], {
prepare : true
}, function(err, result) {
console.log(results);
});
How to get the user lists for last one ?
Example:
user id = 12 - need to get last record;
user id = 34 - need to get last record;
user id = 56 - need to get last record;
I need to get these 3 records.
Table Schema:
CREATE TABLE test.user_activity (
userid text,
ts timestamp,
clientid text,
clientip text,
status text,
PRIMARY KEY (userid, ts)
)
It is not possible if you use the IN filter.
If it is a single user_id filter you can apply order by. Of course you need a column for inserted/updated time. So query will be like this:
SELECT * FROM user_activity WHERE user_id = 12 ORDER BY updated_at LIMIT 1;
You can put N value to get number of records
SELECT * FROM user_activity WHERE userid IN ? ORDER BY id DESC LIMIT N

How to return subset of rows from Cassandra query?

How to limit the query result. I mean how to return subset of rows from cassandra table
# Repository-
#Query("select * from customer_request_v2 where product_id = ?0 and receipt_period = ?1")
Page<CustomerReq>
findByProductIdAndReceiptPeriod(String productId, String receiptPeriod);

Resources