I'm new on couchbase and I'm doing some queries using N1QL but it takes a lot of time (9 minutes)
My data have 200.000 documents and the documents have nested types, the number of nested type inside the documents is 6.000.000 distributed between the 200.000 documents, so the UNNEST operation is important. One sample of my data is:
{"p_partkey": 2, "lineorder": [{"customer": [{"c_city": "INDONESIA1"}], "lo_supplycost": 54120, "orderdate": [{"d_weeknuminyear": 19}], "supplier": [{"s_phone": "16-789-973-6601|"}], "commitdate": [{"d_year": 1993}], "lo_tax": 7}, {"customer": [{...
One query I'm doing is:
SELECT SUM(l.lo_extendedprice*l.lo_discount*0.01) as revenue
from part p UNNEST p.lineorder l UNNEST l.orderdate o
where o.d_year=1993 and l.lo_discount between 1 and 3 and l.lo_quantity<25;
The data has the fields mentioned above.
But it takes 9 minutes to execute.
I'm using only my computer to do it, so just one node.
My computer have 16GB fo RAM and the cluster RAM cota is 3,2GB with just one bucket with 3GB. My data has 2,45GB of total size. I have used the calculation mentioned here: http://docs.couchbase.com/admin/admin/Concepts/bp-sizingGuidelines.html to size my cluster and bucket.
I'm doing something wrong or this time is correct for this amount of data?
For now I have created the index like:
CREATE INDEX idx_discount ON part( DISTINCT ARRAY l.lo_discount FOR l IN lineorder END );
CREATE INDEX idx_quantity ON part( DISTINCT ARRAY l.lo_quantity FOR l IN lineorder END );
CREATE INDEX idx_year ON part( DISTINCT ARRAY o.d_year FOR o IN ( DISTINCT ARRAY l.orderdate FOR l IN lineorder END ) END );
But the database dont use it.
One query example is:
SELECT SUM(l.lo_extendedprice*l.lo_discount*0.01) as revenue
from part p UNNEST p.lineorder l UNNEST l.orderdate o
where o.d_year=1993 and l.lo_discount between 1 and 3 and l.lo_quantity<25;
Another example, I have created the index:
CREATE INDEX teste3 ON `part` (DISTINCT ARRAY l.lo_quantity FOR l IN lineorder END );
and queried:
select l.lo_quantity from part as p UNNEST p.lineorder l where l.lo_quantity>20 limit 3
Because I have deleted the primary index, it dont execute. Returning the error:
"No primary index on keyspace part. Use CREATE PRIMARY INDEX to create one.",
After reading the blog on:http://blog.couchbase.com/2016/may/1.making-most-of-your-arrays..-with-covering-array-indexes-and-more I discovedered the problem:
If you create the INDEX like this:
CREATE INDEX iflight_day
ON `travel-sample` ( DISTINCT ARRAY v.flight FOR v IN schedule END );
You have to use the same letters on the queries, in this case the letter 'v'.
SELECT v.day from `travel-sample` as t UNNEST t.schedule v where v.flight="LY104";
The same is the case for deepest levels:
CREATE INDEX inested ON `travel-sample`
( DISTINCT ARRAY (DISTINCT ARRAY y.flight FOR y IN x.special_flights END) FOR x IN schedule END);
In this case you have to use 'y' and 'x':
SELECT x.day from `travel-sample` as t UNNEST t.schedule x UNNEST x.special_flights y where y.flight="AI444";
And now every thing work fine.
But another problem arises when I queried like this:
SELECT * from `travel-sample` as t UNNEST t.schedule x UNNEST x.special_flights y
where x.day=7 and y.flight="AI444";
Only the day index created like above is used.
CREATE INDEX day
ON `travel-sample` ( DISTINCT ARRAY y.day FOR y IN schedule END );
It's used only one index, sometimes 'day', sometimes 'inested'.
You can use Couchbase 4.5 (GA upcoming) with array indexing. Array indexing can be used with UNNEST. It allows you to index individual elements of arrays, including arrays nested within other arrays.
You can create the following indexes, and then use EXPLAIN to make sure there is an IndexScan using your intended index.
CREATE INDEX idx_discount ON part( DISTINCT ARRAY l.lo_discount FOR l IN lineorder END );
CREATE INDEX idx_quantity ON part( DISTINCT ARRAY l.lo_quantity FOR l IN lineorder END );
CREATE INDEX idx_year ON part( DISTINCT ARRAY ( DISTINCT ARRAY o.d_year FOR o IN l.orderdate END ) FOR l IN lineorder END );
Related
I have the following query of Olympic countries in power query which I wish to sort using another query containing "prioritised countries" (the current top 10). I wish to sort the original query such that if a country is on the prioritised list it is alphabetically sorted at the top of the query.
Below visually shows what I am trying to achieve:
The best I have been able to do is merge queries however this removes countries not on the prioritised query. I appreciate that I can create a second query of the original, append this to the prioritised countries and then remove duplicates however I am looking for a more elegant solution as this will require refreshing the data twice.
Let Q be the query to sort and P be the priority list. Then you can get your desired result by appending the intersection Q ∩ P with the set difference Q \ P.
Here's one way to do this in M:
let
Source =
Table.FromList(
List.Combine(
{
List.Sort( List.Intersect( { P[Country], Q[Country] } ) ),
List.Sort( List.RemoveItems( Q[Country], P[Country] ) )
}
),
null,
{"Country"}
)
in
Source
I need to write a n1ql query which demands another sub-query in select clause. As it is mandatory to use 'USE KEYS' while writing subqueries in n1ql. How to write USE KEYS clause for an inner joined query, below is an example of same case:
select meta(m).id as _ID, meta(m).cas as _CAS,
(select c.description
from bucketName p join bucketName c on p.categoryId = c.categoryId and p.type='product' and
c.type='category' and p.masterId=m.masterId ) as description //--How to use USE KEYS here ?
from bucketName m where m.type='master' and m.caseId='12345'
My requirment is to fetch some value from another 2 joined tables. however, I simplified above query to make it more understandable.
Please suggest the correct way to implement.
Also, is writting
sub-queries in n1ql is better than fetching documents seperatly and
merging them in coding?
Non FROM CLAUSE, correlated sub queries requires USE KEYS due to global secondary indexes queries can take long time and resources. This is restriction at present in the N1QL. If you can derive p's document key from the m you can give that as USE KEYS in p.
Otherwise you have two options
Option 1: As your subquery is in the projection Use ANSI JOIN https://blog.couchbase.com/ansi-join-support-n1ql/
SELECT META(m).id AS _ID, META(m).cas AS _CAS, c.description
FROM bucketName AS m
LEFT JOIN bucketName AS p ON p.masterId=m.masterId AND p.type='product'
LEFT JOIN bucketName AS c ON c.type='category' AND p.categoryId = c.categoryId
WHERE m.type='master' AND m.caseId='12345';
CREATE INDEX ix1 ON (caseId) WHERE type='master';
CREATE INDEX ix2 ON (masterId, categoryId) WHERE type='product';
CREATE INDEX ix3 ON (categoryId, description) WHERE type='category';
NOTE: If there is no Unique relation m to p to c JOIN can produce more results.
If that is case, you can do GROUP BY META(m).id, META(m).cas and
ARRAY_AGG(c.description). All descriptions are given as ARRAY.
Option 2:
As described by you issue two separate quires and merge in the application.
Is there a way to do the postgres equivalent of array_agg or string_agg in stream analytics? I have data that comes in every few seconds, and would like to get the count of the values within a time frame.
Data:
{time:12:01:01,name:A,location:X,value:10}
{time:12:01:01,name:B,location:X,value:9}
{time:12:01:02,name:C,location:Y,value:5}
{time:12:01:02,name:B,location:Y,value:4}
{time:12:01:03,name:B,location:Z,value:2}
{time:12:01:03,name:A,location:Z,value:3}
{time:12:01:06,name:B,location:Z,value:4}
{time:12:01:06,name:C,location:Z,value:7}
{time:12:01:08,name:B,location:Y,value:1}
{time:12:01:13,name:B,location:X,value:8}
With a sliding window of 2 seconds, I want to group the data to see the following:
12:01:01, 2 events, 9.5 avg, 2 distinct names, 1 distinct location, nameA:1, nameB:1, locationX:1
12:01:02, 4 events, 7 avg, 3 distinct names, 2 distinct location, nameA:1, nameB:2,nameC:1,locationX:1,locationY:1
12:01:03...
12:01:06...
...
I can get the number of events, average, and distinct counts without issue. I use a window as well as a with statement to join on the timestamp to get the aggregated counts for that timestamp. I am having trouble figuring out how to get the total counts by name and location, mostly because I do not know how to aggregate strings in Azure.
with agg1 as (
select system.timestamp as start,
avg(value) as avg,
count(1) as events,
count(distinct name) as distinct names,
count(distinct location) as distinct location
from input timestamp by created
group by slidingwindow(second,2)
),
agg2 as (
select agg2_inner.start,
array_agg(name,'|',ct_name) as countbyname (????)
from (
select system.timestamp as start,
name, count(1) as ct_name
from input timestamp by created
group by slidingwindow(second,2), name
) as agg2_inner
group by agg2_inner.start, slidingwindow(seconds,2)
)
select * from agg1 join agg2 on (datediff(second,agg1,agg2) between 0 and 2
and agg1.start = agg2.start)
There is not set list of names, locations so the query needs to be a bit dynamic. It is ok if the counts are in an object within a single query, a process later on can parse to get individual counts.
As far as I know, azure stream analysis doesn't provide the array_agg method. But it provides Collect method which could return the all record values from the window.
I suggest you could use Collect method firstly return the array which grouped by the time and window.
Then you could use Azure Stream Analytics JavaScript user-defined functions to write your own logic to convert the array to the result.
More details, you could refer to below sample:
The query like this:
SELECT
time, udf.yourunfname(COLLECT()) as Result
INTO
[YourOutputAlias]
FROM
[YourInputAlias]
Group by time, TumblingWindow(minute, 10)
The UDF is like this:
I just return the avg and the event length.
function main(InputJSON) {
var sum = 0;
for (i = 0; i < InputJSON.length; i++) {
sum += InputJSON[i].value;
}
var result = {events:InputJSON.length,avg:sum/InputJSON.length };
return result;
}
Data:
{"name": "A", "time":"12:01:01","value":10}
{"name": "B", "time":"12:01:01","value":9}
{"name": "C", "time":"12:01:02","value":10}
Result:
We have a postgreql connection pool used by multithreaded application, that permanently inserts some records into big table. So, lets say we have 10 database connections, executing the same function, whcih inserts the record.
The trouble is, we have 10 records inserted as a result meanwhile it should be only 2-3 records inserted, if only transactions could see the records of each other (our function takes decision to do not insert the record according to the date of the last record found).
We can not afford table locking for func execution period.
We tried different tecniques to make the database apply our rules to new records immediately despite the fact they are created in parallel transactions, but havent succeeded yet.
So, I would be very grateful for any help or idea!
To be more specific, here is the code:
schm.events ( evtime TIMESTAMP, ref_id INTEGER, param INTEGER, type INTEGER);
record filter rule:
BEGIN
select count(*) into nCnt
from events e
where e.ref_id = ref_id and e.param = param and e.type = type
and e.evtime between (evtime - interval '10 seconds') and (evtime + interval '10 seconds')
if nCnt = 0 then
insert into schm.events values (evtime, ref_id, param, type);
end if;
END;
UPDATE (comment length is not enough unfortunately)
I've applied to production the unique index solution. The results are pretty acceptable, but the initial target has not been achieved.
The issue is, with the unique hash I can not control the interval between 2 records with sequential hash_codes.
Here is the code:
CREATE TABLE schm.events_hash (
hash_code bigint NOT NULL
);
CREATE UNIQUE INDEX ui_events_hash_hash_code ON its.events_hash
USING btree (hash_code);
--generate the hash codes data by partioning(splitting) evtime in 10 sec intervals:
INSERT into schm.events_hash
select distinct ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) )
from schm.events;
--and then in a concurrently executed function I insert sequentially:
begin
INSERT into schm.events_hash values ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) );
insert into schm.events values (evtime, ref_id, param, type);
end;
In that case, if evtime lies within hash-determined interval, only one record is being inserted.
The case is, we can skip records that refer to different determined intervals, but are close to each other (less than 60 sec interval).
insert into schm.events values ( '2013-07-22 19:32:37', '123', '10', '20' ); --inserted, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:37' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:39', '123', '10', '20' ); --filtered out, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:39' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:41', '123', '10', '20' ); --inserted, test fail, (trunc( extract(epoch from cast('2013-07-22 19:32:41' as timestamp)) / 10 ) = 137450716 )
I think there must be a way to modify the hash function to achieve the initial target, but havent found it yet. Maybe, there are some table constraint expressions, that are executed by the postgresql itself, out of the transaction?
About your only options are:
Using a unique index with a hack to collapse 20-second ranges to a single value;
Using advisory locking to control communication; or
SERIALIZABLE isolation and intentionally creating a mutual dependency between sessions. Not 100% sure this will be practical in your case.
What you really want is a dirty read, but PostgreSQL does not support dirty reads, so you're kind of stuck there.
You might land up needing a co-ordinator outside the database to manage your requirements.
Unique index
You can truncate your timestamps for the purpose of uniquenes checking, rounding them to regular boundaries so they jump in 20 second chunks. Then add them to a unique index on (chunk_time_seconds(evtime, 20), ref_id, param, type) .
Only one insert will succeed and the rest will fail with an error. You can trap the error in a BEGIN ... EXCEPTION block in PL/PgSQL, or preferably just handle it in the application.
I think a reasonable definition of chunk_time_seconds might be:
CREATE OR REPLACE FUNCTION chunk_time_seconds(t timestamptz, round_seconds integer)
RETURNS bigint
AS $$
SELECT floor(extract(epoch from t) / 20) * 20;
$$ LANGUAGE sql IMMUTABLE;
A starting point for advisory locking:
Advisory locks can be taken on a single bigint or a pair of 32-bit integers. Your key is bigger than that, it's three integers, so you can't directly use the simplest approach of:
IF pg_try_advisory_lock(ref_id, param) THEN
... do insert ...
END IF;
then after 10 seconds, on the same connection (but not necessarily in the same transaction) issue pg_advisory_unlock(ref_id_param).
It won't work because you must also filter on type and there's no three-integer-argument form of pg_advisory_lock. If you can turn param and type into smallints you could:
IF pg_try_advisory_lock(ref_id, param << 16 + type) THEN
but otherwise you're in a bit of a pickle. You could hash the values, of course, but then you run the (small) risk of incorrectly skipping an insert that should not be skipped in the case of a hash collision. There's no way to trigger a recheck because the conflicting rows aren't visible, so you can't use the usual solution of just comparing rows.
So ... if you can fit the key into 64 bits and your application can deal with the need to hold the lock for 10-20s before releasing it in the same connection, advisory locks will work for you and will be very low overhead.
I am using SDO_NN operator to find the nearest hydrant next to a building.
Building:
CREATE TABLE "BUILDINGS"
(
"NAME" VARCHAR2(40),
"SHAPE" "SDO_GEOMETRY")
Hydrant:
CREATE TABLE "HYDRANTS"
( "NAME" VARCHAR2(10),
"POINT" "SDO_POINT_TYPE"
);
I have setup spatial indexes properly for buildings.shape and I run the query to get the nearest hydrant to the building 'Motel'
select b1.name as name, h.point.x as x, h.point.y as y from buildings b1, hydrants h where b1.name ='Motel' and
SDO_nn( b1.shape, MDSYS.SDO_GEOMETRY(2003,NULL, NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),
SDO_ORDINATE_ARRAY( h.point.x,h.point.y)), 'sdo_num_res=1')= 'TRUE';
Here's the problem:
When I set the parameter sdo_num_res=1, I get zero tuples.
And when I make sdo_num_res=2, I get one tuple.
What is the reason for the weird behavior ?
Note: I am getting zero rows only when building.name= 'Motel', for all other tuples I am getting 1 row when sdo_num_res = 1
Edit:
Insert queries
Insert into buildings (NAME,SHAPE) values ('Motel',MDSYS.SDO_GEOMETRY(2003,NULL,NULL,MDSYS.SDO_ELEM_INFO_ARRAY(1,1003,1),MDSYS.SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447)));
Insert into hydrants (name,POINT) values ('p57',MDSYS.SDO_POINT_TYPE(589,448,0));
To perform spatial comparisons between a point to a polygon, the SDO_GEOMETRY is defined with SDO_SRID=2001 and center set to a SDO_POINT_TYPE-> which we want to compare.
MDSYS.SDO_GEOMETRY(2001, NULL, SDO_POINT_TYPE(-79, 37, NULL), NULL, NULL)
First of all, your query does not do what you say it does: it actually returns the nearest building called "Motel" from any of your hydrants. To do what you want (i.e. the opposite) you need to reverse the order of the arguments to SDO_NN: all spatial operators search the first argument, using the value of the second argument.
Then the insert into your HYDRANTS table is wrong:
Insert into hydrants (name,POINT) values ('p57',MDSYS.SDO_POINT_TYPE(589,448,0));
The SDO_POINT_TYPE object is not designed to be used that way: it is only used inside the SDO_GEOMETRY type. The proper way is this:
insert into hydrants (name,POINT) values ('p57',sdo_geometry(2001, null, SDO_POINT_TYPE(589,448,null), null, null));
And of course you need to change your table definition accordingly.
Then your building is also incorrectly created: a polygon must always close, i.e. the last point must be the same as the first point. So the proper shape should be like this:
insert into buildings (NAME,SHAPE) values ('Motel', SDO_GEOMETRY(2003,NULL,NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447,564,425)));
Here is the full example:
Create the tables:
create table buildings (
name varchar2(40) primary key,
shape sdo_geometry
);
create table hydrants(
name varchar2(10) primary key,
point sdo_geometry
);
Populate the tables:
insert into buildings (NAME,SHAPE) values ('Motel', SDO_GEOMETRY(2003,NULL,NULL,SDO_ELEM_INFO_ARRAY(1,1003,1),SDO_ORDINATE_ARRAY(564,425,585,436,573,458,552,447,564,425)));
insert into hydrants (name,POINT) values ('p57',sdo_geometry(2001, null, SDO_POINT_TYPE(589,448,null), null, null));
commit;
Confirm that the geometries are all correct:
select name, sdo_geom.validate_geometry_with_context (point, 0.05) from hydrants;
select name, sdo_geom.validate_geometry_with_context (shape, 0.05) from buildings;
Setup spatial metadata and create spatial indexes:
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'BUILDINGS',
'SHAPE',
sdo_dim_array (
sdo_dim_element ('X', 0,1000,0.05),
sdo_dim_element ('Y', 0,1000,0.05)
),
null
);
commit;
create index buildings_sx on buildings (shape)
indextype is mdsys.spatial_index;
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values (
'HYDRANTS',
'POINT',
sdo_dim_array (
sdo_dim_element ('X', 0,1000,0.05),
sdo_dim_element ('Y', 0,1000,0.05)
),
null
);
commit;
create index hydrants_sx on hydrants (point)
indextype is mdsys.spatial_index;
Now Try the properly written query:
select h.name, h.point.sdo_point.x as x, h.point.sdo_point.y as y
from buildings b, hydrants h
where b.name ='Motel'
and sdo_nn(h.point, b.shape, 'sdo_num_res=1')= 'TRUE';
which returns:
NAME X Y
---------------- ---------- ----------
p57 589 448
1 row selected.