I'm creating a thread-safe function.
I want to have the number incremented by 'p_sitename' and 'p_sourcename'.
CREATE OR REPLACE FUNCTION public.fn_ppp(
p_sitename character varying,
p_sourcename character varying)
RETURNS integer
LANGUAGE 'plpgsql'
COST 100
VOLATILE
AS $BODY$
declare
res integer ;
begin
lock table std_seq in access exclusive mode;
update
std_seq
set
post_id = (
select
post_id + 1 into res
from
std_seq
where
sitename = p_sitename and
sourcename = p_sourcename
limit 1
)
where
sitename = p_sitename and
sourcename = p_sourcename ;
return res;
end;
$BODY$;
The error message is
ERROR: INTO used with a command that cannot return data
CONTEXT: PL/pgSQL function fn_ppp(character varying,character varying) line
8 at SQL statement
SQL state: 42601
Why not?
SELECT ... INTO cannot be in a subquery, only on the top level.
But you can easily do what you want with UPDATE ... RETURNING:
UPDATE std_seq
SET post_id = post_id + 1
WHERE sitename = p_sitename
AND sourcename = p_sourcename
RETURNING post_id INTO res;
Since everything happens in one statement, there is no need to explicitly lock the table at all.
Every concurrent transaction that calls the same function with the same transaction will be blocked until your transaction is done, and no duplicates can occur.
The table should have a primary key constraint on (sitename, sourcename, post_id). That will prevent duplicates, but may also speed up the UPDATE.
Related
I'm dealing with 1MLN of Tweets (with a frequency of about 5K at seconds) and I would like to do something similar to this code in Cassandra. Let's say that I'm using a Lambda Architecture.
I know the following code is not working, I just would like to explain my logic through it.
DROP TABLE IF EXISTS hashtag_trend_by_week;
CREATE TABLE hashtag_trend_by_week(
shard_week timestamp,
hashtag text ,
counter counter,
PRIMARY KEY ( ( shard_week ), hashtag )
) ;
DROP TABLE IF EXISTS topten_hashtag_by_week;
CREATE TABLE topten_hashtag_by_week(
shard_week timestamp,
counter bigInt,
hashtag text ,
PRIMARY KEY ( ( shard_week ), counter, hashtag )
) WITH CLUSTERING ORDER BY ( counter DESC );
BEGIN BATCH
UPDATE hashtag_trend_by_week SET counter = counter + 22 WHERE shard_week='2021-06-15 12:00:00' and hashtag ='Gino';
INSERT INTO topten_hashtag_trend_by_week( shard_week, hashtag, counter) VALUES ('2021-06-15 12:00:00','Gino',
SELECT counter FROM hashtag_trend_by_week WHERE shard_week='2021-06-15 12:00:00' AND hashtag='Gino'
) USING TTL 7200;
APPLY BATCH;
Then the final query to satisfy my UI should be something like
SELECT hashtag, counter FROM topten_hashtag_by_week WHERE shard_week='2021-06-15 12:00:00' limit 10;
Any suggesting ?
You can only have CQL counter columns in a counter table so you need to rethink the schema for the hashtag_trend_by_week table.
Batch statements are used for making writes atomic in Cassandra so including a SELECT statement does not make sense.
The final query for topten_hashtag_by_week looks fine to me. Cheers!
We have multiple update queries in a single partition of a single columnfamily. Like below
update t1 set username = 'abc', url = 'www.something.com', age = ? where userid = 100;
update t1 set username = 'abc', url = 'www.something.com', weight = ? where userid = 100;
update t1 set username = 'abc', url = 'www.something.com', height = ? where userid = 100;
username, url will be always same and are mandatory fields. But depending on the information given there will be extra columns.
As this is a single partition operation and we need atomicity + isolation. We will execute this in a batch.
As per Doc
A BATCH statement combines multiple data modification language (DML) statements (INSERT, UPDATE, DELETE) into a single logical operation, and sets a client-supplied timestamp for all columns written by the statements in the batch.
Now as we are updating columns(username, url) with same value in multiple statement, will C* combines it as a single statement before executing it like
update t1 set username = 'abc', url = 'www.something.com', age = ?, weight = ?, height = ? where userid = 100;
or same value will be upsert?
Another question is that, as they all have the same timestamp how C* resolves that conflict. Will C* compare every column (username, url) value.
As they all have the same timestamp C* resolves the conflict by choosing the largest value for the cells. Atomic Batch in Cassandra
Or should we add queries in batch like below. In this case we have to check username, url has already been added in statement.
update t1 set username = 'abc', url = 'www.something.com', age = ? where userid = 100;
update t1 set weight = ? where userid = 100;
update t1 set height = ? where userid = 100;
In short what will be the best way to do it.
For your first question(will C* combines it as a single statement?) answer is yes.
A single partition batch is applied as a single row mutation.
check this link for details:
https://issues.apache.org/jira/browse/CASSANDRA-6737
For your second question(Will C* compare every column (username, url) value?) answer is also yes.
As given in the answer of your provided link "Conflict is resolved by choosing the largest value for the cells"
So, you can write queries in batch in either way(given in your question).
As it will ultimately converted to a single write internally.
You are using Single partition batch so everything goes into a single partition.So all of your update will be merge and applied with a single RowMutation.
And so your update will be applied with no batch log, atomic, isolated
I have a PostgreSQL database (v9.5.3) that's hosting "jobs" for workers to pull, run, and commit back.
When a worker wants a job, it runs something to the effect of:
SELECT MIN(sim_id) FROM job WHERE job_status = 0;
-- status 0 indicates it's ready to be run
job is a table with this schema:
CREATE TABLE commit_schema.job (
sim_id serial NOT NULL,
controller_id smallint NOT NULL,
controller_parameters smallint NOT NULL,
model_id smallint NOT NULL,
model_parameters smallint NOT NULL,
client_id smallint,
job_status smallint DEFAULT 0,
initial_glucose_id smallint NOT NULL
);
Afterwards, it uses this sim_id to piece together a bunch of parameters in a JOIN:
SELECT a.par1, b.par2 FROM
a INNER JOIN b ON a.sim_id = b.sim_id;
These parameters are then return to the worker, along with the sim_id, and the job is run. The sim_id is locked by setting job.job_status to 1, using an UPDATE:
UPDATE job SET job_status = 1 WHERE sim_id = $1;
The results are then committed using that same sim_id.
Ideally,
Workers wouldn't under any circumstances be able to get the same sim_id.
Two workers requesting a job won't error out, one will just have to wait to receive a job.
I think that using a serializable isolation level will ensure that the MIN() always returns unique sim_id's, but I believe this may also be achievable using a read committed isolation level. Then again, MIN() may not be able to concurrently and deterministically give unique sim_id's to two concurrent workers?
This should work just fine for concurrent access using the default isolation level Read Committed and FOR UPDATE SKIP LOCKED (new in pg 9.5):
UPDATE commit_schema.job j
SET job_status = 1
FROM (
SELECT sim_id
FROM commit_schema.job
WHERE job_status = 0
ORDER BY sim_id
LIMIT 1
FOR UPDATE SKIP LOCKED
) sub
WHERE j.sim_id = sub.sim_id
RETURNING sim_id;
job_status should probably be defined NOT NULL.
Be wary of certain corner cases - detailed explanation in this related answer on dba.SE:
Postgres UPDATE … LIMIT 1
To address your comment
There are various ways to return from a function:
Make it a simple SQL function instead of PL/pgSQL.
Or use RETURN QUERY with PL/pgSQL.
Or assign the result to a variable with RETURNING ... INTO - which can be an OUT parameter, so it will be returned at the end of the function automatically. Or any other variable and return it explicitly.
Related (with code examples):
Can I make a plpgsql function return an integer without using a variable?
Returning from a function with OUT parameter
We have a postgreql connection pool used by multithreaded application, that permanently inserts some records into big table. So, lets say we have 10 database connections, executing the same function, whcih inserts the record.
The trouble is, we have 10 records inserted as a result meanwhile it should be only 2-3 records inserted, if only transactions could see the records of each other (our function takes decision to do not insert the record according to the date of the last record found).
We can not afford table locking for func execution period.
We tried different tecniques to make the database apply our rules to new records immediately despite the fact they are created in parallel transactions, but havent succeeded yet.
So, I would be very grateful for any help or idea!
To be more specific, here is the code:
schm.events ( evtime TIMESTAMP, ref_id INTEGER, param INTEGER, type INTEGER);
record filter rule:
BEGIN
select count(*) into nCnt
from events e
where e.ref_id = ref_id and e.param = param and e.type = type
and e.evtime between (evtime - interval '10 seconds') and (evtime + interval '10 seconds')
if nCnt = 0 then
insert into schm.events values (evtime, ref_id, param, type);
end if;
END;
UPDATE (comment length is not enough unfortunately)
I've applied to production the unique index solution. The results are pretty acceptable, but the initial target has not been achieved.
The issue is, with the unique hash I can not control the interval between 2 records with sequential hash_codes.
Here is the code:
CREATE TABLE schm.events_hash (
hash_code bigint NOT NULL
);
CREATE UNIQUE INDEX ui_events_hash_hash_code ON its.events_hash
USING btree (hash_code);
--generate the hash codes data by partioning(splitting) evtime in 10 sec intervals:
INSERT into schm.events_hash
select distinct ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) )
from schm.events;
--and then in a concurrently executed function I insert sequentially:
begin
INSERT into schm.events_hash values ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) );
insert into schm.events values (evtime, ref_id, param, type);
end;
In that case, if evtime lies within hash-determined interval, only one record is being inserted.
The case is, we can skip records that refer to different determined intervals, but are close to each other (less than 60 sec interval).
insert into schm.events values ( '2013-07-22 19:32:37', '123', '10', '20' ); --inserted, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:37' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:39', '123', '10', '20' ); --filtered out, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:39' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:41', '123', '10', '20' ); --inserted, test fail, (trunc( extract(epoch from cast('2013-07-22 19:32:41' as timestamp)) / 10 ) = 137450716 )
I think there must be a way to modify the hash function to achieve the initial target, but havent found it yet. Maybe, there are some table constraint expressions, that are executed by the postgresql itself, out of the transaction?
About your only options are:
Using a unique index with a hack to collapse 20-second ranges to a single value;
Using advisory locking to control communication; or
SERIALIZABLE isolation and intentionally creating a mutual dependency between sessions. Not 100% sure this will be practical in your case.
What you really want is a dirty read, but PostgreSQL does not support dirty reads, so you're kind of stuck there.
You might land up needing a co-ordinator outside the database to manage your requirements.
Unique index
You can truncate your timestamps for the purpose of uniquenes checking, rounding them to regular boundaries so they jump in 20 second chunks. Then add them to a unique index on (chunk_time_seconds(evtime, 20), ref_id, param, type) .
Only one insert will succeed and the rest will fail with an error. You can trap the error in a BEGIN ... EXCEPTION block in PL/PgSQL, or preferably just handle it in the application.
I think a reasonable definition of chunk_time_seconds might be:
CREATE OR REPLACE FUNCTION chunk_time_seconds(t timestamptz, round_seconds integer)
RETURNS bigint
AS $$
SELECT floor(extract(epoch from t) / 20) * 20;
$$ LANGUAGE sql IMMUTABLE;
A starting point for advisory locking:
Advisory locks can be taken on a single bigint or a pair of 32-bit integers. Your key is bigger than that, it's three integers, so you can't directly use the simplest approach of:
IF pg_try_advisory_lock(ref_id, param) THEN
... do insert ...
END IF;
then after 10 seconds, on the same connection (but not necessarily in the same transaction) issue pg_advisory_unlock(ref_id_param).
It won't work because you must also filter on type and there's no three-integer-argument form of pg_advisory_lock. If you can turn param and type into smallints you could:
IF pg_try_advisory_lock(ref_id, param << 16 + type) THEN
but otherwise you're in a bit of a pickle. You could hash the values, of course, but then you run the (small) risk of incorrectly skipping an insert that should not be skipped in the case of a hash collision. There's no way to trigger a recheck because the conflicting rows aren't visible, so you can't use the usual solution of just comparing rows.
So ... if you can fit the key into 64 bits and your application can deal with the need to hold the lock for 10-20s before releasing it in the same connection, advisory locks will work for you and will be very low overhead.
I've this kind of delete query:
DELETE
FROM SLAVE_TABLE
WHERE ITEM_ID NOT IN (SELECT ITEM_ID FROM MASTER_TABLE)
Are there any way to optimize this?
You can use EXECUTE BLOCK for sequential scanning of detail table and deleting records where no master record is matched.
EXECUTE BLOCK
AS
DECLARE VARIABLE C CURSOR FOR
(SELECT d.id
FROM detail d LEFT JOIN master m
ON d.master_id = m.id
WHERE m.id IS NULL);
DECLARE VARIABLE I INTEGER;
BEGIN
OPEN C;
WHILE (1 = 1) DO
BEGIN
FETCH C INTO :I;
IF(ROW_COUNT = 0)THEN
LEAVE;
DELETE FROM detail WHERE id = :I;
END
CLOSE C;
END
(NOT) IN can usually be optimized by using (NOT) EXISTS instead.
DELETE
FROM SLAVE_TABLE
WHERE NOT EXISTS (SELECT 1 FROM MASTER_TABLE M WHERE M.ITEM_ID = ITEM_ID)
I am not sure what you are trying to do here, but to me this query indicates that you should be using foreign keys to enforce these kind of constraints, not run queries to cleanup the mess afterwards.