Is there a multithreading in Oracle PL/SQL? - multithreading

Now I run data transformations sequentially in my PL/SQL ETL scripts.
I've noticed that some transformations do not depend on each other.
Is there a way to open a thread in PL/SQL and offload some of the processing so job finishes faster?
I tried some performance options including setting parallelism degree but it's capped for a session.
I cannot use Java and I cannot design jobs outside Oracle - it has to be PL/SQL.

If you tried "parallelism degree" setting, I think it's possible to run some of your ETL operations in parallel, so you will have a "task" which will wait it's child threads, each of these threads will process only it's "chunk" (subset of all records). In Oracle since 11g Release 2 there is a package DBMS_PARALLEL_EXECUTE which manages it. I've got simple example.
Preparing test data:
create table test_parallel_processing(
col_id int
, col number
)
begin
for i in 1..1000 loop
execute immediate 'insert into test_parallel_processing
select :p1, 100 from dual' using i;
end loop;
commit;
end;
Process all 1000 records in 4 chunks with parallel degree of 2:
declare
v_sql varchar2(1000);
v_chunking_sql varchar2(1000);
begin
/*we can use our own logic of deviding all scope to chunks:*/
v_chunking_sql := ' select 1 as start_id, 250 as end_id from dual
union all
select 251 as start_id, 500 as end_id from dual
union all
select 501 as start_id, 750 as end_id from dual
union all
select 751 as start_id, 1000 as end_id from dual';
DBMS_PARALLEL_EXECUTE.CREATE_TASK ('mytask');
/*creating chunks with our sql*/
DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_SQL(
task_name => 'mytask'
, sql_stmt => v_chunking_sql
, by_rowid => false
);
/*SQL which will be performed in each chunk*/
v_sql := 'update test_parallel_processing
set col = col*2
where col_id between :start_id and :end_id';
/*running task with any parallel degree, for example, 2*/
DBMS_PARALLEL_EXECUTE.RUN_TASK('mytask', v_sql, DBMS_SQL.NATIVE,
parallel_level => 2);
/*I comment this, cause dropping task will also delete information
about task and chunks from views: USER_PARALLEL_EXECUTE_TASKS
and USER_PARALLEL_EXECUTE_CHUNKS*/
--DBMS_PARALLEL_EXECUTE.drop_task ('mytask');
end;
You can control state and result of processing using views:
USER_PARALLEL_EXECUTE_TASKS
USER_PARALLEL_EXECUTE_CHUNKS

Related

force replication of replicated tables

Some of my tables are of type REPLICATE. I would these tables to be actually replicated (not pending) before I start querying my data. This will help me avoid data movement.
I have a script, which I found online, which runs in a loop and do a SELECT TOP 1 on all the tables which are set for replication, but sometimes the script runs for hours. It may seem as the server sometimes won't trigger replication even if you do a SELECT TOP 1 from foo.
How can you force SQL Datawarehouse to complete replication?
The script looks something like this:
begin
CREATE TABLE #tbl
WITH
( DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
ROW_NUMBER() OVER(
ORDER BY
(
SELECT
NULL
)) AS Sequence
, CONCAT('SELECT TOP(1) * FROM ', s.name, '.', t.[name]) AS sql_code
FROM sys.pdw_replicated_table_cache_state AS p
JOIN sys.tables AS t
ON t.object_id = p.object_id
JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE p.[state] = 'NotReady';
DECLARE #nbr_statements INT=
(
SELECT
COUNT(*)
FROM #tbl
), #i INT= 1;
WHILE #i <= #nbr_statements
BEGIN
DECLARE #sql_code NVARCHAR(4000)= (SELECT
sql_code
FROM #tbl
WHERE Sequence = #i);
EXEC sp_executesql #sql_code;
SET #i+=1;
END;
DROP TABLE #tbl;
SET #i = 0;
WHILE
(
SELECT TOP (1)
p.[state]
FROM sys.pdw_replicated_table_cache_state AS p
JOIN sys.tables AS t
ON t.object_id = p.object_id
JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE p.[state] = 'NotReady'
) = 'NotReady'
BEGIN
IF #i % 100 = 0
BEGIN
RAISERROR('Replication in progress' , 0, 0) WITH NOWAIT;
END;
SET #i = #i + 1;
END;
END
Henrik, if 'select top 1' doesn't trigger a replicated table build, then that would be a defect. Please file a support ticket.
Without looking at your system, it is impossible to know exactly what is going on. Here are a couple of things that could be in factoring into extended build time to look into:
The replicated tables are large (size, not necessarily rows) requiring long build times.
There are a lot of secondary indexes on the replicated table requiring long build times.
Replicated table builds require statirc20 (2 concurrency slots). If the concurrency slots are not available, the build will queue behind other running queries.
The replicated tables are constantly being modified with inserts, updates and deletes. Modifications require the table to be built again.
The best way is to run a command like this as part of the job which creates/updates the table:
select top 1 * from <table>
That will force its redistribution at the correct time, without the slow loop through the stored procedure.

How to improve cassandra 3.0 read performance and throughput using async queries?

I have a table:
CREATE TABLE my_table (
user_id text,
ad_id text,
date timestamp,
PRIMARY KEY (user_id, ad_id)
);
The lengths of the user_id and ad_id that I use are not longer than 15 characters.
I query the table like this:
Set<String> users = ... filled somewhere
Session session = ... builded somewhere
BoundStatement boundQuery = ... builded somewhere
(using query: "SELECT * FROM my_table WHERE user_id=?")
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
The Set of users has aproximately 3000 elements , and each users has aproximately 300 ads.
This code is excecuted in 50 threads in the same machine, (with differents users), (using the same Session object)
The algorithm takes between 2 and 3 seconds to complete
The Cassandra cluster has 3 nodes, with a replication factor of 2. Each node has 6 cores and 12 GB of ram.
The Cassandra nodes are in 60% of their CPU capacity, 33% of ram, 66% of ram (including page cache)
The querying machine is 50% of it's cpu capacity, 50% of ram
How do I improve the read time to less than 1 second?
Thanks!
UPDATE:
After some answers(thank you very much), I realized that I wasn' t doing the queries in parallel, so I changed the code to:
List<Row> rowAds =
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.collect(toList())
.stream()
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
So now the queries are being done in parrallel, this gave me times of aprox 300 milliseconds, so great improvement there!.
But my question continues, can it be faster?
Again, thanks!
users.stream()
.map(user -> session.executeAsync(boundQuery.bind(user)))
.map(ResultSetFuture::getUninterruptibly)
.map(ResultSet::all)
.flatMap(List::stream)
.collect(toList());
A remark. On the 2nd map() you're calling ResultSetFuture::getUninterruptibly. It's a blocking call so you don't benefit much from asynchronous exec ...
Instead, try to transform a list of Futures returned by the driver (hint: ResultSetFuture is implementing the ListenableFuture interface of Guava) into a Future of List
See: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/Futures.html#successfulAsList(java.lang.Iterable)

PL/SQL Join Collection Object problems

I am working with an Oracle 11g database, release 11.2.0.3.0 - 64 bit production
I have written the following procedure which uses a cursor to collect batches of benefit_ids (which are simply of type NUMBER) from a table called benefit_info. For each benefit_id within each batch, I need to obtain the associated customers and then perform various calculations etc. So far I have the following:
CREATE OR REPLACE PROCEDURE ben_correct(in_bulk_collect_limit IN PLS_INTEGER DEFAULT 1000)
IS
TYPE ben_identity_rec IS RECORD
(
life_scd_id NUMBER,
benefit_id NUMBER
);
TYPE ben_identity_col IS TABLE OF ben_identity_rec INDEX BY PLS_INTEGER;
life_col ben_identity_col;
ben_id NUMBER;
CURSOR benefit_cur
IS
SELECT benefit_id FROM benefit_info;
TYPE benefit_ids_t IS TABLE OF NUMBER INDEX BY PLS_INTEGER;
benefit_ids benefit_ids_t;
PROCEDURE get_next_set_of_incoming(out_benefit_ids OUT NOCOPY benefit_ids_t)
IS
BEGIN
FETCH benefit_cur
BULK COLLECT INTO out_benefit_ids
LIMIT in_bulk_collect_limit;
END;
BEGIN
OPEN benefit_cur;
LOOP
get_next_set_of_incoming(benefit_ids);
/*
The code below is too slow as each benefit_id is considered
individually. Want to change FOR LOOP into LEFT JOIN of benefit_ids
*/
FOR indx IN 1 .. benefit_ids.count LOOP
ben_id := benefit_ids(indx);
SELECT c.life_scd_id, c.benefit_id
BULK COLLECT INTO life_col
FROM customer c
WHERE c.benefit_id = ben_id;
-- Now do further processing with life_col
END LOOP;
EXIT WHEN benefit_ids.count = 0;
END LOOP;
CLOSE benefit_cur;
END;
/
As indicated in the code above, the FOR indx IN 1 .. LOOP is VERY slow, particularly as there are millions of benefit_ids. However, I am aware I can replace the entire FOR LOOP with something like:
SELECT c.life_scd_id, c.benefit_id
BULK COLLECT INTO life_col
FROM customer c
LEFT JOIN table(benefit_ids) b
WHERE b.benefit_id IS NOT NULL;
However, for that to work I think I need to declare an Object type at the schema level as I think in the SELECT query you can join on pure tables or collections of objects. Therefore, from the procedure I remove
TYPE benefit_ids_t IS TABLE OF NUMBER INDEX BY PLS_INTEGER;
and instead at the schema level I have defined
CREATE OR REPLACE TYPE ben_id FORCE AS object
(
benefit_id number
);
CREATE OR REPLACE TYPE benefit_ids_t FORCE AS TABLE OF ben_id;
My revised code essentially becomes:
CREATE OR REPLACE PROCEDURE ben_correct(in_bulk_collect_limit IN PLS_INTEGER DEFAULT 1000)
IS
sql_str VARCHAR2(1000);
TYPE ben_identity_rec IS RECORD
(
life_scd_id NUMBER,
benefit_id NUMBER
);
TYPE ben_identity_col IS TABLE OF ben_identity_rec INDEX BY PLS_INTEGER;
life_col ben_identity_col;
CURSOR benefit_cur
IS
SELECT benefit_id FROM benefit_info;
--- benefit_ids_t has now been declared at schema level
benefit_ids benefit_ids_t;
PROCEDURE get_next_set_of_incoming(out_benefit_ids OUT NOCOPY benefit_ids_t)
IS
BEGIN
FETCH benefit_cur
BULK COLLECT INTO out_benefit_ids
LIMIT in_bulk_collect_limit;
END;
BEGIN
OPEN benefit_cur;
LOOP
get_next_set_of_incoming(benefit_ids);
sql_str := 'SELECT c.life_scd_id, c.benefit_id
FROM customer c
LEFT JOIN table(benefit_ids) b
WHERE b.benefit_id IS NOT NULL';
EXECUTE IMMEDIATE sql_str BULK COLLECT INTO life_col;
-- Now do further processing with life_col
EXIT WHEN benefit_ids.count = 0;
END LOOP;
CLOSE benefit_cur;
END;
/
However, this generates ORA-24344 and PLS-00386 errors, ie type mismatch found at 'OUT_BENEFIT_IDS' between FETCH cursor and INTO variables.
I sort of understand that it is complaining that benefit_ids_t is now a table of ben_ids, which are in turn objects of type number, which is in't quite the same as a table of numbers.
I've tried various attempts at resolving the issues, but I can't seem to quite get it right. Any help would be gratefully appreciated.
Also, any general comments to improve are welcome.
You don't need your table type to be of an object containing a number field, it can just be a table of numbers:
CREATE OR REPLACE TYPE benefit_ids_t FORCE AS TABLE OF number;
Or you can use a built-in type like sys.odcinumberlist, but having your own type under your control isn't a bad thing.
You don't want to use dynamic SQL though; this:
sql_str := 'SELECT c.life_scd_id, c.benefit_id
FROM customer c
LEFT JOIN table(benefit_ids) b
WHERE b.benefit_id IS NOT NULL';
EXECUTE IMMEDIATE sql_str BULK COLLECT INTO life_col;
won't work because benefit_ids isn't in scope when that dynamic statement is executed. You can just do it statically:
SELECT c.life_scd_id, c.benefit_id
BULK COLLECT INTO life_col
FROM table(benefit_ids) b
JOIN customer c
ON c.benefit_id = b.column_value;
which is closer to what you had in your original code.
Your EXIT is also in the wrong place - it will try to process rows in a loop when it doesn't find any. I wouldn't bother with the separate fetch procedure at all, it's easier to follow with the fetch directly in the loop:
BEGIN
OPEN benefit_cur;
LOOP
FETCH benefit_cur
BULK COLLECT INTO benefit_ids
LIMIT in_bulk_collect_limit;
EXIT WHEN benefit_ids.count = 0;
SELECT c.life_scd_id, c.benefit_id
BULK COLLECT INTO life_col
FROM table(benefit_ids) b
JOIN customer c
ON c.benefit_id = b.column_value;
-- Now do further processing with life_col
END LOOP;
CLOSE benefit_cur;
END;
If you did really want your object type, you could keep that, but you would need to make your cursor return instances of that object, via its default constructor:
CURSOR benefit_cur
IS
SELECT ben_id(benefit_id) FROM benefit_info;
The customer query join would then be:
SELECT c.life_scd_id, c.benefit_id
BULK COLLECT INTO life_col
FROM table(benefit_ids) b
JOIN customer c
ON c.benefit_id = b.benefit_id;
As it's an object type you can refer to it's field name, benefit_id, rather than the generic column_value from the scalar type table.

postgresql insert rules for parallel transactions

We have a postgreql connection pool used by multithreaded application, that permanently inserts some records into big table. So, lets say we have 10 database connections, executing the same function, whcih inserts the record.
The trouble is, we have 10 records inserted as a result meanwhile it should be only 2-3 records inserted, if only transactions could see the records of each other (our function takes decision to do not insert the record according to the date of the last record found).
We can not afford table locking for func execution period.
We tried different tecniques to make the database apply our rules to new records immediately despite the fact they are created in parallel transactions, but havent succeeded yet.
So, I would be very grateful for any help or idea!
To be more specific, here is the code:
schm.events ( evtime TIMESTAMP, ref_id INTEGER, param INTEGER, type INTEGER);
record filter rule:
BEGIN
select count(*) into nCnt
from events e
where e.ref_id = ref_id and e.param = param and e.type = type
and e.evtime between (evtime - interval '10 seconds') and (evtime + interval '10 seconds')
if nCnt = 0 then
insert into schm.events values (evtime, ref_id, param, type);
end if;
END;
UPDATE (comment length is not enough unfortunately)
I've applied to production the unique index solution. The results are pretty acceptable, but the initial target has not been achieved.
The issue is, with the unique hash I can not control the interval between 2 records with sequential hash_codes.
Here is the code:
CREATE TABLE schm.events_hash (
hash_code bigint NOT NULL
);
CREATE UNIQUE INDEX ui_events_hash_hash_code ON its.events_hash
USING btree (hash_code);
--generate the hash codes data by partioning(splitting) evtime in 10 sec intervals:
INSERT into schm.events_hash
select distinct ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) )
from schm.events;
--and then in a concurrently executed function I insert sequentially:
begin
INSERT into schm.events_hash values ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) );
insert into schm.events values (evtime, ref_id, param, type);
end;
In that case, if evtime lies within hash-determined interval, only one record is being inserted.
The case is, we can skip records that refer to different determined intervals, but are close to each other (less than 60 sec interval).
insert into schm.events values ( '2013-07-22 19:32:37', '123', '10', '20' ); --inserted, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:37' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:39', '123', '10', '20' ); --filtered out, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:39' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:41', '123', '10', '20' ); --inserted, test fail, (trunc( extract(epoch from cast('2013-07-22 19:32:41' as timestamp)) / 10 ) = 137450716 )
I think there must be a way to modify the hash function to achieve the initial target, but havent found it yet. Maybe, there are some table constraint expressions, that are executed by the postgresql itself, out of the transaction?
About your only options are:
Using a unique index with a hack to collapse 20-second ranges to a single value;
Using advisory locking to control communication; or
SERIALIZABLE isolation and intentionally creating a mutual dependency between sessions. Not 100% sure this will be practical in your case.
What you really want is a dirty read, but PostgreSQL does not support dirty reads, so you're kind of stuck there.
You might land up needing a co-ordinator outside the database to manage your requirements.
Unique index
You can truncate your timestamps for the purpose of uniquenes checking, rounding them to regular boundaries so they jump in 20 second chunks. Then add them to a unique index on (chunk_time_seconds(evtime, 20), ref_id, param, type) .
Only one insert will succeed and the rest will fail with an error. You can trap the error in a BEGIN ... EXCEPTION block in PL/PgSQL, or preferably just handle it in the application.
I think a reasonable definition of chunk_time_seconds might be:
CREATE OR REPLACE FUNCTION chunk_time_seconds(t timestamptz, round_seconds integer)
RETURNS bigint
AS $$
SELECT floor(extract(epoch from t) / 20) * 20;
$$ LANGUAGE sql IMMUTABLE;
A starting point for advisory locking:
Advisory locks can be taken on a single bigint or a pair of 32-bit integers. Your key is bigger than that, it's three integers, so you can't directly use the simplest approach of:
IF pg_try_advisory_lock(ref_id, param) THEN
... do insert ...
END IF;
then after 10 seconds, on the same connection (but not necessarily in the same transaction) issue pg_advisory_unlock(ref_id_param).
It won't work because you must also filter on type and there's no three-integer-argument form of pg_advisory_lock. If you can turn param and type into smallints you could:
IF pg_try_advisory_lock(ref_id, param << 16 + type) THEN
but otherwise you're in a bit of a pickle. You could hash the values, of course, but then you run the (small) risk of incorrectly skipping an insert that should not be skipped in the case of a hash collision. There's no way to trigger a recheck because the conflicting rows aren't visible, so you can't use the usual solution of just comparing rows.
So ... if you can fit the key into 64 bits and your application can deal with the need to hold the lock for 10-20s before releasing it in the same connection, advisory locks will work for you and will be very low overhead.

how to convert csv to table in oracle

How can I make a package that returns results in table format when passed in csv values.
select * from table(schema.mypackage.myfunction('one, two, three'))
should return
one
two
three
I tried something from ask tom but that only works with sql types.
I am using oracle 11g. Is there something built-in?
The following works
invoke it as
select * from table(splitter('a,b,c,d'))
create or replace function splitter(p_str in varchar2) return sys.odcivarchar2list
is
v_tab sys.odcivarchar2list:=new sys.odcivarchar2list();
begin
with cte as (select level ind from dual
connect by
level <=regexp_count(p_str,',') +1
)
select regexp_substr(p_str,'[^,]+',1,ind)
bulk collect into v_tab
from cte;
return v_tab;
end;
/
Alas, in 11g we still have to handroll our own PL/SQL tokenizers, using SQL types. In 11gR2 Oracle gave us a aggregating function to concatenate results into a CSV string, so perhaps in 12i they will provide the reverse capability.
If you don't want to create a SQL type especially you can use the built-in SYS.DBMS_DEBUG_VC2COLL, like this:
create or replace function string_tokenizer
(p_string in varchar2
, p_separator in varchar2 := ',')
return sys.dbms_debug_vc2coll
is
return_value SYS.DBMS_DEBUG_VC2COLL;
pattern varchar2(250);
begin
pattern := '[^('''||p_separator||''')]+' ;
select trim(regexp_substr (p_string, pattern, 1, level)) token
bulk collect into return_value
from dual
where regexp_substr (p_string, pattern, 1, level) is not null
connect by regexp_instr (p_string, pattern, 1, level) > 0;
return return_value;
end string_tokenizer;
/
Here it is in action:
SQL> select * from table (string_tokenizer('one, two, three'))
2 /
COLUMN_VALUE
----------------------------------------------------------------
one
two
three
SQL>
Acknowledgement: this code is a variant of some code I found on Tanel Poder's blog.
Here is another solution using a regular expression matcher entirely in sql.
SELECT regexp_substr('one,two,three','[^,]+', 1, level) abc
FROM dual
CONNECT BY regexp_substr('one,two,three', '[^,]+', 1, level) IS NOT NULL
For optimal performance, it is best to avoid using hierarchical (CONNECT BY) queries in the splitter function.
The following splitter function performs a good deal better when applied to greater data volumes
CREATE OR REPLACE FUNCTION row2col(p_clob_text IN VARCHAR2)
RETURN sys.dbms_debug_vc2coll PIPELINED
IS
next_new_line_indx PLS_INTEGER;
remaining_text VARCHAR2(20000);
next_piece_for_piping VARCHAR2(20000);
BEGIN
remaining_text := p_clob_text;
LOOP
next_new_line_indx := instr(remaining_text, ',');
next_piece_for_piping :=
CASE
WHEN next_new_line_indx <> 0 THEN
TRIM(SUBSTR(remaining_text, 1, next_new_line_indx-1))
ELSE
TRIM(SUBSTR(remaining_text, 1))
END;
remaining_text := SUBSTR(remaining_text, next_new_line_indx+1 );
PIPE ROW(next_piece_for_piping);
EXIT WHEN next_new_line_indx = 0 OR remaining_text IS NULL;
END LOOP;
RETURN;
END row2col;
/
This performance difference can be observed below (I used the function splitter as was given earlier in this discussion).
SQL> SET TIMING ON
SQL>
SQL> WITH SRC AS (
2 SELECT rownum||',a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'||rownum txt
3 FROM DUAL
4 CONNECT BY LEVEL <=10000
5 )
6 SELECT NULL
7 FROM SRC, TABLE(SYSTEM.row2col(txt)) t
8 HAVING MAX(t.column_value) > 'zzz'
9 ;
no rows selected
Elapsed: 00:00:00.93
SQL>
SQL> WITH SRC AS (
2 SELECT rownum||',a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'||rownum txt
3 FROM DUAL
4 CONNECT BY LEVEL <=10000
5 )
6 SELECT NULL
7 FROM SRC, TABLE(splitter(txt)) t
8 HAVING MAX(t.column_value) > 'zzz'
9 ;
no rows selected
Elapsed: 00:00:14.90
SQL>
SQL> SET TIMING OFF
SQL>
I don't have 11g installed to play with, but there is a PIVOT and UNPIVOT operation for converting columns to rows / rows to columns, that may be a good starting point.
http://www.oracle.com/technology/pub/articles/oracle-database-11g-top-features/11g-pivot.html
(Having actually done some further investigation, this doesn't look suitable for this case - it works with actual rows / columns, but not sets of data in a column).
There is also DBMS_UTILITY.comma_to_table and table_to_comma for converting CSV lists into pl/sql tables. There are some limitations (handling linefeeds, etc) but may be a good starting point.
My inclination would be to use the TYPE approach, with a simple function that does comma_to_table, then PIPE ROW for each entry in the result of comma_to_table (unfortunately, DBMS_UTILITY.comma_to_table is a procedure so cannot call from SQL).

Resources