how to convert csv to table in oracle

how to convert csv to table in oracle - string

How can I make a package that returns results in table format when passed in csv values.
select * from table(schema.mypackage.myfunction('one, two, three'))
should return
one
two
three
I tried something from ask tom but that only works with sql types.
I am using oracle 11g. Is there something built-in?

The following works
invoke it as
select * from table(splitter('a,b,c,d'))
create or replace function splitter(p_str in varchar2) return sys.odcivarchar2list
is
v_tab sys.odcivarchar2list:=new sys.odcivarchar2list();
begin
with cte as (select level ind from dual
connect by
level <=regexp_count(p_str,',') +1
)
select regexp_substr(p_str,'[^,]+',1,ind)
bulk collect into v_tab
from cte;
return v_tab;
end;
/

Alas, in 11g we still have to handroll our own PL/SQL tokenizers, using SQL types. In 11gR2 Oracle gave us a aggregating function to concatenate results into a CSV string, so perhaps in 12i they will provide the reverse capability.
If you don't want to create a SQL type especially you can use the built-in SYS.DBMS_DEBUG_VC2COLL, like this:
create or replace function string_tokenizer
(p_string in varchar2
, p_separator in varchar2 := ',')
return sys.dbms_debug_vc2coll
is
return_value SYS.DBMS_DEBUG_VC2COLL;
pattern varchar2(250);
begin
pattern := '[^('''||p_separator||''')]+' ;
select trim(regexp_substr (p_string, pattern, 1, level)) token
bulk collect into return_value
from dual
where regexp_substr (p_string, pattern, 1, level) is not null
connect by regexp_instr (p_string, pattern, 1, level) > 0;
return return_value;
end string_tokenizer;
/
Here it is in action:
SQL> select * from table (string_tokenizer('one, two, three'))
2 /
COLUMN_VALUE
----------------------------------------------------------------
one
two
three
SQL>
Acknowledgement: this code is a variant of some code I found on Tanel Poder's blog.

Here is another solution using a regular expression matcher entirely in sql.
SELECT regexp_substr('one,two,three','[^,]+', 1, level) abc
FROM dual
CONNECT BY regexp_substr('one,two,three', '[^,]+', 1, level) IS NOT NULL

For optimal performance, it is best to avoid using hierarchical (CONNECT BY) queries in the splitter function.
The following splitter function performs a good deal better when applied to greater data volumes
CREATE OR REPLACE FUNCTION row2col(p_clob_text IN VARCHAR2)
RETURN sys.dbms_debug_vc2coll PIPELINED
IS
next_new_line_indx PLS_INTEGER;
remaining_text VARCHAR2(20000);
next_piece_for_piping VARCHAR2(20000);
BEGIN
remaining_text := p_clob_text;
LOOP
next_new_line_indx := instr(remaining_text, ',');
next_piece_for_piping :=
CASE
WHEN next_new_line_indx <> 0 THEN
TRIM(SUBSTR(remaining_text, 1, next_new_line_indx-1))
ELSE
TRIM(SUBSTR(remaining_text, 1))
END;
remaining_text := SUBSTR(remaining_text, next_new_line_indx+1 );
PIPE ROW(next_piece_for_piping);
EXIT WHEN next_new_line_indx = 0 OR remaining_text IS NULL;
END LOOP;
RETURN;
END row2col;
/
This performance difference can be observed below (I used the function splitter as was given earlier in this discussion).
SQL> SET TIMING ON
SQL>
SQL> WITH SRC AS (
2 SELECT rownum||',a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'||rownum txt
3 FROM DUAL
4 CONNECT BY LEVEL <=10000
5 )
6 SELECT NULL
7 FROM SRC, TABLE(SYSTEM.row2col(txt)) t
8 HAVING MAX(t.column_value) > 'zzz'
9 ;
no rows selected
Elapsed: 00:00:00.93
SQL>
SQL> WITH SRC AS (
2 SELECT rownum||',a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'||rownum txt
3 FROM DUAL
4 CONNECT BY LEVEL <=10000
5 )
6 SELECT NULL
7 FROM SRC, TABLE(splitter(txt)) t
8 HAVING MAX(t.column_value) > 'zzz'
9 ;
no rows selected
Elapsed: 00:00:14.90
SQL>
SQL> SET TIMING OFF
SQL>

I don't have 11g installed to play with, but there is a PIVOT and UNPIVOT operation for converting columns to rows / rows to columns, that may be a good starting point.
http://www.oracle.com/technology/pub/articles/oracle-database-11g-top-features/11g-pivot.html
(Having actually done some further investigation, this doesn't look suitable for this case - it works with actual rows / columns, but not sets of data in a column).
There is also DBMS_UTILITY.comma_to_table and table_to_comma for converting CSV lists into pl/sql tables. There are some limitations (handling linefeeds, etc) but may be a good starting point.
My inclination would be to use the TYPE approach, with a simple function that does comma_to_table, then PIPE ROW for each entry in the result of comma_to_table (unfortunately, DBMS_UTILITY.comma_to_table is a procedure so cannot call from SQL).

Related

How can this statement be written in JOOQ?

I have a table that has ID varchar(255) as PK. I have managed to create an auto increment temporary column rowNumber to use with a SELECT, as follows:
SELECT
(#cnt := #cnt + 1) AS rowNumber, ID
from table
CROSS JOIN (SELECT #cnt := 0) AS n
where
(
some conditions
)
)
and date > {ts '2020-08-06 08:51:23.08'}
ORDER BY ID
LIMIT 10
;
How can the above query be expressed in JOOQ? Especially, the (#cnt := #cnt + 1) AS rowNumber part.

SQL variable support in jOOQ
jOOQ currently (version 3.13) doesn't support session variables like the ones you've shown. There's a pending feature request for this: https://github.com/jOOQ/jOOQ/issues/2558
Whenever jOOQ doesn't support a vendor specific feature, you can use it via plain SQL templating, instead. Write:
ctx
.select(
field("(#cnt := #cnt + 1)", SQLDataType.BIGINT).as("rowNumber"),
TABLE.ID)
.from(TABLE)
.crossJoin(table(select(field("#cnt := 0", SQLDataType.BIGINT))).as("n"))
.where(...)
.orderBy(TABLE.ID)
.limit(10)
.fetch();
Assuming, as always, this static import:
import static org.jooq.impl.DSL.*;
Starting from MySQL 8
Just a side-note for future visitors who might be using MySQL 8: In those cases, you'd be using the DSL.rowNumber() window function instead of the approach using variables

force replication of replicated tables

Some of my tables are of type REPLICATE. I would these tables to be actually replicated (not pending) before I start querying my data. This will help me avoid data movement.
I have a script, which I found online, which runs in a loop and do a SELECT TOP 1 on all the tables which are set for replication, but sometimes the script runs for hours. It may seem as the server sometimes won't trigger replication even if you do a SELECT TOP 1 from foo.
How can you force SQL Datawarehouse to complete replication?
The script looks something like this:
begin
CREATE TABLE #tbl
WITH
( DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
ROW_NUMBER() OVER(
ORDER BY
(
SELECT
NULL
)) AS Sequence
, CONCAT('SELECT TOP(1) * FROM ', s.name, '.', t.[name]) AS sql_code
FROM sys.pdw_replicated_table_cache_state AS p
JOIN sys.tables AS t
ON t.object_id = p.object_id
JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE p.[state] = 'NotReady';
DECLARE #nbr_statements INT=
(
SELECT
COUNT(*)
FROM #tbl
), #i INT= 1;
WHILE #i <= #nbr_statements
BEGIN
DECLARE #sql_code NVARCHAR(4000)= (SELECT
sql_code
FROM #tbl
WHERE Sequence = #i);
EXEC sp_executesql #sql_code;
SET #i+=1;
END;
DROP TABLE #tbl;
SET #i = 0;
WHILE
(
SELECT TOP (1)
p.[state]
FROM sys.pdw_replicated_table_cache_state AS p
JOIN sys.tables AS t
ON t.object_id = p.object_id
JOIN sys.schemas AS s
ON t.schema_id = s.schema_id
WHERE p.[state] = 'NotReady'
) = 'NotReady'
BEGIN
IF #i % 100 = 0
BEGIN
RAISERROR('Replication in progress' , 0, 0) WITH NOWAIT;
END;
SET #i = #i + 1;
END;
END

Henrik, if 'select top 1' doesn't trigger a replicated table build, then that would be a defect. Please file a support ticket.
Without looking at your system, it is impossible to know exactly what is going on. Here are a couple of things that could be in factoring into extended build time to look into:
The replicated tables are large (size, not necessarily rows) requiring long build times.
There are a lot of secondary indexes on the replicated table requiring long build times.
Replicated table builds require statirc20 (2 concurrency slots). If the concurrency slots are not available, the build will queue behind other running queries.
The replicated tables are constantly being modified with inserts, updates and deletes. Modifications require the table to be built again.

The best way is to run a command like this as part of the job which creates/updates the table:
select top 1 * from <table>
That will force its redistribution at the correct time, without the slow loop through the stored procedure.

Is there a multithreading in Oracle PL/SQL?

Now I run data transformations sequentially in my PL/SQL ETL scripts.
I've noticed that some transformations do not depend on each other.
Is there a way to open a thread in PL/SQL and offload some of the processing so job finishes faster?
I tried some performance options including setting parallelism degree but it's capped for a session.
I cannot use Java and I cannot design jobs outside Oracle - it has to be PL/SQL.

If you tried "parallelism degree" setting, I think it's possible to run some of your ETL operations in parallel, so you will have a "task" which will wait it's child threads, each of these threads will process only it's "chunk" (subset of all records). In Oracle since 11g Release 2 there is a package DBMS_PARALLEL_EXECUTE which manages it. I've got simple example.
Preparing test data:
create table test_parallel_processing(
col_id int
, col number
)
begin
for i in 1..1000 loop
execute immediate 'insert into test_parallel_processing
select :p1, 100 from dual' using i;
end loop;
commit;
end;
Process all 1000 records in 4 chunks with parallel degree of 2:
declare
v_sql varchar2(1000);
v_chunking_sql varchar2(1000);
begin
/*we can use our own logic of deviding all scope to chunks:*/
v_chunking_sql := ' select 1 as start_id, 250 as end_id from dual
union all
select 251 as start_id, 500 as end_id from dual
union all
select 501 as start_id, 750 as end_id from dual
union all
select 751 as start_id, 1000 as end_id from dual';
DBMS_PARALLEL_EXECUTE.CREATE_TASK ('mytask');
/*creating chunks with our sql*/
DBMS_PARALLEL_EXECUTE.CREATE_CHUNKS_BY_SQL(
task_name => 'mytask'
, sql_stmt => v_chunking_sql
, by_rowid => false
);
/*SQL which will be performed in each chunk*/
v_sql := 'update test_parallel_processing
set col = col*2
where col_id between :start_id and :end_id';
/*running task with any parallel degree, for example, 2*/
DBMS_PARALLEL_EXECUTE.RUN_TASK('mytask', v_sql, DBMS_SQL.NATIVE,
parallel_level => 2);
/*I comment this, cause dropping task will also delete information
about task and chunks from views: USER_PARALLEL_EXECUTE_TASKS
and USER_PARALLEL_EXECUTE_CHUNKS*/
--DBMS_PARALLEL_EXECUTE.drop_task ('mytask');
end;
You can control state and result of processing using views:
USER_PARALLEL_EXECUTE_TASKS
USER_PARALLEL_EXECUTE_CHUNKS

postgresql insert rules for parallel transactions

We have a postgreql connection pool used by multithreaded application, that permanently inserts some records into big table. So, lets say we have 10 database connections, executing the same function, whcih inserts the record.
The trouble is, we have 10 records inserted as a result meanwhile it should be only 2-3 records inserted, if only transactions could see the records of each other (our function takes decision to do not insert the record according to the date of the last record found).
We can not afford table locking for func execution period.
We tried different tecniques to make the database apply our rules to new records immediately despite the fact they are created in parallel transactions, but havent succeeded yet.
So, I would be very grateful for any help or idea!
To be more specific, here is the code:
schm.events ( evtime TIMESTAMP, ref_id INTEGER, param INTEGER, type INTEGER);
record filter rule:
BEGIN
select count(*) into nCnt
from events e
where e.ref_id = ref_id and e.param = param and e.type = type
and e.evtime between (evtime - interval '10 seconds') and (evtime + interval '10 seconds')
if nCnt = 0 then
insert into schm.events values (evtime, ref_id, param, type);
end if;
END;
UPDATE (comment length is not enough unfortunately)
I've applied to production the unique index solution. The results are pretty acceptable, but the initial target has not been achieved.
The issue is, with the unique hash I can not control the interval between 2 records with sequential hash_codes.
Here is the code:
CREATE TABLE schm.events_hash (
hash_code bigint NOT NULL
);
CREATE UNIQUE INDEX ui_events_hash_hash_code ON its.events_hash
USING btree (hash_code);
--generate the hash codes data by partioning(splitting) evtime in 10 sec intervals:
INSERT into schm.events_hash
select distinct ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) )
from schm.events;
--and then in a concurrently executed function I insert sequentially:
begin
INSERT into schm.events_hash values ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) );
insert into schm.events values (evtime, ref_id, param, type);
end;
In that case, if evtime lies within hash-determined interval, only one record is being inserted.
The case is, we can skip records that refer to different determined intervals, but are close to each other (less than 60 sec interval).
insert into schm.events values ( '2013-07-22 19:32:37', '123', '10', '20' ); --inserted, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:37' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:39', '123', '10', '20' ); --filtered out, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:39' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:41', '123', '10', '20' ); --inserted, test fail, (trunc( extract(epoch from cast('2013-07-22 19:32:41' as timestamp)) / 10 ) = 137450716 )
I think there must be a way to modify the hash function to achieve the initial target, but havent found it yet. Maybe, there are some table constraint expressions, that are executed by the postgresql itself, out of the transaction?

About your only options are:
Using a unique index with a hack to collapse 20-second ranges to a single value;
Using advisory locking to control communication; or
SERIALIZABLE isolation and intentionally creating a mutual dependency between sessions. Not 100% sure this will be practical in your case.
What you really want is a dirty read, but PostgreSQL does not support dirty reads, so you're kind of stuck there.
You might land up needing a co-ordinator outside the database to manage your requirements.
Unique index
You can truncate your timestamps for the purpose of uniquenes checking, rounding them to regular boundaries so they jump in 20 second chunks. Then add them to a unique index on (chunk_time_seconds(evtime, 20), ref_id, param, type) .
Only one insert will succeed and the rest will fail with an error. You can trap the error in a BEGIN ... EXCEPTION block in PL/PgSQL, or preferably just handle it in the application.
I think a reasonable definition of chunk_time_seconds might be:
CREATE OR REPLACE FUNCTION chunk_time_seconds(t timestamptz, round_seconds integer)
RETURNS bigint
AS $$
SELECT floor(extract(epoch from t) / 20) * 20;
$$ LANGUAGE sql IMMUTABLE;
A starting point for advisory locking:
Advisory locks can be taken on a single bigint or a pair of 32-bit integers. Your key is bigger than that, it's three integers, so you can't directly use the simplest approach of:
IF pg_try_advisory_lock(ref_id, param) THEN
... do insert ...
END IF;
then after 10 seconds, on the same connection (but not necessarily in the same transaction) issue pg_advisory_unlock(ref_id_param).
It won't work because you must also filter on type and there's no three-integer-argument form of pg_advisory_lock. If you can turn param and type into smallints you could:
IF pg_try_advisory_lock(ref_id, param << 16 + type) THEN
but otherwise you're in a bit of a pickle. You could hash the values, of course, but then you run the (small) risk of incorrectly skipping an insert that should not be skipped in the case of a hash collision. There's no way to trigger a recheck because the conflicting rows aren't visible, so you can't use the usual solution of just comparing rows.
So ... if you can fit the key into 64 bits and your application can deal with the need to hold the lock for 10-20s before releasing it in the same connection, advisory locks will work for you and will be very low overhead.

How can I return a CSV string from PL/SQL table type in Oracle

I have defined a table type PL/SQL variable and added some data there.
create or replace type varTableType as table of varchar2(32767);
my_table varTableType := varTableType()
...
my_table := some_function();
Now I have this my_table table type variable with several thousands of records.
I have to select only those records ending with specific character, say 'a' and get results in a comma separated string.
I think that COLLECT function could do this, but I do not understand exactly how.
I am using Oracle 10g.

Without getting into the question- why are you using a table type and not a table (or temporary table), you can do it like this:
declare
my_table varTableType;
i varchar2(32767);
begin
my_table := new
varTableType('bbbb', 'ccca', 'ddda', 'eee', 'fffa', 'gggg');
select trim(xmlagg(xmlelement(e, column_value || ','))
.extract('//text()'))
into i
from table(my_table)
where column_value like '%a';
dbms_output.put_line(i);
end;
There are more ways to concat rows- WM_CONCAT (if enabled) or LISTAGG (since 11g R2) but the basic idea of
select column_value
from table(my_table)
where column_value like '%a';
stays
There is another way without sql:
declare
my_table varTableType;
i varchar2(32767);
begin
my_table := new
varTableType('bbbb', 'ccca', 'ddda', 'eee', 'fffa', 'gggg');
FOR j IN my_table.first .. my_table.last LOOP
IF my_table(j) like '%a' THEN
i := i || my_table(j);
END IF;
END LOOP;
dbms_output.put_line(i);
end;

See this blog post by Tim Hall for a number of ways to do this, depending on Oracle version.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string