Oracle INSTR equivalent in Spark SQL - apache-spark

I tried to replicate the oracle Instr function, but it seems to me that there are not all the arguments that exist in Oracle. I receive this error and I would like to include this transformation in a "plataforma" field in the table but I can't:
SELECT
SUBSTR(a.SOURCE, 0, INSTR(a.SOURCE, '-', 1, 2) - 1) AS plataforma,
COUNT(*) AS qtd
FROM db1.table AS as a
LEFT JOIN db1.table2 AS b ON a.ID=b.id
GROUP BY SUBSTR(a.SOURCE, 0, INSTR(a.SOURCE, '-', 1, 2) - 1)
ORDER BY qtd
The Apache Spark 2.0 database encountered an error while running this
query.
Error running query: org.apache.spark.sql.AnalysisException: Invalid number of arguments for function instr. Expected: 2; Found: 4;
line 8 pos 45
I made the transformation of the field that way but I don't know if it is the correct one:
How can I replicate the same Oracle function in Spark? I need to do just this:
Source:
apache-spark-sql
sql-server-dw
Result:
apache-spark
sql-server

What you're looking for is substring_index function :
substring_index('apache-spark-sql', '-', 2)
It returns the substring before 2 occurrences of -.
I suppose you want to get the substring before the last occurrence of -. So you can count the number of - in the input string and combine it with substring_index function like this:
substring_index(col, '-', size(split(col, '-')) - 1)
Where size(split(col, '-')) - 1 gives the number of occurences of -.

Related

How to preserve a list type when calling that udf in pyspark?

I have a pyspark UDF which returns me a list of weeks. yw_list contains a list of weeks like 202001, 202002, ....202048 etc..
def Week_generator(week, no_of_weeks):
end_index = yw_list.index(week)
start_index = end_index - no_of_weeks + 1
return(yw_list[start_index:end_index+1])
spark.udf.register("Week_generator", Week_generator)
When I'm calling this UDF in my spark sql dataframe, instead of storing the result as a list, it is getting stored as a string. Because of this I'm not able to iterate over the values in the list.
spark.sql(""" select Week_generator('some week column', 4) as col1 from xyz""")
Output Schema: col1:String
Any idea or suggestion on how to resolve this ?
As pointed out by Suresh, I missed out adding the datatype.
spark.udf.register("Week_generator", Week_generator,ArrayType(StringType()))
This solved my issue.

string concat operator(||) throwing error in hive

I am trying to concat string columns in table with concat operator || and throwing error.
Here is the query: select "Bob"||'~'||"glad" from table
its throwing error as : ParseException - cannpt recognize input near '|' '"~"' '|' in expression specification
its works with concat function but not with concat operator.
select concat("bob","~","glad") from table - its working
I am using hive version 2.1 and could anyone tell me why this operator not working?
Thanks,Babu
Hive doesnt support concat operator ||, its oracle syntax. Please use concat function to concat multiple values. You can use concat_ws to concat with a delimiter.
concat
select concat ('this','~','is','~','hello','~','world');
Output : this~is~hello~world
select concat_ws ('~','hello','world','is','not','enough');
Output : hello~world~is~not~enough

split a file path into its constituent paths in Hive/Presto

Using Presto/Hive, I'd like to split a string in the following way.
Input string:
\Users\Killer\Downloads\Temp
\Users\Killer\Downloads\welcome
and have the query return these rows:
\Users\
\Users\Killer\
\Users\Killer\Downloads\
\Users\Killer\Downloads\Temp
\Users\
\Users\Killer\
\Users\Killer\Downloads\
\Users\Killer\Downloads\welcome
Can anyone please help me.
Solution for Hive. split to get array, explode array using posexplode, collect array again using analytic function and concatenate (literals \ should be shielded with one more backslash - \\ and in the regex used in split, single backslash represented as four backslashes):
select s.level,
concat(concat_ws('\\',collect_set(s.path) over(order by level rows between unbounded preceding and current row)),
case when level<size(split(t.str,'\\\\'))-1 then '\\' else '' end
) result
from mytable t lateral view posexplode(split(t.str,'\\\\')) s as level, path
Result:
level result
0 \
1 \Users\
2 \Users\Killer\
3 \Users\Killer\Downloads\
4 \Users\Killer\Downloads\Temp
This can do the job:
SELECT item, array_join( array_agg(item) over (order by id), '\' )
FROM UNNEST(split('\Users\Killer\Downloads\Temp','\')) WITH ORDINALITY t(item,id)
Explanation:
We first split the sting to an array by the delimeter \, then we UNNEST this array into rows, one row per item. After that we do array_agg on all items till this row id ("rolling" aggregation of window function), and finally we array_join them back with the \ delimeter.

remove last character from string

I am trying to create a new dataframe column (b) removing the last character from (a).
column a is a string with different lengths so i am trying the following code -
from pyspark.sql.functions import *
df.select(substring('a', 1, length('a') -1 ) ).show()
I get a TypeError: 'Column' object is not callable
it seems to be due to using multiple functions but i cant understand why as these work on their own -
if i hardcode the column length this will work
df.select(substring('a', 1, 10 ) ).show()
or if i use length on it's own it works
df.select(length('a') ).show()
why can i not use multiple functions ?
is there an easier method of removing the last character from all rows in a column ?
Using substr
df.select(col('a').substr(lit(0), length(col('a')) - 1))
or using regexp_extract:
df.select(regexp_extract(col('a'), '(.*).$', 1))
Function substring does not work as the parameters pos and len needs to be integers, not columns
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.substring
Your code is almost correct.you just need to use len function.
df = spark.createDataFrame([('abcde',)],['dummy'])
from pyspark.sql.functions import substring
df.select('dummy',substring('dummy', 1, len('dummy') -1).alias('substr_dummy')).show()
#+-----+------------+
#|dummy|substr_dummy|
#+-----+------------+
#|abcde| abcd|
#+-----+------------+

how to convert csv to table in oracle

How can I make a package that returns results in table format when passed in csv values.
select * from table(schema.mypackage.myfunction('one, two, three'))
should return
one
two
three
I tried something from ask tom but that only works with sql types.
I am using oracle 11g. Is there something built-in?
The following works
invoke it as
select * from table(splitter('a,b,c,d'))
create or replace function splitter(p_str in varchar2) return sys.odcivarchar2list
is
v_tab sys.odcivarchar2list:=new sys.odcivarchar2list();
begin
with cte as (select level ind from dual
connect by
level <=regexp_count(p_str,',') +1
)
select regexp_substr(p_str,'[^,]+',1,ind)
bulk collect into v_tab
from cte;
return v_tab;
end;
/
Alas, in 11g we still have to handroll our own PL/SQL tokenizers, using SQL types. In 11gR2 Oracle gave us a aggregating function to concatenate results into a CSV string, so perhaps in 12i they will provide the reverse capability.
If you don't want to create a SQL type especially you can use the built-in SYS.DBMS_DEBUG_VC2COLL, like this:
create or replace function string_tokenizer
(p_string in varchar2
, p_separator in varchar2 := ',')
return sys.dbms_debug_vc2coll
is
return_value SYS.DBMS_DEBUG_VC2COLL;
pattern varchar2(250);
begin
pattern := '[^('''||p_separator||''')]+' ;
select trim(regexp_substr (p_string, pattern, 1, level)) token
bulk collect into return_value
from dual
where regexp_substr (p_string, pattern, 1, level) is not null
connect by regexp_instr (p_string, pattern, 1, level) > 0;
return return_value;
end string_tokenizer;
/
Here it is in action:
SQL> select * from table (string_tokenizer('one, two, three'))
2 /
COLUMN_VALUE
----------------------------------------------------------------
one
two
three
SQL>
Acknowledgement: this code is a variant of some code I found on Tanel Poder's blog.
Here is another solution using a regular expression matcher entirely in sql.
SELECT regexp_substr('one,two,three','[^,]+', 1, level) abc
FROM dual
CONNECT BY regexp_substr('one,two,three', '[^,]+', 1, level) IS NOT NULL
For optimal performance, it is best to avoid using hierarchical (CONNECT BY) queries in the splitter function.
The following splitter function performs a good deal better when applied to greater data volumes
CREATE OR REPLACE FUNCTION row2col(p_clob_text IN VARCHAR2)
RETURN sys.dbms_debug_vc2coll PIPELINED
IS
next_new_line_indx PLS_INTEGER;
remaining_text VARCHAR2(20000);
next_piece_for_piping VARCHAR2(20000);
BEGIN
remaining_text := p_clob_text;
LOOP
next_new_line_indx := instr(remaining_text, ',');
next_piece_for_piping :=
CASE
WHEN next_new_line_indx <> 0 THEN
TRIM(SUBSTR(remaining_text, 1, next_new_line_indx-1))
ELSE
TRIM(SUBSTR(remaining_text, 1))
END;
remaining_text := SUBSTR(remaining_text, next_new_line_indx+1 );
PIPE ROW(next_piece_for_piping);
EXIT WHEN next_new_line_indx = 0 OR remaining_text IS NULL;
END LOOP;
RETURN;
END row2col;
/
This performance difference can be observed below (I used the function splitter as was given earlier in this discussion).
SQL> SET TIMING ON
SQL>
SQL> WITH SRC AS (
2 SELECT rownum||',a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'||rownum txt
3 FROM DUAL
4 CONNECT BY LEVEL <=10000
5 )
6 SELECT NULL
7 FROM SRC, TABLE(SYSTEM.row2col(txt)) t
8 HAVING MAX(t.column_value) > 'zzz'
9 ;
no rows selected
Elapsed: 00:00:00.93
SQL>
SQL> WITH SRC AS (
2 SELECT rownum||',a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z'||rownum txt
3 FROM DUAL
4 CONNECT BY LEVEL <=10000
5 )
6 SELECT NULL
7 FROM SRC, TABLE(splitter(txt)) t
8 HAVING MAX(t.column_value) > 'zzz'
9 ;
no rows selected
Elapsed: 00:00:14.90
SQL>
SQL> SET TIMING OFF
SQL>
I don't have 11g installed to play with, but there is a PIVOT and UNPIVOT operation for converting columns to rows / rows to columns, that may be a good starting point.
http://www.oracle.com/technology/pub/articles/oracle-database-11g-top-features/11g-pivot.html
(Having actually done some further investigation, this doesn't look suitable for this case - it works with actual rows / columns, but not sets of data in a column).
There is also DBMS_UTILITY.comma_to_table and table_to_comma for converting CSV lists into pl/sql tables. There are some limitations (handling linefeeds, etc) but may be a good starting point.
My inclination would be to use the TYPE approach, with a simple function that does comma_to_table, then PIPE ROW for each entry in the result of comma_to_table (unfortunately, DBMS_UTILITY.comma_to_table is a procedure so cannot call from SQL).

Resources