T-SQL Split Word into characters

T-SQL Split Word into characters - string

I have searched everywhere and I cannot find this implementation anywhere.
Let's say I have the word: QWERTY
I want to obtain this table:
Q
W
E
R
T
Y
Or for QWERTY AnotherWord I want to obtain
Q
W
E
R
T
Y
[space character here]
A
n
o
t
h
e
r
W
o
r
d

Do it like this:
select substring(a.b, v.number+1, 1)
from (select 'QWERTY AnotherWord' b) a
join master..spt_values v on v.number < len(a.b)
where v.type = 'P'

Declare #word nvarchar(max)
Select #word = 'Hello This is the test';
with cte (Number)as
(Select 1
union all
select Number +1 From cte where number <len(#word)
)
select * from Cte Cross apply (Select SUBSTRING(#word,number,1 ) ) as J(Letter)

Here you have it:
create table #words (
character varchar(1)
)
declare #test varchar(10)
select #test = 'QWERTY'
declare #count int, #total int
select #total = len(#test), #count = 0
while #count <= #total
begin
insert into #words select substring(#test, #count, 1)
select #count = #count + 1
end
select * from #words
drop table #words

Here is a table-valued function (derived from aF's temp table implementation). It differs slightly from aF's implementation in that it starts with #count=1; this excludes an extraneous leading space.
CREATE FUNCTION [dbo].[Chars] (#string VARCHAR(max))
RETURNS #chars TABLE (character CHAR)
AS
BEGIN
DECLARE #count INT,
#total INT
SELECT #total = Len(#string),
#count = 1
WHILE #count <= #total
BEGIN
INSERT INTO #chars
SELECT Substring(#string, #count, 1)
SELECT #count = #count + 1
END
RETURN
END
Usage:
SELECT * FROM dbo.chars('QWERTY AnotherWord')

Please, PLEASE avoid referencing systems tables, specifically system tables in system databases. In fact, the selected answer above probably won't compile in a Visual Studio 2013 Database Project
Table variables are fine, but recursion with a CTE is the answer:
DECLARE #str VARCHAR(max)
SET #str = 'QWERTY AnotherWord'
WITH Split(stpos,endpos)
AS(
SELECT 1 AS stpos, 2 AS endpos
UNION ALL
SELECT endpos, endpos+1
FROM Split
WHERE endpos <= LEN(#str)
)
SELECT
'character' = SUBSTRING(#str,stpos,COALESCE(NULLIF(endpos,0),LEN(#str)+1)-stpos)
,'charindex' = stpos
FROM Split
That said, the use for the code above is to get a table full of letters representing different permissions for a user. That is not the way to do this. Make a table with an ID, a permission code and a description then make a linking table between the users table and the new permissions table. this gives you the same abilities and doesn't make you solve dumb problems like this.

I wanted to contribute my own solution to this problem.
Convert into table valued function as desired (and handle nulls however you wish)
DECLARE #str nvarchar(100) = 'QWERTY AnotherWord'
DECLARE #len int = LEN(#str)-1;
--create a string of len(#str)-1 commas
--because STRING_SPLIT will return n rows for n-1 commas
--split string to return a table of len(#str) rows
--provide an index column named [index]
WITH [rows] AS (
SELECT
ROW_NUMBER() OVER (ORDER BY [value]) [index]
FROM STRING_SPLIT(REPLICATE(',', #len), ',')
),
--for each row, take the index number
--and extract the character from that index
[split] AS (
SELECT
[index],
SUBSTRING(#str,[index],1) [char]
FROM [rows]
)
--maintain the same order
--and return just the extracted characters
SELECT
--[index],
[char]
FROM [split]
ORDER BY [index] ASC
output:
char
----
Q
W
E
R
T
Y
A
n
o
t
h
e
r
W
o
r
d

I like the use of REPLICATE() and substring in the answer by #drrollergator. I find value in the answer below, in accounting for:
The truncation to 8000 characters mentioned by Microsoft learn/docs. Explicitly casting to a larger datatype will avoid this.
the unordered ROW_NUMBER as mentioned in [https://stackoverflow.com/questions/44105691/row-number-without-order-by].
Sample SQL:
DECLARE #str NVARCHAR(MAX) = N'QWERTY AnotherWord'
SELECT
ss.[value]
FROM
( SELECT TOP(LEN(#str))
SUBSTRING(#str,n.[i],1) [value]
,n.[i]
FROM ( SELECT ROW_NUMBER() OVER(ORDER BY (SELECT '.')) [i] FROM STRING_SPLIT(REPLICATE(CAST('.' AS VARCHAR(MAX)),LEN(#str) - 1),'.') ) n([i])
/* [A.] Generate numbers equal to character count in #expression */
ORDER BY n.[i]
/* [B.] Return 1-Char-Substring for each number/position */
) ss

Related

Condition WHERE string without/with space

I have table with some strings. I would like make select with condition string = eqauls something
I Dont have any other strings....
The select returns more rows when I Have:
What is wrong?
DECLARE #C VARCHAR(2) = 'A'+SPACE(1)
DECLARE #T TABLE (id INT NOT NULL, string VARCHAR(200) NOT NULL)
INSERT INTO #T
(
id,
string
)
VALUES
( 1, 'A'), (2,'A'+SPACE(1))
SELECT * FROM #T WHERE string = #C--With space only
Returns:
id string
1 A
2 A
I know hot to make select LIKE '%.... '.
I want to know why TSQL returns more rows.
SQL 2019, MSSQL version 18.9.2

SQL Server follows the ANSI standard when it comes to comparing strings with =. Read a longer description over here: https://dba.stackexchange.com/a/10511/7656
The bottom line is, you can't check for trailing spaces with =. Use LIKE without any % instead.
Given
CREATE TABLE T (id INT NOT NULL, string VARCHAR(200) NOT NULL)
INSERT INTO T VALUES (1, 'A')
INSERT INTO T VALUES (2, 'A ')
this
SELECT id, len(string) len, datalength(string) datalength FROM T
results in
id
len
datalength
1
1
1
2
1
2
and
SELECT id FROM T WHERE string LIKE 'A '
will give you 2. See http://sqlfiddle.com/#!18/2356c9/1

You can use one of the following solutions
-- Option 1: add to the filter the condition `DATALENGTH(#C) = DATALENGTH(string)` or 'DATALENGTH(#C) < DATALENGTH(string)'
SELECT * FROM #T WHERE string = #C and DATALENGTH(#C) <= DATALENGTH(string)
-- Option 2: Use `LIKE` and add the expresion '%'
SELECT * FROM #T WHERE string like #C + '%'

The = operator ignores trailing spaces just like LEN(). The LIKE operator does not
SELECT * FROM #T WHERE string LIKE #C
You can prove this with
SELECT CASE WHEN 'A' = 'A ' THEN 'True' ELSE 'False' END -- True
SELECT CASE WHEN 'A' = ' A' THEN 'True' ELSE 'False' END -- False because of leading space
SELECT CASE WHEN 'A' LIKE 'A ' THEN 'True' ELSE 'False' END -- False
SELECT LEN(string), FROM #T -- both return 1

Oracle start and end position of string function

I'm looking to create a function where I pass in a string and it returns the start and end position of the string along with the pattern I'm searching for. Would instrr be the correct command to use?
create table data(
str VARCHAR2(100)
);
INSERT into data (id,str) VALUES (1,'123hellphello321hello64');
Expected outcome
start_pos end_pos str
9 13 hello
16 20 hello

You can use INSTR in a recursive sub-query factoring clause:
WITH search (term) AS (
SELECT 'hello' FROM DUAL
),
rsqfc (id, start_pos, end_pos, str, term) AS (
SELECT id,
INSTR(str, term, 1),
INSTR(str, term, 1) + LENGTH(term),
str,
term
FROM data
CROSS JOIN search
UNION ALL
SELECT id,
INSTR(str, term, end_pos),
INSTR(str, term, end_pos) + LENGTH(term),
str,
term
FROM rsqfc
WHERE start_pos > 0
)
SELECT *
FROM rsqfc
WHERE start_pos > 0;
Which, for the sample data:
create table data(id, str) AS
SELECT 1, '123hellphello321hello64' FROM DUAL;
Outputs:
ID
START_POS
END_POS
STR
TERM
1
9
14
123hellphello321hello64
hello
1
17
22
123hellphello321hello64
hello
db<>fiddle here

You can also take advantage of the return_option option of the regexp_instr function as below.
regexp_instr
with YourTable (id, c) as (
select 1, '123hellphello321hello64' from dual
)
, search (term) as (
select 'hello' from dual
)
, rws (lvl) as (
select level from dual
connect by level <= (
select max(regexp_count(t.c, s.term)) from YourTable t cross join search s )
)
select t.id, t.c
, regexp_instr(t.c, s.term, 1, rws.lvl, 0) start_pos
, regexp_instr(t.c, s.term, 1, rws.lvl, 1) end_pos
, s.term
from YourTable t
cross join search s
join rws on rws.lvl <= regexp_count(t.c, s.term)
order by t.id, rws.lvl
;
db<>fiddle

Searching a substring in an string table?

I have an internal table with the following data (<fs_content>):
OFFER/005056B467AE1ED9B1962F12360477E9-A
OFFER/005056B467AE1ED9B1962F12360477E9-B
OFFER/005056B467AE1ED9B1962F12360477E9-C
OFFER/005056B467AE1ED9B1962F12360477E9-D
OFFER/005056B467AE1ED9B1962F12360477E9-E
I have to search repeatedly values like this (V1):
OFFER-A
OFFER-B
OFFER-C
OFFER-M
OFFER-L
I expect that the following values are identified, which match one line in the internal table (itab_v1_result):
OFFER-A
OFFER-B
OFFER-C
But as you can see in <fs_content> there's the same code 005056B467AE1ED9B1962F12360477E9, after OFFER/ until - symbol.
Now, I want to assign the rows from <fs_content> to field Symbol <fs_my_content> by comparing V1 value with each row in <fs_content>, but the problem is that V1 value is not completely same like <fs_content> rows.
I've tried to do something like this, but it's not working, <fs_my_content> is always empty:
READ TABLE <fs_content> ASSIGNING <fs_my_content> WITH KEY ('ATTR_NAME') = V1.
How can I get itab_v1_result to contain what I expect?
My minimal reproducible example:
TYPES:
BEGIN OF ty_content,
attr_name TYPE string,
END OF ty_content.
FIELD-SYMBOLS:
<fs_my_content> TYPE any,
<fs_content> TYPE ANY TABLE.
DATA:
itab_content TYPE STANDARD TABLE OF ty_content,
itab_v1 TYPE STANDARD TABLE OF string,
itab_v1_result TYPE STANDARD TABLE OF string,
v1 TYPE string.
itab_content = VALUE #(
( attr_name = 'OFFER/005056B467AE1ED9B1962F12360477E9-A' )
( attr_name = 'OFFER/005056B467AE1ED9B1962F12360477E9-B' )
( attr_name = 'OFFER/005056B467AE1ED9B1962F123604D7E9-C' )
( attr_name = 'OFFER/005056B467AE1ED9B1962F12360477E9-D' )
( attr_name = 'OFFER/005056B467AE1ED9B1962F12360477E9-E' ) ).
itab_v1 = VALUE #(
( `OFFER-A` )
( `OFFER-B` )
( `OFFER-C` )
( `OFFER-M` )
( `OFFER-L` ) ).
ASSIGN itab_content TO <fs_content>.
LOOP AT itab_v1 INTO v1.
READ TABLE <fs_content> ASSIGNING <fs_my_content> WITH KEY ('ATTR_NAME') = v1.
IF sy-subrc = 0.
APPEND v1 TO itab_v1_result.
ENDIF.
ENDLOOP.
" Here, itab_v1_result is empty unfortunately!?

You cannot use any operators other than = in READ TABLE. But you can use them in a LOOP.
First you'd have to arrange your V1 in a way that CS can identify, so just use the '-X', which seems to be unique. Then you can use your condition in the LOOP clause.
offset = STRLEN( v1 ) - 2.
v2 = v1+offset(2).
LOOP AT itab1 ASSIGNING <fs_itab1> WHERE attribute_name CS v2.
" do something
" if you only want to do it for the first entry you find, then just EXIT afterwards
ENDLOOP.

You are over-complicating the solution. Why not just use substring access?
LOOP AT itab_v1 INTO v1.
LOOP AT itab_content ASSIGNING FIELD-SYMBOL(<content>).
CHECK v1(5) = <content>-attr_name(5) AND substring( val = v1 off = strlen( v1 ) - 1 len = 1 ) = substring( val = <content>-attr_name off = strlen( <content>-attr_name ) - 1 len = 1 ).
APPEND v1 TO itab_v1_result.
ENDLOOP.
ENDLOOP.

Thanks a lot to all of you for your variants of solution. It was very helpful for me.
Here's the solution of my problem.
First of all we should loop at <fs_content> and assign it to new field-symbol <dynamic_content>.
Then, we should get ATTR_NAME field from <dynamic_content> and assign it to another field-symbol <contact_attribute_name>.
We'll use some function for working with STRING type value, because of this we'll assign <contact_attribute_name> to lv_attr_name.
As we know (from task description) in lv_attr_name would be the values like: OFFER/005056B467AE1ED9B1962F12360477E9-A and so on.
Because of this we'll find the position of first / by find() method from the beginning of lv_attr_name and put the value into lv_slash_position.
We repeat this operation for finding the position of first - after lv_slash_position and put the value into lv_dash_position.
After this two operation we'll use the replace() method and replace lv_dash_position - lv_slash_position to empty value. In the end we'll get OFFER/-A and put it into lv_attr_val_string.
In the end we'll compare lv_attr_val_string and v1, if lv_attr_val_string <> v1 we would not put it to the final itab itab_v1_result, else we'll do it.
LOOP AT <fs_content> ASSIGNING <dynamic_content>.
ASSIGN COMPONENT 'ATTR_NAME' OF STRUCTURE <dynamic_content> TO <contact_attribute_name>.
DATA(lv_attr_name) = CONV string( <contact_attribute_name> ).
DATA(lv_slash_position) = find( val = lv_attr_val_string
sub = '/'
off = 0 ).
IF lv_slash_position <> '1-'.
DATA(lv_dash_position) = find( val = lv_attr_val_string
sub = '-'
off = lv_slash_position ).
lv_attr_val_string = replace( val = lv_attr_val_string
off = lv_slash_position
len = ( lv_dash_position - lv_slash_position )
with = '' ).
ENDIF.
IF lv_attr_val_string <> v1.
APPEND v1 TO itab_v1_result.
CONTINUE.
ENDIF.
ENDLOOP.

Oracle Function to return similarity between strings

I have an interesting problem and am wondering if oracle has a built-in function to do this or I need to find a fast way to do it in plsql.
Take 2 strings:
s1 = 'abc def hijk'
s2 = 'abc def iosk'
The function needs to return abc def because the strings are exactly the same up to that point.
Another example:
s1 = 'abc def hijk www'
s2 = 'abc def iosk www'
The function needs to return abc def.
The only way I can think of doing this is loop through string1 and compare each character with substr() again the substr of string 2.
Just wondering if Oracle's got something built-in. Performance is pretty important.

After re-reading your question, here would be what you really wanted:
with cte1 as (
select 1 id, 'abc def hijk www' str from dual
union all
select 2 id, 'abc def iosk www' str from dual
), num_gen as (
-- a number generator up to the minimum length of the strings
SELECT level num
FROM dual t
CONNECT BY level <= (select min(length(str)) from cte1)
), cte2 as (
-- build substrings of increasing length
select id, num_gen.num, substr(cte1.str, 1, num_gen.num) sub
from cte1
cross join num_gen
), cte3 as (
-- self join to check if the substrings are equal
select x1.num, x1.sub sub1, x2.sub sub2
from cte2 x1
join cte2 x2 on (x1.num = x2.num and x1.id != x2.id)
), cte4 as (
-- select maximum string length
select max(num) max_num
from cte3
where sub1 = sub2
)
-- finally, get the substring with the max length
select cte3.sub1
from cte3
join cte4 on (cte4.max_num = cte3.num)
where rownum = 1
Essentially, this is what you would do in pl/sql: Build substrings of increasing length and stop at the point at which they are not matching anymore.

I doubt that there is some built-in SQL function, but it can be done in SQL only using regular expressions:
with cte1 as (
select 1 id, 'abc def hijk www' str from dual
union all
select 2 id, 'abc def iosk www' str from dual
), cte2 as (
SELECT distinct id, trim(regexp_substr(str, '[^ ]+', 1, level)) str
FROM cte1 t
CONNECT BY instr(str, ' ', 1, level - 1) > 0
)
select distinct t1.str
from cte2 t1
join cte2 t2 on (t1.str = t2.str and t1.id != t2.id)
I haven't done any performance tests, but my experience tells me this is most likely faster than any pl/sql solution since you are totally avoiding context switches.

You should check the package UTL_MATCH for a similar functionality, but the get exact your request you must write own function.
The binary search for the common substring length provides good performance for long strings.
create or replace function ident_pfx(str1 varchar2, str2 varchar2) return varchar2
as
len_beg PLS_INTEGER;
len_end PLS_INTEGER;
len_mid PLS_INTEGER;
len_result PLS_INTEGER;
begin
if str1 is null or str2 is null then return null; end if;
--
len_result := 0;
len_beg := 0;
len_end := least(length(str1),length(str2));
LOOP
BEGIN
-- use binary search for the common substring length
len_mid := ceil((len_beg + len_end) / 2);
IF (substr(str1,1,len_mid) = substr(str2,1,len_mid))
THEN
len_beg := len_mid; len_result := len_mid;
ELSE
len_end := len_mid;
END IF;
END;
IF (len_end - len_beg) <= 1 THEN
-- check last character
IF (substr(str1,1,len_end) = substr(str2,1,len_end))
THEN
len_result := len_end;
END IF;
EXIT ;
END IF;
END LOOP;
return substr(str1,1,len_result);
end;
/
select ident_pfx('abc def hijk www','abc def iosk www') ident_pfx from dual;
abc def

Another possible solution would be to use the XOR.
If you XOR the two strings together, the result should have a NUL byte whereever the two strings match.
XOR is not a native operator, but i am pretty sure there is support for it in one of the libraries.

If "the performance is pretty important", you should avoid the "looping" on substrings.
Here an alternative using the XOR (as proposed by #EvilTeach).
with string_transform as (
select 'abc def hijk www' str1, 'abc def iosk www' str2 from dual
),
str as (
select
str1, str2,
-- add suffix to handle nulls and identical strings
-- calculate XOR
utl_raw.bit_xor(utl_raw.cast_to_raw(str1||'X'),utl_raw.cast_to_raw(str2||'Y')) str1_xor_str2
from string_transform
), str2 as (
select
str1, str2,
str1_xor_str2,
-- replace all non-identical characters (not 00) with 2D = '-'
utl_raw.translate(str1_xor_str2,
utl_raw.translate(str1_xor_str2,'00','01'),
utl_raw.copies('2D',length(str1_xor_str2))) xor1
from str
), str3 as (
select
str1, str2,
-- replace all identical characters (00) with 2B (= '+') and cast back to string
utl_raw.cast_to_varchar2(utl_raw.translate(xor1,'00','2B')) diff
-- diff = ++++++++---+++++ (+ means identical position; - difference)
from str2
)
select str1, str2,
-- remove the appended suffix character
substr(diff,1,length(diff)-1) diff,
-- calculate the length of the identical prefix
instr(diff,'-')-1 same_prf_length
from str3
;
Basically both strings are first converted to RAW format. XOR sets the identical bytes (characters) to 00. With translate the identical bytes are converted to '+', all other to '-'.
The identical prefix length is the position of the first '-' in the string minus one.
Technically a (different) sufix character is added to both strings to hanlde NULLs and identical strings.
Note that if the string is longer that 2000, some extra processing must be added
due to limitation of UTL_RAW.CAST_TO_VARCHAR2.

Split sql string into words

I want to split string into words like below, the output of all the string should be same:
INPUT:
1. This is a string
2. This is a string
3. This is a string
4. This is a string
OUTPUT:
This is a
Means, that I want first three words from the sentence, irrespective of the spaces.

Try this:
declare #s1 varchar(3000) ;
declare #xml xml,#str varchar(100),#delimiter varchar(10), #out varchar(max);;
select #delimiter =' '
select #s1 = 'This is a string';
select #s1 = 'This is a string ';
select #s1 = 'This is a string ';
select #s1 = 'This is a string';
select #xml = cast(('<X>'+replace(#s1,#delimiter ,'</X><X>')+'</X>') as xml)
select top 3 #out =
COALESCE(#out + ' ', '') + C.value('.', 'varchar(100)')
from #xml.nodes('X') as X(C)
where LEN(C.value('.', 'varchar(10)')) > 0
select #out

Now your case contains two steps:
1. Removing additional spaces and converting them to single space. You can use REPLACE() method to this.
SELECT REPLACE(REPLACE(REPLACE("This is a string",' ','<>'),'><',''),'<>',' ')
Process:
The innermost REPLACE changes all blanks to a less-than greater-than pair.
If there are three spaces between This and is, the innermost REPLACE returns This<><><>is.
The middle REPLACE changes all greater-than less-than pairs to the empty string, which removes them.
The<><><>is becomes The<>is.
The outer REPLACE changes all less-than greater-than pairs to a single blank. The<>is becomes
The is.
Now all the sentences are normalized with one space.
2. Split the words and get the three words.
There are lot of Stackoverflow question which discusses them. I liked the Common Table Expression to split the string : How do I split a string so I can access item x?
Let me know if you require any help in the splitting the words.

Create a Tally Table:
SELECT TOP 11000
IDENTITY( INT,1,1 ) AS Num
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
GO
Create a Table Valued Function:
CREATE FUNCTION dbo.[fnSetSplit]
(
#String VARCHAR(8000),
#Delimiter CHAR(1)
)
RETURNS TABLE
AS
RETURN
( SELECT Num,
SUBSTRING(#String, CASE Num
WHEN 1 THEN 1
ELSE Num + 1
END,
CASE CHARINDEX(#Delimiter, #String,
Num + 1)
WHEN 0
THEN LEN(#String) - Num + 1
ELSE CHARINDEX(#Delimiter,
#String, Num + 1)
- Num
- CASE WHEN Num > 1 THEN 1
ELSE 0
END
END) AS String
FROM dbo.Tally
WHERE Num <= LEN(#String)
AND ( SUBSTRING(#String, Num, 1) = #Delimiter
OR Num = 1 )
)
Query function:
SELECT TOP 3
fss.String
FROM dbo.fnSetSplit('This is a string', ' ') fss
WHERE NOT ( fss.String = '' )
If you need to reconcatenate, look at string concatenation using FOR XML (PATH)

SQL Server 2016 (compatibility level 130) allows to use STRING_SPLIT function:
DECLARE #delimiter varchar(10) = ' '
SELECT STRING_AGG(value, #delimiter)
FROM (SELECT TOP 3 value FROM STRING_SPLIT('This is a string', #delimiter) WHERE LEN(value)>0) inq
SELECT STRING_AGG(value, #delimiter)
FROM (SELECT TOP 3 value FROM STRING_SPLIT('This is a string ', #delimiter) WHERE LEN(value)>0) inq
SELECT STRING_AGG(value, #delimiter)
FROM (SELECT TOP 3 value FROM STRING_SPLIT('This is a string', #delimiter) WHERE LEN(value)>0) inq
SELECT STRING_AGG(value, #delimiter)
FROM (SELECT TOP 3 value FROM STRING_SPLIT('This is a string', #delimiter) WHERE LEN(value)>0) inq
Result:
This is a
This is a
This is a
This is a

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

T-SQL Split Word into characters - string

I have searched everywhere and I cannot find this implementation anywhere. Let's say I have the word: QWERTY I want to obtain this table: Q W E R T Y Or for QWERTY AnotherWord I want to obtain Q W E R T Y [space character here] A n o t h e r W o r d

Do it like this: select substring(a.b, v.number+1, 1) from (select 'QWERTY AnotherWord' b) a join master..spt_values v on v.number < len(a.b) where v.type = 'P'

Declare #word nvarchar(max) Select #word = 'Hello This is the test'; with cte (Number)as (Select 1 union all select Number +1 From cte where number <len(#word) ) select * from Cte Cross apply (Select SUBSTRING(#word,number,1 ) ) as J(Letter)

Related

Condition WHERE string without/with space

Oracle start and end position of string function

Searching a substring in an string table?

Oracle Function to return similarity between strings

Split sql string into words

Categories

Resources