I am trying to work out in Oracle how to isolate/highlight word combinations in a concatenated string like the one below:
Some words##Again words##More of this||####||Some words##Again words##Other
The idea is to find the word combinations that appear exactly twice and replace them by 0 so I'm left with the ones that appear only once, either on the left side of the ||####|| or on the right side. The result of the query should be something like this:
Highlighted
Some words##Again words##More of this||####||Some words##Again words##**Other**
Replaced
0##0##More of this||####||0##0##Other
To give you some more information about the concatenation: the left side (before the ||####||) is my current customer record, while on the right hand side I have the previous version. By making the replacements I can reveal any differences between customer records.
I have tried to get this done by using:
regexp_replace: this does not work entirely with REGEXP_REPLACE(MY STRING,'((Some words){1,2})|((Again words){1,2})','0',1,0) as for some reason the string parts in my first record are never correctly replaced. I'm also hitting the limits of this function due to the number of word combinations I need to match;
nested CASE WHEN: does not work either obviously as CASE WHEN - even nested - stops when the first match is found but I need to have all conditions checked and replaced.
I have thought about using subselects, but as this query uses one of the largest tables in my schema, this will not be usable except on a per customer basis. And it might still not work...
Some more information in order to find a solid, performant solution:
I have 34 possible word combinations to match
I have no idea which ones will be there, ever, except when I run the query obviously
I have no idea in which order they will be in the concatenated string
I hope this is clear. Anyone with some magical ideas?
Thanks in advance
You can use a recursive sub-query factoring clause to replace one duplicated term at each iteration:
WITH replaced ( value, start_char ) AS (
SELECT REGEXP_REPLACE(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
'\10\30\6',
1
),
REGEXP_INSTR(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
1
)
FROM table_name
UNION ALL
SELECT REGEXP_REPLACE(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
'\10\30\6',
start_char + 1
),
REGEXP_INSTR(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
start_char + 1
)
FROM replaced
WHERE start_char > 0
)
SELECT value
FROM replaced
WHERE start_char = 0;
Which, for the sample data:
CREATE TABLE table_name ( value ) AS
SELECT 'Some words##Again words##More of this||####||Some words##Again words##Other' FROM DUAL UNION ALL
SELECT '333##123##789##555||####||123##456##789##222##333' FROM DUAL;
Outputs:
| VALUE |
| :------------------------------------ |
| 0##0##More of this||####||0##0##Other |
| 0##0##0##555||####||0##456##0##222##0 |
db<>fiddle here
Explanation:
The regular expression matches:
(##|^) either two # characters or the start of the string ^ (in the first capturing group ());
([^#]+?) one-or-more characters that are not # (in the second capturning group ());
( the start of the 3rd capturing group;
(##[^#]+?)* two # characters followed by one-or-more non-# characters (in the 4th capturing group ()) all repeated zero-or-more * times;
\|\|####\|\| then two | characters, four # characters and two | characters;
([^#]+?##)* then one-of-more non-# characters followed by two # characters (in the 5th capturing group ());
) the end of the 3rd capturing group;
\2 a duplicate of the 2nd capturing group; then
(##|$) either two # characters or the end-of-the-string $ (in the 6th capturing group).
This is replaced by:
\10\30\6 which is the contents of the 1st capturing group then a zero (replacing the 2nd capturing group) then the contents of the 3rd capturing group then a second zero (replacing the matched duplicate) then the contents of the 6th capturing group.
The query will replace a pair of duplicate terms in the string (if they exist) and REGEXP_INSTR will find the start of the match and put the values into value and start_char (respectively); then at the next iteration the regular expression will start looking from the next character on from the start of the previous match, so that it will gradually move across the string finding matches until no more duplicate terms can be found and REGEXP_REPLACE will not perform a replacement and REGEXP_INSTR will return 0 and the iteration will terminate.
The final query filters to return the only the final level of the iteration (when all the duplicates have been replaced).
Related
I had similar question (link below), but it just lets say "add-on" to my issue that I found on the way.
Find all code combinations using text string in Power Query
What I need is to extract exact matches (or I would say fuzzy matches in Power Query) that are in one string using substring as lookup.
(Please ignore T1 and T2 in the screenshot and data)
As you can see in Table 3 (T3) is a main string, and in T4 is substring with slightly different markings (like JH instead of JH0 or else..) Thats exactly what I need, to use substring as it is but to filter out main string and get results as they are in T5.
I tried my luck using Fuzzy matching in Power Query but the problem is afterwards when I have different substring with more instances, my query is failing due to "column doesn't exist and so on...it has to be dynamic.
I would like to have solution in Power Query!
https://docs.google.com/spreadsheets/d/1Ji1kyV7UsD2YBRJgWUY5zisyL3ySPGwW/edit?usp=sharing&ouid=101738555398870704584&rtpof=true&sd=true
let Source = Excel.CurrentWorkbook(){[Name="Table4"]}[Content],
FindList = Text.Split(Table.ReplaceValue(Table3,",","_",Replacer.ReplaceText,{"String"})[String]{0},"_"),
FindList2 = List.Transform(FindList, each Text.Remove(_,{"0".."9"})),
Newlist=Text.Split(Source[Substring]{0},"_"),
Newlist2=Text.Combine(List.Transform(Newlist, each try FindList{List.PositionOf(FindList2,_)} otherwise "missing"),"_")
in Newlist2
what it is doing (a) split table3 into a list at either a , or _ (b) duplicate the list from A and remove all numbers (c) split table4 into a list at each _ (d) match each value from c against b. If there is a match, use that position number to pull the value from a, otherwise put "missing" (e) put the results back together with a comma separation
Per comments, alternate version that works for multiple matches from Table3:
Newlist2=Text.Combine(List.Transform(Newlist, each try
if List.Count(List.PositionOf(FindList2,_,20))=0 then "missing" else
Text.Combine( List.Transform(List.PositionOf(FindList2,_,20), each FindList{_}),"_") otherwise "missing"),"_")
This is an exmaple of the string, and it can be longer
1160752 Meranji Oil Sats -Mt(MA) (000600007056 0001), PE:Toolachee Gas Sats -Mt(MA) (000600007070 0003)GL: Contract Services (510000), COT: Network (N), CO: OM-A00009.0723,Oil Sats -Mt(MA) (000600007053 0003)
The result needs to be column1 600007056 column2 600007070 column3 600007053
I am working in Spotfire and creating calclated columns through transformations as I need the columns to join to other data sets
I have tried the below, but it is only picking up the 1st 600.. number not the others, and there can be an undefined amount of those.
Account is the column with the string
Mid([Account],
Find("(000",[Account]) + Len("(000"),
Find("0001)",[Account]) - Find("(000",[Account]) - Len("(000"))
Thank you!
Assuming my guess is correct, and the pattern to look for is:
9 numbers, starting with 6, preceded by 1 opening parenthesis and 3 zeros, followed by a space, 4 numbers and a closing parenthesis
you can grab individual occurrences by:
column1: RXExtract([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))',1)
column2: RXExtract([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))',2)
etc.
The tricky bit is to find how many columns to define, as you say there can be many. One way to know would be to first calculate a max number of occurrences like this:
maxn: Max((Len([Amount]) - Len(RXReplace([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))','','g'))) / 9)
still assuming the number of digits in each column to extract is 9. This compares the length of the original [Amount] to the one with the extracted patterns replaced by an empty string, divided by 9.
Then you know you can define up to maxn columns, the extra ones for the rows with fewer instances will be empty.
Note that Spotfire always wants two back-slash for escaping (I had to add more to the editor to make it render correctly, I hope I have not missed any).
I have values in rows like below:
Https://abc/uvw/xyz
Https://def/klm/qew/asdas
Https://ghi/sdk/asda/as/aa/
Https://jkl/asd/vcx/asdsss/ssss/
Now i want the result to be like below
Https://abc/uvw/xyz
Https://def/klm/qew
Https://ghi/sdk/asda
Https://jkl/asd/vcx
So how to take result by skipping / for up to some count or is there any other way to get this done in excel. Is there any way to skip result of the RIGHT when it Finds 4 '/' in string?
You could use SUBSTITUTE to replace the nth / (in this case 5th) to a unique character and perform a LEFT based on that unique character obtained from FIND. I'll take CHAR(1) as the unique character:
=LEFT(A1,IFERROR(FIND(CHAR(1),SUBSTITUTE(A1,"/",CHAR(1),5))-1,LEN(A1)))
Another option would be to split on / using Text to Columns under the Data tab and join back only the columns you need.
I'm stuck on trying to fetch all text up to (but not including) the last dash.
I can find a solution for fetching text to the left of 1 dash (eg. SUBSTRING(#ID, 1, CHARINDEX('-', #ID) -1) ) and even say the second dash but the issue is that the number of dashes in my list vary wildly.
Eg.
ID
ABC-DEF-GHI-001
ABC-DEF-2
ABC-DEF-GHI-JKL-00003
ABC-DEF-GH-4
ABC-123-DEF-008
From the above I would like to fetch, all the text to the left of the last dash.
ABC-DEF-GHI
ABC-DEF
ABC-DEF-GHI-JKL
ABC-DEF-GH
ABC-123-DEF
Any pointers appreciated.
One trick we can use here to find the first occurrence of dash in the reversed string. Then, use that index to offset a substring of the entire string, but taken from the original beginning.
SELECT
col,
LEFT(col, LEN(col) - CHARINDEX('-', REVERSE(col))) AS col_sub
FROM yourTable
WHERE
col LIKE '%-%';
Demo
I have a text string which is user definable in length
As example the user has entered 1234567890
What I want is to pull out every first character followed by every 3rd character
So we get the following
1st | 1234567890 = 1
3rd | 234567890 = 14
1st | 23567890 = 142
3rd | 3567890 = 1426
1st | 357890 = 14263
3rd | 57890 = 142638
1st | 5790 = 1426385
3rd | 790 = 14263850
1st | 79 = 142638507
3rd | 9 = 1426385079
I also need to account for the e fact that in the end the last two numbers will have less then three digits.
Anyone ideas on how I could achieve this in batch?
This is where batch string manipulation gets really useful:
#echo off
set str="1234567890"
for %%a in (%str%) do set str=%%~a
set newstr=
:Loop
set "first=%str:~0,1%"
set "fourth=%str:~3,1%"
set "str=%str:~1,2%%str:~4%"
set "newstr=%newstr%%first%%fourth%"
if not "%fourth%"=="" goto Loop
set "newstr=%newstr%%str:~1%%str:~0,1%
echo.%newstr%
For the example input 1234567890 from your question, the output would be indeed:
1426385079
Explanation
This code works using a loop (which is somewhat of an equivalent to a while loop in C).
In every iteration of the loop the first and fourth characters are extracted from str and appended to newstr, which will eventually hold the final output.
Next, str is then updated by appending the following two substrings:
%str:~1,2% extracts two characters starting from the second character (indices start from 0, so the second index is 1).
%str:~4% extracts all characters starting from the fifth
The new value of str is basically the old value, without the first and fourth characters.
The loop stops when str holds three or less characters, (that is whenfourth is an empty string!). After the loop, the last (three or less) characters are dealt with and appended to newstr in the correct order -- this is that special case that you wanted to account for.
Hope that helps!