Remove 1-3 length character from string in sql - string

From a space delimited string, i would like to remove all words that are long from 1 to 3 characters.
For example: this string
LCCPIT A2 LCCMAD B JBPM_JIT CCC
should become
LCCPIT LCCMAD JBPM_JIT
So, A2, B and CCC words are removed (since they are long 2, 1 and 3 characters). Is there a way to do it? I think i could use REGEXP_REPLACE, but i didn't find the correct regular expression to have this result.

Split string to words and aggregate back only these substrings whose length is greater than 3.
Sample data:
SQL> with test (col) as
2 (select 'LCCPIT A2 LCCMAD B JBPM_JIT CCC' from dual)
Query begins here:
3 select listagg(val, ' ') within group (order by lvl) result
4 from (select regexp_substr(col, '[^ ]+', 1, level) val,
5 level lvl
6 from test
7 connect by level <= regexp_count(col, ' ') + 1
8 )
9 where length(val) > 3;
RESULT
--------------------------------------------------------------------------------
LCCPIT LCCMAD JBPM_JIT
SQL>

I prefer a regex replacement trick:
SELECT TRIM(REGEXP_REPLACE(val, '(^|\s+)\w{1,3}(\s+|$)', ' '))
FROM dual;
-- output is 'LCCPIT LCCMAD JBPM_JIT'
Demo
The strategy above is match any 1, 2, or 3 letter word, along with any surrounding whitespace, and to replace with just a single space. The outer call to TRIM() is necessary to remove dangling spaces which might arise from the first or last word being removed.

Related

Splitting a pandas column every n characters

I have a dataframe where some columns contain long strings (e.g. 30000 characters). I would like to split these columns every 4000 characters so that I end up with a range of new columns containing strings of length at most 4000. I have an upper bound on the string lengths so I know there should be at most 9 new columns. I would like there to always be 9 new columns, having None/NaN in columns where the string is shorter.
As an example (with n = 10 instead of 4000 and 3 columns instead of 9), let's say I have the dataframe:
df_test = pd.DataFrame({'id': [1, 2, 3],
'str_1': ['This is a long string', 'This is an even longer string', 'This is the longest string of them all'],
'str_2': ['This is also a long string', 'a short string', 'mini_str']})
id str_1 str_2
0 1 This is a long string This is also a long string
1 2 This is an even longer string a short string
2 3 This is the longest string of them all mini_str
In this case I want to get the result
id str_1_1 str_1_2 str_1_3 str_1_4 str_2_1 str_2_2 str_2_3
0 1 This is a long strin g NaN This is al so a long string
1 2 This is an even long er string NaN a short st ring NaN
2 3 This is th e longest string of them all mini_str NaN NaN
Here, I want e.g. first row, column str_1_3 to be a string of length 1.
I tried using
df_test['str_1'].str.split(r".{10}", expand=True, n=10)
but that didn't work. It gave this as result
0 1 2 3
0 g None
1 er string None
2 them all
where the first columns aren't filled.
I also tried looping through every row and inserting '|' every 10 characters and then splitting on '|' but that seems tedious and slow.
Any help is appreciated.
The answer is quite simple, that is, insert a delimiter and split it.
For example, use | as the delimiter and let n = 4:
series = pd.Series(['This is an even longer string', 'This is the longest string of them all'],name='str1')
name = series.name
cols = series.str.replace('(.{10})', r'\1|').str.split('|', n=4, expand=True).add_prefix(f'{name}_')
That is, use str.replace to add delimiter, use str.split to split them apart and use add_prefix to add the prefixes.
The output will be:
str1_0 str1_1 str1_2 str1_3
0 This is an even long er string None
1 This is th e longest string of them all
The reason why str.split('.{10}') doesn't work is that the pat param in the function str.split is a pattern to match the strings as split delimiters but not strings that should be in splited results. Therefore, with str.split('.{10}'), you get one character every 10-th chars.
UPDATE: Accroding to the suggestion from #AKX, use \x1F as a better delimiter:
cols = series.str.replace('(.{10})', '\\1\x1F').str.split('\x1F', n=4, expand=True).add_prefix(f'{name}_')
Note the absence of the r string flags.

Replace string parts that appear twice Oracle

I am trying to work out in Oracle how to isolate/highlight word combinations in a concatenated string like the one below:
Some words##Again words##More of this||####||Some words##Again words##Other
The idea is to find the word combinations that appear exactly twice and replace them by 0 so I'm left with the ones that appear only once, either on the left side of the ||####|| or on the right side. The result of the query should be something like this:
Highlighted
Some words##Again words##More of this||####||Some words##Again words##**Other**
Replaced
0##0##More of this||####||0##0##Other
To give you some more information about the concatenation: the left side (before the ||####||) is my current customer record, while on the right hand side I have the previous version. By making the replacements I can reveal any differences between customer records.
I have tried to get this done by using:
regexp_replace: this does not work entirely with REGEXP_REPLACE(MY STRING,'((Some words){1,2})|((Again words){1,2})','0',1,0) as for some reason the string parts in my first record are never correctly replaced. I'm also hitting the limits of this function due to the number of word combinations I need to match;
nested CASE WHEN: does not work either obviously as CASE WHEN - even nested - stops when the first match is found but I need to have all conditions checked and replaced.
I have thought about using subselects, but as this query uses one of the largest tables in my schema, this will not be usable except on a per customer basis. And it might still not work...
Some more information in order to find a solid, performant solution:
I have 34 possible word combinations to match
I have no idea which ones will be there, ever, except when I run the query obviously
I have no idea in which order they will be in the concatenated string
I hope this is clear. Anyone with some magical ideas?
Thanks in advance
You can use a recursive sub-query factoring clause to replace one duplicated term at each iteration:
WITH replaced ( value, start_char ) AS (
SELECT REGEXP_REPLACE(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
'\10\30\6',
1
),
REGEXP_INSTR(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
1
)
FROM table_name
UNION ALL
SELECT REGEXP_REPLACE(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
'\10\30\6',
start_char + 1
),
REGEXP_INSTR(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
start_char + 1
)
FROM replaced
WHERE start_char > 0
)
SELECT value
FROM replaced
WHERE start_char = 0;
Which, for the sample data:
CREATE TABLE table_name ( value ) AS
SELECT 'Some words##Again words##More of this||####||Some words##Again words##Other' FROM DUAL UNION ALL
SELECT '333##123##789##555||####||123##456##789##222##333' FROM DUAL;
Outputs:
| VALUE |
| :------------------------------------ |
| 0##0##More of this||####||0##0##Other |
| 0##0##0##555||####||0##456##0##222##0 |
db<>fiddle here
Explanation:
The regular expression matches:
(##|^) either two # characters or the start of the string ^ (in the first capturing group ());
([^#]+?) one-or-more characters that are not # (in the second capturning group ());
( the start of the 3rd capturing group;
(##[^#]+?)* two # characters followed by one-or-more non-# characters (in the 4th capturing group ()) all repeated zero-or-more * times;
\|\|####\|\| then two | characters, four # characters and two | characters;
([^#]+?##)* then one-of-more non-# characters followed by two # characters (in the 5th capturing group ());
) the end of the 3rd capturing group;
\2 a duplicate of the 2nd capturing group; then
(##|$) either two # characters or the end-of-the-string $ (in the 6th capturing group).
This is replaced by:
\10\30\6 which is the contents of the 1st capturing group then a zero (replacing the 2nd capturing group) then the contents of the 3rd capturing group then a second zero (replacing the matched duplicate) then the contents of the 6th capturing group.
The query will replace a pair of duplicate terms in the string (if they exist) and REGEXP_INSTR will find the start of the match and put the values into value and start_char (respectively); then at the next iteration the regular expression will start looking from the next character on from the start of the previous match, so that it will gradually move across the string finding matches until no more duplicate terms can be found and REGEXP_REPLACE will not perform a replacement and REGEXP_INSTR will return 0 and the iteration will terminate.
The final query filters to return the only the final level of the iteration (when all the duplicates have been replaced).

Tricky substring usage in Oracle

I am looking for a way to retrieve always the 2 numbers after the last "/"character in a string. If there is something after the 2 numbers that I want, I don't care.
the code that I came up with is this:
CASE when INSTRUMENT like '%/%' then SUBSTR(INSTRUMENT,INSTR(INSTRUMENT,'/',-3,1)+1,2) else '0' end
This seems to work fine.
The problem is that when it does not find any "/"character, then it does not fill them with a 0 as I would like.
To give you an example of what I would like to perform:
XXX/YYY/ZZZ/92 ---> Returns 92
XXX/YYY/ZZZ/42 (test) ---> Returns 42
XXX YYY ZZZ 10 ---> Returns 0
All of this must be "plugged" in a select case statement, but this should not change the solution.
Thanks in advance
Perhaps regexp_substr will work for you. If it returns NULL, meaning the pattern of a slash followed by 2 digits and 0 or more characters to the end of the string was not found, replace with a '0' by using NVL().
SQL> with tbl(str) as (
select 'XXX/YYY/ZZZ/92' from dual union
select 'XXX/YYY/ZZZ/42 (test)' from dual union
select 'XXX YYY ZZZ 10' from dual
)
select nvl(regexp_substr(str, '/(\d{2}).*$', 1, 1, NULL, 1), '0') digits
from tbl;
DIGITS
---------------------
0
42
92
SQL>

String Handling in Talend

I have this kind of data,
12345 Lipa AVE, AKA 1234 LIpa AVE, Lipa City, LP, 12345
I want this to transform into this:
All the data that I'm going to process have 1 comma to separate the address and another case is the 2 comma above.
An example of the 1 comma is below,
12345 Lipa AVE, Lipa City, LP, 12345
The simplest solution is to unify the structure, and then make the mapping. In this case it means first convert the 4 column structure (1 comma case) into 5 columns (2 commas case) where the second field is empty.
The diagram is the following:
tFileInputFullRow -> tJavaRow -> tExtractDelimitedField -> tMap -> tFileOutputDelimited
So first read the full row, then detect the case and insert the extra column if necessary. The tJavaRow code is the following:
output_row.line = "";
String[] elements = input_row.line.split(",");
if(elements.length == 4)
elements[0] += ",";
for(String element:elements)
output_row.line += element + ",";
In tExtractDelimitedField set the separator to comma and finally in the tMap merge the two addresses field into one:
row3.address2 != null && !row3.address2.equals("") ? row3.address1 + "," + row3.address2 : row3.address1
The tExtractDelimitedField can be skipped in the tJavaRow by changing the output schema and then passing the array elements one by one.

Oracle extract variable number from string

I'm looking to extract a specific number out of a string of mixed alphanumeric characters that are variable in length in a query. I will need this in order to calculate a range based off that number. I'm using Oracle.
Example:
D-3-J32P232
-I need to grab the J32 at least, and most likely even the 32 out of that string. This range of numbers can change at any given time.
It could range from:
D-3-J1P232
to
D-3-J322P2342
The numbers after the second and third letters can be any number of length. Is there any way to do this?
This is simpler and gets both the numbers for the range
select substr( REGEXP_SUBSTR('D-3-J322P2342','[A-Z][0-9]+',1,1),2),
substr( REGEXP_SUBSTR('D-3-J322P2342','[A-Z][0-9]+',1,2),2)
from dual
REGEXP_SUBSTR could work (11g version):
SELECT REGEXP_SUBSTR('D-3-J322P2342','([A-Z]+-\d+-[A-Z]+)(\d+)',1,1,'i',2) num
FROM dual;
A test of your sample data:
SQL> SELECT REGEXP_SUBSTR('D-3-J322P2342',''([A-Z]+-\d+-[A-Z]+)(\d+)',1,1,'i',2) num
2 FROM dual;
NUM
---
322
SQL>
This will accept any case alpha string, followed by a dash, followed by one or more digits, followed by a dash, followed by another any case alpha string, then your number of interest.
In 10g REGEXP_REPLACE, it's bit less straightforward, as they did not add the ability to reference subexpressions until 11g:
SELECT REGEXP_SUBSTR(str,'\d+',1,1) NUM
FROM (SELECT REGEXP_REPLACE('D-3-J322P2342','([A-Z]+-\d+-[A-Z]+)','',1,1,'i') str
FROM dual);
Your sample data:
SQL> SELECT REGEXP_SUBSTR(str,'\d+',1,1) NUM
2 FROM
(SELECT REGEXP_REPLACE('D-3-J322P2342','([A-Z]+-\d+-[A-Z]+)','',1,1,'i') str
3 FROM dual);
NUM
---
322
REGEXP_SUBSTR would do the job

Resources