To extract a string based on some specific characters - excel

I have values in rows like below:
Https://abc/uvw/xyz
Https://def/klm/qew/asdas
Https://ghi/sdk/asda/as/aa/
Https://jkl/asd/vcx/asdsss/ssss/
Now i want the result to be like below
Https://abc/uvw/xyz
Https://def/klm/qew
Https://ghi/sdk/asda
Https://jkl/asd/vcx
So how to take result by skipping / for up to some count or is there any other way to get this done in excel. Is there any way to skip result of the RIGHT when it Finds 4 '/' in string?

You could use SUBSTITUTE to replace the nth / (in this case 5th) to a unique character and perform a LEFT based on that unique character obtained from FIND. I'll take CHAR(1) as the unique character:
=LEFT(A1,IFERROR(FIND(CHAR(1),SUBSTITUTE(A1,"/",CHAR(1),5))-1,LEN(A1)))
Another option would be to split on / using Text to Columns under the Data tab and join back only the columns you need.

Related

How do I drop complete rows (including all values in it) that contain a certain value in my Pandas dataframe?

I'm trying to write a python script that finds unique values (names) and reports the frequency of their occurrence, making use of Pandas library. There's a total of around 90 unique names, which I've anonymised in the head of the dataframe pasted below.
,1,2,3,4,5
0,monday09-01-2022,tuesday10-01-2022,wednesday11-01-2022,thursday12-01-2022,friday13-01-2022
1,Anonymous 1,Anonymous 1,Anonymous 1,Anonymous 1,
2,Anonymous 2,Anonymous 4,Anonymous 5,Anonymous 5,Anonymous 5
3,Anonymous 3,Anonymous 3,,Anonymous 6,Anonymous 3
4,,,,,
I'm trying to drop any row (the full row) that contains the regex expression "^monday.*", intending to indicate the word "monday" followed by any other number of random characters. I want to drop/deselect any cell/value within that row.
To achieve this goal, I've tried using the line of code below (and many other approaches I found on SO).
df = df[df[1].str.contains("^monday.*", case = True, regex=True) == False]
To clarify, I'm trying to search values of column "1" for the value "^.monday.*" and then deselecting the rows and all values in that row that match the regex expression. I've succesfully removed "monday09-01-2022" and "tuesday10-01-2022" etc.. but I'm also losing random names that are not in the matching rows.
Any help would be very much appreciated! Thank you!

Postgresql, regex pattern match in "where"

One table column is a string that contains multiple substrings separated by delimiter character the pipe char (|), like this "aa-a|aa-a-a|a-a|aa", the delimiter character cannot be the leading and ending char of the column string. And the match check is in "where", when match then the row is selected. Actually it's a search, pass in substring such as "aa-a" and search for all rows that has the "aa-a" as a full substring, the "aa-a-a" should not be a match. Another case also need to be considered when there is only one substring and no delimiter. Something like this:
Select * from tb where REGEX_FUNC(tb.col_1, "pattern")>0
in which the "pattern" might be like "^aa-a$" (1) what the REGEX_FUNC should be, need to create my own? (2) what the "pattern" should be?
No need for a regex match.
You can convert the delimited value into an array, then check the array if it matches your comparison value:
where 'aa-a' = any(string_to_array(col1, '|'))

How to extract text from a string between where there are multiple entires that meet the criteria and return all values

This is an exmaple of the string, and it can be longer
1160752 Meranji Oil Sats -Mt(MA) (000600007056 0001), PE:Toolachee Gas Sats -Mt(MA) (000600007070 0003)GL: Contract Services (510000), COT: Network (N), CO: OM-A00009.0723,Oil Sats -Mt(MA) (000600007053 0003)
The result needs to be column1 600007056 column2 600007070 column3 600007053
I am working in Spotfire and creating calclated columns through transformations as I need the columns to join to other data sets
I have tried the below, but it is only picking up the 1st 600.. number not the others, and there can be an undefined amount of those.
Account is the column with the string
Mid([Account],
Find("(000",[Account]) + Len("(000"),
Find("0001)",[Account]) - Find("(000",[Account]) - Len("(000"))
Thank you!
Assuming my guess is correct, and the pattern to look for is:
9 numbers, starting with 6, preceded by 1 opening parenthesis and 3 zeros, followed by a space, 4 numbers and a closing parenthesis
you can grab individual occurrences by:
column1: RXExtract([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))',1)
column2: RXExtract([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))',2)
etc.
The tricky bit is to find how many columns to define, as you say there can be many. One way to know would be to first calculate a max number of occurrences like this:
maxn: Max((Len([Amount]) - Len(RXReplace([Amount],'(?<=\\(000)6\\d{8}(?=\\s\\d{4}\\))','','g'))) / 9)
still assuming the number of digits in each column to extract is 9. This compares the length of the original [Amount] to the one with the extracted patterns replaced by an empty string, divided by 9.
Then you know you can define up to maxn columns, the extra ones for the rows with fewer instances will be empty.
Note that Spotfire always wants two back-slash for escaping (I had to add more to the editor to make it render correctly, I hope I have not missed any).

Replace string parts that appear twice Oracle

I am trying to work out in Oracle how to isolate/highlight word combinations in a concatenated string like the one below:
Some words##Again words##More of this||####||Some words##Again words##Other
The idea is to find the word combinations that appear exactly twice and replace them by 0 so I'm left with the ones that appear only once, either on the left side of the ||####|| or on the right side. The result of the query should be something like this:
Highlighted
Some words##Again words##More of this||####||Some words##Again words##**Other**
Replaced
0##0##More of this||####||0##0##Other
To give you some more information about the concatenation: the left side (before the ||####||) is my current customer record, while on the right hand side I have the previous version. By making the replacements I can reveal any differences between customer records.
I have tried to get this done by using:
regexp_replace: this does not work entirely with REGEXP_REPLACE(MY STRING,'((Some words){1,2})|((Again words){1,2})','0',1,0) as for some reason the string parts in my first record are never correctly replaced. I'm also hitting the limits of this function due to the number of word combinations I need to match;
nested CASE WHEN: does not work either obviously as CASE WHEN - even nested - stops when the first match is found but I need to have all conditions checked and replaced.
I have thought about using subselects, but as this query uses one of the largest tables in my schema, this will not be usable except on a per customer basis. And it might still not work...
Some more information in order to find a solid, performant solution:
I have 34 possible word combinations to match
I have no idea which ones will be there, ever, except when I run the query obviously
I have no idea in which order they will be in the concatenated string
I hope this is clear. Anyone with some magical ideas?
Thanks in advance
You can use a recursive sub-query factoring clause to replace one duplicated term at each iteration:
WITH replaced ( value, start_char ) AS (
SELECT REGEXP_REPLACE(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
'\10\30\6',
1
),
REGEXP_INSTR(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
1
)
FROM table_name
UNION ALL
SELECT REGEXP_REPLACE(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
'\10\30\6',
start_char + 1
),
REGEXP_INSTR(
value,
'(##|^)([^#]+?)((##[^#]+?)*\|\|####\|\|([^#]+?##)*)\2(##|$)',
start_char + 1
)
FROM replaced
WHERE start_char > 0
)
SELECT value
FROM replaced
WHERE start_char = 0;
Which, for the sample data:
CREATE TABLE table_name ( value ) AS
SELECT 'Some words##Again words##More of this||####||Some words##Again words##Other' FROM DUAL UNION ALL
SELECT '333##123##789##555||####||123##456##789##222##333' FROM DUAL;
Outputs:
| VALUE |
| :------------------------------------ |
| 0##0##More of this||####||0##0##Other |
| 0##0##0##555||####||0##456##0##222##0 |
db<>fiddle here
Explanation:
The regular expression matches:
(##|^) either two # characters or the start of the string ^ (in the first capturing group ());
([^#]+?) one-or-more characters that are not # (in the second capturning group ());
( the start of the 3rd capturing group;
(##[^#]+?)* two # characters followed by one-or-more non-# characters (in the 4th capturing group ()) all repeated zero-or-more * times;
\|\|####\|\| then two | characters, four # characters and two | characters;
([^#]+?##)* then one-of-more non-# characters followed by two # characters (in the 5th capturing group ());
) the end of the 3rd capturing group;
\2 a duplicate of the 2nd capturing group; then
(##|$) either two # characters or the end-of-the-string $ (in the 6th capturing group).
This is replaced by:
\10\30\6 which is the contents of the 1st capturing group then a zero (replacing the 2nd capturing group) then the contents of the 3rd capturing group then a second zero (replacing the matched duplicate) then the contents of the 6th capturing group.
The query will replace a pair of duplicate terms in the string (if they exist) and REGEXP_INSTR will find the start of the match and put the values into value and start_char (respectively); then at the next iteration the regular expression will start looking from the next character on from the start of the previous match, so that it will gradually move across the string finding matches until no more duplicate terms can be found and REGEXP_REPLACE will not perform a replacement and REGEXP_INSTR will return 0 and the iteration will terminate.
The final query filters to return the only the final level of the iteration (when all the duplicates have been replaced).

T-SQL : Fetch all Text before last occurence of Dash

I'm stuck on trying to fetch all text up to (but not including) the last dash.
I can find a solution for fetching text to the left of 1 dash (eg. SUBSTRING(#ID, 1, CHARINDEX('-', #ID) -1) ) and even say the second dash but the issue is that the number of dashes in my list vary wildly.
Eg.
ID
ABC-DEF-GHI-001
ABC-DEF-2
ABC-DEF-GHI-JKL-00003
ABC-DEF-GH-4
ABC-123-DEF-008
From the above I would like to fetch, all the text to the left of the last dash.
ABC-DEF-GHI
ABC-DEF
ABC-DEF-GHI-JKL
ABC-DEF-GH
ABC-123-DEF
Any pointers appreciated.
One trick we can use here to find the first occurrence of dash in the reversed string. Then, use that index to offset a substring of the entire string, but taken from the original beginning.
SELECT
col,
LEFT(col, LEN(col) - CHARINDEX('-', REVERSE(col))) AS col_sub
FROM yourTable
WHERE
col LIKE '%-%';
Demo

Resources