Multiple string substitutions in PostgreSQL - string

I have a column with abbreviations separated by spaces like this
'BG MSG'
Also, there's another table with substitutions
target replacement
----------------------
'BG', 'Brick Galvan'
'MSG', 'Mosaic Galvan'
The goal is to apply all the substitutions to the abbreviations to obtain something like
'Brick Galvan Mosaic Galvan' from 'BG MSG'
I know I could do
replace( replace('BG MSG', 'BG', 'Brick Galvan'), 'MSG', 'Mosaic Galvan')
But imagine there are hundreds of substitutions, and they can change from one day to the next. The resulting query will be hideous to maintain.
I mean, I could do a code generator that will create the query with all the nested replaces, but I'm looking for something more elegant and postgres-native.
I've found solutions like this one
How to replace multiple special characters in Postgres 9.5 but they seem to work only for single characters.

Let's say your tables look like this:
create table my_table(id serial primary key, abbrevs text);
insert into my_table (abbrevs) values
('BG MSG');
create table substitutions(target text, replacement text);
insert into substitutions values
('BG', 'Brick Galvan'),
('MSG', 'Mosaic Galvan');
You can get each abbreviation in a single row:
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
id | abbrev
----+--------
1 | BG
1 | MSG
(2 rows)
and use them to join the substitution table and get full names:
select id, string_agg(replacement, ' ') as full_names
from (
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
) t
join substitutions on abbrev = target
group by id
id | full_names
----+----------------------------
1 | Brick Galvan Mosaic Galvan
(1 row)
Db<>fiddle.

Nested replace approach would work but it is quite ugly, right?
SELECT REPLACE(REPLACE(REPLACE(REPLACE(…
After carefully formatted to make it look readable, the best you can get follows:
SELECT
REPLACE(
REPLACE(
REPLACE(
REPLACE(...
On the other hand, you might just use the LATERAL JOIN solution which uses more characters but, it is definitely more readable.
-- Input: BG, MSG
-- Output: Brick Galvan, Mosaic Galvan
SELECT msg.Materials
FROM (SELECT 'BG, MSG' AS Materials) mt
INNER JOIN LATERAL (SELECT REPLACE(mt.Materials::text, 'BG', 'Brick Galvan') AS Materials) bg ON true
INNER JOIN LATERAL (SELECT REPLACE(bg.Materials::text, 'MSG', 'Mosaic Galvan') AS Materials) msg ON true;

Related

U-Sql not allowing non-equijoins

I have stumbled across a bit of an issue with U-SQL which for me is a problem I haven't yet found a workaround for.
It seems U-SQL doesnt support anything else but == in joins, so you can't put > or < in the join itself.
For the use case below which I have done in oracle:
create table trf.test_1(
number_col int
);
insert into trf.test_1 VALUES (10);
insert into trf.test_1 VALUES (20);
insert into trf.test_1 VALUES (30);
insert into trf.test_1 VALUES (60);
drop table trf.test_2;
create table trf.test_2(
number_col int
);
insert into trf.test_2 VALUES (20);
insert into trf.test_2 VALUES (30);
SELECT t1.number_col, t2.number_col
FROM trf.test_1 t1
LEFT JOIN trf.test_2 t2 ON t1.number_col < t2.number_col
;
I get the following:
How might I do that in u-sql without the < join?
I tried a cross join, but if you include the < in the where clause it just turns into an inner and you don't get the rows with the nulls.
Any ideas appreciated.
#t1 =
SELECT * FROM
( VALUES
(10),
(20),
(30),
(60)
) AS T(num_col);
#t2 =
SELECT * FROM
( VALUES
(20),
(30)
) AS T(num_col);
#result =
SELECT t1.num_col, t2.num_col AS num_col_2
FROM #t1 AS t1
CROSS JOIN #t2 AS t2
WHERE t1.num_col < t2.num_col;
#result2 =
SELECT t1.num_col, t2.num_col AS num_col_2
FROM #t1 AS t1
LEFT JOIN #result AS t2 ON t1.num_col == t2.num_col;
OUTPUT #result2
TO "/Output/ReferenceGuide/Joins/exampleA.csv"
USING Outputters.Csv();
Edit - I added the left join from the #t1 dataset back to the #result set which seems to work but would be interested if there are any better solutions out there. Seems a bit of a work around.
This is a known feature and discussed extensively in the article "U-SQL SELECT Selecting from joins".
Some quotes from that article:
Join Comparisons
U-SQL, like most scaled out Big Data Query languages
that support joins, restricts the join comparison to equality
comparisons between columns in the rowsets to be joined...
...
If one has a non-equality comparison or a more complex expression (such as a method invocation) in the comparison, one can move the comparison to the SELECT’s WHERE clause. Or the more complex expression can be placed in an earlier SELECT statement’s column and then that alias can be referred to in the join comparison.
Basically they don't scale particularly well on a distributed platform like ADLA.

how can i dynamically pass value to db2 search clause 'like' while fetching result from other table

Can someone help me how can i dynamically pass value to db2 search clause like while fetching result from other table.
I am trying this:
select * from table2 where file_name like '%(select file_name from table1)'
I've even tried CONTACT, using sysibm.sysdummy1 methods but no luck.
maybe, this help;
SELECT *
FROM table2
JOIN table1
ON table2.file_name LIKE CONCAT('%',table1.file_name)
Not having been shown the DDL for the files, nor any sample data and expected results from which a reader could determine if there might not be [other] considerations as implied obstacles, the following variation of the already-offered answer is more liberal in selecting what might be intended by the select * from table2 where file_name like '%(select file_name from table1)' from the OP; i.e. rather than effective predicates of ends-with [or a starts-with] the file-name value, the following achieves an effective predicate of contains the file-name value.
select /* t1.file_name, */ t2.*
from table2 as t2
inner join
table1 as t1
on t2.file_name like '%' concat rtrim(t1.file_name) concat '%'

Google BigQuery Replace function for string type

I am trying to replace certain customer names in my data.
I was able to do SQL using Google BigQuery language to transform one part of the string another via the replace function for one particular string.
Replace(CustomerName, 'ABC', 'XYZ')
However, I have a couple more that I would need to use the replace function such that
Replace(CustomerName, 'PLO', 'Rustic')
Replace(CustomerName, 'Kix', 'BowWow')
and so on.
I've tried doing
Replace(CustomerName, 'ABC', 'XYZ') OR Replace(CustomerName, 'PLO', 'Rustic') OR Replace(CustomerName, 'Kix', 'BowWow')
but that got me an error message.
I've also tried
Replace(CustomerName, 'ABC', 'XYZ') AND Replace(CustomerName, 'PLO', 'Rustic') AND Replace(CustomerName, 'Kix', 'BowWow')
but that also got me an error message.
I am able to just use "case when statement" and then hardcode each one, but I'm wondering if there is a better/faster way to just use replace statement instead.
Thanks for your help.
The CASE WHEN option is pretty reasonable. Another option is to chain them together:
REPLACE(
REPLACE(
REPLACE(
CustomerName,
'ABC',
'XYZ'),
'PLO',
'Rustic'),
'Kix',
'BowWow')
Which one you pick really depends on the exact scenario. The chained REPLACE calls are probably faster, but they could overlap in weird ways (e.g., if the output to one replacement matches the input to a subsequent one). The CASE WHEN approach avoids that issue, but it's probably more expensive because you need to do one operation to find the substring and another to actually replace it.
Note that when you're using AND or OR, you're trying to combine the string output of REPLACE as if it were a boolean, which is why it's failing.
In cases when you have quite a number of replacements - chaining of REPLACEs can become not practical and annoying manual work.
Below addresses this potential issue (assuming you maintain Lookup table with pairs: Word, Replacement)
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
You can test it using below example
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM (
SELECT CustomerName FROM
(SELECT '1234ABC567' AS CustomerName),
(SELECT '12 34 PLO 56' AS CustomerName),
(SELECT 'Kix' AS CustomerName),
(SELECT '98 ABC PLO Kix ABC 76 XYZ 54' AS CustomerName),
(SELECT 'ABCQweKIX' AS CustomerName)
) YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM (
SELECT Word, Replacement FROM
(SELECT 'XYZ' AS Word, 'QWE' AS Replacement),
(SELECT 'ABC' AS Word, 'XYZ' AS Replacement),
(SELECT 'PLO' AS Word, 'Rustic' AS Replacement),
(SELECT 'Kix' AS Word, 'BowWow' AS Replacement)
)
) ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
Please note: there is still issue if result of one replacement matches the input to a subsequent replacement
I believe there are multiple ways to tackle this problem, and it depends on the size of your dataset, practicality of simply making a guiding table by hand and uploading it to BigQuery, and the granularity of the data you want to replace.
If your values are very granular, you can create a table with "from" and "to" values on different columns, and join that table with your main table, and retrieve those values very cleanly.
# Replace the support_table table with your actual table
WITH support_table AS (
SELECT "ABC" AS OldValue, "XYZ" AS NewValue
)
SELECT main_table.OldValue, support_table.NewValue FROM main_table
JOIN support_table ON main_table.old_value = support_table.old_value
Now, if you want to replace a big list of different values with something, you can use REGEXP_REPLACE with a string containing all possible values.
If you have a very big list of items, you can use
STRING_AGG in a table with all the values you want to replace, or skip the STRING_AGG step and create said string by hand.
Both of the snippets below result in "item1|item2|item3". Choose which is faster for you to do.
# Replace the values_to_replace table with your actual table
WITH values_to_replace AS (
SELECT "item1" AS ColumnWithItemsToReplace
UNION ALL
SELECT "item2"
UNION ALL
SELECT "item3"
)
SELECT STRING_AGG(ColumnsWithItemsToReplace,"|") FROM values_to_replace
SELECT r"item1|item2|item3"
STRING_AGG will retrieve all the values from a table or query and concatenate them using a separator of choice. If you use the pipe separator, you will be able to create a string like "item1|item2|item3|..."
For a regular expression, the pipe counts as "or", which means that the regex will interpret the string as "item1 or item2 or item3". Thus, if you pass that generated string to REGEXP_REPLACE as the values to be replaced, it will be considered valid.
Example code below:
REGEXP_REPLACE(
column_to_replace
,(SELECT STRING_AGG(ColumnWithItemsToReplace,"|") FROM `YourTable`)
,"Replacer"
)
Hope it helps.

Count null columns as zeros with oracle

I am running a query with Oracle:
SELECT
c.customer_number,
COUNT(DISTINCT o.ORDER_NUMBER),
COUNT(DISTINCT q.QUOTE_NUMBER)
FROM
Customer c
JOIN Orders o on c.customer_number = o.party_number
JOIN Quote q on c.customer_number = q.account_number
GROUP BY
c.customer_number
This works beautifully and I can get the customer and their order and quote counts.
However, not all customers have orders or quotes but I still want their data. When I use LEFT JOIN I get this error from Oracle:
ORA-24347: Warning of a NULL column in an aggregate function
Seemingly this error is caused by the eventual COUNT(NULL) for customers that are missing orders and/or quotes.
How can I get a COUNT of null values to come out to 0 in this query?
I can do COUNT(DISTINCT NVL(o.ORDER_NUMBER, 0)) but then the counts will come out to 1 if orders/quotes are missing which is no good. Using NVL(o.ORDER_NUMBER, NULL) has the same problem.
Try using inline views:
SELECT
c.customer_number,
o.order_count,
q.quote_count
FROM
customer c,
( SELECT
party_number,
COUNT(DISTINCT order_number) AS order_count
FROM
orders
GROUP BY
party_number
) o,
( SELECT
account_number,
COUNT(DISTINCT quote_number) AS quote_count
FROM
quote
GROUP BY
account_number
) q
WHERE 1=1
AND c.customer_number = o.party_number (+)
AND c.customer_number = q.account_number (+)
;
Sorry, but I'm not working with any databases right now to test this, or to test whatever the ANSI SQL version might be. Just going on memory.

How to replace one or more consecutive symbols with one symbol in DB2

I am using DB2 LUW 9.5. In a field, I have a value like this one:
Test^test^^test^^^^test^^test^test
In a SELECT query, I would like to replace the duplicated ^ with only one ^. This would produce:
Test^test^test^test^test^test
The delimiter is known and static (can be hardcoded). Would you know a way to obtain the desired output using DB2 functions?
Thank you
You need one other character that can be used as delimiter, for example the pipe sign (|).
Let's say the table is defined as
create table myTable (
myColumn varchar(400)
);
Add a value for a test:
insert into myTable (myColumn) values
('Test^^^^^^^^test^^^^^^^test^^^^^^test^^^^^test^^^^test^^^test^^test^test');
Then do a smart replacement with use of the other delimiter
select replace(replace(replace(myColumn, '^^', '^|^'), '|^^', ''), '^|^', '^')
from myTable;
The result:
Test^test^test^test^test^test^test^test^test^test
Instead of using a one character delimiter you can use a string of which you are sure it will not occur in the values, for example 'xy'. The next query will give the same results:
select replace(replace(replace(myColumn, '^^', '^xy^'), 'xy^^', ''), '^xy^', '^')
from myTable;

Resources