Google BigQuery Replace function for string type - string

I am trying to replace certain customer names in my data.
I was able to do SQL using Google BigQuery language to transform one part of the string another via the replace function for one particular string.
Replace(CustomerName, 'ABC', 'XYZ')
However, I have a couple more that I would need to use the replace function such that
Replace(CustomerName, 'PLO', 'Rustic')
Replace(CustomerName, 'Kix', 'BowWow')
and so on.
I've tried doing
Replace(CustomerName, 'ABC', 'XYZ') OR Replace(CustomerName, 'PLO', 'Rustic') OR Replace(CustomerName, 'Kix', 'BowWow')
but that got me an error message.
I've also tried
Replace(CustomerName, 'ABC', 'XYZ') AND Replace(CustomerName, 'PLO', 'Rustic') AND Replace(CustomerName, 'Kix', 'BowWow')
but that also got me an error message.
I am able to just use "case when statement" and then hardcode each one, but I'm wondering if there is a better/faster way to just use replace statement instead.
Thanks for your help.

The CASE WHEN option is pretty reasonable. Another option is to chain them together:
REPLACE(
REPLACE(
REPLACE(
CustomerName,
'ABC',
'XYZ'),
'PLO',
'Rustic'),
'Kix',
'BowWow')
Which one you pick really depends on the exact scenario. The chained REPLACE calls are probably faster, but they could overlap in weird ways (e.g., if the output to one replacement matches the input to a subsequent one). The CASE WHEN approach avoids that issue, but it's probably more expensive because you need to do one operation to find the substring and another to actually replace it.
Note that when you're using AND or OR, you're trying to combine the string output of REPLACE as if it were a boolean, which is why it's failing.

In cases when you have quite a number of replacements - chaining of REPLACEs can become not practical and annoying manual work.
Below addresses this potential issue (assuming you maintain Lookup table with pairs: Word, Replacement)
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
You can test it using below example
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM (
SELECT CustomerName FROM
(SELECT '1234ABC567' AS CustomerName),
(SELECT '12 34 PLO 56' AS CustomerName),
(SELECT 'Kix' AS CustomerName),
(SELECT '98 ABC PLO Kix ABC 76 XYZ 54' AS CustomerName),
(SELECT 'ABCQweKIX' AS CustomerName)
) YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM (
SELECT Word, Replacement FROM
(SELECT 'XYZ' AS Word, 'QWE' AS Replacement),
(SELECT 'ABC' AS Word, 'XYZ' AS Replacement),
(SELECT 'PLO' AS Word, 'Rustic' AS Replacement),
(SELECT 'Kix' AS Word, 'BowWow' AS Replacement)
)
) ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
Please note: there is still issue if result of one replacement matches the input to a subsequent replacement

I believe there are multiple ways to tackle this problem, and it depends on the size of your dataset, practicality of simply making a guiding table by hand and uploading it to BigQuery, and the granularity of the data you want to replace.
If your values are very granular, you can create a table with "from" and "to" values on different columns, and join that table with your main table, and retrieve those values very cleanly.
# Replace the support_table table with your actual table
WITH support_table AS (
SELECT "ABC" AS OldValue, "XYZ" AS NewValue
)
SELECT main_table.OldValue, support_table.NewValue FROM main_table
JOIN support_table ON main_table.old_value = support_table.old_value
Now, if you want to replace a big list of different values with something, you can use REGEXP_REPLACE with a string containing all possible values.
If you have a very big list of items, you can use
STRING_AGG in a table with all the values you want to replace, or skip the STRING_AGG step and create said string by hand.
Both of the snippets below result in "item1|item2|item3". Choose which is faster for you to do.
# Replace the values_to_replace table with your actual table
WITH values_to_replace AS (
SELECT "item1" AS ColumnWithItemsToReplace
UNION ALL
SELECT "item2"
UNION ALL
SELECT "item3"
)
SELECT STRING_AGG(ColumnsWithItemsToReplace,"|") FROM values_to_replace
SELECT r"item1|item2|item3"
STRING_AGG will retrieve all the values from a table or query and concatenate them using a separator of choice. If you use the pipe separator, you will be able to create a string like "item1|item2|item3|..."
For a regular expression, the pipe counts as "or", which means that the regex will interpret the string as "item1 or item2 or item3". Thus, if you pass that generated string to REGEXP_REPLACE as the values to be replaced, it will be considered valid.
Example code below:
REGEXP_REPLACE(
column_to_replace
,(SELECT STRING_AGG(ColumnWithItemsToReplace,"|") FROM `YourTable`)
,"Replacer"
)
Hope it helps.

Related

How to construct a `select ... in` SQL query in nim?

I'm using nim and db_sqlite to fetch some rows with certain _ids from a database table. For example:
for row in db.fastRows("SELECT * FROM t WHERE _id IN (?, ?)", #["1", "2"]):
echo row
This works as expected, however, the sequence at the end is constructed dynamically at runtime, which means I need a variable amount of ? in the query. I end up creating a sequence with question marks, joining them, interpolating the string and turning it into a database query:
var qs : seq[string]
for id in ids:
qs.add("?")
let query_string = """SELECT * FROM t WHERE _id IN ($1)""" % join(qs, ",")
let query = SqlQuery(query_string)
for row in db.fastRows(query, ids):
echo row
Is there a better way to construct a select ... in query in nim? Ideally one with just one ? in the SqlQuery.
(For what it's worth, the current behavior is similar to other languages I've used)
you could do the replacement manually, here's one way using strformat and map
import strformat,db_sqlite,sequtils,strutils
#assuming ids is a seq[string] here
let query = sql(&"SELECT * FROM t WHERE _id IN ({ids.map(dbQuote).join(\",\")})")
for row in db.fastRows(query):
echo row

Multiple string substitutions in PostgreSQL

I have a column with abbreviations separated by spaces like this
'BG MSG'
Also, there's another table with substitutions
target replacement
----------------------
'BG', 'Brick Galvan'
'MSG', 'Mosaic Galvan'
The goal is to apply all the substitutions to the abbreviations to obtain something like
'Brick Galvan Mosaic Galvan' from 'BG MSG'
I know I could do
replace( replace('BG MSG', 'BG', 'Brick Galvan'), 'MSG', 'Mosaic Galvan')
But imagine there are hundreds of substitutions, and they can change from one day to the next. The resulting query will be hideous to maintain.
I mean, I could do a code generator that will create the query with all the nested replaces, but I'm looking for something more elegant and postgres-native.
I've found solutions like this one
How to replace multiple special characters in Postgres 9.5 but they seem to work only for single characters.
Let's say your tables look like this:
create table my_table(id serial primary key, abbrevs text);
insert into my_table (abbrevs) values
('BG MSG');
create table substitutions(target text, replacement text);
insert into substitutions values
('BG', 'Brick Galvan'),
('MSG', 'Mosaic Galvan');
You can get each abbreviation in a single row:
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
id | abbrev
----+--------
1 | BG
1 | MSG
(2 rows)
and use them to join the substitution table and get full names:
select id, string_agg(replacement, ' ') as full_names
from (
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
) t
join substitutions on abbrev = target
group by id
id | full_names
----+----------------------------
1 | Brick Galvan Mosaic Galvan
(1 row)
Db<>fiddle.
Nested replace approach would work but it is quite ugly, right?
SELECT REPLACE(REPLACE(REPLACE(REPLACE(…
After carefully formatted to make it look readable, the best you can get follows:
SELECT
REPLACE(
REPLACE(
REPLACE(
REPLACE(...
On the other hand, you might just use the LATERAL JOIN solution which uses more characters but, it is definitely more readable.
-- Input: BG, MSG
-- Output: Brick Galvan, Mosaic Galvan
SELECT msg.Materials
FROM (SELECT 'BG, MSG' AS Materials) mt
INNER JOIN LATERAL (SELECT REPLACE(mt.Materials::text, 'BG', 'Brick Galvan') AS Materials) bg ON true
INNER JOIN LATERAL (SELECT REPLACE(bg.Materials::text, 'MSG', 'Mosaic Galvan') AS Materials) msg ON true;

Multiple values in WHERE clause using sqldf in R

I am trying to query multiple values in the WHERE clause, using sqldf in R. I have the following query, however, it continues to throw an error. Any help would be appreciated.
sqldf("SELECT amount
from df
where category = 'description' and 'original description'")
ERROR: <0 rows> (or 0-length row.names)
You just need to use in condition
sqldf("SELECT amount
from df
where category in ('description','original description')")
If you want to use like operator, you need to use OR instead of AND.(not sure what other entries are in the category, if you don't have any other category that has "description" in its name, the following might be enough
sqldf("SELECT amount from df where category LIKE 'descriptio%'")
You need to define each where clause explicitly, so
SELECT amount FROM df WHERE category = 'description' OR category = 'original description'
You can pass in multiple values, it's done with the IN operator:
SELECT amount FROM df WHERE category IN ( 'description', 'original description' )

Third substring from the end in Google Big Query

I use standard sql and want to extract third substring from the end.
Example Input: "Search-site-variable-brand-0-city-none-18053517"
Output: "city"
I just wanted to point out that if you plan to apply this transformation to multiple columns, it may be useful to pull the logic into a UDF. Here's an example of how to do that:
CREATE TEMP FUNCTION SecondSubstringFromEnd(s STRING) AS ((
SELECT arr[SAFE_OFFSET(ARRAY_LENGTH(arr) - 3)]
FROM (
SELECT SPLIT(s, '-') AS arr
)
));
WITH Input AS (
SELECT 'Search-site-variable-brand-0-city-none-18053517' AS str UNION ALL
SELECT 'a-b' UNION ALL
SELECT 'w-x-yyy-z'
)
SELECT
str,
SecondSubstringFromEnd(str) AS second_substring_from_end
FROM Input;
This might do the trick:
WITH data AS(
select "Search-site-variable-brand-0-city-none-18053517" as Input
)
SELECT
CASE WHEN ARRAY_LENGTH(SPLIT(Input, '-')) > 3 THEN SPLIT(Input, '-')[OFFSET(ARRAY_LENGTH(SPLIT(Input, '-')) - 3)] END word
FROM data
It returns NULL in case the string has no split, such as empty strings.
Few more variations for BigQuery Standard SQL:
#standardSQL
WITH YourTable AS(
SELECT 'Search-site-variable-brand-0-city-none-18053517' AS Input UNION ALL
SELECT 'Second-substring-from-the-end-in-Google-BigQuery' UNION ALL
SELECT 'bigQuery-assign-a-value-to-table-1-based-on-table-2' UNION ALL
SELECT 'Error-Message-Too-many-sources-provided-15285-Limit-is-10000' UNION ALL
SELECT 'Google-Bigquery-data-import-from-Google-Analytics-360' UNION ALL
SELECT 'Bigquery-Partitioning-data-past-2000-limit'
)
SELECT
Input,
REVERSE(SPLIT(REVERSE(Input), '-')[SAFE_ORDINAL(3)]) AS Output_1,
ARRAY_REVERSE(SPLIT(Input, '-'))[SAFE_ORDINAL(3)] AS Output_2
FROM YourTable
The "ARRAY_REVERSE" function works wonders in this scenario.
with input AS
(
SELECT "Search-site-variable-brand-0-city-none-18053517" AS to_reverse_string
)
SELECT ARRAY_REVERSE(SPLIT(to_reverse_string, "-"))[SAFE_OFFSET(2)]
FROM input

How to replace one or more consecutive symbols with one symbol in DB2

I am using DB2 LUW 9.5. In a field, I have a value like this one:
Test^test^^test^^^^test^^test^test
In a SELECT query, I would like to replace the duplicated ^ with only one ^. This would produce:
Test^test^test^test^test^test
The delimiter is known and static (can be hardcoded). Would you know a way to obtain the desired output using DB2 functions?
Thank you
You need one other character that can be used as delimiter, for example the pipe sign (|).
Let's say the table is defined as
create table myTable (
myColumn varchar(400)
);
Add a value for a test:
insert into myTable (myColumn) values
('Test^^^^^^^^test^^^^^^^test^^^^^^test^^^^^test^^^^test^^^test^^test^test');
Then do a smart replacement with use of the other delimiter
select replace(replace(replace(myColumn, '^^', '^|^'), '|^^', ''), '^|^', '^')
from myTable;
The result:
Test^test^test^test^test^test^test^test^test^test
Instead of using a one character delimiter you can use a string of which you are sure it will not occur in the values, for example 'xy'. The next query will give the same results:
select replace(replace(replace(myColumn, '^^', '^xy^'), 'xy^^', ''), '^xy^', '^')
from myTable;

Resources