split a file path into its constituent paths in Hive/Presto - string

Using Presto/Hive, I'd like to split a string in the following way.
Input string:
\Users\Killer\Downloads\Temp
\Users\Killer\Downloads\welcome
and have the query return these rows:
\Users\
\Users\Killer\
\Users\Killer\Downloads\
\Users\Killer\Downloads\Temp
\Users\
\Users\Killer\
\Users\Killer\Downloads\
\Users\Killer\Downloads\welcome
Can anyone please help me.

Solution for Hive. split to get array, explode array using posexplode, collect array again using analytic function and concatenate (literals \ should be shielded with one more backslash - \\ and in the regex used in split, single backslash represented as four backslashes):
select s.level,
concat(concat_ws('\\',collect_set(s.path) over(order by level rows between unbounded preceding and current row)),
case when level<size(split(t.str,'\\\\'))-1 then '\\' else '' end
) result
from mytable t lateral view posexplode(split(t.str,'\\\\')) s as level, path
Result:
level result
0 \
1 \Users\
2 \Users\Killer\
3 \Users\Killer\Downloads\
4 \Users\Killer\Downloads\Temp

This can do the job:
SELECT item, array_join( array_agg(item) over (order by id), '\' )
FROM UNNEST(split('\Users\Killer\Downloads\Temp','\')) WITH ORDINALITY t(item,id)
Explanation:
We first split the sting to an array by the delimeter \, then we UNNEST this array into rows, one row per item. After that we do array_agg on all items till this row id ("rolling" aggregation of window function), and finally we array_join them back with the \ delimeter.

Related

Joining 2 tables in pyspark, multiple conditions, left join?

I have 2 tables like below. My code below is joining 2 tables (left join). The problem is I have to do the same join twice. The first join is happening on log_no and LogNumber which returns all records from the left table (table1), and the matched records from the right table (table2). The second join is doing the same thing but on the substring of log_no with LogNumber. for example, 777 will match with 777 from table 2, 777-A there is no match but when using a substring function 777-A becomes 777 which will be matched in table 2.
Instead of creating 2 separate joins like below, how can I cover both scenarios with one join instead. Code below:
# first join to match 1234-A (table 1) with 1234-A (table 2)
df5 = df5.join(df_app, trim(df5.LOG_NO) == trim(df_app.LogNumber), "left")\
.select (df5["*"], df_app["ApplicationId"])
df5 = df5.withColumnRenamed("ApplicationId","ApplicationId_1")
# second join with substring function, to match 777-C with 777,
# my string is longer than my examples, this is why I have a substring for the first 8 characters. I provided simple examples.
df5 = df5.join(df_app, substring(trim(df5.LOG_NO), 1, 8) == trim(df_app.LogNumber), "left")\
.select (df5["*"], df_app["ApplicationId"])
df5 = df5.withColumnRenamed("ApplicationId","ApplicationId_2")
You can combine the two join conditions using a bitwise OR:
df5 = df5.join(df_app,
(trim(df5.LOG_NO) == trim(df_app.LogNumber)) |
(substring(trim(df5.LOG_NO), 1, 8) == trim(df_app.LogNumber)),
"left") \
.select(df5["*"], df_app["ApplicationId"])

Multiple string substitutions in PostgreSQL

I have a column with abbreviations separated by spaces like this
'BG MSG'
Also, there's another table with substitutions
target replacement
----------------------
'BG', 'Brick Galvan'
'MSG', 'Mosaic Galvan'
The goal is to apply all the substitutions to the abbreviations to obtain something like
'Brick Galvan Mosaic Galvan' from 'BG MSG'
I know I could do
replace( replace('BG MSG', 'BG', 'Brick Galvan'), 'MSG', 'Mosaic Galvan')
But imagine there are hundreds of substitutions, and they can change from one day to the next. The resulting query will be hideous to maintain.
I mean, I could do a code generator that will create the query with all the nested replaces, but I'm looking for something more elegant and postgres-native.
I've found solutions like this one
How to replace multiple special characters in Postgres 9.5 but they seem to work only for single characters.
Let's say your tables look like this:
create table my_table(id serial primary key, abbrevs text);
insert into my_table (abbrevs) values
('BG MSG');
create table substitutions(target text, replacement text);
insert into substitutions values
('BG', 'Brick Galvan'),
('MSG', 'Mosaic Galvan');
You can get each abbreviation in a single row:
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
id | abbrev
----+--------
1 | BG
1 | MSG
(2 rows)
and use them to join the substitution table and get full names:
select id, string_agg(replacement, ' ') as full_names
from (
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
) t
join substitutions on abbrev = target
group by id
id | full_names
----+----------------------------
1 | Brick Galvan Mosaic Galvan
(1 row)
Db<>fiddle.
Nested replace approach would work but it is quite ugly, right?
SELECT REPLACE(REPLACE(REPLACE(REPLACE(…
After carefully formatted to make it look readable, the best you can get follows:
SELECT
REPLACE(
REPLACE(
REPLACE(
REPLACE(...
On the other hand, you might just use the LATERAL JOIN solution which uses more characters but, it is definitely more readable.
-- Input: BG, MSG
-- Output: Brick Galvan, Mosaic Galvan
SELECT msg.Materials
FROM (SELECT 'BG, MSG' AS Materials) mt
INNER JOIN LATERAL (SELECT REPLACE(mt.Materials::text, 'BG', 'Brick Galvan') AS Materials) bg ON true
INNER JOIN LATERAL (SELECT REPLACE(bg.Materials::text, 'MSG', 'Mosaic Galvan') AS Materials) msg ON true;

Select Query containing tuple With mixed single as well as double quotes

Postgresql select query containing tuple with single quotes as well as double quotes when giving this tuple as the input to select query it genrates error stating that specific value is not present in the database.
I have treid converting that list of values to JSON list with double quotes but that doesn't help either.
list = ['mango', 'apple', "chikoo's", 'banana', "jackfruit's"]
query = """select category from "unique_shelf" where "Unique_Shelf_Names" in {}""" .format(list)
ERROR: column "chikoo's" doesn't exist
Infact chikoo's does exist
But due to double quotes its not fetching the value.
Firstly please don't use list as a variable name, list is a reserved keyword and you don't wanna overwrite it.
Secondly, using "" around tables and columns is bad practice, use ` instead.
Thirdly, when you format an array, it outputs as
select category from `unique_shelf`
where `Unique_Shelf_Names` in (['mango', 'apple', "chikoo's", 'banana', "jackfruit's"])
Which is not a valid SQL syntax.
You can join all values with a comma
>>>print("""select category from `unique_shelf` where `Unique_Shelf_Names` in {})""".format(','.join(l)))
select category from `unique_shelf`
where `Unique_Shelf_Names` in (mango,apple,chikoo's,banana,jackfruit's)
The issue here is that the values inside the in bracket are not quoted. We can do that by formatting them beforehand using double quotes(")
l = ['mango', 'apple', "chikoo's", 'banana', "jackfruit's"]
list_with_quotes = ['"{}"'.format(x) for x in l]
query = """
select category from `unique_shelf`
where `Unique_Shelf_Names` in ({})""" .format(','.join(list_with_quotes))
This will give you an output of
select category from `unique_shelf`
where `Unique_Shelf_Names` in ("mango","apple","chikoo's","banana","jackfruit's")

Google BigQuery Replace function for string type

I am trying to replace certain customer names in my data.
I was able to do SQL using Google BigQuery language to transform one part of the string another via the replace function for one particular string.
Replace(CustomerName, 'ABC', 'XYZ')
However, I have a couple more that I would need to use the replace function such that
Replace(CustomerName, 'PLO', 'Rustic')
Replace(CustomerName, 'Kix', 'BowWow')
and so on.
I've tried doing
Replace(CustomerName, 'ABC', 'XYZ') OR Replace(CustomerName, 'PLO', 'Rustic') OR Replace(CustomerName, 'Kix', 'BowWow')
but that got me an error message.
I've also tried
Replace(CustomerName, 'ABC', 'XYZ') AND Replace(CustomerName, 'PLO', 'Rustic') AND Replace(CustomerName, 'Kix', 'BowWow')
but that also got me an error message.
I am able to just use "case when statement" and then hardcode each one, but I'm wondering if there is a better/faster way to just use replace statement instead.
Thanks for your help.
The CASE WHEN option is pretty reasonable. Another option is to chain them together:
REPLACE(
REPLACE(
REPLACE(
CustomerName,
'ABC',
'XYZ'),
'PLO',
'Rustic'),
'Kix',
'BowWow')
Which one you pick really depends on the exact scenario. The chained REPLACE calls are probably faster, but they could overlap in weird ways (e.g., if the output to one replacement matches the input to a subsequent one). The CASE WHEN approach avoids that issue, but it's probably more expensive because you need to do one operation to find the substring and another to actually replace it.
Note that when you're using AND or OR, you're trying to combine the string output of REPLACE as if it were a boolean, which is why it's failing.
In cases when you have quite a number of replacements - chaining of REPLACEs can become not practical and annoying manual work.
Below addresses this potential issue (assuming you maintain Lookup table with pairs: Word, Replacement)
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
You can test it using below example
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM (
SELECT CustomerName FROM
(SELECT '1234ABC567' AS CustomerName),
(SELECT '12 34 PLO 56' AS CustomerName),
(SELECT 'Kix' AS CustomerName),
(SELECT '98 ABC PLO Kix ABC 76 XYZ 54' AS CustomerName),
(SELECT 'ABCQweKIX' AS CustomerName)
) YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM (
SELECT Word, Replacement FROM
(SELECT 'XYZ' AS Word, 'QWE' AS Replacement),
(SELECT 'ABC' AS Word, 'XYZ' AS Replacement),
(SELECT 'PLO' AS Word, 'Rustic' AS Replacement),
(SELECT 'Kix' AS Word, 'BowWow' AS Replacement)
)
) ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
Please note: there is still issue if result of one replacement matches the input to a subsequent replacement
I believe there are multiple ways to tackle this problem, and it depends on the size of your dataset, practicality of simply making a guiding table by hand and uploading it to BigQuery, and the granularity of the data you want to replace.
If your values are very granular, you can create a table with "from" and "to" values on different columns, and join that table with your main table, and retrieve those values very cleanly.
# Replace the support_table table with your actual table
WITH support_table AS (
SELECT "ABC" AS OldValue, "XYZ" AS NewValue
)
SELECT main_table.OldValue, support_table.NewValue FROM main_table
JOIN support_table ON main_table.old_value = support_table.old_value
Now, if you want to replace a big list of different values with something, you can use REGEXP_REPLACE with a string containing all possible values.
If you have a very big list of items, you can use
STRING_AGG in a table with all the values you want to replace, or skip the STRING_AGG step and create said string by hand.
Both of the snippets below result in "item1|item2|item3". Choose which is faster for you to do.
# Replace the values_to_replace table with your actual table
WITH values_to_replace AS (
SELECT "item1" AS ColumnWithItemsToReplace
UNION ALL
SELECT "item2"
UNION ALL
SELECT "item3"
)
SELECT STRING_AGG(ColumnsWithItemsToReplace,"|") FROM values_to_replace
SELECT r"item1|item2|item3"
STRING_AGG will retrieve all the values from a table or query and concatenate them using a separator of choice. If you use the pipe separator, you will be able to create a string like "item1|item2|item3|..."
For a regular expression, the pipe counts as "or", which means that the regex will interpret the string as "item1 or item2 or item3". Thus, if you pass that generated string to REGEXP_REPLACE as the values to be replaced, it will be considered valid.
Example code below:
REGEXP_REPLACE(
column_to_replace
,(SELECT STRING_AGG(ColumnWithItemsToReplace,"|") FROM `YourTable`)
,"Replacer"
)
Hope it helps.

How to replace one or more consecutive symbols with one symbol in DB2

I am using DB2 LUW 9.5. In a field, I have a value like this one:
Test^test^^test^^^^test^^test^test
In a SELECT query, I would like to replace the duplicated ^ with only one ^. This would produce:
Test^test^test^test^test^test
The delimiter is known and static (can be hardcoded). Would you know a way to obtain the desired output using DB2 functions?
Thank you
You need one other character that can be used as delimiter, for example the pipe sign (|).
Let's say the table is defined as
create table myTable (
myColumn varchar(400)
);
Add a value for a test:
insert into myTable (myColumn) values
('Test^^^^^^^^test^^^^^^^test^^^^^^test^^^^^test^^^^test^^^test^^test^test');
Then do a smart replacement with use of the other delimiter
select replace(replace(replace(myColumn, '^^', '^|^'), '|^^', ''), '^|^', '^')
from myTable;
The result:
Test^test^test^test^test^test^test^test^test^test
Instead of using a one character delimiter you can use a string of which you are sure it will not occur in the values, for example 'xy'. The next query will give the same results:
select replace(replace(replace(myColumn, '^^', '^xy^'), 'xy^^', ''), '^xy^', '^')
from myTable;

Resources