Presto combining two columns and output as one - presto

i'm trying to combine two columns together in presto.
this is part of a query, and it has to be formatted in a certain way.
SELECT 'Display' AS channel,
DBM.dated,
DBM.market,
DBM.impressions,
DBM.clicks,
sum(DBM.amount_spent_EUR)+sum(DBm.platform_fee) as DBM.amount_spent_EUR
FROM
(
SELECT
DATE_FORMAT(DATE_PARSE(date,'%Y/%m/%d'),'%Y-%m-%d') AS dated,
trim(SPLIT_PART(insertion_order,'|',3)) AS market,
sum(cast(impressions as double)) as impressions,
sum(cast(clicks as double)) as clicks,
sum(CAST(media_cost_advertiser_currency AS DOUBLE)*1.15) AS amount_spent_EUR,
sum(CAST(media_fee_1_adv_currency AS DOUBLE)*1.15) as platform_fee
FROM ralph_lauren_google_sheet_dbm_data_2
WHERE dated <= {{days_ago 1}}
GROUP BY 1,2
)DBM
the error is as following:
Query 20190814_125505_19433_rcrut failed: line 1:144: extraneous input
'.' expecting {, ',', 'EXCEPT', 'FROM', 'GROUP', 'HAVING',
'INTERSECT', 'LIMIT', 'ORDER', 'UNION', 'WHERE'}
the error is the dbm.amount_spent_eur. this column has to come out like this.
How can I get around this?

You can use double quotes in such cases.
as "DBM.amount_spent_EUR"

Related

Synapse Spark SQL Delta Merge Mismatched Input Error

I am trying to update the historical table, but am getting a merge error. When I run this cell:
%%sql
select * from main
UNION
select * from historical
where Summary_Employee_ID=25148
I get a two row table that looks like:
EmployeeID Name
25148 Wendy Clampett
25148 Wendy Monkey
I'm trying to update the Name... using the following merge command
%%sql
MERGE INTO main m
using historical h
on m.Employee_ID=h.Employee_ID
WHEN MATCHED THEN
UPDATE SET
m.Employee_ID=h.Employee_ID,
m.Name=h.Name
WHEN NOT MATCHED THEN
INSERT(Employee,Name)
VALUES(h.Employee,h.Name)
Here's my error:
Error:
mismatched input 'MERGE' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos 0)
Synapse doesn't support the sql merge, like databricks. However, you can use the python solution. Note historical was really my updates...
So for the above, I used:
import delta
main = delta.DeltaTable.forPath(spark,"path")
(main
.alias("main")
.merge(historical.alias("historical"),
.whenMatchedUpdate(set = {main.Employee_ID=historical.Employee_ID})
.whenNotMathcedInsert(values =
{"employeeID":"historical.employeeID","name"="historical.name})
.execute()
)
Your goal is to upsert the target table historical, but as per your query the target table is set to main instead of historical and also the update statement set to main and insert statement set to historical
Try the following,
%%sql
MERGE INTO historical target
using main source
on source.Employee_ID=target.Employee_ID
WHEN MATCHED THEN
UPDATE SET
target.Name=source.Name
WHEN NOT MATCHED THEN
INSERT(Employee,Name)
VALUES(source.Employee,source.Name)
It's supported in Spark 3.0 that's currently in preview, so this might be worth a try. I did see the same error on the Spark 3.0 pool, but it's quite misleading as it actually means that you're trying to merge on duplicate data or that you're offering duplicate data to the original set. I've validated this by querying the delta lake and the raw file for duplicates with the serverless SQL Pool and Polybase.

Multiple string substitutions in PostgreSQL

I have a column with abbreviations separated by spaces like this
'BG MSG'
Also, there's another table with substitutions
target replacement
----------------------
'BG', 'Brick Galvan'
'MSG', 'Mosaic Galvan'
The goal is to apply all the substitutions to the abbreviations to obtain something like
'Brick Galvan Mosaic Galvan' from 'BG MSG'
I know I could do
replace( replace('BG MSG', 'BG', 'Brick Galvan'), 'MSG', 'Mosaic Galvan')
But imagine there are hundreds of substitutions, and they can change from one day to the next. The resulting query will be hideous to maintain.
I mean, I could do a code generator that will create the query with all the nested replaces, but I'm looking for something more elegant and postgres-native.
I've found solutions like this one
How to replace multiple special characters in Postgres 9.5 but they seem to work only for single characters.
Let's say your tables look like this:
create table my_table(id serial primary key, abbrevs text);
insert into my_table (abbrevs) values
('BG MSG');
create table substitutions(target text, replacement text);
insert into substitutions values
('BG', 'Brick Galvan'),
('MSG', 'Mosaic Galvan');
You can get each abbreviation in a single row:
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
id | abbrev
----+--------
1 | BG
1 | MSG
(2 rows)
and use them to join the substitution table and get full names:
select id, string_agg(replacement, ' ') as full_names
from (
select id, unnest(string_to_array(abbrevs, ' ')) as abbrev
from my_table
) t
join substitutions on abbrev = target
group by id
id | full_names
----+----------------------------
1 | Brick Galvan Mosaic Galvan
(1 row)
Db<>fiddle.
Nested replace approach would work but it is quite ugly, right?
SELECT REPLACE(REPLACE(REPLACE(REPLACE(…
After carefully formatted to make it look readable, the best you can get follows:
SELECT
REPLACE(
REPLACE(
REPLACE(
REPLACE(...
On the other hand, you might just use the LATERAL JOIN solution which uses more characters but, it is definitely more readable.
-- Input: BG, MSG
-- Output: Brick Galvan, Mosaic Galvan
SELECT msg.Materials
FROM (SELECT 'BG, MSG' AS Materials) mt
INNER JOIN LATERAL (SELECT REPLACE(mt.Materials::text, 'BG', 'Brick Galvan') AS Materials) bg ON true
INNER JOIN LATERAL (SELECT REPLACE(bg.Materials::text, 'MSG', 'Mosaic Galvan') AS Materials) msg ON true;

How to use groupby with array elements in Pyspark?

I'm running a groupBy operation on a dataframe in Pyspark and I need to groupby a list which may be by one or two features.. How can I execute this?
record_fields = [['record_edu_desc'], ['record_construction_desc'],['record_cost_grp'],['record_bsmnt_typ_grp_desc'], ['record_shape_desc'],
['record_sqft_dec_grp', 'record_renter_grp_c_flag'],['record_home_age'],
['record_home_age_grp','record_home_age_missing']]
for field in record_fields:
df_group = df.groupBy('year', 'area', 'state', 'code', field).sum('net_contributions')
### df write to csv operation
My first thought was to create a list of lists and pass it to the groupby operation, but I get the following error:
TypeError: Invalid argument, not a string or column:
['record_edu_desc'] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
How do I make this work? I'm open to other ways I could do this.
Try this (note that * [asterisk] before field):
for field in record_fields:
df_group = df.groupBy('year', 'area', 'state', 'code', *field).sum('net_contributions')
Also take a look at this question to know more about asterisk in python.

How to Left Join in Presto SQL?

Can't for the life of me figure out a simple left join in Presto, even after reading the documentation. I'm very familiar with Postgres and tested my query there to make sure there wasn't a glaring error on my part. Please reference code below:
select * from
(select cast(order_date as date),
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1)
left join
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1) as b
on a.order_date = b.summary_date
Running that gives the following error
SQL Error: Failed to run query
Failed to run query
line 1:1: mismatched input 'on' expecting {'(', 'SELECT', 'DESC', 'WITH',
'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'GRANT',
'REVOKE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'CALL', 'PREPARE', 'DEALLOCATE', 'EXECUTE'} (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: a33a6671-07a2-4d7b-bb75-f70f7b82409e)
line 1:1: mismatched input 'on' expecting {'(', 'SELECT', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'GRANT', 'REVOKE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'CALL', 'PREPARE', 'DEALLOCATE', 'EXECUTE'} (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: a33a6671-07a2-4d7b-bb75-f70f7b82409e)
The first problem I notice is that your join clause assumes the first sub-query is aliased as a, but it is not aliased at all. I recommend aliasing that table to see if that fixes it (I also recommend aliasing the order_date column explicitly outside of the cast() statement since you are joining on that column).
Try this:
select * from
(select cast(order_date as date) as order_date,
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1) as a
left join
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1) as b
on a.order_date = b.summary_date
One option is to declare your subqueries by using with:
with a as
(select cast(order_date as date),
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1),
b as
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1)
select * from a
left join b
on a.order_date = b.summary_date;

Google BigQuery Replace function for string type

I am trying to replace certain customer names in my data.
I was able to do SQL using Google BigQuery language to transform one part of the string another via the replace function for one particular string.
Replace(CustomerName, 'ABC', 'XYZ')
However, I have a couple more that I would need to use the replace function such that
Replace(CustomerName, 'PLO', 'Rustic')
Replace(CustomerName, 'Kix', 'BowWow')
and so on.
I've tried doing
Replace(CustomerName, 'ABC', 'XYZ') OR Replace(CustomerName, 'PLO', 'Rustic') OR Replace(CustomerName, 'Kix', 'BowWow')
but that got me an error message.
I've also tried
Replace(CustomerName, 'ABC', 'XYZ') AND Replace(CustomerName, 'PLO', 'Rustic') AND Replace(CustomerName, 'Kix', 'BowWow')
but that also got me an error message.
I am able to just use "case when statement" and then hardcode each one, but I'm wondering if there is a better/faster way to just use replace statement instead.
Thanks for your help.
The CASE WHEN option is pretty reasonable. Another option is to chain them together:
REPLACE(
REPLACE(
REPLACE(
CustomerName,
'ABC',
'XYZ'),
'PLO',
'Rustic'),
'Kix',
'BowWow')
Which one you pick really depends on the exact scenario. The chained REPLACE calls are probably faster, but they could overlap in weird ways (e.g., if the output to one replacement matches the input to a subsequent one). The CASE WHEN approach avoids that issue, but it's probably more expensive because you need to do one operation to find the substring and another to actually replace it.
Note that when you're using AND or OR, you're trying to combine the string output of REPLACE as if it were a boolean, which is why it's failing.
In cases when you have quite a number of replacements - chaining of REPLACEs can become not practical and annoying manual work.
Below addresses this potential issue (assuming you maintain Lookup table with pairs: Word, Replacement)
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
You can test it using below example
SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
SELECT
CustomerName, Replacements
FROM (
SELECT CustomerName FROM
(SELECT '1234ABC567' AS CustomerName),
(SELECT '12 34 PLO 56' AS CustomerName),
(SELECT 'Kix' AS CustomerName),
(SELECT '98 ABC PLO Kix ABC 76 XYZ 54' AS CustomerName),
(SELECT 'ABCQweKIX' AS CustomerName)
) YourTable
CROSS JOIN (
SELECT
GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
FROM (
SELECT Word, Replacement FROM
(SELECT 'XYZ' AS Word, 'QWE' AS Replacement),
(SELECT 'ABC' AS Word, 'XYZ' AS Replacement),
(SELECT 'PLO' AS Word, 'Rustic' AS Replacement),
(SELECT 'Kix' AS Word, 'BowWow' AS Replacement)
)
) ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
var Replacements = r.Replacements.split(';');
var fixedCustomerName = r.CustomerName;
for (var i = 0; i < Replacements.length; i++) {
var pat = new RegExp(Replacements[i].split(',')[0],'gi')
fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
}
emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
}"
)
Please note: there is still issue if result of one replacement matches the input to a subsequent replacement
I believe there are multiple ways to tackle this problem, and it depends on the size of your dataset, practicality of simply making a guiding table by hand and uploading it to BigQuery, and the granularity of the data you want to replace.
If your values are very granular, you can create a table with "from" and "to" values on different columns, and join that table with your main table, and retrieve those values very cleanly.
# Replace the support_table table with your actual table
WITH support_table AS (
SELECT "ABC" AS OldValue, "XYZ" AS NewValue
)
SELECT main_table.OldValue, support_table.NewValue FROM main_table
JOIN support_table ON main_table.old_value = support_table.old_value
Now, if you want to replace a big list of different values with something, you can use REGEXP_REPLACE with a string containing all possible values.
If you have a very big list of items, you can use
STRING_AGG in a table with all the values you want to replace, or skip the STRING_AGG step and create said string by hand.
Both of the snippets below result in "item1|item2|item3". Choose which is faster for you to do.
# Replace the values_to_replace table with your actual table
WITH values_to_replace AS (
SELECT "item1" AS ColumnWithItemsToReplace
UNION ALL
SELECT "item2"
UNION ALL
SELECT "item3"
)
SELECT STRING_AGG(ColumnsWithItemsToReplace,"|") FROM values_to_replace
SELECT r"item1|item2|item3"
STRING_AGG will retrieve all the values from a table or query and concatenate them using a separator of choice. If you use the pipe separator, you will be able to create a string like "item1|item2|item3|..."
For a regular expression, the pipe counts as "or", which means that the regex will interpret the string as "item1 or item2 or item3". Thus, if you pass that generated string to REGEXP_REPLACE as the values to be replaced, it will be considered valid.
Example code below:
REGEXP_REPLACE(
column_to_replace
,(SELECT STRING_AGG(ColumnWithItemsToReplace,"|") FROM `YourTable`)
,"Replacer"
)
Hope it helps.

Resources