Redshift - Unsupported PIVOT column type: text - pivot

I had a look at this topic: ERROR: Unsupported PIVOT column type: text but unfortunately it didn't provide me with an answer.
I have a simple table that looks like the following:
user_id | type | reminder_type | sent_at
----------------------------------------------------
user_a | MID | REMINDER_1 | 2022-02-01 15:00:00
user_a | MID | REMINDER_2 | 2022-02-15 06:00:00
Then I try to perform this query:
SELECT
*
FROM table
PIVOT (
MIN(sent_at) FOR reminder_type IN('REMINDER_1', 'REMINDER_2')
)
In order to get the following result:
user_id | type | reminder_1 | reminder_2
----------------------------------------------------------
user_a | MID | 2022-02-01 15:00:00 | 2022-02-15 06:00:00
And it gives me the aforementioned error:
Can't get my head wrapped around it and AWS documentation doesn't provide any details about the error

The column reminder_type was a result of REGEXP_REPLACE that resulted in type VARCHAR(101).
Suddenly it worked when I explicitly cast the column to VARCHAR
REGEXP_REPLACE(remin_type, '<regex>', '') AS reminder_type doesn't work
REGEXP_REPLACE(remin_type, '<regex>', '')::VARCHAR AS reminder_type works perfectly

Related

partition by multiple columns in Spark SQL not working properly

I want to partition by three columns in my query :
user id
cancelation month year.
retention month year.
I used row number and partition by as follows
row_number() over (partition by user_id ,cast ( date_format(cancelation_date,'yyyyMM') as integer),cast ( date_format(retention_date,'yyyyMM') as integer) order by cast ( date_format(cancelation_date,'yyyyMM') as integer) asc, cast ( date_format(retention_date,'yyyyMM') as integer) asc) as row_count
example of the output I got :
| user_id |cancelation_date |cancelation_month_year|retention_date|retention_month_year|row_count|
| -------- | -------------- |----------------------|--------------|--------------------|---------|
| 566 | 28-5-2020 | 202005 | 20-7-2020 | 202007 |1 |
| 566 | 28-5-2020 | 202005 | 30-7-2-2020 | 202007 |2 |
example of the output I want to get:
user_id
cancelation_date
cancelation_month_year
retention_date
retention_month_year
row_count
566
28-5-2020
202005
20-7-2020
202007
1
566
28-5-2020
202005
30-7-2-2020
202007
1
note that user may have more than cancelation months, for example f he has canceled in August , I want row count =2 for all dates in August and so on.
it's not obvious why partition by is partitioning by retention date instead of partitioning by retention month year.
I get the impression that row_number is not what you want, rather you are interested in dense_rank, wherein you would get your expected output.

Postgres query to find max value of a column and group by an ID, only if the date for that max value is the Max(dt) for the table. works in sqllite3

I have a query in sqlite3 that works, but I can't get the query to work on postgres. The purpose of the query is find if a stock has hit its new high for the Max(Date) in the table. ie the last date for which stock prices were updated. So Monday, would look at Friday's day.
SQLite3 query that works:
SELECT * from (
select symbol, name, stock_id, min(close), dt
FROM day_stock_price join stock on stock.id = day_stock_price.stock_id
GROUP by stock_id
ORDER By symbol
) where dt = (select max(dt) from day_stock_price)
Table "public.day_stock_price"
Column | Type | Collation | Nullable | Default
-----------+-----------------------------+-----------+----------+---------
stock_id | integer | | not null |
dt | timestamp without time zone | | not null |
open | numeric | | not null |
high | numeric | | not null |
low | numeric | | not null |
close | numeric | | not null |
volume | numeric | | not null |
insqueeze | boolean | | |
At first I thought this was getting closer but it does NOT return the Maximum close price. I was just trying to get the max(close) for each stock_id.
select distinct on (stock_id) max(close), dt from day_stock_price Group by stock_id, dt;
In the result the first row returned was the max.
stock_id | max | dt
----------+-------+---------------------
2 | 10.51 | 2021-02-02 00:00:00
3 | 0.716 | 2020-05-05 00:00:00
but the second stock as a max of 14, which I found by selecting all closes for stock_id=3 in DESC order.
That brings me to a very similar Postgres query that does run:
SELECT * from (
select symbol, name, stock_id, max(close), dt
FROM day_stock_price join stock on stock.id = day_stock_price.stock_id
GROUP by stock_id, symbol, name, dt
ORDER By symbol
) as X where dt = (select max(dt) from day_stock_price)
The answer is just totally wrong. This returns the close value for every stock for the max(dt). What is wanted that is its only returned if its the Max(Close) AND also the MAX(Dt).
This seems far harder than it should be, so I think I'm just going about this in a very non-Postgres way. I'm new to postgres obviously. I appreciate the help.
Data in day_stock_price looks like this:
stockdb=# select * from day_stock_price limit 10;
stock_id | dt | open | high | low | close | volume | insqueeze
----------+---------------------+--------+---------+-------+--------+----------+-----------
4165 | 2021-02-03 00:00:00 | 2.02 | 2.31 | 1.87 | 1.965 | 21271087 | f
4165 | 2021-02-02 00:00:00 | 1.71 | 2.04 | 1.59 | 1.8911 | 15867773 | f
4165 | 2021-02-01 00:00:00 | 1.65 | 1.73 | 1.51 | 1.6999 | 8824377 | f
4165 | 2021-01-29 00:00:00 | 1.63 | 1.86 | 1.54 | 1.5899 | 9848362 | f
4165 | 2021-01-28 00:00:00 | 1.53 | 1.77 | 1.49 | 1.5701 | 8900787 | f
4666 | 2021-02-03 00:00:00 | 26.9 | 26.9738 | 26.9 | 26.932 | 15695 | f
4666 | 2021-02-02 00:00:00 | 26.885 | 26.895 | 26.88 | 26.895 | 1500 | f
4666 | 2021-02-01 00:00:00 | 26.875 | 26.885 | 26.86 | 26.885 | 1850 | f
4666 | 2021-01-29 00:00:00 | 26.87 | 26.9 | 26.8 | 26.83 | 120001 | f
4666 | 2021-01-28 00:00:00 | 26.86 | 26.87 | 26.86 | 26.87 | 831 | f
(10 rows)
In sqlite3 the response was something like this ( I can't re-run it)
Symbol
Name
Stock_id
Max(Close)
Date
AAPL
Apple
5
103.89
2020-05-05
INTC
Intel
9
56.89
2020-05-05
The date would ALWAYS be the same because its the max date in the DB.
So its finding the stocks who hit a new high in the last trading day.
This should return all day_stock_price rows which reached a high on the most recent day. Note that this will return rows that match a prior day exactly; if you want to avoid that, add a filter on dt < (select max(dt) from day_stock_price) in the per_stock_maxes CTE, and change the join condition to (t1.stock_id=t2.stock_id and t1.close>t2.max_close).
with today_closes as (
select * from day_stock_price where dt = (select max(dt) from day_stock_price)
),
per_stock_maxes as (
select stock_id, max(close) max_close from day_stock_price group by stock_id
)
select t1.* from today_closes t1 join per_stock_maxes t2
on (t1.stock_id=t2.stock_id and t1.close=t2.max_close)

Find and remove matching column values in pyspark

I have a pyspark dataframe where occasionally the columns will have a wrong value that matches another column. It would look something like this:
| Date | Latitude |
| 2017-01-01 | 43.4553 |
| 2017-01-02 | 42.9399 |
| 2017-01-03 | 43.0091 |
| 2017-01-04 | 2017-01-04 |
Obviously, the last Latitude value is incorrect. I need to remove any and all rows that are like this. I thought about using .isin() but I can't seem to get it to work. If I try
df['Date'].isin(['Latitude'])
I get:
Column<(Date IN (Latitude))>
Any suggestions?
If you're more comfortable with SQL syntax, here is an alternative way using a pyspark-sql condition inside the filter():
df = df.filter("Date NOT IN (Latitude)")
Or equivalently using pyspark.sql.DataFrame.where():
df = df.where("Date NOT IN (Latitude)")

How to return datetime in Excel when SQL table has only time?

I am currently querying against an Informix SQL table that is datetime, however only holds time, such as "07:30:44". I need to be able to pull this data and have the send_time show what the actual time was, and not "0:00:00".
Example:
Informix Table Data
--------------------------
| send_date | 02/09/2016 | --datetime field
--------------------------
| send_time | 07:30:44 | --datetime field
--------------------------
When I query through Excel via Microsoft Query editor, I can see the correct value, "07:30:44" in the preview, however what is returned in my sheet is "0:00:00". Normally I would change the formatting on the cell, however the literal value for the entire column is "12:00:00 AM".
Excel pulls/displays
--------------------------
| send_date | 02/09/2016 |
--------------------------
| send_time | 12:00:00 AM| --displayed is "0:00:00"
--------------------------
I have cast the field as char to see if it would return correctly, and it does! However you can't use parameters/build quick reports with this method.
Desired Excel output
--------------------------
| send_date | 02/09/2016 |
--------------------------
| send_time | 07:30:44 |
--------------------------
Is there another way to resolve this? I'd love to give the users an easy way to build queries without having to cast anything.

SSIS transformation : 46 columns into 4

I have a business intelligence SSIS project, and I'm preparing my fact table, the problem in here, is that my source is an excel file with 46 columns and my fact table needs only 4, but I need the information from the 46 columns.
I will try to simplify with a little example:
Source:
code_agency | date | OC_realisation | CV_realistion | NTC_realisation
700 | 1/1/2014 | 4 | 6 | 3
200 | 1/1/2014 | 5 | 1 | 0
Destination
code_agency | date | Code_realisation | value
700 | 1/1/2014 | OC | 4
700 | 1/1/2014 | CC | 6
700 | 1/1/2014 | NTC | 3
200 | 1/1/2014 | OC | 5
200 | 1/1/2014 | CC | 1
200 | 1/1/2014 | NTC | 0
This was an example with only 3 realization columns, but I really have 46 on my excel source.
Does anyone know how to achieve the desired output? Please help and thanks.
Option 1. Use Unpivot transformation.
Unpivot Transformation Example
Option 2: Handle it in TSQL
SELECT Code_Agency, date, 'OC' AS Code_realisation, OC_realisation AS value
UNION ALL
SELECT Code_Agency, date, 'CC' AS Code_realisation, CC_realisation AS value
UNION ALL
SELECT Code_Agency, date, 'NTC' AS Code_realisation, NTC_realisation AS value
.
.
.
.

Resources