Calculating %diff between 2 values in Presto - presto

I have a field type double called values.
I want to calculate the %diff between last value and the one before last value, for example:
value
10
2
4
2
the output should be: -50%
How can I do this in presto?

If you have a field for ordering (otherwise result is not guaranteed) you can use lag window function:
-- sample data
WITH dataset ("values", date) AS (
VALUES (10, now()),
(4, now() + interval '1' hour),
(2, now() + interval '2' hour)
)
--query
select ("values" - l) * 100.0 / l as value
from(
select "values",
lag("values") over(order by date) as l,
date
from dataset
)
order by date desc
limit 1
Output:
value
-50.0

Related

How can we filter rows based on timestamp column?

I have a cassandra column which is of type date and has values in timestamp format like below. How can we filter rows based on this column which have date greater than today's date?
Example:
Type: date
Timestamp: 2021-06-29 11:53:52 +00:00
TTL: null
Value: 2021-03-16T00:00:00.000+0000
I was able to filter rows using columname <= '2021-09-25' which gives ten rows some of them having dates on sep 23 and 24. When i filter using columname < '2021-09-24', i get an error like below
An error occurred on line 1 (use Ctrl-L to toggle line numbers):
Cassandra failure during read query at consistency ONE (1 responses were required but only 0 replica responded, 1 failed)
The CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Depending on where you're running the query, the filter could be translated in the local timezone. Let me illustrate with this example table:
CREATE TABLE community.tstamptbl (
id int,
tstamp timestamp,
PRIMARY KEY (id, tstamp)
)
These 2 statements may appear similar but translate to 2 different entries:
INSERT INTO tstamptbl (id, tstamp) VALUES (5, '2021-08-09');
INSERT INTO tstamptbl (id, tstamp) VALUES (5, '2021-08-09 +0000');
The first statement creates an entry with a timestamp in my local timezone (Melbourne, Australia) while the second statement creates an entry with a timestamp in UTC (+0000):
cqlsh:community> SELECT * FROM tstamptbl WHERE id = 5;
id | tstamp
----+---------------------------------
5 | 2021-08-08 14:00:00.000000+0000
5 | 2021-08-09 00:00:00.000000+0000
Similarly, you need to be precise when reading the data. You need to specify the timezone to remove ambiguity. Here are some examples:
SELECT * FROM tstamptbl WHERE id = 5 AND tstamp < '2021-08-09 +0000';
SELECT * FROM tstamptbl WHERE id = 1 AND tstamp < '2021-08-10 12:00+0000';
SELECT * FROM tstamptbl WHERE id = 1 AND tstamp < '2021-08-10 12:34:56+0000';
In the second part of your question, the error isn't directly related to your filter. The problem is that the replica(s) failed to respond for whatever reason (e.g. unresponsive/overloaded, down, etc). You need to investigate that issue separately. Cheers!

Excel PowerPivot Count new and distinct items in a period not counted before

Assuming I have the following data table:
Period | ID
-----------
P1 | 1
P2 | 1
P1 | 2
P2 | 3
P1 | 2
I am intersted in the number of unique IDs / Period only if the ID has not been counted already in a pervious period, ordered alphabatically. IDs per period in the source themselves can already occure multiple times and shall count as 1 / peroid (distinct count).
Also the data source is not pre-ordered by period and I have no influence on the sort order.
So the result I would like to get in a Pivot is like:
Period | Number of Unique IDs not already counted
-------------------------------------------------
P1 | 2 # Because the are uniquelly ID 1 and 2 in the period
P2 | 1 # Only counting ID 3, because ID 1 has already been counted in period 1
Please help me with the DAX measure I can use in the Pivot.
This is a measure written in DAX. It should work in a pivot table with the Period selected on the rows
DistinctID =
VAR PeriodsPerId =
SELECTCOLUMNS (
ALL ( T[ID] ),
"ID", T[ID],
"Period", CALCULATE ( MIN ( T[Period] ), ALLEXCEPT ( T, T[ID] ) )
)
RETURN
COUNTROWS ( FILTER ( PeriodsPerId, [Period] IN VALUES ( T[Period] ) ) )
It works first by preparing a table variable containing the minimum period per ID and then filtering this table for the Periods in the current selection.
Of course, if the Period is selected through a dimension, substitute the dimension in the last VALUES
Here's one way, which would require to reposition your columns as well as add a new column. This assumes you don't have duplicates in ID/Period combos. You didn't list any duplicates in your sample, so I'm making this assumption.
In my data, I have ID as column A and Period as column B.
Order your data by Period, ascending. Then in column C, you can use this formula to determine if that ID has been used before.
Cell C2 formula: =IF(VLOOKUP(A2,A:B,2,FALSE) = B2,1,0)
Copy it down and then create your pivot table, summing column C.

Presto / AWS Athena query, historicized table (last value in aggregation)

I've got a table split in a static part and a history one. I have to create a query which groups by a series of dimensions, including year and month, and do some aggregations. One of the values that I need to project is a value of the last tuple of the history table matching the given year / month couple.
History table have validity_date_start and validity_date_end, and the latter is NULL if it's up-to-date.
This is the query I've done so far (using temporary tables for ease of reproduction):
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-07-01 00:00:00' THEN 27
ELSE CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-03-01 00:00:00' THEN 1
ELSE CASE WHEN t1.id = 2 AND time.date >= timestamp '2020-05-01 00:00:00' THEN 42 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-01-03 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id
value and expected_value should match, but they don't (value is always empty). I've evidently misunderstood how FIRST_VALUE(...) OVER(...) works.
May you please help me?
Thank you very much!
I've eventually found out what I was doing wrong here.
In the documents it is written:
The partition specification, which separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions
This led me to think that if I already had a GROUP BY statement, this was useless. It is not: generally if you want to get the datum for the given group, you have to specify it in the PARTITION BY statement, too (or better the dimensions that you're projecting in the SELECT part).
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(PARTITION BY (time.year, time.month, t1.name) ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN time.date >= timestamp '2020-07-01 00:00:00' AND t1.id = 1 THEN 27
ELSE CASE WHEN time.date >= timestamp '2020-05-01 00:00:00' AND t1.id = 2 THEN 42
ELSE CASE WHEN time.date >= timestamp '2020-03-01 00:00:00' AND t1.id = 1 THEN 1 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-03-01 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id

Getting weekly and daily averages of timestamp data

I currently have data on a Spark data frame that is formatted as such:
Timestamp Number
......... ......
M-D-Y 3
M-D-Y 4900
The timestamp data is in no way uniform or consistent (i.e., I could have one value that is present on March 1, 2015, and the next value in the table be for the date September 1, 2015 ... also, I could have multiple entries per date).
So I wanted to do two things
Calculate the number of entries per week. So I would essentially want a new table that represented the number of rows in which the timestamp column was in the week that the row corresponded to. If there are multiple years present, I would ideally want to average the values per each year to get a single value.
Average the number column for each week. So for every week of the year, I would have a value that represents the average of the number column (0 if there is no entry within that week).
Parsing date is relatively easy using built-in functions by combining unix_timestamp and simple type casting:
sqlContext.sql(
"SELECT CAST(UNIX_TIMESTAMP('March 1, 2015', 'MMM d, yyyy') AS TIMESTAMP)"
).show(false)
// +---------------------+
// |_c0 |
// +---------------------+
// |2015-03-01 00:00:00.0|
// +---------------------+
With DataFrame DSL equivalent code would be something like this:
import org.apache.spark.sql.functions.unix_timestamp
unix_timestamp($"date", "MMM d, yyyy").cast("timestamp")
To fill missing entries you can use different tricks. The simplest approach is to use the same parsing logic as above. First let's create a few helpers:
def leap(year: Int) = {
((year % 4 == 0) && (year % 100 != 0)) || (year % 400 == 0)
}
def weeksForYear(year: Int) = (1 to 52).map(w => s"$year $w")
def daysForYear(year: Int) = (1 to { if(leap(2000)) 366 else 366 }).map(
d => s"$year $d"
)
and example reference data (here for weeks but you can do the same thing for days):
import org.apache.spark.sql.functions.{year, weekofyear}'
val exprs = Seq(year($"date").alias("year"), weekofyear($"date").alias("week"))
val weeks2015 = Seq(2015)
.flatMap(weeksForYear _)
.map(Tuple1.apply)
.toDF("date")
.withColumn("date", unix_timestamp($"date", "yyyy w").cast("timestamp"))
.select(exprs: _*)
Finally you can transform the original data:
val df = Seq(
("March 1, 2015", 3), ("September 1, 2015", 4900)).toDF("Timestamp", "Number")
val dfParsed = df
.withColumn("date", unix_timestamp($"timestamp", "MMM d, yyyy").cast("timestamp"))
.select(exprs :+ $"Number": _*)
merge and aggregate:
weeks2015.join(dfParsed, Seq("year", "week"), "left")
.groupBy($"year", $"week")
.agg(count($"Number"), avg($"Number"))
.na.fill(0)

1 = 1 returns False in T-SQL - Why?

Please look at the snippet below
DEClaRE #p__linq__0 datetime
SET #p__linq__0 = '2012-02-01 00:00:00'
SELECT (STR( CAST( DATEPART (day, #p__linq__0) AS float)))
SELECT
InvoicingActivityStartDay,
(STR(CAST( DATEPART (day, #p__linq__0) AS float))),
CASE WHEN STR(CAST(DATEPART (day, #p__linq__0) AS float))= InvoicingActivityStartDay THEN 'EQUAL' ELSE 'NOT EQUAL' END
FROM INVOICEMETADATA
This was the rough SQL Translation of a Linq-to-Entities query I had in my application. The two possible values for InvoicingActivityStartDay are 1 and 20.
This snippet results in rows like this:
InvoicingActivityStartDay Column1 Column2
1 1 NOT EQUAL
20 1 NOT EQUAL
I understand why it returns NOT EQUAL for the second row; but why does it return NOT EQUAL for the first row where 1 = 1?
Is InvoicingActivityStartDay a string? SELECT STR(CAST(DATEPART (day, getdate()) AS float)) returns a string. Are you expecting an integer comparison?

Resources