LEAD function with date scenario - apache-spark

I have multiple files, but lets consider 2 files which have filename and start dates columns.
Start_Date
FileName
2022-01-01
product 1
2022-02-02
product 2
please consider both rows as a separate files data.
Now I wanted to generate a dim table which have below requirement.
1st time when I read the file1 I am looking for dim table like below.
Start_date
End_date
file_name
2022-01-01
null
product 1
for 2nd time when I read the file I am looking for dim table like below.
Start_date
End_date
file_name
2022-01-01
2022-02-01
product 1
2022-02-02
null
product 2
basically I want to change the above row null to 2nd file start_date -1
please help
what I am planning, 1st time I am Inserting the data using below query, and 2nd time load also I am using the same query to insert data first then by using Lead function I am planning to update the table
select first_value(Start_date) as EFFECTIVE_START_DATE,
null as End_date,first_value(Source) as SRCE_FILE_NAME from pqdf_view
I am able to write the lead function which is working but How can I update the already inserted column using Lead function. using below query I am able to create a new Column but I want to use the already existing column which is already I inserted using above query data.
select *, LEAD(date_sub(EFFECTIVE_START_DATE,1)) OVER(ORDER BY PRODUCT_QUALITY_SK ASC) as EFFECTIVE_END_DATE from edp_silver.dim_product_quality

Related

How can I split weekly data to monthly using Excel/ Power Pivot?

My Data is in weekly buckets. I want to split the number into a monthly number but, since there is an overlap in days falling in both the months, I want a weighted average of the data in terms of days that fall in each of the months. For example:
Now, in the above picture, I want to split that 200 (5/7*200 in Jan, 2/7 in Feb). How can I do that using Excel/ Power Pivot/ Dax Functions? Any help here is much appreciated.
Thank you!
Assuming your fact table looks something like below. Values are associated with the starting date of the week it occurred.
Although it may actually be a more granular data, having multiple rows for each week with additional attributes (such as identifiers of a person, a store, depending on the business), what being shown below will work the same.
What we need to do first is to create a date table. We can do that in "Design" tab, by clicking "Date Table", then "New".
In this date table, we need to add a column for starting date of the week which the date of each row is in. Set the cursor to "Add Column" area, and input following formula. Then rename this column to "Week Start Date".
= [Date] - [Day Of Week Number] + 1
Now, we can define the measure to calculate the number allocated to each month with following formula. What this measure is doing is:
Iterating over each row of the fact table
Count the number of days for the week visible in the filter context
Add the value portion for the visible days
Value Allocation := SUMX (
MyData,
VAR WeekStartDate = MyData[Week]
VAR NumDaysInSelection = COUNTROWS (
FILTER (
'Calendar',
'Calendar'[Week Start Date] = WeekStartDate
)
)
VAR AllocationRate = DIVIDE ( NumDaysInSelection, 7 )
RETURN AllocationRate * MyData[Value]
)
Result in the pivot table will be looking like this.

Excel VBA Power Query - How to create a query that dynamically returns only the sale's rows of the last minute?

I have a comma separated csv file with the following structure:
Col Headers:
ProdDate, ProdTime, OLEDATETIME, ProdBuyPrice, ProdSellPrice, ProdBoughtQTY, ProdSoldQTY, etc
09/21/2019, 13:54:22, 43729.5801, 12.45, 12.61, 8, 9, etc.
This CSV file is atualized many times per minute (5 to 70 times per minute) meaning that it can have 5 to 70 lines within the last minute of sales, then I can't fix an arbitray fixed number on "mantain first lines" to return only the rows that arrived in the last minute and I never did this before with Power Query. So I need an finished recipe to do this, but my googling resulted nothing until now.
Any suggestion?
This is an example of how you can identify a dynamic row number. In this example, we have a table that shows fruit sales by store. We want to create a query that returns the highest number of bananas sold.
This is what our data table looks like.
Step 1 - Add an index column starting from 1. This assigns row numbers.
Add Column > Index Column > From 1
Step 2 - Filter and Sort the data.
Remove any columns that are unnecessary.
Filter the Item column for Bananas.
Sort the Values column in descending order.
Right-click on the first value in the Index column and choose Drill-Down.
RESULT
Now you have a dynamic row #. You could also instead choose the value itself to return the sales instead of the index. To apply this to other scenarios, just keep filtering and sorting until you get to the result you need.
This is how you filter a time column for records occurring in the latest one minute of times.
let
Source = Excel.CurrentWorkbook(){[Name="t_DatesAndTimes"]}[Content],
ChangedTypes_ColData = Table.TransformColumnTypes(Source,{{"Date", type date}, {"Time", type time}}),
AddCol_DateAndTime = Table.AddColumn(ChangedTypes_ColData, "Date and Time", each [Date] & [Time], type datetime),
LatestTime_ofReport_MinusOneMinute = List.Max(AddCol_DateAndTime[Date and Time])-#duration(0,0,1,0),
FilterRows_KeepTimesInLastMinute = Table.SelectRows(AddCol_DateAndTime, each [Date and Time] >= LatestTime_ofReport_MinusOneMinute)
in
FilterRows_KeepTimesInLastMinute
Data Table needing to be filtered
Table filtered for time in the last minute of times listed in the report.

Get last item with date range and name filter in google sheets

I have the below set of records in Google Sheets. I would like to filter the rows with specific name and date range. Once I have the filtered data, I would like to fetch the last row's final amount cell data.
Ex: I would like to fetch final amount as 300 if my date(dd/mm/yyyy) range is 01/01/206 to 11/06/2016 and Name selection is 'Sandeep'.
As I have experience SQLite db, I have inserted the same records in DB and got the expected result using the below query.
select Final from MyTable where Date in (select max(Date) from MyTable WHERE Date BETWEEN '01/01/2016' AND '11/06/2016' and name = "Sandeep")
But I am not getting idea how to use multiple select statements in google sheets. It is ok for me to get result using any other way. So please help me to get the result as explained above.
= QUERY (A1:E50,"Select F where A > date '2016-1-1' and A < date '2016-6-11' and B ='Sandeep' order by A desc limit 1")
Use Column IDs A,B,C instead of name, income. Multiple columns can be given in a single Select clause separated by a ,
Dates in where clause should be written in yyyy-mm-dd format only(regardless of the format of dates in actual column)
See if this works
=index(E:E, max(filter(row(A:A), A:A>date(2016, 1, 1), A:A<date(2016, 6, 11), B:B="Sandeep")))
If you want to include start and end date, change > to >= and < to <=.

How get rows with oldest date per year

I have task in excel. I think I show you it on example. Let say we have table as:
ID date
1 2015-03-11
1 2015-05-13
2 2013-01-10
2 2010-05-11
1 2014-09-19
2 2013-04-01
I have to make some operations to get rows with oldest date per every year. So I should have:
ID date
1 2015-03-11
1 2014-09-19
2 2013-01-10
2 2010-05-11
I will grateful for any help. Thanks in advance!
This is but one option. I like using SQL for this type of work and since Excel can connect to itself as an ODBC data source, that's just what I did here...
Create a Named range in excel (I called mine SomeTable) I do this by selecting the range in question and clicking in the drop down field to the left of the formula space that usually lists the selected cell (B11 in image below)
I then select data, from external sources and select the option for Microsoft Query (ODBC). Select new data source give it a name (Excel File name) Select microsoft excel driver. click connect. browse to where the file is containing the named range (Some table) Select ok and then in the 4th option select the named range (SomeTable)... select a place to put the table on a worksheet.
Now click in the "table" data it creates and go to the data menu properties. and enter the following in the definition tab under command text
.
Select ID, Date
FROM SomeTable ST
INNER JOIN
(Select MIN(date) as mDate, year(date) as mYear
FROM someTable
Group by year(date)) A on
ST.Date = A.mDate
If all done correctly you should get results like this:
Column EF is the source table named "SomeTable"
A10 is where I chose to put the table
B20 is where the SQL used to get the max per year
was put.

Cumulative return in a PivotTable?

I have a table of daily data in Excel. Let's say it's daily stock return data that looks like:
Date, DailyReturn
1/1/2001, .021
1/2/2001, .005
1/3/2001, .0034
1/4/2001, .013
....
12/31/2001, .004
The data is in a table on one of the sheets. Let's call it TableOfData. I've added a pivot table on another sheet, and I've added the date as the Row Labels. I've also added grouping by year and month to the rows. Now it is easy to add the average and stddev of the returns for each month, but I would also like the total return for the month, which would be calculated as Product(DailyReturn + 1) -1
I can add a calculated field called DailyReturnFactor, which is simply DailyReturn + 1, and I can add the Product of that field to the values and the pivot table shows the return for each month, except that I still need to subtract 1. That is where I'm stuck. Any ideas on if this is possible? The other issue is that I would like to do other sets of calculations as well, e.g. kurtosis.
Note I'm using Excel 2007.

Resources