How to unnest multiple columns in presto, outputting into corresponding rows - presto

I'm trying to unnest some code
I have a a couple of columns that have arrays, both columns using | as a deliminator
The data would be stored looking like this, with extra values to the side which show the current currency
I want to output it like this
I tried doing another unnest column, like this
SELECT c.campaign, c.country, a.product_name, u.price--, u.price -- add price to this split. handy for QBR
FROM c, UNNEST(split(price, '|')) u(price), UNNEST(split(product_name, '|')) a(product_name)
group by 1,2, 3, 4
but this duplicated several rows, so I'm not sure if unnesting the two columns doesn't quite work
Thanks

The issue with your query is that the clause FROM c, UNNEST(...), UNNEST(...) is effectively computing the cross join between each row of c and the rows produced by each of the derived tables resulting from the UNNEST calls.
You can solve it by unnesting all your arrays in a single call to UNNEST, thus, producing a single derived table. When used in that manner, the UNNEST produces a table with one column for each array and one row for each element in the arrays. If the arrays have a different length, it will produce rows up to the number of elements in the largest array and fill in with NULL for the column of the smaller array.
To illustrate, for your case, this is what you want:
WITH data(a, b, c) AS (
VALUES
('a|b|c', '1|2|3', 'CAD'),
('d|e|f', '4|5|6', 'USD')
)
SELECT t.a, t.b, data.c
FROM data, UNNEST(split(a, '|'), split(b, '|')) t(a, b)
which produces:
a | b | c
---+---+-----
a | 1 | CAD
b | 2 | CAD
c | 3 | CAD
d | 4 | USD
e | 5 | USD
f | 6 | USD
(6 rows)

Related

Build 1D array / list in formula by multiplying values for use in AVERAGE()

I have an excel spreadsheet with a list of values, column A contains the grading, column B contains the number of occurrences:
A | B
---------------
Grading | Count
1 | 1
2 | 1
3 | 2
4 | 3
5 | 5
I would like to find the average grading based on the count but to do this I need to build a list based on these values, I.E. the above chart should translate into:
=AVERAGE(1,2,3,3,4,4,4,5,5,5,5,5).
I have managed to come to a solution through a very convoluted method of creating a new table, using IF and COUNTIF to print out an array and then AVERAGE the entire range but this is time consuming to repeat and I'm sure there is much simpler way of doing this.
If I'm not mistaken, you can just take the sum of product of columns A and B, then divide by the sum of the Count column:
=SUMPRODUCT(A2:A6, B2:B6) / SUM(B2:B6)
Note that using your hand written expanded formula yielded the same results:
=AVERAGE(1,2,3,3,4,4,4,5,5,5,5,5)

Compute Correlation Dataframe for each Vector Row by Index Python

I have a dataframe with 500 columns indexed by date, with four years of data.
| Date | A | AAL | AAP | AAPL | ABC ......
| 1/2/2004 | 18.442521 |25.954398 |1.38449 |11.528444......
| 1/5/2004 | 18.922795 |25.718507 |1.442394 |11.919131...
| 1/6/2004 | 19.518334 |26.177538 |1.437189 |11.870028....
.
.
. etc...
I would like to calculate the Pearson correlation matrix for each day, so each row. I want to save the matrices by date, in the most space efficient manner readable by R. (Right now my goal is separate sheets, by index date, in Excel. I am open to suggestions.)
I have tried several ways, but this seemed the most promising, because I could not apply the corr() to a df.groupby.
However this method returned empty dataframes, and now I am stuck!
I am looking for a method that doesn't involve iteration.
def do_Corr(df_group):
"""Apply the function to each group in the data and return one result."""
X = df_group.corr()
return X
df.groupby([df.index.year,df.index.month,df.index.day]).apply(do_Corr).dropna()
You probably want df.T.corr(). .T transposes the dataframe, so rows becomes columns, then you can apply .corr() method.

Compare two excel columns which the most frequently occur in specific date

I would like to compare between few columns, what where the top 5 most popular products in year 2015.
I have this kind of data flow to work with:
Client | Product | Date of buy
------------------------------
client1 | A | 15.06.2015
client3 | A | 04.12.2015
client5 | F | 15.06.2015
client9 | G | 15.01.2015
client2 | G | 15.01.2015
client1 | R | 05.07.2015
client3 | G | 15.06.2015
client1 | F | 05.07.2015
client3 | F | 15.06.2016
Results - which products client bought the most with (in same date) the top 5 products communities of them. E.g..
1. Product A + Product H 222 times
2. Product A + Product E 77 times
3. Product B + Product O 70 times
4. etc
5. ...
Greetz,
Making the assumption:
you can use helper columns.
Your Columns up above are A, B and C.
You have two header rows and data starts in row 3.
Your dates are stored in an excel date format and not string values.
In E2 I generated a list of unique product items using the following formula:
=INDEX($B$3:$B$11,MATCH(0,INDEX(COUNTIF($E$2:E2,$B$3:$B$11),0,0),0))
I copied it down to match the number of rows in the initial list. It starts spitting #N/A when all the unique items in the list have been listed. If you want to avoid this you could put the formula inside of:
=IFERROR(insert formula,"")
Now in column F I did a count based on your criteria of each item and within the year 2015. I used a multiple count if function called COUNTIFS:
=COUNTIFS($C$3:$C$11,"<"&DATE(2016,1,1),
$C$3:$C$11,">"&DATE(2014,12,31),
$B$3:$B$11,E3)
I just reformatted that for easier reading. You will have to edit that slightly if you want to copy and paste. If you don't like seeing 0 when there is no product in the adjacent column you could wrap the equation in:
=IF(E3="","", insert formula )
I then skipped a column and sorted the list of counted items from largest to smallest and had it return the numbers in sequence. I only went down two rows, but you could technically do the whole list. The large function does this and the formula in H3 looks like:
=LARGE($F$3:$F$11,ROWS($1:1))
I then went back 1 column and put the product name that corresponds to the count, and then took the next name in the list when products had equal count. I put that in column F as normally when I read I want to read the product name first then read the quantity. If you want it the other way around just swap the columns. The formula in G1 is:
=INDEX($E$3:$E$11,MATCH(H3,$F$3:$F$11,0)+COUNTIF($H$3:$H3,H3)-1)
Copy E3 and F3 down as far as you need. Copy G3 and H3 down one row and you will have top two. down two rows and you have top three etc.
This is how it looks...The dates are displayed according to my computers date format.

How to compare two columns value in excel?

I have over 100k rows of data like below:
ALLA,ALLA,"Company1, Inc.","Company1, Inc.",PSA,PSA,1,1,FALSE,FALSE
BCCO,BCCO,"Company2, Inc.","Company2, Inc.",PSB,PSB,1,1,FALSE,FALSE
CTTP,CTTP,"Company3, Inc.","Company3, Inc.",PSC,PSC,1,1,FALSE,FALSE
CMMZ,CMMZ,"Company4, Inc.","Company4, Inc.",PSD,PSD,1,1,FALSE,FALSE
I want to know how to figure if data in column 1 is the same as column 2, column 3 as column 4 and so on. How could I do that in excel?
Following Cory's formula, I found that I can compare whole columns using:
=if(A:A=B:B, "yay", "aww")
Problem is I have a header in the file:
c - symbol, symbol, c - companyname, companyname, c - tradingvenue, tradingvenue, c - tierrank, tierrank, c - iscaveatemptor, iscaveatemptor
Shouldn't this cause A:A=B:B to be false?
Given this:
| A | B |
---+-----+-----+
1 | X | X |
---+-----+-----+
2 | Y | Y |
---+-----+-----+
3 | Z | Z |
The formula =SUMPRODUCT(--(A1:A3=B1:B3)) will tell you how many times the A value matches the B value.
You should get 3 as a result here. If, for example, you change B3 to Q then it will give you 2.
To do this on two columns without specifying the end of the range, try:
=SUMPRODUCT(--(A:A=B:B),--(LEN(A:A)>0))
I've been using Excel since 1991, and unless you want to write a VB macro, I think the best way is to do the simple IF statement suggested in the comments. If you need to test several columns at once, which is what your question suggests, then I'd do
=IF(AND(A1=B1,C1=D1,E1=F1,G1=H1),0,1)
Fill that formula down the column and then you'll be able toinstantly count the number of rows that don't matchwith a data-filter, select all the rows which have a '1', so you'll be able to examine the rows that don't match

Excel Sophisticated Sort - Return low/high values

I am trying to sort data imported from a csv file. The data comes in like such:
Columns
A | B
--------
t1 | 1
t3 | 9
t1 | 2
t2 | 5
t1 | 1
t3 | 13
t1 | 3
t3 | 11
t2 | 4
t2 | 7
t3 | 10
t3 | 10
and i want output similar to this:
Columns
D | E | F
----------------
t1 | 1 | 3
t2 | 4 | 7
t3 | 9 | 13
Explanation: Basically what I need to do is find the lowest and highest values from column B for each different value in column A, and list them neatly as shown in the second example.
Ive worked with VBA before, so if this would have to be done via VBA thats fine. Im just at a loss as to how to accomplish this task. Any help would be appreciated.
EDIT: Forgot to mention, if would make the task simpler, its fine if i have to manually sort the data alphabetically based on col A (thus putting same values together)
I agree with #chrisneilsen that a Pivot Table is the best way to go. If you are set on using formulas, you can try using the following (both entered as arrays - Ctrl+Shift+Enter):
In cell E1, which will represent the minimum value:
=MIN(IF($A$1:$A$12=D1,1,MAX($B$1:$B$12)+1)*$B$1:$B$12)
And in cell F1, which will represent the maximum value:
=MAX(IF($A$1:$A$12=D1,1,MIN($B$1:$B$12)-1)*$B$1:$B$12)
The general idea is that check to see which values in column A are equal to your target value (column D). The result will be an array of 1's where there is a match, and using MIN as an example, the maximum of the column + 1. This is done because we want to set this equal to a value that can't possibly be attained in your current setup, so the maximum value + 1 will ensure that MIN will return a value that is legitimate.
Here is a Pivot Table using Excel 2007. To create, add column headers to your data, select your data and then in the Ribbon click Insert -> Pivot Table. In the dialog box, you decide where you want to put it (it is commonly put in a New Worksheet, so you can leave the default if you want - I left it in the same worksheet for illustration purposes). From there, you can arrange it by dragging each field so it matches the pictures. For the Max/Min fields, just drag the Value field into the Values section twice. Then, in the actual Pivot Table, you can right-click on one of the values in the column and select Summarize Data By -> Min to summarize by the minimum value for each key:

Resources