Compute Correlation Dataframe for each Vector Row by Index Python - python-3.x

I have a dataframe with 500 columns indexed by date, with four years of data.
| Date | A | AAL | AAP | AAPL | ABC ......
| 1/2/2004 | 18.442521 |25.954398 |1.38449 |11.528444......
| 1/5/2004 | 18.922795 |25.718507 |1.442394 |11.919131...
| 1/6/2004 | 19.518334 |26.177538 |1.437189 |11.870028....
.
.
. etc...
I would like to calculate the Pearson correlation matrix for each day, so each row. I want to save the matrices by date, in the most space efficient manner readable by R. (Right now my goal is separate sheets, by index date, in Excel. I am open to suggestions.)
I have tried several ways, but this seemed the most promising, because I could not apply the corr() to a df.groupby.
However this method returned empty dataframes, and now I am stuck!
I am looking for a method that doesn't involve iteration.
def do_Corr(df_group):
"""Apply the function to each group in the data and return one result."""
X = df_group.corr()
return X
df.groupby([df.index.year,df.index.month,df.index.day]).apply(do_Corr).dropna()

You probably want df.T.corr(). .T transposes the dataframe, so rows becomes columns, then you can apply .corr() method.

Related

Convert date-time string to Date in excel

I have two columns with date-time values such as "2019-08-15T00:45:28.228Z". I want to convert each of them into date format columns, and then find number of minutes between the dates.
eg:
| A1 | A2 | Date(A1) | Date(A2) | A2-A1 in minutes |
|--------------------------|--------------------------|----------|----------|------------------|
| 2019-08-15T00:45:28.228Z | 2019-08-15T00:55:28.228Z | | | 10 |
| 2019-07-25T00:45:45.127Z | 2019-07-25T01:25:55.127Z | | | 40 |
I have not been able to convert the columns into a date format, because it has a time element as well, and all online examples seem to be only for date
To get the data time all we need to do is remove the T and the Z so:
=--REPLACE(LEFT(A2,23),11,1," ")
Then format to the desired format.
Then a simple subtraction of the dates and a format of [mm] will return the desired output.
You can use LEFT function to retrieve the date (it will then be displayed in its original format, like 2019-08-15), or simply do --LEFT to convert the result to a number and change its formatting to Date using Excel:
Then use MID function to retrieve the times and do your A2-A1 calculation (format the last column as Time):
Update - as per Paul's suggestion, the length of the syntax will always be the same so you could use the following functions:
=--LEFT(A2,10)
=MID(A2,12,12)

How to unnest multiple columns in presto, outputting into corresponding rows

I'm trying to unnest some code
I have a a couple of columns that have arrays, both columns using | as a deliminator
The data would be stored looking like this, with extra values to the side which show the current currency
I want to output it like this
I tried doing another unnest column, like this
SELECT c.campaign, c.country, a.product_name, u.price--, u.price -- add price to this split. handy for QBR
FROM c, UNNEST(split(price, '|')) u(price), UNNEST(split(product_name, '|')) a(product_name)
group by 1,2, 3, 4
but this duplicated several rows, so I'm not sure if unnesting the two columns doesn't quite work
Thanks
The issue with your query is that the clause FROM c, UNNEST(...), UNNEST(...) is effectively computing the cross join between each row of c and the rows produced by each of the derived tables resulting from the UNNEST calls.
You can solve it by unnesting all your arrays in a single call to UNNEST, thus, producing a single derived table. When used in that manner, the UNNEST produces a table with one column for each array and one row for each element in the arrays. If the arrays have a different length, it will produce rows up to the number of elements in the largest array and fill in with NULL for the column of the smaller array.
To illustrate, for your case, this is what you want:
WITH data(a, b, c) AS (
VALUES
('a|b|c', '1|2|3', 'CAD'),
('d|e|f', '4|5|6', 'USD')
)
SELECT t.a, t.b, data.c
FROM data, UNNEST(split(a, '|'), split(b, '|')) t(a, b)
which produces:
a | b | c
---+---+-----
a | 1 | CAD
b | 2 | CAD
c | 3 | CAD
d | 4 | USD
e | 5 | USD
f | 6 | USD
(6 rows)

In Tableau, how do I use two parts of a pivotted column for x and y values on a graph?

I'm trying to plot some data (standard curves for analytical chemistry) where the x axis is the mass of a compound I added to a solution, and the y axis is the signal recorded from an instrument (peak height on a mass spectrometer). I'd like Tableau to color code the data by compound (compound A, compound B, compound C, etc.), so that I'd wind up with a graph that looks something like this:
The original structure of my data was like this:
SampleID | Mass A | Mass B | ... | Signal A | Signal B | ...
standard 0 | 0| 0| ... | 0| 0| ...
standard 5 | 2.535| 2.555| ... | 0.494| 1.240| ...
standard 25| 12.675| 12.775| ... | 2.426| 7.235| ...
I know how to make graphs one compound at a time with these original data, but for the purposes of other analyses I'm doing with these data and because I want multiple compounds on the same graph, I've pivotted them so that the structure is now like this:
SampleID | Compound | Parameter | Value
standard 0 | A | Mass | 0
standard 0 | A | Signal | 0
standard 5 | A | Mass | 2.535
etc.
How do I make a graph where the mass is on the x axis, the signal is on the y axis, and the points are colored by compound? I don't see a good way to do it when my data are in this format. I've tried making new calculated variables where the value = NULL if the parameter is not equal to "Mass" and another calculated variable where the value = NULL if the parameter is not equal to "Signal" and then putting those pills on the columns and rows, but that's not working. Is there a way to do this in Tableau with data structured like this pivotted form?
Alternatively, is there a way to spread my pivotted data so that the new structure is like this:
SampleID | Compound | Mass | Signal
standard 0 | A | 0| 0
standard 5 | A | 2.535| 0.494
standard 25| A | 12.675| 2.426
standard 0 | B | 0| 0
etc.
and would that work better?
(For R users, that last bit would be the equivalent of the tidyr package gather and spread functions.)
To make the second structure appear like the third, add a calculated field called Mass defined as if Parameter = "Mass" then Value end. Do the same for Signal.
You can then hide the fields Parameter and Value if you like, and work with Mass and Value instead.
Put AVG(Mass) on the Columns Shelf and AVG(Signal) on the Rows shelf -- AVG, not ATTR. Then finally, put [Sample Id] on detail.
If I had to deal with this, I'd prefer to pre-process the data so that it has the format "SampleID | Compound | Mass | Signal", that would make Tableau chart straightforward.
I think there's a way to achieve the same with the data structure you have, but it's more tricky. So, if I understand correctly, you have the data it this form:
SampleId Compound Parameter Value
standard 5 A Mass 2.535
standard 5 A Signal 0.494
standard 5 B Mass 2.555
standard 5 B Signal 1.24
standard 25 A Mass 12.675
standard 25 A Signal 2.426
standard 25 B Mass 12.775
standard 25 B Signal 7.235
1) You can create calculated fields for Mass and Signal using level of detail expressions, that exclude the Parameter granularity:
Mass
{exclude [Parameter] : min(if [Parameter] = 'Mass' then [Value] else NULL end)}
Signal
{exclude [Parameter] : min(if [Parameter] = 'Signal' then [Value] else NULL end)}
That will "collapse" nulls in case Parameter is not included in the view.
2) Using the Scatter Plot visualization, you can pull Mass to columns and Signal to rows, add Compound to Color pane and SampleId to Detail pane. The plot will look like this:

EXCEL find the last relative maximum in a array (formula, not VBA)

I have a range containing values such as:
169.7978
168.633
168.5479
168.7819
167.7407
165.4146
165.1232
I don't need the maximum value of the range, i.e., the first cell in this example), but the last relative maximum, which in this case is the fourth cell. Is there a way to get this value without having to write a VBA macro? The formula must be general enough to work with a multiple number of maxima.
It may be a bit limited, but you may start somewhere as below.
Stated array in the OP is:
+----------+---+
| y | x |
+----------+---+
| 169.7978 | 1 |
| 168.633 | 2 |
| 168.5479 | 3 |
| 168.7819 | 4 |
| 167.7407 | 5 |
| 165.4146 | 6 |
| 165.1232 | 7 |
+----------+---+
Given this, you can find direct adjacency relative min/max with the following helper columns
Assign a Global_Rank helper column and look for y distro identical trend on both adjacent f(x) with the following formulas ( assuming your data is sorted by the x index )( formulas from Row 2 and filled down ).
RelativeMax:
=IF(AND(D2<=D1,D2<=D3),"RelativeMax","")
RelativeMin:
=IF(AND(D2>=D1,D2>=D3),"RelativeMin","")
Modify as needed. Hope this helps.
Edit:
Although...
If you're going to assume the data is ordered properly, you could also just use =IF(AND(B2>=B1,B2>=B3),"RelativeMin",IF(AND(B2<=B1,B2<=B3),"RelativeMax","")) and skip all the malarkey. This should work with multiple maxima/minima. Please report back with results from your dataset!

Count number of rows where multiple criteria are met

I'm trying to generate a table that shows a count of how many items are in any given status on any given day. My result table has a set of Dates down column A and column headers are various statuses. A sample of my data table with headers looks like this:
Product | Notice | Assigned | Complete | In Office | In Accounting
1 | 5/5/13 | 5/7/13 | 5/9/13 | 5/10/13 | 5/11/13
2 | 5/5/13 | 5/6/13 | 5/8/13 | 5/9/13 | 5/10/13
3 | 5/6/13 | 5/9/13 | 5/10/13 | 5/10/13 | 5/10/13
4 | 5/4/13 | 5/5/13 | 5/7/13 | 5/8/13 | 5/9/13
5 | 5/7/13 | 5/8/13 | 5/10/13 | 5/11/13 | 5/11/13
If my output table were to contain a set of dates in the first column with the statuses as headers, I need a count of how many rows were at the given status and had not yet transitioned to the next status so that in the Notice column, I'd have a count of rows where the Notice Date was <= X AND where the Assigned, Complete, In Office, In Accounting are all greater than X.
I've used a Sum(if(frequency(if statement to get me REALLY close but I feel like I need to have an AND statement within the second IF like this =SUM(IF(FREQUENCY(IF(AND
Here's what I have that won't work:
=SUM(IF(FREQUENCY(IF(AND(Table1[Assigned]<=A279,Table1[[Complete]:[In Accounting]]<=A279),ROW(Table1[[Complete]:[In Accounting]])),ROW(Table1[[Complete]:[In Accounting]]))>0,1))
If I take the "AND" portion out, this works fine except I need it to ONLY count rows where the given status actually has a date so if an "Assigned" date is empty, I don't want that row to be counted for the Assigned column.
Here's an example of what I'd expect to see in the results. I've listed the count in the each column as well as the corresponding product numbers in parenthesis. The corresponding product numbers are for reference only and won't actually be in the result table.
Date | Notice | Assigned | Complete
5/6 | 2 (1,3) | 2 (2,4) | 0
5/7 | 2 (3,5) | 2 (1,2) | 1 (4)
5/8 | 1 (3) | 2 (1,5) | 1 (2)
OK, assuming you have the original data in A1:F6 then with 2nd table headers in B9:D9 and row labels in A10:A12 then you can use this "array formula" in B10
=SUM((B$2:B$6<=$A10)*(MMULT((C$2:$F$6>$A10)+(C$2:$F$6=""),TRANSPOSE(COLUMN(C$2:$F$6)^0))=COLUMNS(C$2:$F$6)))
confirmed with CTRL+SHIFT+ENTER and copied down and across (see screenshot below)
As you can see the results are as per your requirement. If you replace dates with blanks it will still work
MMULTis a way to get a single value from each row even when you are looking at multiple columns.
I used cell references because I think that's easier, especially when copying the formula across and having a reducing range.......but you can use structured references if you want
Have you tried using COUNTIFS to count based on multiple criteria. It is fairly well documented here: http://office.microsoft.com/en-us/excel-help/countifs-function-HA010047494.aspx (2007+ only)
Basically, you use it like
=COUNTIFS(first_range_to_check, value_you_want_in_first_range, ...)
where the ... represents as many pairs as you want (up to 127 total pairs), note the conditions are AND connection so if you have two pairs, the first pair AND the second pair must return true for that row to count.

Resources