finding if n out of m columns are null over each row using calculated column functions in Spotfire - spotfire

I have the following table and I would like to get the number of nulls for each SEQ_ID
SEQ_ID zScore for 7d zScore for 14d zScore for 21d zScore for 28d zScore for 35d
456 11.353 13.2922 9.0162 8.8533
789 8.5991 8.8244 5.7394
So for SEQ_ID 456 I would have 1 null
For SEQ_ID 789 I would have 2 nulls
Is there a way to do this without writing complicated case statements with brute force combinations in the Calculated column area using Spotfire

I guess you are looking for a Spotfire custom expression not involving R.
This would give you the number of columns that are not null. If you know the total number of columns, you can easily turn it into the number of null columns
Len(RXReplace(Concatenate($map("[yourtable].$esc($csearch([yourtable],"*"))",",'-',")),'\\w+','Z','g')) -
Len(RXReplace(Concatenate($map("[yourtable].$esc($csearch([yourtable],"*"))",",'-',")),'\\w+','','g'))
[yourtable] would be the name of your data table. This acts on all columns.

Related

Find the column in Subquery coalesce function

I am using the Coalesce function to return a value from my preferred ranking of Columns but I also want to include the name of the column that the value was derived from.
`i.e
Table:
Apples Pears Mangos
4 5
**SQL **
; with CTE as
(
Select
Coalesce(Apples,Pears,Mangos) as QTY_Fruit
from Table
) select *, column name from QTY_Fruit
from CTE
Result:
QTY_Fruit Col Name
4 Pears`
I am trying to avoid a case statement if possible because there are about 12 fields that I will need to use in my Coalesce. Would love for an easy way to pull the column name based on value in QTY_Fruit. I'm all ears if the answer lies outside the use of subqueries but I figured this would be a start.

How to drop entire record if more than 90% of features have missing value in pandas

I have a pandas dataframe called df with 500 columns and 2 million records.
I am able to drop columns that contain more than 90% of missing values.
But how can I drop in pandas the entire record if 90% or more of the columns have missing values across the whole record?
I have seen a similar post for "R" but I am coding in python at the moment.
You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).
df.dropna(axis=0, thresh=50, inplace=True)
You could use isna + mean on axis=1 to find the percentage of NaN values for each row. Then select the rows where it's less than 0.9 (i.e. 90%) using loc:
out = df.loc[df.isna().mean(axis=1)<0.9]

Snowflake unpivoting

I need to transpose a table in which column1 is name of an entity and column2 to column366 are dates in a year that hold a dollar amount. The table, the select statement and the output result are all given
below -
Question - This syntax requires me to create a comma separated list of columns - which are basically 365 dates - and use that list in the IN clause of the select statement.
Like this -
.....unpivot (cash for dates in ("1-1-2020" , "1-2-2020" , "1-3-2020"........."12-31-2020")) order by 2
Is there any better way of doing this ? Like with regular expressions ? I don't want to type 365 dates in mm-dd-yyyy format and get carpel tunnel for my trouble
Here is the table - First line is column header, second line is separator. 3rd, 4th and 5th lines are sample data.
Name 01-01-2020 01-02-2020 01-03-2020 12-31-2020
---------------------------------------------------------------------------------------------------
Entity1 10.00 15.75 20.00 100.00
Entity2 11.00 16.75 20.00 10.00
Entity3 112.00 166.75 29.00 108.00
I can transpose it using the select statement below
select * from Table1
unpivot (cash for dates in ("1-1-2020" , "1-2-2020" , "1-3-2020")) order by 2
to get an output like the one below -
Name-------------------dates-----------------------cash
--------------------------------------------------------------
Entity1 01-01-2020 10.00
Entity2 01-01-2020 11.00
Entity3 01-01-2020 112.00
...............................
.............................
.........
and so on
There is a simpler way to do this without PIVOT. Snowflake gives you a function to represent an entire row as an "OBJECT" -- a collection of key-value pairs. With that representation, you can FLATTEN each element and extract both the column name (key == date) and the value inside (value == cash). Here is a query that will do it:
with obj as (
select OBJECT_CONSTRUCT(*) o from Table1
)
select o:NAME::varchar as name,
f.key::date as date,
f.value::float as cash
from obj,
lateral flatten (input => obj.o, mode => 'OBJECT') f
where f.key != 'NAME'
;

Return the mean() for a column for rows in DF with highest value in adjacent column

I am trying to calculate the mean of a column in a Dataframe for the rows with X highest values in the same Dataframe. Have been trying and searching for hours without luck.
summary_percent_df
TTM_Si TTM_F Rev_Met price
ticker
AVP -0.082571 -7.927108 -2.287786 0.000000
HELE 0.005513 1.542568 1.244480 0.629727
IPAR -0.024999 -1.611722 -0.309357 0.705969
NUS -0.049710 0.664017 0.208076 0.656487
REV -0.016126 -4.113906 -1.297464 0.218214
I want to return a single mean() value of the "price" column for the 3 stocks that have the highest value in the "TTM_Si" Column. I have come close using groupby() with .head() but having an issue sorting the data after having unstacked it for the groupby.

Break ties in RANKX Powerpivot formula

I can rank my data with this formula, which groups by Year, Trust and ID, and ranks the Areas.
rankx(
filter(Table,
[Year]=earlier([Year])&&[Trust]=earlier([Trust])&&[ID]=earlier([ID])),
[Area], ,1,Dense)
This works fine - unless you have data where the same Area appears more than once in the same group, whereupon it gives all rows the rank of 1. Is there any way to force unique rank values? So two rows that have the same Area would be given the rank of 1 and 2 (in an arbitrary order)? Thank you for your time.
Assuming you don't have duplicate rows in your table, you can add another column as a tie-breaker in your expression.
Suppose your table has an additional column, [Name], that is distinct between your multiple [Area] rows. Then you could write your formula like this:
= RANKX(
FILTER(Table,
[Year] = EARLIER([Year]) &&
[Trust] = EARLIER([Trust]) &&
[ID] = EARLIER([ID])),
[Area] & [Name], , 1, Dense)
You can append as many columns as you need to get the tie-breaking done.

Resources