Cross table in Spotfire - calculated-columns

*UPDATE based on ksp's answer (thank you very much for that, it was almost what I was looking for.)
Can somebody help me with the following problem.
Given the data table:
Key Rec Period DOW Category Value
Key1 Rec1 Period1 dow1 KPIa x1
Key1 Rec2 Period1 dow1 KPIb z1
Key1 Rec3 Period2 dow1 KPIa y1
Key2 Rec4 Period1 dow1 KPIa x1
Key2 Rec5 Period1 dow1 KPIb z1
Key2 Rec6 Period2 dow1 KPIa y1
Key1 Rec7 Period1 dow2 KPIa x2
Key1 Rec8 Period1 dow2 KPIb z2
Key1 Rec9 Period2 dow2 KPIa y2
Key2 Rec10 Period1 dow2 KPIa x2
Key2 Rec11 Period1 dow2 KPIb z2
Key2 Rec12 Period2 dow2 KPIa y2
Key1 Rec13 Period1 dow1 Delta d1
Key1 Rec14 Period1 dow2 Delta d2
Key2 Rec15 Period1 dow1 Delta d3
Key2 Rec16 Period1 dow2 Delta d4
In Spotfire, it is possible to create the following cross table:
Avg(KPIa) Avg(KPIb) Delta
Period1 Period2 Period1 Period1
dow1 dow2 dow1 dow2 dow1 dow2 dow1 dow2
Key1 x1 x2 y1 y2 z1 z2 d1 d2
Key2 x1 y1 y2 z1 z2 d3 d4
Now there is something I would want to change in this cross table but I can’t manage to figure out how:
Delta is a column which is only valid for Period1. Is it possible to apply the extra Period and DOW level only to certain columns of the cross table?
So what I want is:
Avg(KPIa) Avg(KPIb) Delta
Period1 Period2 Period1
dow1 dow2 dow1 dow2 dow1 dow2
Key1 x1 x2 y1 y2 z1 z2 (d1 + d2) / 2
Key2 x1 y1 y2 z1 z2 (d3 + d4) / 2
And when the dow2 is filtered out:
Avg(KPIa) Avg(KPIb) Delta
Period1 Period2 Period1
dow1 dow1 dow1
Key1 x1 y1 z1 d1
Key2 x1 y1 z1 d3
Thanks in advance.

# user6076025 - Please check this solution and let me know if this helps.
I have considered X as 1, Y as 2 and Z as 3 for computation purpose.
I have unpivoted your data which is there in the first screenshot of your post and then created a cross table from the unpivoted data.
Attached are the screenshots for your reference.

# user6076025 - I have assigned values to 'dummy value' column in your table for computation purpose and added a calculated column 'new delta' which will average d1,d2 and d3,d4.
Here is the formula:
Now, I have created a cross table from this data. Below are the screenshots of the table and cross table.
Please let me know if this helps.

Regarding the Dow issue, I would place a Drop-down list in a text area with Fixed Value options
Display Name : 'Include Dow2' Value: 0
Display Name : 'Exclude Dow2' Value: 1
Which would have a script for on-change that would do the following:
if Document.Properties["udDowChoice"] == '0':
Document.Properties["PivotString"] = '<[Category] NEST [Period] NEST [DOW]>'
else:
Document.Properties["PivotString"] = '<[Category] NEST [Period]>'
Then in the Custom Expression for the Horizontal Axes, you make it equal ${PivotString}
And Limit Data Using Expression
If(${udDow} = 0, 1=1, [DOW] <> 'dow2')
To avoid potential confusion from the users, I also recommend hiding the DOW filter from the Filtering Scheme.

Related

For each row, add column name to list in new column if row value matches a condition

I have a series of columns, each containing either Y or N.
I would like to create a new column that contains a list of columns (for that particular row) that contain Y.
Old DataFrame
>>> df
col1 col2 col3 col4 col5
a Y N N N Y
b Y N Y Y Y
c N N Y N N
New Dataframe
>>> df_new
col1 col2 col3 col4 col5 col6
a Y N N N Y [col1, col5]
b Y N Y Y Y [col1, col3, col4, col5]
c N N Y N N [col3]
So far I can get it working for a single column with:
df["col6"] = ["col1" if val == "Y" else "" for val in df["col1"]]
But ideally I want to do the same for all columns, so I somehow end up with the result above. I could imagine doing some kind of loop, but then how I go about appending the result to the list value in col6 I'm unsure on. Can someone steer me in the right direction please?
Compare values by Y first, then use DataFrame.dot with Series.str.split:
df["col6"] = df.eq('Y').dot(df.columns + ',').str[:-1].str.split(',')
print (df)
col1 col2 col3 col4 col5 col6
a Y N N N Y [col1, col5]
b Y N Y Y Y [col1, col3, col4, col5]
c N N Y N N [col3]
Or if need better performance use list comprehension with numpy arrays:
cols = df.columns.to_numpy()
df["col6"] = [cols[x].tolist() for x in df.eq('Y').to_numpy()]

Fill NaN if values in another column are identical

I have the following dataframe:
Out[117]: mydata
author email ri oi
0 X1 NaN NaN 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2#ma.com NaN 0000-0001-8437-498X
4 X5 ab#ma.com NaN 0000-0001-8437-498X
where column ri represents an author's ResearcherID, and oi the ORCID. One author may has more than one email address, so column email has duplicates.
Firstly, I'm trying to fill na in ri if the corresponding rows in oi share the same value, using a non-NaN value in ri. The result I want is:
author email ri oi
0 X1 NaN K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com K-5448-2012 0000-0001-8437-498X
Secondly, merging emails and using the merged value to fill na in column email, if the values in ri (or oi) are identical. I want to get a dataframe like the following one:
author email ri oi
0 X1 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
I've tried the following code:
final_df = pd.DataFrame()
na_df = mydata[mydata.oi.isna()]
for i in set(mydata.oi.dropna()):
fill_df = mydata[mydata.oi == i]
fill_df.ri = fill_df.ri.fillna(method='ffill')
fill_df.ri = fill_df.ri.fillna(method='bfill')
null_df = pd.concat([null_df, fill_df])
final_df = pd.concat([final_df, na_df])
This code returned the one I want in the the frist step, but is there an elegent way to approach this? Furthermore, how to get the merged value in email and then use the merged value as an input in the process of filling na?
Try 2 transform. One for each column. On ri, use first. On email, use combination of dropna, unique, and join
g = df.dropna(subset=['oi']).groupby('oi')
df['ri'] = g.ri.transform('first')
df['email'] = g.email.transform(lambda x: ';'.join(x.dropna().unique()))
Out[79]:
author email ri oi
0 X1 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab#ma.com;ab2#ma.com K-5448-2012 0000-0001-8437-498X

get distinct columns dataframe

Hello how can i do to only the lines where val is different in the 2 dataframes.
Notice that i can have id1 or id2 or both as below.
d2 = {'id1': ['X22', 'X13',np.nan,'X02','X14'],'id2': ['Y1','Y2','Y3','Y4',np.nan],'VAL1':[1,0,2,3,0]}
F1 = pd.DataFrame(data=d2)
d2 = {'id1': ['X02', 'X13',np.nan,'X22','X14'],'id2': ['Y4','Y2','Y3','Y1','Y22'],'VAL2':[1,0,4,3,1]}
F2 = pd.DataFrame(data=d2)
Expected Output
d2 = {'id1': ['X02',np.nan,'X22','X14'],'id2': ['Y4','Y3','Y1',np.nan],'VAL1':[3,2,1,0],'VAL2':[1,4,3,1]}
F3 = pd.DataFrame(data=d2)
First merge by all columns with left_on and right_on parameters, then filter out both rows and remove missing values by reshape by stack with unstack:
df=pd.merge(F1, F2, left_on=['id1','id2','VAL2'],
right_on=['id1','id2','VAL1'], how="outer", indicator=True)
df=(df[df['_merge'] !='both']
.set_index(['id1','id2'])
.drop('_merge', 1)
.stack()
.unstack()
.reset_index())
print (df)
id1 id2 VAL2 VAL1
0 X02 Y4 3 1
1 X22 Y1 1 3
F1.merge(F2,how='left',left_on=['id1','id2'],right_on=['id1','id2'])\
.query("VAL1!=VAL2")

Replace column values based on partial string match from another dataframe python pandas

I need to update some cell values, based on keys from a different dataframe. The keys are always unique strings, but the second dataframe may or may not contain some extra text at the beginning or at the end of the key. (not necessarily separated by " ")
Frame:
Keys Values
x1 1
x2 0
x3 0
x4 0
x5 1
Correction:
Name Values
SS x1 1
x2 AA 1
x4 1
Expected output Frame:
Keys Values
x1 1
x2 1
x3 0
x4 1
x5 1
I am using the following:
frame.loc[frame['Keys'].isin(correction['Keys']), ['Values']] = correction['Values']
The problem is that isin returns True only on exact mach (as far as I know), which works for only about 30% of my data.
First extract values by Frame['Keys'] joined by | for OR:
pat = '|'.join(x for x in Frame['Keys'])
Correction['Name'] = Correction['Name'].str.extract('('+ pat + ')', expand=False)
#remove non matched rows filled by NaNs
Correction = Correction.dropna(subset=['Name'])
print (Correction)
Name Values
0 x1 1
1 x2 1
2 x4 1
Then create dictionary and map for map by Correction['Name']:
d = dict(zip(Correction['Name'], Correction['Values']))
Frame['Values'] = Frame['Keys'].map(d).fillna(Frame['Values']).astype(int)
print (Frame)
Keys Values
0 x1 1
1 x2 1
2 x3 0
3 x4 1
4 x5 1

Adding attribute "keys" to concatenated dataframes

I am concatenating two dataframes along axis = 1 (columns) and try to use "keys" to later be able to distinguish between the columns of the two dataframes that have the same name.
df1 = pd.DataFrame({'tl': ['x1', 'x2', 'x3', 'x4'],
'ff': ['y1', 'y2', 'y3', 'y4'],
'dd': ['z1', 'z2', 'z3', 'z4']},
index=[2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04])
df2 = pd.DataFrame({'tl': ['x1', 'x2', 'x3', 'x4'],
'ff': ['y1', 'y2', 'y3', 'y4'],
'rf': ['z1', 'z2', 'z3', 'z4']},
index=[2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04])
data = pd.concat([df1, df2],keys=['snow','wind'], axis=1, ignore_index=True)
However, when trying to print all the columns belonging to one of the keys as suggested by #YashTD in Pandas add keys while concatenating dataframes at column level
print(comb_data.snow.tl)
I get the following error message:
AttributeError: 'DataFrame' object has no attribute 'snow'
I think, the keys are just not being added to the dataframe, but I don't know why. They also don't show up wenn printing the dataframe head() at they should be suggested by
Pandas add keys while concatenating dataframes at column level
Do you know how to add the keys to the dataframe?
First remove parameter ignore_index=True for MultiIndex in columns and then select by tuple:
data = pd.concat([df1, df2],keys=['snow','wind'], axis=1)
print (data)
snow wind
tl ff dd tl ff rf
2016-01-01 x1 y1 z1 x1 y1 z1
2016-01-02 x2 y2 z2 x2 y2 z2
2016-01-03 x3 y3 z3 x3 y3 z3
2016-01-04 x4 y4 z4 x4 y4 z4
print (data[('snow','tl')])
2016-01-01 x1
2016-01-02 x2
2016-01-03 x3
2016-01-04 x4
Name: (snow, tl), dtype: object

Resources