Add sections of a table column as new Columns - Power Query - excel

I am having the following Unpivoted table that contains Stat-tested % values and their Stat-letters and Stat-Letters position indicators on separate rows.
----------------------------------------
CODE | ATTR | TEXT | VALUE
----------------------------------------
1 mean I love it 0.45
2 mean I love it 0.67
3 mean I love it 0.49
4 mean I love it 0.21
5 mean I love it 0.66
1 mean I love it abd
2 mean I love it e
3 mean I love it cd
4 mean I love it a
5 mean I love it ab
1 mean I love it 1
2 mean I love it 1
3 mean I love it 1
4 mean I love it 1
5 mean I love it 1
1 wt-mean I hate it 0.22
2 wt-mean I hate it 0.56
3 wt-mean I hate it 0.13
4 wt-mean I hate it 0.89
5 wt-mean I hate it 0.50
1 wt-mean I hate it ab
2 wt-mean I hate it ae
3 wt-mean I hate it c
4 wt-mean I hate it b
5 wt-mean I hate it de
1 wt-mean I hate it 1
2 wt-mean I hate it 1
3 wt-mean I hate it 1
4 wt-mean I hate it 1
5 wt-mean I hate it 1
I want to group on the CODE column and add the Stat-tested Letters and position indicators as separate columns like below:
----------------------------------------------------------------
CODE | ATTR | TEXT | VALUE LETTERS POSITION
----------------------------------------------------------------
1 mean I love it 0.45 abd 1
2 mean I love it 0.67 e 1
3 mean I love it 0.49 cd 1
4 mean I love it 0.21 a 1
5 mean I love it 0.66 ab 1
1 wt-mean I hate it 0.22 ab 1
2 wt-mean I hate it 0.56 ae 1
3 wt-mean I hate it 0.13 c 1
4 wt-mean I hate it 0.89 b 1
5 wt-mean I hate it 0.50 de 1
The problem i am encountering while grouping the data on Value column, is that the column has mixed data types (text, number). How to split these into individual columns as shown below?

You can insert new custom-columns for this, check with try if the value is a number.
My Source is Tabelle1
let
Quelle = Excel.CurrentWorkbook(){[Name="Tabelle1"]}[Content],
Change_Type = Table.TransformColumnTypes(Quelle,{{"CODE", Int64.Type}, {"ATTR", type text}, {"TEXT", type text}, {"VALUE", type any}}),
Custom_Number = Table.AddColumn(Change_Type, "Number", each if try Number.From([VALUE]) < 1 otherwise null = true then [VALUE] else null),
Custom_Letters = Table.AddColumn(Custom_Number, "Letters", each if (try Number.From([VALUE]) >= 1 otherwise null) = null then [VALUE] else null),
#"Hinzugefügte benutzerdefinierte Spalte" = Table.AddColumn(Custom_Letters, "POSITION", each if [Number] = null and [Letters]= null then [VALUE] else null),
Grouped_Rows = Table.Group(#"Hinzugefügte benutzerdefinierte Spalte", {"CODE", "ATTR", "TEXT"}, {{"VALUE", each List.Max([Number]), type nullable number}, {"LETTERS", each List.Max([Letters]), type nullable text}, {"POSITION", each List.Max([POSITION]), type nullable number}})
in
Grouped_Rows

Related

Count unique values in a MS Excel column based on values of other column

I am trying to find the unique number of Customers, O (Orders), Q (Quotations) and D (Drafts) our team has dealt with on a particular day from this sample dataset. Please note that there are repeated "Quote/Order #"s in the dataset. I need to figure out the unique numbers of Q/O/D on a given day.
I have figured out all the values except the fields highlighted in light orange color of my Expected output table. Can someone help me figure out the MS Excel formula for these four values as requested above?
Below is the given dataset. Please note that there can be empty values against a date. But those will always be found in the bottom few rows of the table:
Date
Job #
Job Type
Quote/Ordr #
Parts
Customer
man-hr
4-Apr-22
1
O
307585
1
FRU
0.35
4-Apr-22
2
D
307267
28
ATM
4.00
4-Apr-22
2
D
307267
25
ATM
3.75
4-Apr-22
2
D
307267
6
ATM
0.17
4-Apr-22
3
D
307438
3
ELCTRC
0.45
4-Apr-22
4
D
307515
7
ATM
0.60
4-Apr-22
4
D
307515
5
ATM
0.55
4-Apr-22
4
D
307515
4
ATM
0.35
4-Apr-22
5
O
307587
4
PULSE
0.30
4-Apr-22
6
O
307588
3
PULSE
0.40
5-Apr-22
1
O
307623
1
WST
0.45
5-Apr-22
2
O
307629
4
CG
0.50
5-Apr-22
3
O
307630
10
SUPER
1.50
5-Apr-22
4
O
307631
3
SUPER
0.60
5-Apr-22
5
O
307640
7
CAM
0.40
5-Apr-22
6
Q
307527
6
WG
0.55
5-Apr-22
6
Q
307527
3
WG
0.30
5-Apr-22
To figure out the unique "Number of Jobs" on Apr 4, I used the Excel formula:
=MAXIFS($K$3:$K$20,$J$3:$J$20,R3) Where, R3 ='4-Apr-22'
To figure out the unique "Number of D (Draft) Jobs" I used the Excel formula:
=SUMIFS($P$3:$P$20,$J$3:$J$20,R3,$L$3:$L$20,"D")
[1
[2

How to display the rows with the most number of occurrences in a column of a dataframe?

I have a data frame with 6 columns:
taken person quant reading personal family
0 1 lake rad 9.7 Anderson Lake
1 1 lake sal 0.21 Anderson Lake
2 5 Lim sal 0.08 Andy Lim
3 2 Lim rad 9.82 Andy Lim
4 2 Lim sal 0.13 Andy Lim
5 3 dyer rad 7.7 William Dyer
Output i want:
taken person quant reading personal family
0 5 Lim sal 0.08 Andy Lim
1 2 Lim rad 9.82 Andy Lim
2 2 Lim sal 0.13 Andy Lim
Basically, i want to display all the rows in the df based on the most number of occurrences in the personal column. This is what i've tried but it doesn't work
test = df.personal.mode()
test1 = df.loc[df.personal == test]
display(test1)
You can combine value_counts and boolean indexing:
df[df['person'] == df['person'].value_counts().index[0] ]
Output:
taken person quant reading personal family
2 5 Lim sal 0.08 Andy Lim
3 2 Lim rad 9.82 Andy Lim
4 2 Lim sal 0.13 Andy Lim
Note that this only keep one person in the case there are several persons with same number of appearances. If you want to keep all of them, mode and isin is a better choice:
df[df['person'].isin(df['person'].mode())]

How to have a cross tabulation for categorical data in Pandas (Python)?

I have the following code for example.
df = pd.DataFrame(dtype="category")
df["Gender"]=np.random.randint(2, size=100)
df["Q1"] = np.random.randint(3, size=100)
df["Q2"] = np.random.randint(3, size=100)
df["Q3"] = np.random.randint(3, size=100)
df[["Gender", "Q1", "Q2", "Q3"]] = df[["Gender", "Q1", "Q2", "Q3"]].astype('category')
pd.pivot_table(data=df,index=["Gender"])
I want to have a pivot table with percentages over gender for all the other columns. Infact, like the follwing.
How to achieve this?
The above code gives an error saying that
No numeric types to aggregate
I dont have any numerical columns. I just want to find the frequency in each category under male and female and find the percentage of them over male and female respectively.
As suggested by your question, you can use the pd.crosstab to make the cross tabulation you need.
You just need to do a quick preprocessing with your data, which is to melt and convert Q columns to rows (see details below):
df = df.melt(id_vars='Gender',
value_vars=['Q1', 'Q2', 'Q3'],
var_name='Question', value_name='Answer' )
Then you can use pd.crosstab and calculate percentage as needed (here the percentage for each Question per Gender per Answer is shown)
pd.crosstab(df.Question, columns=[df.Gender, df.Answer]).apply(lambda row: row/row.sum(), axis=1)
Gender 0 1
Answer 0 1 2 0 1 2
Question
Q1 0.13 0.18 0.18 0.13 0.19 0.19
Q2 0.09 0.21 0.19 0.22 0.13 0.16
Q3 0.19 0.10 0.20 0.16 0.18 0.17
Details
df.head()
Gender Q1 Q2 Q3
0 1 0 2 0
1 1 0 0 1
2 0 2 0 2
3 0 0 2 0
4 0 1 1 1
df.melt().head()
Gender Question Answer
0 1 Q1 0
1 1 Q1 0
2 0 Q1 2
3 0 Q1 0
4 0 Q1 1

Which statsmodels ANOVA model for within- and between-subjects design?

I have a classic ANOVA design: two experimental conditions with two levels each; one participant answers on two of the four resulting conditions. A sample of my data looks like this:
participant_ID Condition_1 Condition_2 dependent_var
1 1 1 0.71
1 2 1 0.43
2 1 1 0.77
2 2 1 0.37
3 1 1 0.58
3 2 1 0.69
4 2 1 0.72
4 1 1 0.12
26 2 2 0.91
26 1 2 0.53
27 1 2 0.29
27 2 2 0.39
28 2 2 0.75
28 1 2 0.51
29 1 2 0.42
29 2 2 0.31
Using statsmodels, I wish to identify the effects of both conditions on the dependent variable, allowing for the fact that each participant answers twice and that there may be interactions. My expectation would be that I would use the repeat-measures ANOVA option as follows:
from statsmodels.stats.anova import AnovaRM
aovrm = AnovaRM(data, 'dependent_var', 'participant_ID', within=['Condition_1'], between = ['Condition_2'], aggregate_func= 'mean').fit()
However, when I do this, I get the following error:
NotImplementedError: Between subject effect not yet supported!
Does anyone know of a workaround for this that doesn't involve learning R? My instinct would be to try a mixed linear model, but I don't know how to account for the fact that each participant answered twice.
Apologies if this turns out to really be a Cross Validated question!
You could try out the pingouin package: https://pingouin-stats.org/index.html
It seems to cover mixed anovas, which are not yet fully implemented in statsmodels.

Python LIfe Expectancy

Trying to use panda to calculate life expectanc with complex equations.
Multiply or divide column by column is not difficult to do.
My data is
A b
1 0.99 1000
2 0.95 =0.99*1000=990
3 0.93 = 0.95*990
Field A is populated and field be has only the 1000
Field b (b2) = A1*b1
Tried shift function, got result for b2 only and the rest zeros any help please thanks mazin
IIUC, if you're starting with:
>>> df
A b
0 0.99 1000.0
1 0.95 NaN
2 0.93 NaN
Then you can do:
df.loc[df.b.isnull(),'b'] = (df.A.cumprod()*1000).shift()
>>> df
A b
0 0.99 1000.0
1 0.95 990.0
2 0.93 940.5
Or more generally:
df['b'] = (df.A.cumprod()*df.b.iloc[0]).shift().fillna(df.b.iloc[0])

Resources