Which statsmodels ANOVA model for within- and between-subjects design? - python-3.x

I have a classic ANOVA design: two experimental conditions with two levels each; one participant answers on two of the four resulting conditions. A sample of my data looks like this:
participant_ID Condition_1 Condition_2 dependent_var
1 1 1 0.71
1 2 1 0.43
2 1 1 0.77
2 2 1 0.37
3 1 1 0.58
3 2 1 0.69
4 2 1 0.72
4 1 1 0.12
26 2 2 0.91
26 1 2 0.53
27 1 2 0.29
27 2 2 0.39
28 2 2 0.75
28 1 2 0.51
29 1 2 0.42
29 2 2 0.31
Using statsmodels, I wish to identify the effects of both conditions on the dependent variable, allowing for the fact that each participant answers twice and that there may be interactions. My expectation would be that I would use the repeat-measures ANOVA option as follows:
from statsmodels.stats.anova import AnovaRM
aovrm = AnovaRM(data, 'dependent_var', 'participant_ID', within=['Condition_1'], between = ['Condition_2'], aggregate_func= 'mean').fit()
However, when I do this, I get the following error:
NotImplementedError: Between subject effect not yet supported!
Does anyone know of a workaround for this that doesn't involve learning R? My instinct would be to try a mixed linear model, but I don't know how to account for the fact that each participant answered twice.
Apologies if this turns out to really be a Cross Validated question!

You could try out the pingouin package: https://pingouin-stats.org/index.html
It seems to cover mixed anovas, which are not yet fully implemented in statsmodels.

Related

Python/how to insert data from one table into another according to the data condition

I have a the first table
user_cnt cost acquisition_cost
channel
facebook_ads 2726 2140.904643 0.79
instagram_new_adverts 3347 2161.441691 0.65
yandex_direct 4817 2233.111449 0.46
youtube_channel_reklama 2686 1068.119204 0.40
and the second
user_id profit source cost_per_user income
8a28b283ac30 0.91 facebook_ads ? 0.12
d7cf130a0105 0.63 youtube_channel ? 0.17
The second table has more 200k rows, but i showed only two. So, i need to put "acquisition_cost" value from the first table to column "cost_per_user" in the second table according to the name of channel/source. For instance, on the first row in the second table cost_per_user should has value - 0.79 due to it's facebook_ads.
I will be grateful if someone can help me to solve this task.
First of all i tried to use the function:
I tried the function:
def cost (source_type):
if source_type == 'instagram_new_adverts':
return 0.65
elif source_type == 'facebook_ads':
return 0.79
elif source_type == 'youtube_channel_reklama':
return 0.40
else: return 0.46
target_build['cost_per_user'] = target_build['source'].apply(cost)`
but i have to find another desicion without using of constants(return 0.65).
Another attemption was like this
for row in first_table['channel'].unique():
second_table.loc[second_table['source'] == row, 'cost_per_user'] = first_table['acquisition_cost']
this code works only for the first four lines and for another it put zero value.
and the last idea was
second_table['cost_per_user'] = second_table['cost_per_user'].where(
second_table['source'].isin(b.index), b['acquisition_cost'])
and again it didn't work.
It looks like you are using pandas. One possible solution is to use pandas version of inner join: merge.
Example
Suppose you don't want to modify either your first or second table, you can create a temporary table including just channel and acquisition_cost from the first table, but also changing the column names to source and cost_per_user. This can be implemented in various ways. One possible way, presuming your channel is an index, is shown below.
temp_df = first_df.acquisition_cost.reset_index().rename(
columns={'channel': 'source', 'acquisition_cost': 'cost_per_user'},
)
temp_df looks like this
source cost_per_user
0 facebook_ads 0.79
1 instaram_new_adverts 0.65
2 yandex_direct 0.46
3 youtube_channel_reklama 0.40
Say your second table looks like this:
user_id profit source income
0 c3519e80c071 0.773956 yandex_direct 0.227239
1 cc39ba469a08 0.438878 instaram_new_adverts 0.554585
2 a44a621e0222 0.858598 facebook_ads 0.063817
3 9dbf921b0959 0.697368 youtube_channel_reklama 0.827631
4 d45bf8fcab75 0.094177 youtube_channel_reklama 0.631664
5 57dbe1efd8b1 0.975622 yandex_direct 0.758088
6 1e0e3f1e13f7 0.761140 instaram_new_adverts 0.354526
7 27a7a7470ef4 0.786064 youtube_channel_reklama 0.970698
8 360dfd543fb5 0.128114 yandex_direct 0.893121
9 a31f46c26abb 0.450386 instaram_new_adverts 0.778383
You can run the merge call to attach cost_per_user to each source.
new_df = pd.merge(left=second_df, right=temp_df, on='source', how='inner')
new_df would look like this
user_id profit source income cost_per_user
0 c3519e80c071 0.773956 yandex_direct 0.227239 0.46
1 57dbe1efd8b1 0.975622 yandex_direct 0.758088 0.46
2 360dfd543fb5 0.128114 yandex_direct 0.893121 0.46
3 cc39ba469a08 0.438878 instaram_new_adverts 0.554585 0.65
4 1e0e3f1e13f7 0.761140 instaram_new_adverts 0.354526 0.65
5 a31f46c26abb 0.450386 instaram_new_adverts 0.778383 0.65
6 a44a621e0222 0.858598 facebook_ads 0.063817 0.79
7 9dbf921b0959 0.697368 youtube_channel_reklama 0.827631 0.40
8 d45bf8fcab75 0.094177 youtube_channel_reklama 0.631664 0.40
9 27a7a7470ef4 0.786064 youtube_channel_reklama 0.970698 0.40
Notes
Complication would arise if the source column of the second table does not have a one-to-one match to the channel in the first table. You will need to read the doc on merge to decide how you would want to handle that situation (e.g. use inner join to discard any mismatch, or left join to keep the unmatched source but receiving no cost_per_user).

Count unique values in a MS Excel column based on values of other column

I am trying to find the unique number of Customers, O (Orders), Q (Quotations) and D (Drafts) our team has dealt with on a particular day from this sample dataset. Please note that there are repeated "Quote/Order #"s in the dataset. I need to figure out the unique numbers of Q/O/D on a given day.
I have figured out all the values except the fields highlighted in light orange color of my Expected output table. Can someone help me figure out the MS Excel formula for these four values as requested above?
Below is the given dataset. Please note that there can be empty values against a date. But those will always be found in the bottom few rows of the table:
Date
Job #
Job Type
Quote/Ordr #
Parts
Customer
man-hr
4-Apr-22
1
O
307585
1
FRU
0.35
4-Apr-22
2
D
307267
28
ATM
4.00
4-Apr-22
2
D
307267
25
ATM
3.75
4-Apr-22
2
D
307267
6
ATM
0.17
4-Apr-22
3
D
307438
3
ELCTRC
0.45
4-Apr-22
4
D
307515
7
ATM
0.60
4-Apr-22
4
D
307515
5
ATM
0.55
4-Apr-22
4
D
307515
4
ATM
0.35
4-Apr-22
5
O
307587
4
PULSE
0.30
4-Apr-22
6
O
307588
3
PULSE
0.40
5-Apr-22
1
O
307623
1
WST
0.45
5-Apr-22
2
O
307629
4
CG
0.50
5-Apr-22
3
O
307630
10
SUPER
1.50
5-Apr-22
4
O
307631
3
SUPER
0.60
5-Apr-22
5
O
307640
7
CAM
0.40
5-Apr-22
6
Q
307527
6
WG
0.55
5-Apr-22
6
Q
307527
3
WG
0.30
5-Apr-22
To figure out the unique "Number of Jobs" on Apr 4, I used the Excel formula:
=MAXIFS($K$3:$K$20,$J$3:$J$20,R3) Where, R3 ='4-Apr-22'
To figure out the unique "Number of D (Draft) Jobs" I used the Excel formula:
=SUMIFS($P$3:$P$20,$J$3:$J$20,R3,$L$3:$L$20,"D")
[1
[2

How to use Word2Vec CBOW in statistical algorithm?

I have seen few examples of using CBOW in Neural Networks models (although I did not understand it)
I know that Word2Vec is not similar to BOW or TFIDF, as there is no single value for CBOW
and all examples I saw were using Neural Network.
I have 2 questions
1- Can we convert the vector to a single value and put it in a dataframe so we can use it in Logistic Regression Model?
2- Is there any simple code for CBOW usage with logistic Regression?
More Explanation.
In my case, I have a corpus that I want to make a comparison between top features in BOW and CBOW
after converting to BOW
I get this dataset
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 5 3 8 2 0
2 0 1 0 0 6 9
3 1 4 1 5 1 7
after converting to TFIDF
I get this dataset
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 0.38 0.42 0.02 0.22 0.00 0.19
2 0 0.75 0.20 0.08 0.12 0.37 0.21
3 1 0.17 0.84 0.88 0.11 0.07 0.44
I am observing the results of top 3 features in each model
so my dataset become like this
BOW (I put null here for the values that will be omitted)
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 5 null 8 null null 7
2 0 null null null 6 9 2
3 1 4 null 5 null 7 null
TFIDF (I put null here for the values that will be omitted)
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 0.38 0.42 null 0.22 null null
2 0 0.75 null null null 0.37 0.21
3 1 null 0.84 0.88 null null 0.44
I want now to do the same with Word2Ven CBOW
I want to take the highest values in the CBOW model
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 v11 v12 v13 v14 v15 v16
2 0 v21 v22 v23 v24 v25 v26
3 1 v31 v32 v33 v34 v35 v36
to be like this
RepID Label Cat Dog Snake Rabbit Apple Orange ...
1 1 v11 null v13 null v15 null
2 0 null null v23 null v25 v26
3 1 v31 null v33 v34 null null
No matter the internal training method, CBOW or skip-gram, a word-vector is always a multidimensional vector: it contains many floating-point numbers.
So at one level, that is one "value" - where the "value" is a vector. But it's never a single number.
Word-vectors, even with all their dimensions, can absolutely serve as inputs for a downstream logistic regression task. But the exact particulars depend on exactly what data you're operating on, and what you intend to achieve - so you may want to expand your question, or ask a more specific followup, with more info about the specific data/task you're considering.
Note also: this is done more often with the pipeline of a library like scikit-learn. Putting dense high-dimensional word-vectors themselves (or other features derived from word-vectors) directly into "dataframes" is often a mistake, adding overhead & indirection compared to working with such large feature-vectors in their more compact/raw format of (say) numpy arrays.

reading data with varying length header

I want to read in python a file which contains a varying length header and then extract in a dataframe/series the variables which are coming after the header.
The data looks like :
....................................................................
Data coverage and measurement duty cycle:
When the instrument duty cycle is not in measure mode (i.e. in-flight
calibrations) the data is not given here (error flag = 2).
The measurements have been found to exhibit a strong sensitivity to cabin
pressure.
Consequently the instrument requires calibrated at each new cabin
pressure/altitude.
Data taken at cabin pressures for which no calibration was performed is
not given here (error flag = 2).
Measurement sensivity to large roll angles was also observed.
Data corresponding to roll angles greater than 10 degrees is not given
here (error flag = 2)
......................................................................
High Std: TBD ppb
Target Std: TBD ppb
Zero Std: 0 ppb
Mole fraction error flag description :
0 : Valid data
2 : Missing data
31636 0.69 0
31637 0.66 0
31638 0.62 0
31639 0.64 0
31640 0.71 0
.....
.....
So what I want is to extract the data as :
Time C2H6 Flag
0 31636 0.69 0 NaN
1 31637 0.66 0 NaN
2 31638 0.62 0 NaN
3 31639 0.64 0 NaN
4 31640 0.71 0 NaN
5 31641 0.79 0 NaN
6 31642 0.85 0 NaN
7 31643 0.81 0 NaN
8 31644 0.79 0 NaN
9 31645 0.85 0 NaN
I can do that with
infile="/nfs/potts.jasmin-north/scratch/earic/AEOG/data/mantildas_faam_20180911_r1_c118.na"
flightdata = pd.read_fwf(infile, skiprows=53, header=None, names=['Time', 'C2H6', 'Flag'],)
but I m skipping approximately 53 rows because I counted how much I should skip. I have a bunch of these files and some don't have exactly 53 rows in the header so I was wondering what would be the best way to deal with this and a criteria to have Python always only read the three columns of data when finds them? I thought if I'd want let's say Python to actually read the data from where encounters
Mole fraction error flag description :
0 : Valid data
2 : Missing data
what should I do ? What about another criteria to use which would work better ?
You can split on the header delimiter, like so:
with open(filename, 'r') as f:
myfile = f.read()
infile = myfile.split('Mole fraction error flag description :')[-1]
# skip lines with missing data
infile = infile.split('\n')
# likely a better indicator of a line with incorrect format, you know the data better
infile = '\n'.join([line for line in infile if ' : ' not in line])
# create dataframe
flightdata = pd.read_fwf(infile, header=None, names=['Time', 'C2H6', 'Flag'],)

Excel need to sum distinct id's value

I am struggling to find the sum of distinct id's value. Example given below.
Week TID Ano Points
1 111 ANo1 1
1 112 ANo1 1
2 221 ANo2 0.25
2 222 ANo2 0.25
2 223 ANo2 0.25
2 331 ANo3 1
2 332 ANo3 1
2 333 ANo3 1
2 999 Ano9 0.25
2 998 Ano9 0.25
3 421 ANo4 0.25
3 422 ANo4 0.25
3 423 ANo4 0.25
3 531 ANo5 0.5
3 532 ANo5 0.5
3 533 ANo5 0.5
From the above data i need to bring the below result. Could anyone help please using some excel formula?
Week Points_Sum
1 1
2 1.50
3 0.75
You say "sum of distinct id's value"? All the IDs are different so I'm assuming you want to sum for each different "Ano" within the week?
=SUM(IF(FREQUENCY(IF(A$2:A$17=F2,MATCH(C$2:C$17,C$2:C$17,0)),ROW(A$2:A$17)-ROW(A$2)+1),D$2:D$17))
confirmed with CTRL+SHIFT+ENTER
where F2 contains a specific week number
Assumes that each "Ano" will always have the same points value
Probably not the most efficient solution... but this array formula works:
= SUMPRODUCT(IF($A$2:$A$15=$F2,$D$2:$D$15),1/MMULT((IF($A$2:$A$15=$F2,$D$2:$D$15)=
TRANSPOSE(IF($A$2:$A$15=$F2,$D$2:$D$15)))+0,(ROW($A$2:$A$15)>0)+0))
Note this is an array formula, so you have to press Ctrl+Shift+Enter after typing this formula instead of just Enter.
See working example below. This formula is in cell G2 and dragged down.

Resources