Appropriate stat test for multiple 3x3 tables - statistics

In my study, each subject produces a 3x3 contingency table.So I have a table of counts of how many times a subject responded X Y Z in condition A, B or C. All the columns sum up to 50(num of trials in the task).
A B C
X 20 15 25
Y 20 15 20
Z 10 20 5
I have multiple such tables. I want to know if it's correct to pool all the data together and run a chi square or there is another thing I can do (something that includes random factors)

Related

TRUE/FALSE ← VLOOKUP ← Identify the ROW! of the first negative value within a column

Firstly, we have an array of predetermined factors, ie. V-Z;
their attributes are 3, the first two (•xM) multiplied giving the 3rd.
f ... factors
• ... cap, the values in the data set may increase max
m ... fixed multiplier
p ... let's call it power
This is a separate, standalone array .. we'd access with eg. VLOOKUP
f • m pwr
V 1 9 9
W 2 8 16
X 3 7 21
Y 4 6 24
Z 5 5 25
—————————————————————————————————————————————
Then we have 6 columns, in which the actual data to be processed is in, & thereof derive the next-level result, based on the interaction of both samples introduced.
In addition, there are added two columns, for balance & profit.
Here's a short, 6-row data sample:
f • m bal profit
V 2 3 377 1
Y 2 3 156 7
Y 1 1 122 0
X 1 2 -27 2
Z 3 3 223 3
—————————————————————————————————————————————
Ultimately, starting at the end, we are comparing IF -27 inverted → so 27 is within the X's power range ie. 21 (as per the first sample) .. which is then fed into a bigger formula, beyond the scope of this post.
This can be done with VLOOKUP, all fine by now.
—————————————————————————————————————————————
To get to that .. for the working example, we are focusing coincidentally on row5, since that's the one with the first negative value in the 'balance' column, so ..
on factorX = which factor exactly is to us unknown &
balance -27 = which we have to locate amongst potentially dozens to hundreds of rows.
Why!?
Once we know that the factor is X, based on the * & multiplier pertaining to it, then we also know which 'power' (top array) to compare -27, as the identified first negative value in the balance column, to.
Is that clear?
I'd like to know the formula on how to achieve that, & (get to) move on with the broader-scope work.
—————————————————————————————————————————————
The main issue for me is not knowing how to identify the first negative or row -27 pertains to, then having that piece of information how to leverage it to get the X or identify the factor type, especially since its positioned left of the latter & to the best of my knowledge I cannot use negative column index number (so, latter even if possible is out of the question anyway).
To recap;
IF(21>27) = IF(-21<-27)
27 → LOCATE ROW with the first negative number (-27)
21 → IDENTIFY the FACTOR TYPE, same row as (-27)
→ VLOOKUP pwr, based on factor type identified (top array, 4th column right)
→ invert either 21 to a negative number or (-27) to the positive number
= TRUE/FALSE
Guessing your columns I'll say your first chart is in columns A to D, and the second in columns G to K
You could find the letter of that factor with something like this:
=INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0)))
INDEX(J:J<0) converts that column to TRUE and FALSE depending on being negative or not and with XMATCH you find the first TRUE. You could then use that in VLOOKUP:
=VLOOKUP(INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0))),A:D,4,0)
That would return the 21. You can use the first concept too to find the the -27 and with ABS have its "positive value"
=VLOOKUP(INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0))),A:D,4,0) > INDEX(J:J,XMATCH(TRUE,INDEX(J:J<0)))
That should return true or false in the comparison

Calculate Average Across Multiple Pairs/Permutations

Not sure how to use AVERAGEIFS or a combination of SUMIFS and COUNTIFS to efficiently solve this, or some other function.
Basically, assume I have the following dataset of trip times between certain points
Start End Trip Time(Minutes)
A B 12
A B 8
B A 9
B A 2
A C 15
C A 5
C B 11
C B 9
B C 7
A B 16
A D 18
D C 21
E A 11
X Y 19
There could be n number of points in the dataset, but assume we are only interested in the average trip time of all trip pairs between 4 cities (A,B,C,D). i.e. AB, BA, AC, CA, BD, DB, etc. but not AA, BB, CC, DD.
How can I go about averaging the trip time between all these permutations? Much help would be appreciated..thank you!
Not very pretty, but using a named range "CITIES" (A20:A23 below)
In E3 to arrange as unique pairs regardless of direction (and fill down):
=IFERROR(INDEX(CITIES,MIN(MATCH(A3,CITIES,0),MATCH(B3,CITIES,0)))&":"&
INDEX(CITIES,MAX(MATCH(A3,CITIES,0),MATCH(B3,CITIES,0))),"")
In F3:
=IF(E3<>"",AVERAGEIFS($C$3:$C$16,$E$3:$E$16,E3),"")
You can copy/paste values/remove duplicates to get the unique pairs.

Distributing data with lower and upper boundaries in Excel

Here's a link to a screenshot with the formula used in Column B and some sample data
I have a spreadsheet with 48 rows of data in column A
The values range from 0 to 19
The average of these 48 rows = 8.71
the standard deviation of the population = 3.77
I've used the STANDARDIZE function in excel in column B to return the Z-score of each item in column A given that I know the mean (8.71), std dev (3.77), and x (whatever is in column A).
For example (row 2) has:
x = 2
z = -1.779
Using the z value, I want to create an lower (4) and upper (24) boundary and calculate what the value would be in this 3rd column.
Essentially, if x = 0 (min value), then z = -2.3096, and columnC = 4 (lower boundary condition)
Conversely, if x = 19 (max value), then z = 2.9947, and columnC = 19 (upper boundary condition)
and then all other values between 0 to 19 would be calculated....
Any ideas how I can accomplish this with a formula in the column C?
So if your lowest original value is 0 and your highest is 19 and you want to re-distribute them from 4 to 24 and we assume that both are linear that means:
Since both are linear we have to use these formulas:
we develope the first to c so we get
and replace the c in the second equation with that so we get
and develope this to m as follows
If we put this togeter with our third equation above we get:
So we finally have equations for m = and c = and we can use the numbers from our old and new lower and upper bound to get:
you can use these values with
where x is are your old values in column A and y is the new distributed value in column B:
Some visualization if you change the boundaries:
Idea for a non-linear solution
If you want 4 and 24 as boundaries and the mean should be 12 the solution cannot be linear of course. But you could use for example any other formula like
So you can use this formula for column D y2 with the following values a, b, c as well as calculating the mean, min and max over column D y2.
Then use the solver:
Goal is: Mean $M$15 should be 12
secondary conditions: $M$16 = 4 (lower boundary) and $M$17 = 24 (upper boundary)
variable cells are a, b and c: $M$11:$M$13
The solver will now adjust the values a, b and c so that you get very close to your goal and to get these results:
The min is 4 the max is almost 24 and the mean is almost 12 that is probably the closest you can get with a numeric method.

Comparing data frames with a level of error

I have two dataframes as
df_schematic
layer x y
0 18 -10850.0 -6550.0
1 18 -10850.0 -5750.0
2 18 -10950.0 -5850.0
3 18 -10950.0 -5450.0
4 31 -10850.0 -5350.0
5 14 -10850.0 -4950.0
6 17 2945.5 6550.0
2278 rows × 3 columns
df_report
layer x y
0 18 9161.19 -3106.42
1 18 9141.51 -3185.38
2 18 9023.40 -3185.38
3 18 9003.71 -3106.42
4 18 8800.20 -2840.65
5 17 2945.8 6549.6
2216 rows × 3 columns
i am trying to compare df_schematic with the report and find out any missing or irregular values among the report. The main problem is the level of tolerance we can have for a coordinate.
For example:
17 2945.5 6550.0
and
17 2945.8 6549.6
are clearly not equal but they should be passed as a correct entry as the error level is +/-0.5.
Is there any way to find out the missing values and while keeping the tolerance in mind.
Make some experiments with np.isclose.
I mean the following scenario:
Write a function, say isClose, comparing one pair of coordinates (x1, y1) with
another pair (x2, y2), from 2 source rows, something like
np.isclose(x1, x2, atol=0.5) & np.isclose(y1, y2, atol=0.5).
Taking a row from df_schematic as a "base point":
find in df_report all rows with exactly equal value of layer,
for each such row check isClose for x and y coordinates from both rows,
until you find one where this function returns True.
Repeat this procedure for each row from df_schematic.

compare values of two dataframes based on certain filter conditions and then get count

I am new to spark. I am writing a pyspark code where I have two dataframes such that :
DATAFRAME-1:
NAME BATCH MARKS
A 1 44
B 15 50
C 45 99
D 2 18
DATAFRAME-2:
NAME MARKS
A 36
B 100
C 23
D 67
I want my output as a comparison between these two dataframes such that I can store the counts as my variables.
for instance,
improvedStudents = 1 (since D belongs to batch 1-15 and has improved his score)
badPerformance = 2 (A,B have performed bad since they belong between batch 1-15 and their marks are lesser than before)
neutralPerformance = 1 (C because even if his marks went down, he belongs to batch 45 that we dont want to consider)
This is just an example out of a complex problem I'm trying to solve.
Thanks
If the data is as in your example why don't you just join them and create new columns for every metric that you have:
val df = df1.withColumnRenamed("MARKS", "PRE_MARKS")
.join(df2.withColumnRenamed("MARKS", "POST_MARKS"), Seq("NAME"))
.withColumn("Evaluation",
when(col("BATCH") > 15, lit("neutral"))
.when(col("PRE_MARKS") gt col("POST_MARKS"), lit("bad"))
.when(col("POST_MARKS") gt col("PRE_MARKS"), lit("improved"))
.otherwise(lit("neutral"))
.groupBy("Evaluation")
.count

Resources