Comparing data frames with a level of error - python-3.x

I have two dataframes as
df_schematic
layer x y
0 18 -10850.0 -6550.0
1 18 -10850.0 -5750.0
2 18 -10950.0 -5850.0
3 18 -10950.0 -5450.0
4 31 -10850.0 -5350.0
5 14 -10850.0 -4950.0
6 17 2945.5 6550.0
2278 rows × 3 columns
df_report
layer x y
0 18 9161.19 -3106.42
1 18 9141.51 -3185.38
2 18 9023.40 -3185.38
3 18 9003.71 -3106.42
4 18 8800.20 -2840.65
5 17 2945.8 6549.6
2216 rows × 3 columns
i am trying to compare df_schematic with the report and find out any missing or irregular values among the report. The main problem is the level of tolerance we can have for a coordinate.
For example:
17 2945.5 6550.0
and
17 2945.8 6549.6
are clearly not equal but they should be passed as a correct entry as the error level is +/-0.5.
Is there any way to find out the missing values and while keeping the tolerance in mind.

Make some experiments with np.isclose.
I mean the following scenario:
Write a function, say isClose, comparing one pair of coordinates (x1, y1) with
another pair (x2, y2), from 2 source rows, something like
np.isclose(x1, x2, atol=0.5) & np.isclose(y1, y2, atol=0.5).
Taking a row from df_schematic as a "base point":
find in df_report all rows with exactly equal value of layer,
for each such row check isClose for x and y coordinates from both rows,
until you find one where this function returns True.
Repeat this procedure for each row from df_schematic.

Related

TRUE/FALSE ← VLOOKUP ← Identify the ROW! of the first negative value within a column

Firstly, we have an array of predetermined factors, ie. V-Z;
their attributes are 3, the first two (•xM) multiplied giving the 3rd.
f ... factors
• ... cap, the values in the data set may increase max
m ... fixed multiplier
p ... let's call it power
This is a separate, standalone array .. we'd access with eg. VLOOKUP
f • m pwr
V 1 9 9
W 2 8 16
X 3 7 21
Y 4 6 24
Z 5 5 25
—————————————————————————————————————————————
Then we have 6 columns, in which the actual data to be processed is in, & thereof derive the next-level result, based on the interaction of both samples introduced.
In addition, there are added two columns, for balance & profit.
Here's a short, 6-row data sample:
f • m bal profit
V 2 3 377 1
Y 2 3 156 7
Y 1 1 122 0
X 1 2 -27 2
Z 3 3 223 3
—————————————————————————————————————————————
Ultimately, starting at the end, we are comparing IF -27 inverted → so 27 is within the X's power range ie. 21 (as per the first sample) .. which is then fed into a bigger formula, beyond the scope of this post.
This can be done with VLOOKUP, all fine by now.
—————————————————————————————————————————————
To get to that .. for the working example, we are focusing coincidentally on row5, since that's the one with the first negative value in the 'balance' column, so ..
on factorX = which factor exactly is to us unknown &
balance -27 = which we have to locate amongst potentially dozens to hundreds of rows.
Why!?
Once we know that the factor is X, based on the * & multiplier pertaining to it, then we also know which 'power' (top array) to compare -27, as the identified first negative value in the balance column, to.
Is that clear?
I'd like to know the formula on how to achieve that, & (get to) move on with the broader-scope work.
—————————————————————————————————————————————
The main issue for me is not knowing how to identify the first negative or row -27 pertains to, then having that piece of information how to leverage it to get the X or identify the factor type, especially since its positioned left of the latter & to the best of my knowledge I cannot use negative column index number (so, latter even if possible is out of the question anyway).
To recap;
IF(21>27) = IF(-21<-27)
27 → LOCATE ROW with the first negative number (-27)
21 → IDENTIFY the FACTOR TYPE, same row as (-27)
→ VLOOKUP pwr, based on factor type identified (top array, 4th column right)
→ invert either 21 to a negative number or (-27) to the positive number
= TRUE/FALSE
Guessing your columns I'll say your first chart is in columns A to D, and the second in columns G to K
You could find the letter of that factor with something like this:
=INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0)))
INDEX(J:J<0) converts that column to TRUE and FALSE depending on being negative or not and with XMATCH you find the first TRUE. You could then use that in VLOOKUP:
=VLOOKUP(INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0))),A:D,4,0)
That would return the 21. You can use the first concept too to find the the -27 and with ABS have its "positive value"
=VLOOKUP(INDEX(G:G,XMATCH(TRUE,INDEX(J:J<0))),A:D,4,0) > INDEX(J:J,XMATCH(TRUE,INDEX(J:J<0)))
That should return true or false in the comparison

Appropriate stat test for multiple 3x3 tables

In my study, each subject produces a 3x3 contingency table.So I have a table of counts of how many times a subject responded X Y Z in condition A, B or C. All the columns sum up to 50(num of trials in the task).
A B C
X 20 15 25
Y 20 15 20
Z 10 20 5
I have multiple such tables. I want to know if it's correct to pool all the data together and run a chi square or there is another thing I can do (something that includes random factors)

Difference between consecutive maxima and minima in a .csv dataset

I have a dataset which represents tracking data of a mouse's paw moving up and down in the y-axis as it reaches up for and pulls down on a piece of string.
The output of the data is a list of y-coordinates corresponding to a fraction of a second. For example:
1 333.9929833
2 345.4504726
3 355.7046572
4 367.6136684
5 379.7906121
6 390.5470788
7 397.9017118
8 403.677123
9 412.1550843
10 416.516814
11 419.8205706
12 423.7994881
13 429.4874275
14 419.2652898
15 360.1626136
16 298.8212249
17 264.3647809
18 265.0078862
19 268.1828407
20 283.101321
21 294.8219163
22 308.4875135
In this series, there is a max value of 429... and a minimum of 264... - however, as you can see from an example image:
(excuse the gaps), there are multiple consecutive wave-like maxima and minima.
The goal is to find the difference between each maxima and consecutive minima, and each minima and consecutive maxima (i.e. max1-min1, min2-max1, max2-min2...). Ideally, this would also provide the timepoints of each max and min (e.g. 13 and 17 for the provided dataset) - there is a column with integer labels (1, 2, 3...) corresponding to each coordinate.
Thanks for your help!

Microsoft Excel Pie Chart bug

Pie chart percentage not calculated correctly by excel. In the picture you can see that the c and d values are exactly the same, but for some reason "c" has a higher percentage denoted to it and I can't figure out why.
The values are a-21; b-5; c-11; d-11; e-3; f-5; g-1; h-39. On the pie chart the percentage received is a-22%; b-5%; c-12%; d-11%; e-3%; f-5%; g-1%; h-41%
While not the ideal solution, if you right click on one of the labels and press the Format Data Labels option, you can change the Number display type to percentage, this will increase the number of decimal places in the percentage shown but give you the accurate result asked for.
The problem is caused by your actual percentages being:
Name Val %
a 21 21.875
b 5 5.208333333
c 11 11.45833333
d 11 11.45833333
e 3 3.125
f 5 5.208333333
g 1 1.041666667
h 39 40.625
As you can see these numbers can't be exactly represented as a (whole number) percentage, the compensations have to be made somewhere. It just so happened that the compensations were made on numbers that should be the same.
Another possible option would be to round your percentage results:
Name Val % Rounded %
a 21 21.875 22
b 5 5.208333333 5
c 11 11.45833333 11
d 11 11.45833333 11
e 3 3.125 3
f 5 5.208333333 5
g 1 1.041666667 1
h 39 40.625 41
The sum of these values is now 99 instead of 96 as in your original, which results in a better graph:
You can do this using the formula =ROUND(num,0) for each of your calculated percentages.

In Excel aggregate values from a column if the following values are larger than zero

I would like excel to iterate through a column of precipitation values and keep adding them as long the value is larger than zero. The table for example looks like this with the required result in NewCol.
Date prec NewCol
x 10 10
x 8 18
x 3 21
x 0 0
x 1 1
x 0 0
I would like to use the value 21 (and all other values with largest values) to assess consequtive rainfall days. Wondering if this is possible in Excel? Haven't been able to find the solution in Mathlab.
Thanks in Advance!

Resources