Never worked with data structure this messy - python-3.x

I have this file for work (and 7000 others of the same format) that is very messy and not tidy in any way. I've been reading about tidying data using Pandas but feel I'm spinning my wheels at this point...
Here is the raw data viewed in Excel:
Here is some example text from the CSV:
Section 6. Reserve Summary
Ten Minute Reserve Requirement:, 1801
Ten Minute Reserve Estimate:, 1801
Thirty Minute Reserve Requirement:, 626
Thirty Minute Reserve Estimate:, 1926
Expected Actions of OP 4:, 0
Additional Capacity Available from OP 4 Actions:, 0
Section 7. Interchange Summary
Description, Import Limit MW, Export Limit MW, Scheduled, Contract
Highgate, -225, 0, -225
NB, -550, 200, -432
NYISO AC, -1400, 1200, 0
NYISO CSC, -346, 330, 330
NYISO NNC, -200, 200, 194
Phase 2 -2000 1200 -1501
Section 8. Weather Forecast Summary for the Peak Hour
City, Conditions, Wind, High Temperature (F)
Boston, Partly Cloudy, NE-10, 66
Hartford, Mostly Clear, N-12, 77
You can see column A is useless so I can remove. Column B mostly has variable names but also has Section names (rows 7, 9, 11...). Sometime column B has the value, but most of the time the value is listed in Column C-- also sometimes listed in Column D. Lines 44- 54 have some extra formatting going on where there are is a table of variable names and values...
Anyway, I absolutely do not have the skills to turn this into a tidy dataframe and will need to throw this to someone else. However, I'm hoping anyone can give advice on what to do. Is this even called 'data cleaning' or 'data structuring'?
I dropped Col A, then transposed the data, but that is far from setting this dataframe up correctly. What are other techniques to move data into the tidy structure needed?
Any resources shared would be great! I searched for too long on 'tidy data', 'data cleaning', 'data structuring' but all were too simplistic compared to this application.

Related

Excel Array formula to count moving average outliers

I've tried a few things on this and settled on a 'cheap' solution. Wanted to know if this can be done directly and more elegantly.
Problem Statement and Sample Data
Assume we have a table in excel with ~200 columns and a large number of rows (~10k).
Sample Data:
identifier
val1
val2
val3
...
val200
ID_1
100
102
34
...
89
We want to add a column at the end that shows us how many "moving average" outliers exist. A moving average outlier is defined as a point that is outside the range (mean - 2 * std deviations, mean + 2 * std deviations), where the mean and std dev is calculated using the previous 10 values (therefore its a moving average outlier).
We will not test the first 10 values. But from val11, the previous 10 values will be used to form the window and we want to test if the value is an outlier.
My Solution so far
I created another table of same dimensions as the original. In cells from val11 (to val200, for all columns), I put in the formula below in the new table. And then, I can simply sum the columns in each row in the new table.
Assume val11 is on X2 in the "shocks" worksheet (for first row):
=IF(OR(shocks!X2<AVERAGEA(shocks!D2:W2)-2STDEVA(shocks!D2:W2),shocks!X2>AVERAGEA(shocks!D2:W2)+2STDEVA(shocks!D2:W2)),1,"")
But if possible, I want to avoid having a second table since it bloats and slows down the file. Any help would be greaty appreciated

VLOOKUP exact value and closest date

I have different water tanks and 2 employees who measures the water tanks. Sometimes they measure the volume of the tank on the same day and sometimes not. I want to see how much their measurements differed. I understand sometimes the dates are not the same, thus, I would like to vlookup the volume of an exact tank to the closest date possible of Bob's reading dates.
Bob's Readings
Water_Tank_Name
Date
Volume
Red
15/02/2021
300
Blue
15/02/2021
145
Red
21/02/2021
280
Red
04/03/2021
339
Blue
05/03/2021
170
Sarah's Readings
Water_Tank_Name
Date
Volume
Blue
15/02/2021
148
Blue
19/02/2021
190
Red
23/02/2021
294
Blue
01/03/2021
140
I used xlookup but that only returns a value if the exact Water_Tank_Name and exact Date return a value. However, I would like to exactly watch the Water_Tank_Name and match to the closet Date.
=XLOOKUP(Bob!A2 & Bob!A2, Sarah!A:A & Sarah!B:B, Sarah!C:C)
You could use (with Excel 365):
=LET( tf, Bob!A2, df, Bob!B2,
tS, Sarah!A:A, dS,Sarah!B:B, dV, Sarah!C:C,
L, tS & dS,
S, SIGN(ABS(IFERROR(XLOOKUP(tf & df, L, dS,,-1)-df, 999)) - ABS(IFERROR(XLOOKUP(tf & df, L, dS,,1)-df, 999))),
XLOOKUP(tf & df, L, dV,,S) )
Where tf is the tank identity that you want use for the search and df is the date value that you want to search. This finds the nearest date and determines if it is smaller or larger than df and then tells the XLOOKUP to search for the next larger or smaller (S is either 1 or -1) that will arrive at the nearest date. It might be possible to replace the two XLOOKUPs for S with FILTERs, but I am not sure if it would be faster. The use of whole columns for SARAH should be replaced with Excel table columns - otherwise, it will run slow.
I am almost positive there is something that is less verbose than this monstrosity, but it works and makes sense if you take it apart piece by piece.
=IFERROR(FILTER($EG$12:$EG$15,$EE$12:$EE$15=$EE5)*$EF$12:$EF$15=$EF5+MIN(ABS(EF5-FILTER($EF$12:$EF$15, $EE$12:$EE$15=$EE5))))),FILTER($EG$12:$EG$15,($EE$12:$EE$15=$EE5)*($EF$12:$EF$15=$EF5-MIN(ABS(EF5-FILTER($EF$12:$EF$15, $EE$12:$EE$15=$EE5))))))

(Excel)Calculating costs, where prices differ based on quantities

I'm looking for some help as I'm not really sure of the correct terms to use on my query below, so whilst normally I would google this, I'm not really sure what to search for.
I need to work out the total cost for something, where you have a flat rate, and then an additional cost that changes depending on how much of something you have.
So an example, you get expenses paid for millage. If you drive 0-20 miles, you'll get £10. Between 30-50 miles you get 50p per mile. Between 51-100 miles you get £1 per mile and so on, added onto the base rate of the initial £10 you'd get paid as standard.
It's not the best example, but hoping it gives an idea of what I'm after.
If I was doing this by hand I'd know how to work it out, but I'm not to sure what kind of formula I need to be using - I've never had to work with complex formulas past "=sum" until now.
If anyone has any examples they can share or can point me in the right direction of what kind of things to google I'd be most grateful !
Thanks
Well, here is one way, but you don't state what the rate is between 21 and 30...
very basic, but you should be able to edit and expand as you want.
Do note that the limits (30 miles, 50 miles) and rates used in the formula all come from the sheet - so if the 30 mile limit changes to 25 miles - all you need to do is change cell A7...
I apologize for not answering sooner, but I find this question a bit difficult to address due to the complexity of formulas we can encounter. I know the one you documented is not the most complex one we might encounter, but I was not sure if that was your actual problem or if it was intended as a simple example. I have seen a variety of other things which have often thrown me for a loop.
For example, take this set of rules:
Minimum Fee is $23.50 up to $500
$501 - $2,000 = $3.05 per 100 unit increment
$2,001 - $25,000 = $14.00 per 1000 unit increment over $2,000
$25,001 - $50,000 = $10.10 per 1000 unit increment over $25,000
$50,001 - $100,000 = $7.00 per 1000 unit increment over $50,000
$100,001 - $500,000 = $5.60 per 1000 unit increment over $100,000
$500,001 - $1,000,000 = $4.75 per 1000 unit increment over $500,000
$1,000,001 - $9,999,000 = $3.65 per 1000 unit increment over $1,000,000
$10,000,001 and up = $3.65 per 1000 unit increment over $10,000,000
It does not look too different from yours except that there is an increment of something other than a single unit. In other words for the $501 to $2,000 range, $501 to $600 would all get the same additional $3.05 incremental charge. Another dollar would actually double this because it jumps to the next increment. Like your example, each range builds on the prior range. Assuming that these amounts are in colums A through F:
i Low High Fee Base Fee Per
0 1 500 23.50
1 501 2,000 $3.05 100
2 2,001 25,000 $23.50 1000
3 25,001 50,000 $10.10 1000
4 50,001 100,000 $7.00 1000
5 100,001 500,000 $5.60 1000
6 500,001 1,000,000 $4.75 1000
7 1,000,001 9,999,999 $3.65 1000
8 10,000,000 $3.65 1000
Note also that the rate declines as the amounts increase whereas yours appears to increase.
What I did with this is create a maximum value in Column H as follows:
i Max
0 =E3
1 =INT((C4-C3)/F4)*D4
2 =INT((C5-C4)/F5)*D5
3 =INT((C6-C5)/F6)*D6
4 =INT((C7-C6)/F7)*D7
5 =INT((C8-C7)/F8)*D8
6 =INT((C9-C8)/F9)*D9
7 =INT((C10-C9)/F10)*D10
8
The first one, where i is zero, is simply the base fee. The others are computed and copied. There is no maximum for the last row. I did not really think I needed this column but it made it easier to devise the formulas.
Assuming that I put an amount to evaluate in Cell I2, it will be evaluated as follows where the formula in row 3 (where i=0) is the set fee but all others are basically a copied formula:
i 4,950
0 =IF(I$2>=$B3,$H3,0)
1 =IF(I$2>=$B4,IF($H4="",INT((I$2-$C3)/$F4)*$D4,MIN($H4,INT((I$2-$C3)/$F4)*$D4)),0)
2 =IF(I$2>=$B5,IF($H5="",INT((I$2-$C4)/$F5)*$D5,MIN($H5,INT((I$2-$C4)/$F5)*$D5)),0)
3 =IF(I$2>=$B6,IF($H6="",INT((I$2-$C5)/$F6)*$D6,MIN($H6,INT((I$2-$C5)/$F6)*$D6)),0)
4 =IF(I$2>=$B7,IF($H7="",INT((I$2-$C6)/$F7)*$D7,MIN($H7,INT((I$2-$C6)/$F7)*$D7)),0)
5 =IF(I$2>=$B8,IF($H8="",INT((I$2-$C7)/$F8)*$D8,MIN($H8,INT((I$2-$C7)/$F8)*$D8)),0)
6 =IF(I$2>=$B9,IF($H9="",INT((I$2-$C8)/$F9)*$D9,MIN($H9,INT((I$2-$C8)/$F9)*$D9)),0)
7 =IF(I$2>=$B10,IF($H10="",INT((I$2-$C9)/$F10)*$D10,MIN($H10,INT((I$2-$C9)/$F10)*$D10)),0)
8 =IF(I$2>=$B11,IF($H11="",INT((I$2-$C10)/$F11)*$D11,MIN($H11,INT((I$2-$C10)/$F11)*$D11)),0)
The Fee for this is the sum of all of the rows (labeled i, 0 through 8 above). in this example, it would be 23.50 plus 45.75 plus 28.00 for a total of 97.25.
Not too bad. How about a set like this:
No fee if $1,000 or less
$1,001 - $5,000 = $80.00 + 3% of excess over $1,000.00 per 100 unit increment
$5,001 - $10,000 = $250.00 + 2% of excess over $5,000.00 per 500 unit increment
$10,001 - $25,000 = $350.00 + 1% of excess over $10,000.00 per 1000 unit increment
$25,001 and Over = $520.00 + 3/4% of excess over $25,000.00 per 1000 unit increment
In your formula, the initial flat amount never changes and once you've computed the amount for that range, other ranges build upon it. Here, there are steps. For example at $1,000 the fee is zero, but at $1,001, it jumps to $80 as if there were an $80 fee for the first 1000. Without boring you with the entire table, Here is the formula for computing the range from 5,001 to 10,000 assuming that G2 contains the amount to use and Row 5 colums A through E are the following:
Low High Rate Minimum Increment
5,001 10,000 2.00% 250 500
=($D5+$C5*INT(($G$2-($A5-1))/$E5)*$E5)*($G$2>=$A5)*OR($B5="",$G$2<=$B5)
The formula simply looks at the current row and does the computation if the amount in G2 falls within the range from Column A to Column B.
A simplification of all of the above comes when each range cumulatively builds on the prior ranges AND the rate of payment is always increasing, like the U.S. Tax Tables:
Over Not Over
0 9,525 10% of taxable income
9,525 38,700 $952.50 plus 12% of the excess over $9,525
38,700 82,500 $4,453.50 plus 22% of the excess over $38,700
82,500 157,500 $14,089.50 plus 24% of the excess over $82,500
157,500 200,000 $32,089.50 plus 32% of the excess over $157,500
200,000 500,000 $45,689.50 plus 35% of the excess over $200,000
500,000 $150,689.50 plus 37% of the excess over $500,000
Here, we can use something referred to as the "deskpad method" to shortcut the computation
Assuming that the amount to be evaluated is in G1 and these are in column A through C starting in Row 1:
Over Not Over Rate
0 9,525 10.0%
9,525 38,700 12.0%
38,700 82,500 22.0%
82,500 157,500 24.0%
157,500 200,000 32.0%
200,000 500,000 35.0%
500,000 37.0%
We compute the amount based on G1 as follows:
=ROUND(SUMPRODUCT($C$2:$C$8-$C$1:$C$7,$G$1-$A$2:$A$8,N($G$1>$A$2:$A$8)),0)
Note: this is not entered as an array formula.
How does this relate to your question. If the need is as simple as you stated (in other words, the rate is always increasing and we do not have any "steps" in the reimbursement, we can compute it similarly to the U.S. Tax computation.
I created these values in columns A through D starting in row 1:
Over Not Over
0 20 £- Flat Amount of £10.00
20 50 £0.50 £10.00 plus £.50 per mile over 20 miles
50 100 £1.00 £25.00 plus £1.00 per mile over 50 miles
100 £1.50 £75.00 plus £1.50 per mile over 100 miles
where column D is just descriptive. I put the £10.00 flat fee in Cell E1.
Assuming that G1 contains the number of miles, we would compute the reimbursement as:
=$E$1+ROUND(SUMPRODUCT($C$2:$C$5-$C$1:$C$4,$G$1-$A$2:$A$5,N($G$1>$A$2:$A$5)),2))
For example, when G1 is 52 miles, the computation is £27.00
Note: this is not entered as an array formula.
So, if this is the situation, what you would need is a place to house Columns A through C, a place to house the flat amount and a formula similar to what I provided to compute the reimbursement based on the cell housing the number of miles.
Please note that all the earlier items indicate that this formula will not be so simple if the rate is stepped or the rate declines or if the incremental unit is something other than 1 mile.
I hope that some of this makes sense. Good luck.
Things to google : "nested IF in excel"
How to do this in a one-line-formula : enter " =IF(A1<20,10,IF(A1>50,IF(A1>50,10+A1,"u"),0.5*(A1))) " in B1, your milage in A1.
To learn building this :
identify the conditions :
condition1 > 0-20 miles, you'll get £10.
condition2 > between 30-50 miles you get 50p per mile
condition3 > between 51-100 miles you get £1 per mile added onto £10
put the conditions into IF() statement
For contition1 > just type " =if(a1<20,10,0) " at B2 (and try it!) (:
Note : The syntax for IF() function is if("condition","if-true-do-this","if-false-do-this")
Thus, for condition2 > " =if(a1>20,a1*0.5,0) "
And for condition3 > " =if(a1>50,if(a1>50,10+a1),0) " correction : should be " =if(a1>50,10+a1,0) "
Combining all the conditions > "=IF(A1>20,IF(A1>50,IF(A1>50,10+A1,"error"),0.5*(A1)),10) "
Notice that I changed 0 in the "if-false-do-this" part of the equation just to make sure it show something when the milage entered is less than 0.
Hope that helps. /(^_^)

Find a temperature and work out how long it remained >= this temperature

I have an excel sheet with times in one column and temperatures in another. I'm trying to work out a formula that will find a certain temperature and measure how long it remained at that temperature.
11:25:29 AM 69.3°C
11:26:29 AM 69.6°C
11:27:29 AM 69.8°C
11:28:29 AM 70.0°C
11:29:29 AM 70.2°C
11:35:29 AM 70.8°C
11:36:29 AM 70.3°C
11:37:29 AM 69.5°C
11:38:29 AM 68.5°C
11:39:29 AM 67.5°C
12:39:29 PM 66.3°C
1:39:29 PM 52.1°C
2:39:29 PM 12.1°C
3:39:29 PM 5.0°C
In this example, I would like to find when it hit 70.0°C and how long it stayed above 70.0°C.
This is a bit of a tough problem because you might have multiple occasions where you go above 70 degrees. In that case, do you want the total time spent above 70 in the entire dataset, or do you want the total time spent above 70 consecutively? And then, how are you determining which of these potential multiple nonconsecutive periods you are talking about?
That said, you can try this. If column A is your datetime, and column B is your temp reading, specify another cell as your temperature reference value ($D$1 here), and in column C starting in row 2 do this:
=(A2-A1)*IF(B2>=$D$1,1,0)
and then copy that all the way down. What that does is it calculates the time difference between measurements and then if the temperature at that time is greater than your reference, it multiplies it by 1, otherwise it multiplies by 0. Because a date/time in Excel is really just a number, what you get is an interval of a day between measurements in each cell of column C. In other words, .25 = 6 hours.
Now that you have that data in column C, you are free to further parse it. You can use a simple SUM(C:C) formula in a cell, or you can go back and sum up individual ranges. I hope this helps.

Using the Excel's Rank() function to calculate allocations based on ranking and constraints

I have the following table set up
Limit Allocation Yield Ranking
$600 [to calc] 0.07% 7
$600 0.09% 6
$600 0.20% 1
$400 0.20% 1
$400 0.13% 4
$200 0.19% 3
$200 0.12% 5
Additionally, I have a constraint which I could only allocate a total of $2000 across the 7 rows here, by the rankings of their yield (so a higher yield would get everything allocated up to the limit column if there is any left overs from the $2000 total).
I was wondering how I could set up the equations so that it could perform the allocation automatically. Thanks!
I'm going to assume this table starts in A1...
In E1, put the amount you have to allocate
In B2 (and then copied to B3...B8) use the following formula
=MIN(A2,$E$1-SUMIF($D$2:$D$8,">"&D2,$B$2:$B$8))
This will work out how much has been taken by higher ranked, and take the rest, upto whatever is the lesser amount of their limit, and what is left in the pot.
There is one fault with this equation that you will need to figure out how to handle:
If there are equal ranks at the end of the distribution, then both will get the final amount. (e.g. try this with $2,001, and you will see that the 2 rows that have then rank 1 will both claim the final dollar)
Answer to solve the ties for rank causing problem. In the rank column D, add to the rank =rank(c2,$c$2:$c$8,0) + (.0000001 * row(a2)), or whatever row you are in. Then format the rank column to only show integers. Doing this makes the very small decimal addition to the rank the tie breaker so the first row with the rank's matching integer will take the allocation. Since you are adding it to the rank, it doesn't effect any totals. By changing the column format display to integer, the viewer will not be aware of the tiebreaker.

Resources