I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.
I have searched a lot for a solution to this, but have found nothing. Maybe that's because it's a little hard to describe, or at least, I'm having trouble describing it for a search engine.
I have two columns of dates, the first column is the date a purchase order was received to be inspected, the second is the date that purchase order was accepted or rejected. What I would like is a graph with dates on the X-axis, and then the number of purchase orders in the queue on that day on the Y-axis.
Some purchase orders are completed that day, so they would still be counted, but they might not get addressed for days or weeks, so they would be counted on all those days until they were addressed.
I've been trying to do this with a formula, but am stumped. I feel like I might need to use multiple formulas, or go over to VBA, but my VBA is a little limited.
Edit: Here is a sample dataset.
Date In Date Out
9/1/18 9/1/18
9/1/18 9/1/18
9/1/18 9/2/18
9/1/18 9/3/18
9/2/18 9/2/18
9/2/18 9/4/18
So, it would be 4 for 9/1/18, 4 for 9/2/18, 2 for 9/3/18, and 1 for 9/4/18.
I have tried using COUNTIFS, but I don't know how to check between the two columns for the "between" dates.
If your data is in column A and B. Put your dates in column C (the X axis of your chart), then in column D you can write =COUNTIFS($A$1:$A$1000,"<="&C1,$A$1:$A$1000,">="&C1). The COUNTIFS function will consider that for each row of data, all conditions must be met to be added to the count (a little weird, but definitely useful). See screenshot.
I am working on a model of charging load of electric vehicle. I am attaching a link to an excel workbook for your better understanding.
Column B contains random time values
Column G to P represents houses and each house can have 1 car. So the each time values needs to be distributed in one column. Now when a car is plugged in, its load stays constant for 3 cells.
I want excel to randomly distribute these cars e.g. 4 cars to 4 houses and leave others blank.
what i can think of is, to assign each time a random house then use IF formula with AND function to match random times with time series and second condition to match random houses with columns 1-10.
the problem i am facing is, the formula gives a value error and only works in the rows with has random generated time in front of them screenshot. I know there is a very small thing that i am missing. please help me find it
Regards
workbook
=IF(ISNA(MATCH(G$5,$C$6:$C$9,FALSE)),"",IF(AND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE))>=$F6,INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE))<=$F6+TIME(0,30,0)),11,""))
The two elements in the AND find the house number in column C and return the corresponding time in column B.
The first element compares the time in F to that time. The second element compares the time + 30 minutes to F (three cells). If it's between those two times, it gets an 11.
The ISNA makes sure that the house in question is on the list. You could also use an IFERROR, but I prefer the precision of ISNA.
Update
If you want the values to wrap around, you need to OR compare to the next day.
=IF(ISNA(MATCH(G$5,$C$6:$C$9,FALSE)),"",IF(OR(AND(ROUND($F6,5)>=ROUND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE)),5),ROUND($F6,5)<=ROUND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE))+TIME(0,30,0),5)),AND(ROUND($F6+1,5)>=ROUND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE)),5),ROUND($F6+1,5)<=ROUND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE))+TIME(0,30,0),5))),11,""))
That formula structure looks like
=If(isna(),"",if(or(and(today,today),and(tomorrow,tomorrow)),11,"")
This formulas already getting too big. If you triple it for your three voltages, it will be huge. You should consider writing a UDF in VBA. It won't be as quick to calculate, but will probably be more maintainable.
If you want to stick with a formula, you could put the wattage in row 4 above the house number. Then in another table, list the wattages and minutes to charge. So in, say, B12:C14 you have
3.7 120
11 30
22 15
Now where you have 11 in your formula, you'd have G$4 and the two placed you have TIME(0,30,0), you'd have TIME(0,INDEX($C$12:$C$14,MATCH(G$4,$B$12:$B$14,FALSE)),0). I re-arranged some stuff to make it more 'readable' (but it's still pretty tough) and here's the final formula
=IF(ISNA(MATCH(G$5,$C$6:$C$9,FALSE)),"",IF(OR(AND(ROUND($F6,5)>=ROUND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE)),5),ROUND($F6,5)<=ROUND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE))+TIME(0,INDEX($C$12:$C$14,MATCH(G$4,$B$12:$B$14,FALSE)),0),5)),AND(ROUND($F6+1,5)>=ROUND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE)),5),ROUND($F6+1,5)<=ROUND(INDEX($B$6:$B$9,MATCH(G$5,$C$6:$C$9,FALSE))+TIME(0,INDEX($C$12:$C$14,MATCH(G$4,$B$12:$B$14,FALSE)),0),5))),G$4,""))
I'm currently trying to calculate the following:
That is, the average daily sales of a firm. The example underneath is small, but in reality I have 280 days and over a 100 firms. I've tried with VLOOKUP(firmname A, (cells), sales) but it obviously only brings back one number. Sum(Vlookup) is also not the best of choice.
Anyone who could point me in the right direction ... SUMIF?
http://i66.tinypic.com/2wmnrbn.png
Assuming data is located at B2:C11 and the list of Firms is at F2:F5 enter this formula in G3 and copy till last record
=AVERAGEIFS($D$3:$D$11,$C$3:$C$11,$F3)
Myself and some friends are taking part in a weight loss challenge this year and I will be recording monthly weigh in's and body measurements. I need to find a calculation which will work out the difference's in inches and pounds.
I have the item title in column B from row 10 down to Row 17. The first one in Row 10 is weight which is calculated in pounds.
Then going across from Column C is the month starting with Jan ending in December in Column N.
The total loss needs to be updated after every monthly entry into column O.
Unfortunately I cannot post a picture of the table as I'm new to this group.
I've tried other formulaes suggested to people with similar problems but they don't work for me.
Can anyone help?
Many Thanks
Helen
Bit hard to work it out from your description but I think you are looking for
=C10-MIN(D10:N10)
That assumes the largest figure will always be in column C and will update every time a new entry is placed in the row.
If the weight might go up (not that you are going to fail the challenge) you could use
=C10 - LOOKUP(1,1/(D10:N10<>""),D10:N10)
This should do the trick. (And you can copy down to other rows as necessary)
=INDEX(C10:N10,1,COUNT(C10:N10))-C10
INDEX used here, returns the value from the range C10:N10 in the first and only row, where the column is determined by the count of values already entered. So if you have values entered for 4 months, the formula will take April's value and subtract January's value.
A negative number will represent weight loss. A positive number means weight gain.
Total fat loss :
(AVERAGE(C10:N10) - C10)*2