Groupwise Probability Distribution - python-3.x

I have a dataframe df of gps points. I had geographical region that I divided into grid. Each grid cell is represented by pair of two columns (row, col) in a dataframe. The GPS points are labelled with their transportation modes. I want to calculate probability distribution of each grid cell by its transportation modes. (there are five modes of transportation, i.e. walk, bike, car, train, subway).
Row Col P(Walk) P(Bike) P(Car) P(Train) P(Subway)
8 8 Freq(walk)/n Freq(bike)/n Freq(car)/n Freq(train)/n Freq(subway)/n
8 9 Freq(walk)/n Freq(bike)/n Freq(car)/n Freq(train)/n Freq(subway)/n
8 10 Freq(walk)/n Freq(bike)/n Freq(car)/n Freq(train)/n Freq(subway)/n
For example grid cell at row 8, col 8 contains 638 gps points. 598 walk points and 40 subway points Then probability of each transportation mode for this specific grid cell becomes
Row Col P(Walk) P(Bike) P(Car) P(Train) P(Subway)
8 8 598/638 0/638 0/638 0/638 40/638
8 9 ... ... ... ... ...
8 10 ... ... ... ... ...
... ... ... ... ... ... ...
'''
grp = df.groupby(['row','col','Transportation_Mode'])
One way is to iterate over each group one by one using for loops to get the frequency of each transportation mode. But I think their should be more easier or pandorizable way or library that can solve this in just few lines.
An image of geographical region is attached for better understanding of the problem where each geographical region is divided into grid cells represented by rows and cols. Each grid cell contains multiple gps points labelled with their transportation modes.
The csv file of dataframe is available in given link for more clarity of data.
https://drive.google.com/open?id=1R_BBL00G_Dlo-6yrovYJp5zEYLwlMPi9

If I'm not mistaken, you're looking for a more elegant way to loop over each group object and generate a 2-dimensional probability distribution for each one?
It sounds like you should look into this pandas documentation (more specifically the apply function).
You could simply apply a visualization to each group such as this SNS KDE visualization and then join the individual plots back into a grid like the one you provided. With a little ax magic, you can construct a grid for each transportation type. I think those are the best tools at hand to use. I'll leave the logic to you.

Related

How to Combine Columns with the Same Heading in Excel

I have a set of cost data for different pieces of unique equipment. Each piece of equipment is classified as a particular equipment class which I have pulled from an index match on the unique equipment number. I now have a set of ~9000 columns of cost data, each with a column header of one of the ~300 equipment classes.
What I want to do is to get the median, 25%, and 75% for the full data set for each of these equipment classes.
I either want to create a single long column of all the data for each equipment class, or have a way to calculate the Percentile() values for the data in all columns with the same heading.
I could filter the data for each equipment class one at a time and calculate the percentile values, but with 300 equipment classes it would take forever.
Example:
Class01 Class02 Class01 Class03 Class03
1 4 7 10 13
2 5 8 11 14
3 6 9 12 15
And I want the 25%, median and 75% for the distribution for Class01, Class02, and Class03
Thank you for your time.
I either want to create a single long column of all the data for each equipment class, or have a way to calculate the Percentile() values for the data in all columns with the same heading.
I'll just tell you how to change your data around. From there the percentiles/ quartiles will be straight forward.
Start with your data like this. Notice that I added a column on the left. It's easy to make, just type Item1 and drag down (or double click the small square in the bottom right corner of the cell)
You then need to hit Alt+D+P.
Select multiple consolidated ranges > next
Next (create page ranges for me...)
Select all of your data as the range, click add then finish
You will now get a pivot table that looks like this:
Click the grand-grand total (i.e. 120) and that will create another pivot table like this:
Et voila...

Excel IF OR Statement

I am having trouble determining the correct way to calculate a final rank order for four categories. Each of the four metrics make up a higher group. A Top 10 of each category is applied to the respective product to risk analysis.
CURRENT LOGIC - Assignment of 25% max per category.
Columns - Y4
Parts
0.25
25
=IF(L9=1,$Y$4,IF(L9=2,$Y$4*0.9, IF(L9=3,$Y$4*0.8, IF(L9=4,$Y$4*0.7, IF(L9=5,$Y$4*0.6, IF(L9=6,$Y$4*0.5, IF(L9=7,$Y$4*0.4, IF(L9=8,$Y$4*0.3, IF(L9=9,$Y$4*0.2, IF(L9=10,$Y$4*0.1,0))))))))))
DESIRED...
I would like to use a statement to determine three criteria in order to apply a score (1=100, 2=90, 3=80, etc..).
SUM the rank positions of each of the four categories-apply product rank ascending (not including NULL since it's not in the Top 10)
IF a product is identified in more than one metric-apply a significant contribution weight of (*.75),
IF a product has the number 1 rank in any of the four metrics-apply a score of (100).
Data - UPDATED EXAMPLE
(Product) Parts Labor Overhead External Final Score
"XYZ" 3 1 7 7 100
"ABC" NULL 6 NULL 2 100
"LMN" 4 NULL NULL NULL 70
This is way beyond my capability. ANY assistance is appreciated greatly!!!
Jim
I figured this is a good start and I can alter the weight as needed to reflect the reality of the situation.
=AVERAGE(G28:I28)+SUM(G28:I28)*0.25
However, I couldn't figure out how to put a cap on the score of no more than 100 points.
I am still unclear of what exactly you are attempting and if this will work, but how about this simple matrix using an array formula and some conditional formatting.
Array Formula in F2 (make sure to press Ctrl+Shift+Enter when exiting formula edit mode)
=MIN(100,SUM(IF(B2:E2<>"NULL",CHOOSE(B2:E2,100,90,80,70,60,50,40,30,20,10))))
Conditional Formatting defined as shown below.
Red = 100 value where it comes from a 1
Yellow = 100 value where it comes from more than 1 factor, but without a 1.

Need to make a calculation based on multiple lookup results on another Excel sheet

Need to try and get a result based on possible 3 lookups in Excel.
I have a price for a certain size hire vehicle and need to check to see if I want to add in a supplement or not based on entry into a cell in another sheet.
I have a sheet called Keys that has the criteria I base my calculations on and a second sheet I have the rates loaded for all the vehicle sizes available, cars to coaches. I would like to calculate the supplement for customers of the move to a larger vehicle or even a reduction dependent on what I choose.
Keys data is:
Vehicle Sizes
Range # Seats Rate Column Supplement Range to work on
1 4 R N
2 7 S Y 1
3 16 T N
4 24 U Y 5
5 29 V N
6 35 W N
7 45 X N
So for example if the I have chosen to calculate the supplement on the 7 seater then I want to calculate the difference between the 7 seater and 4 seater and that is my supplement. I have also chosen to calculate the reduction between the 29 and 24 seater vehicles.
Am trying to figure out how to combine multiple IF and LOOKUP, if they are correct or not.
So basically IF I have a Y in the supplement column on Keys then calculate the difference in the rates based on the Rate Column based on the Range to work on.
Any suggestions or help appreciated
Sorry think I forgot about the actual rates. They are stored on another sheet as per below. the charges are per service, like an airport transfer etc., they are in VN Dong so thats why they are in the 100,000 + range.
R S T U V W X
Rate with Surcharge
4 7 16 24 29 35 45
340000 373000 394000 735000 780000 1050000 1210000
I have tried to tweak the answer from pnuts but getting a bit lost, note sure if I need the MATCH in the formula of not.
I doubt this will suit but it may help to clarify your requirements:
=IF($D2="N","",INDEX(Sheet2!$Q$2:$X$4,MATCH(F$1,Sheet2!$Q$2:$Q$4,0),CODE($C2)-80)-INDEX(Sheet2!$Q$2:$X$4,MATCH(F$1,Sheet2!$Q$2:$Q$4,0),CODE($C2)+$E2-$A1-81))
in F2 copied across and down to suit.

How to calculate average of a column of numbers linked to each frequency bin making up a histogram, Excel 2010?

I have three columns. Column A consists of numbers, column B consists of bin ranges, and column C consists of number data relevant to the individual data in column A.
Using columns A and B, I created a frequency histogram where all the data in column A have been grouped into the bins of column B. I would like to calculate the average value of each bin using the data from column C (i.e., calculate a mean value for each bin using data from column C that is associated to each value (from column A) that made up each bin).
Can anybody help?
Thanks for the replies. Here is an example of the data (Unfortunately I can not paste in images):
Below are three columns with headers Jar Type (in volume (ml)), Cookies (he number of chocolate chip cookies in the jar), and Interval for bins (bins to count the jar types):
Jar type-cookies-intervals for bins
500 3 100
500 1 150
500 0.5 200
250 3 250
150 1 300
500 1 350
150 2 400
250 2 450
### # 500
Making a histogram of the frequency of jar types gives this grouping:
Bin-Frequency
100 0
150 2
200 0
250 2
300 0
350 0
400 0
450 0
500 4
More 0
Now what I am trying to do is to find out what is the mean number of cookies that can be found in each type of jar. For example, for the 500ml we know that there are 4x500ml jars, and that in each of the 500ml we have 3+1+0.5+1 = 5.5 cookies in total. the mean would be 1.735 cookies.
My issue is that I have 5000+ numbers that separate into 100 bins.
The question calls for a "wandering trace" of a scatterplot: the values of column A (plot them on the horizontal axis) are placed into bins, which therefore comprise vertical strips in the scatterplot. The values of column C (plotted on the vertical axis) are averaged within each strip. This technique smooths out and summarizes apparent trends in the scatterplot.
In this example with 100 records the original data are in black and computed values are in green. Here is the wandering trace of means:
The open circles plot column C (associated values) against column A (data) while the solid squares, connected with a dashed red trace, plot the bin means (column G) against the midpoints (column F).
Any statistical package will provide functions for grouping data and performing operations on those groups. Excel does this to a limited extent with its SUMIF and COUNTIF functions. To use them, create a column (D in the spreadsheet) showing the grouping factor. (That's a simple lookup in the sorted BINS vector using the VLOOKUP function with its "range" option set to true.) SUMIF computes sums by group factor and COUNTIF counts by group factor. Their ratios are the bin means.
Here is what the formulas look like:
Only three formulas were actually entered and then copied down as needed:
=VLOOKUP(A2, Bins, 1, TRUE) computes the group for the value in cell A2. Bins a name for the array $(-2,-3, \ldots, 3)$ in column B.
=AVERAGE(B3:B4) computes the midpoint of the first bin. This was used as a horizontal plotting position in the scatterplot.
=SUMIF(Bin,"="&B3,NewValues)/COUNTIF(Bin, "="&B3) is where all the work is done. Bin refers to the group codes in column D and NewValues refers to the associated values in column C. The tricky parts are the constructs "="&B3: these form a text value instructing the data to be grouped by comparison to the number in cell B3, which is the first endpoint. Because this is a formula, copying it down automatically updates the B3 to B4, then B5, etc.

Creating a scatter-plot with series from row values, and XY values from two other rows

I am doing a project that requires me to study the several condition that affect GPS accuracy, and after I collected a set of data and dumped it to Excel, I was trying to plot a scatter graph, grouping the data into different series according to a value: in this case, I wanted to plot the Latitude and Longitude values as the XY scatter values, and separate the series by the number of satellites when the fix was obtained.
Timestamp Latitude Longitude #Satellites
133009.279 3839.3354 904.7395 0
133010.279 3839.3354 904.7395 0
133011.279 3839.3354 904.7395 0
133026 3845.9863 907.4513 4
133027 3845.986 907.4491 4
133028 3845.9851 907.448 4
133222 3845.9909 907.4866 4
133023.28 3845.9817 907.4429 5
133024.28 3845.9867 907.4549 5
133048 3845.9868 907.452 5
133205 3845.9929 907.4858 5
133206 3845.9927 907.486 5
133207 3845.9925 907.4862 5
133056 3845.9885 907.4569 6
133057 3845.9881 907.4578 6
133223 3845.9905 907.4868 6
133224 3845.9901 907.487 6
I have tried selecting the three rows, adding the series afterwards by selecting the appropriate row, and even tried pivot tables, but these don't allow for scatter-plots unfortunately.
All this to no avail, but I am positive that you can plot the graph. Does anyone have an idea?
PS: Manually selecting the series myself isn't an option, since there is a large number of data. If I could select all of the data for one specific value in a row, though, would let my select each series, and I think I would be able to make it from there.
Have a look at the XY charts from FusionCharts XT - http://www.fusioncharts.com/demos/gallery/#bubble-and-xycharts

Resources