GroupBy dataframe and find out max number of occurrences of another column - pandas-groupby

I have to use groupby() on a dataframe in python 3.x. Column name is Origin, then based upon the origin, I have to find out the destination with maximum occurrences.
Sample df is like:
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay origin dest
0 2013 1 1 517 515 2 830 819 11 EWR IAH
1 2013 1 1 533 529 4 850 830 20 LGA IAH
2 2013 1 1 542 540 2 923 850 33 JFK MIA
3 2013 1 1 544 545 -1 1004 1022 -18 JFK BQN
4 2013 1 1 554 600 -6 812 837 -25 LGA ATL
5 2013 1 1 554 558 -4 740 728 12 EWR ORD
6 2013 1 1 555 600 -5 913 854 19 EWR FLL
7 2013 1 1 557 600 -3 709 723 -14 LGA IAD
8 2013 1 1 557 600 -3 838 846 -8 JFK MCO
9 2013 1 1 558 600 -2 753 745 8 LGA ORD

You can use the following to find out the max number of occurrences of another column:
df.groupby(['origin'])['dest'].size().reset_index()
origin dest
0 EWR 3
1 JFK 3
2 LGA 4

you can use aggregate functions to make your life simpler and plot graphs onto it as well.
fun={'dest':{'Count':'count'}
df= df.groupby(['origin','dest']).agg(fun).reset_index()
df.columns=df.columns.droplevel(1)
df

Related

How to find missing index values in a pandas dataframe?

My dataframe is like this. I know what I lost some rows in data cleaning because len(df) was previously 500 and now it is 489.
I can see, for example, that the row 496 is missing.
all month day year
0 03/25/93 03 25 93
...
480 2013 1 1 2013
481 1974 1 1 1974
482 1990 1 1 1990
483 1995 1 1 1995
484 2004 1 1 2004
485 1987 1 1 1987
486 1973 1 1 1973
487 1992 1 1 1992
488 1977 1 1 1977
489 1985 1 1 1985
490 2007 1 1 2007
491 2009 1 1 2009
492 1986 1 1 1986
493 1978 1 1 1978
494 2002 1 1 2002
495 1979 1 1 1979
497 2008 1 1 2008
498 2005 1 1 2005
499 1980 1 1 1980
how can I find out which rows are missing?
If my question is a duplicate, please point me to the solution. thanks!
The easiest, if you have unique index values, is probably to use the difference on the index, i.e. you could simply do:
df_original.index.difference(df_cleaned.index)

How can I find the sum of certain columns and find avg of other columns in python?

enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000

How to extract lines from a file when the second columns of a file matches the values in another file

I got two files.
file 1:
4
14
18
45
53
60
64
102
106
158
162
file2:
28 1 2
54 1 2
90 1 1
103 1 1
155 1 17
191 1 1
235 1 1
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
529 57 3
542 53 1
560 58 6
562 164 25
568 164 5
I want to extract the value from file2 if the second column of file two matches the value in file 1.
So the expected output will be:
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
542 53 1
I saw many of the solution online is using python or Perl, however, I want to use linux command to do this, any idea?
This should do it?
awk 'FNR==NR{a[$0]++};FNR!=NR{if($2 in a){print}}' file1 file2
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
542 53 1
Explanation:
we hand awk both files (order is important in this case!).
as long as we read the first file (FNR==NR) we store each value in an array a[$1]++
when we reach the second file we just check if values from the second file's second column ($2) are in the array; if yes, we print them.

Concatenate two dataframes by column

I have 2 dataframes. First dataframe contain number of year and count with 0:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 0
4 1894 0
5 1895 0
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 0
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 0
18 1908 0
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 0
27 1917 0
28 1918 0
29 1919 0
.. ... ...
90 1980 0
91 1981 0
92 1982 0
93 1983 0
94 1984 0
95 1985 0
96 1986 0
97 1987 0
98 1988 0
99 1989 0
100 1990 0
101 1991 0
102 1992 0
103 1993 0
104 1994 0
105 1995 0
106 1996 0
107 1997 0
108 1998 0
109 1999 0
110 2000 0
111 2001 0
112 2002 0
113 2003 0
114 2004 0
115 2005 0
116 2006 0
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
Second dataframe have similar columns but filled with smaller number of years and filled count:
year count
0 1970 1
1 1957 7
2 1947 19
3 1987 12
4 1979 7
5 1940 1
6 1950 19
7 1972 4
8 1954 15
9 1976 15
10 2006 3
11 1963 16
12 1980 6
13 1956 13
14 1967 5
15 1893 1
16 1985 5
17 1964 6
18 1949 11
19 1945 15
20 1948 16
21 1959 16
22 1958 12
23 1929 1
24 1965 12
25 1969 15
26 1946 12
27 1961 1
28 1988 1
29 1918 1
30 1999 3
31 1986 3
32 1981 2
33 1960 2
34 1974 4
35 1953 9
36 1968 11
37 1916 2
38 1955 5
39 1978 1
40 2003 1
41 1982 4
42 1984 3
43 1966 4
44 1983 3
45 1962 3
46 1952 4
47 1992 2
48 1973 4
49 1993 10
50 1975 2
51 1900 1
52 1991 1
53 1907 1
54 1977 4
55 1908 1
56 1998 2
57 1997 3
58 1895 1
I want to create third dataframe df3. For each row, if year in df1 and df2 are equal, then df3["count"] = df2["count"] else df3["count"] = df1["count"].
I tried to use join to do this:
df_new = df2.join(df1, on='year', how='left')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But got an error:
ValueError: columns overlap but no suffix specified: Index(['year'], dtype='object')
I found the solution to this error(Pandas join issue: columns overlap but no suffix specified) But after I run code with those changes:
df_new = df2.join(df1, on='year', how='left', lsuffix='_left', rsuffix='_right')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But output is not what I want:
count year
0 NaN 1890
1 NaN 1891
2 NaN 1892
3 NaN 1893
4 NaN 1894
5 NaN 1895
6 NaN 1896
7 NaN 1897
8 NaN 1898
9 NaN 1899
10 NaN 1900
11 NaN 1901
12 NaN 1902
13 NaN 1903
14 NaN 1904
15 NaN 1905
16 NaN 1906
17 NaN 1907
18 NaN 1908
19 NaN 1909
20 NaN 1910
21 NaN 1911
22 NaN 1912
23 NaN 1913
24 NaN 1914
25 NaN 1915
26 NaN 1916
27 NaN 1917
28 NaN 1918
29 NaN 1919
.. ... ...
29 1.0 1918
30 3.0 1999
31 3.0 1986
32 2.0 1981
33 2.0 1960
34 4.0 1974
35 9.0 1953
36 11.0 1968
37 2.0 1916
38 5.0 1955
39 1.0 1978
40 1.0 2003
41 4.0 1982
42 3.0 1984
43 4.0 1966
44 3.0 1983
45 3.0 1962
46 4.0 1952
47 2.0 1992
48 4.0 1973
49 10.0 1993
50 2.0 1975
51 1.0 1900
52 1.0 1991
53 1.0 1907
54 4.0 1977
55 1.0 1908
56 2.0 1998
57 3.0 1997
58 1.0 1895
[179 rows x 2 columns]
Desired output is:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 1
4 1894 0
5 1895 1
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 1
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 1
18 1908 1
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 2
27 1917 0
28 1918 1
29 1919 0
.. ... ...
90 1980 6
91 1981 2
92 1982 4
93 1983 3
94 1984 3
95 1985 5
96 1986 3
97 1987 12
98 1988 1
99 1989 0
100 1990 0
101 1991 1
102 1992 2
103 1993 10
104 1994 0
105 1995 0
106 1996 0
107 1997 3
108 1998 2
109 1999 3
110 2000 0
111 2001 0
112 2002 0
113 2003 1
114 2004 0
115 2005 0
116 2006 3
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
The issue if because you should place year as index. In addition, if you don't want to lose data, you should join on outer instead of left.
This is my code:
df = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df2 = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df = df.set_index("year")
df2 = df2.set_index("year")
df3 = df.join(df2["qty"], how = "outer", lsuffix='_left', rsuffix='_right')
df3 = df3.fillna(0)
At this step you have 2 columns with values from df1 or df2. In you merge rule, I don't get what you want. You said :
if df1["qty"] == df2["qty"] => df3["qty"] = df2["qty"]
if df1["qty"] != df2["qty"] => df3["qty"] = df1["qty"]
That means you want everytime df1["qty"] because of df1["qty"] == df2["qty"]. Am I right ?
Just in case. If you want a code to adjust you can use apply as follow :
def foo(x1, x2):
if x1 == x2:
return x2
else:
return x1
df3["count"] = df3.apply(lambda row: foo(row["qty_left"], row["qty_left"]), axis=1)
df3.drop(["qty_left","qty_right"], axis = 1, inplace = True)
I hope it helps,
Nicolas

Adding rows that match a criteria in another column in Excel

This is a sample data
Polling_Booth INC SAD BSP PS_NO
1 89 47 2 1
2 97 339 6 1
3 251 485 8 1
4 356 355 25 2
5 290 333 9 2
6 144 143 4 3
7 327 196 1 4
8 370 235 1 5
And this is what I'm trying to achieve
Polling_Booth INC SAD BSP PS_NO OP_INC OP_SAD OP_BSP
1 89 47 2 1
2 97 339 6 1
3 251 485 8 1 437 871 16
4 356 355 25 2
5 290 333 9 2 646 688 34
6 144 143 4 3 144 143 4
7 327 196 1 4 327 196 1
8 370 235 1 5 370 235 1
This is achieved adding up rows which has the same PS_NO, This is what I have tried
=if(E2=E3,sum(B2,B3),0) #same for all the rows
Any help would be much appreciated..Thanks
You could get it to look like your table by adding another condition to check if it's the last occurrence of the PS_No in column E and setting the result to an empty string if not
=IF(COUNTIF($E$2:$E2,$E2)=COUNTIF($E$2:$E$10,$E2),SUMIF($E$2:$E$10,$E2,B$2:B$10),"")
If the data is sorted by PS_No, you can do it more easily by
=IF($E3<>$E2,SUMIF($E$2:$E$10,$E2,B$2:B$10),"")
which I think is what you were trying in your question

Resources