Concatenate two dataframes by column - python-3.x

I have 2 dataframes. First dataframe contain number of year and count with 0:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 0
4 1894 0
5 1895 0
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 0
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 0
18 1908 0
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 0
27 1917 0
28 1918 0
29 1919 0
.. ... ...
90 1980 0
91 1981 0
92 1982 0
93 1983 0
94 1984 0
95 1985 0
96 1986 0
97 1987 0
98 1988 0
99 1989 0
100 1990 0
101 1991 0
102 1992 0
103 1993 0
104 1994 0
105 1995 0
106 1996 0
107 1997 0
108 1998 0
109 1999 0
110 2000 0
111 2001 0
112 2002 0
113 2003 0
114 2004 0
115 2005 0
116 2006 0
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
Second dataframe have similar columns but filled with smaller number of years and filled count:
year count
0 1970 1
1 1957 7
2 1947 19
3 1987 12
4 1979 7
5 1940 1
6 1950 19
7 1972 4
8 1954 15
9 1976 15
10 2006 3
11 1963 16
12 1980 6
13 1956 13
14 1967 5
15 1893 1
16 1985 5
17 1964 6
18 1949 11
19 1945 15
20 1948 16
21 1959 16
22 1958 12
23 1929 1
24 1965 12
25 1969 15
26 1946 12
27 1961 1
28 1988 1
29 1918 1
30 1999 3
31 1986 3
32 1981 2
33 1960 2
34 1974 4
35 1953 9
36 1968 11
37 1916 2
38 1955 5
39 1978 1
40 2003 1
41 1982 4
42 1984 3
43 1966 4
44 1983 3
45 1962 3
46 1952 4
47 1992 2
48 1973 4
49 1993 10
50 1975 2
51 1900 1
52 1991 1
53 1907 1
54 1977 4
55 1908 1
56 1998 2
57 1997 3
58 1895 1
I want to create third dataframe df3. For each row, if year in df1 and df2 are equal, then df3["count"] = df2["count"] else df3["count"] = df1["count"].
I tried to use join to do this:
df_new = df2.join(df1, on='year', how='left')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But got an error:
ValueError: columns overlap but no suffix specified: Index(['year'], dtype='object')
I found the solution to this error(Pandas join issue: columns overlap but no suffix specified) But after I run code with those changes:
df_new = df2.join(df1, on='year', how='left', lsuffix='_left', rsuffix='_right')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But output is not what I want:
count year
0 NaN 1890
1 NaN 1891
2 NaN 1892
3 NaN 1893
4 NaN 1894
5 NaN 1895
6 NaN 1896
7 NaN 1897
8 NaN 1898
9 NaN 1899
10 NaN 1900
11 NaN 1901
12 NaN 1902
13 NaN 1903
14 NaN 1904
15 NaN 1905
16 NaN 1906
17 NaN 1907
18 NaN 1908
19 NaN 1909
20 NaN 1910
21 NaN 1911
22 NaN 1912
23 NaN 1913
24 NaN 1914
25 NaN 1915
26 NaN 1916
27 NaN 1917
28 NaN 1918
29 NaN 1919
.. ... ...
29 1.0 1918
30 3.0 1999
31 3.0 1986
32 2.0 1981
33 2.0 1960
34 4.0 1974
35 9.0 1953
36 11.0 1968
37 2.0 1916
38 5.0 1955
39 1.0 1978
40 1.0 2003
41 4.0 1982
42 3.0 1984
43 4.0 1966
44 3.0 1983
45 3.0 1962
46 4.0 1952
47 2.0 1992
48 4.0 1973
49 10.0 1993
50 2.0 1975
51 1.0 1900
52 1.0 1991
53 1.0 1907
54 4.0 1977
55 1.0 1908
56 2.0 1998
57 3.0 1997
58 1.0 1895
[179 rows x 2 columns]
Desired output is:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 1
4 1894 0
5 1895 1
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 1
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 1
18 1908 1
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 2
27 1917 0
28 1918 1
29 1919 0
.. ... ...
90 1980 6
91 1981 2
92 1982 4
93 1983 3
94 1984 3
95 1985 5
96 1986 3
97 1987 12
98 1988 1
99 1989 0
100 1990 0
101 1991 1
102 1992 2
103 1993 10
104 1994 0
105 1995 0
106 1996 0
107 1997 3
108 1998 2
109 1999 3
110 2000 0
111 2001 0
112 2002 0
113 2003 1
114 2004 0
115 2005 0
116 2006 3
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]

The issue if because you should place year as index. In addition, if you don't want to lose data, you should join on outer instead of left.
This is my code:
df = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df2 = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df = df.set_index("year")
df2 = df2.set_index("year")
df3 = df.join(df2["qty"], how = "outer", lsuffix='_left', rsuffix='_right')
df3 = df3.fillna(0)
At this step you have 2 columns with values from df1 or df2. In you merge rule, I don't get what you want. You said :
if df1["qty"] == df2["qty"] => df3["qty"] = df2["qty"]
if df1["qty"] != df2["qty"] => df3["qty"] = df1["qty"]
That means you want everytime df1["qty"] because of df1["qty"] == df2["qty"]. Am I right ?
Just in case. If you want a code to adjust you can use apply as follow :
def foo(x1, x2):
if x1 == x2:
return x2
else:
return x1
df3["count"] = df3.apply(lambda row: foo(row["qty_left"], row["qty_left"]), axis=1)
df3.drop(["qty_left","qty_right"], axis = 1, inplace = True)
I hope it helps,
Nicolas

Related

I would like to find consecutive numbers in column A and column B in python (pandas)

I would like to find consecutive numbers in column A and Column B in python, Column A should be ascending but Column B is descending. I am attaching an example file.
Input file
nucleotide
Pos_A
Pos_B
Connection_Pos20_Pos102
20
102
Connection_Pos19_Pos102
19
102
Connection_Pos20_Pos101
20
101
Connection_Pos18_Pos102
18
102
Connection_Pos19_Pos101
19
101
Connection_Pos20_Pos100
20
100
Connection_Pos17_Pos102
17
102
Connection_Pos18_Pos101
18
101
Connection_Pos19_Pos100
19
100
Connection_Pos20_Pos99
20
99
Connection_Pos16_Pos102
16
102
Connection_Pos17_Pos101
17
101
Connection_Pos18_Pos100
18
100
Connection_Pos19_Pos99
19
99
Connection_Pos20_Pos98
20
98
Connection_Pos15_Pos102
15
102
Connection_Pos16_Pos101
16
101
Connection_Pos17_Pos100
17
100
Connection_Pos18_Pos99
18
99
Connection_Pos19_Pos98
19
98
Connection_Pos20_Pos97
20
97
Connection_Pos14_Pos102
14
102
Connection_Pos15_Pos101
15
101
Connection_Pos16_Pos100
16
100
Output:
nucleotide
Pos_A
Pos_B
Consecutive ID
Consecutive Number (Size)
Connection_Pos20_Pos102
20
102
101
1
Connection_Pos19_Pos102
19
102
100
2
Connection_Pos20_Pos101
20
101
100
2
Connection_Pos18_Pos102
18
102
99
3
Connection_Pos19_Pos101
19
101
99
3
Connection_Pos20_Pos100
20
100
99
3
Connection_Pos17_Pos102
17
102
98
4
Connection_Pos18_Pos101
18
101
98
4
Connection_Pos19_Pos100
19
100
98
4
Connection_Pos20_Pos99
20
99
98
4
Connection_Pos16_Pos102
16
102
97
5
Connection_Pos17_Pos101
17
101
97
5
Connection_Pos18_Pos100
18
100
97
5
Connection_Pos19_Pos99
19
99
97
5
Connection_Pos20_Pos98
20
98
97
5
Connection_Pos15_Pos102
15
102
96
6
Connection_Pos16_Pos101
16
101
96
6
Connection_Pos17_Pos100
17
100
96
6
Connection_Pos18_Pos99
18
99
96
6
Connection_Pos19_Pos98
19
98
96
6
Connection_Pos20_Pos97
20
97
96
6
Connection_Pos14_Pos102
14
102
95
7
Connection_Pos15_Pos101
15
101
95
7
Connection_Pos16_Pos100
16
100
95
7
Connection_Pos17_Pos99
17
99
95
7
Connection_Pos18_Pos98
18
98
95
7
Connection_Pos19_Pos97
19
97
95
7
Connection_Pos20_Pos96
20
96
95
7
For Consecutive ID, if Pos_B's shifted difference != 1, then we want to subtract 1, so we mark those indexes as -1 with mul(-1) and cumsum them:
df['ID'] = df.Pos_B.shift().sub(df.Pos_B).ne(1).mul(-1).cumsum() + df.Pos_B[0]
For Consecutive Number, if Pos_A's shifted difference != -1, then we want to add 1, so we mark those indexes as 1 and cumsum again:
df['Number'] = df.Pos_A.shift().sub(df.Pos_A).ne(-1).mul(1).cumsum()
Result:
nucleotide Pos_A Pos_B ID Number
0 Connection_Pos20_Pos102 20 102 101 1
1 Connection_Pos19_Pos102 19 102 100 2
2 Connection_Pos20_Pos101 20 101 100 2
3 Connection_Pos18_Pos102 18 102 99 3
4 Connection_Pos19_Pos101 19 101 99 3
5 Connection_Pos20_Pos100 20 100 99 3
6 Connection_Pos17_Pos102 17 102 98 4
7 Connection_Pos18_Pos101 18 101 98 4
8 Connection_Pos19_Pos100 19 100 98 4
9 Connection_Pos20_Pos99 20 99 98 4
10 Connection_Pos16_Pos102 16 102 97 5
11 Connection_Pos17_Pos101 17 101 97 5
12 Connection_Pos18_Pos100 18 100 97 5
13 Connection_Pos19_Pos99 19 99 97 5
14 Connection_Pos20_Pos98 20 98 97 5
15 Connection_Pos15_Pos102 15 102 96 6
16 Connection_Pos16_Pos101 16 101 96 6
17 Connection_Pos17_Pos100 17 100 96 6
18 Connection_Pos18_Pos99 18 99 96 6
19 Connection_Pos19_Pos98 19 98 96 6
20 Connection_Pos20_Pos97 20 97 96 6
21 Connection_Pos14_Pos102 14 102 95 7
22 Connection_Pos15_Pos101 15 101 95 7
23 Connection_Pos16_Pos100 16 100 95 7
Do it one by one then groupby with ngroup
s1 = df.Pos_A.diff().le(0).cumsum()
s2 = df.Pos_B.diff().ge(0).cumsum()
df['out'] = df.groupby([s1,s2]).ngroup()+1
Out[452]:
0 1
1 2
2 2
3 3
4 3
5 3
6 4
7 4
8 4
9 4
10 5
11 5
12 5
13 5
14 5
15 6
16 6
17 6
18 6
19 6
20 6
21 7
22 7
23 7
24 7
25 7
26 7
27 7
dtype: int64

Pandas: Combine pandas columns that have the same column name

If we have the following df,
df
A A B B B
0 10 2 0 3 3
1 20 4 19 21 36
2 30 20 24 24 12
3 40 10 39 23 46
How can I combine the content of the columns with the same names?
e.g.
A B
0 10 0
1 20 19
2 30 24
3 40 39
4 2 3
5 4 21
6 20 24
7 10 23
8 Na 3
9 Na 36
10 Na 12
11 Na 46
I tried groupby and merge and both are not doing this job.
Any help is appreciated.
If columns names are duplicated you can use DataFrame.melt with concat:
df = pd.concat([df['A'].melt()['value'], df['B'].melt()['value']], axis=1, keys=['A','B'])
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46
EDIT:
uniq = df.columns.unique()
df = pd.concat([df[c].melt()['value'] for c in uniq], axis=1, keys=uniq)
print (df)
A B
0 10.0 0
1 20.0 19
2 30.0 24
3 40.0 39
4 2.0 3
5 4.0 21
6 20.0 24
7 10.0 23
8 NaN 3
9 NaN 36
10 NaN 12
11 NaN 46

How can I find the sum of certain columns and find avg of other columns in python?

enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000

How to do substruction in the cells of columns in python

I have this dataframe (df) in python:
Cumulative sales
0 12
1 28
2 56
3 87
I want to create a new column in which I whould have the the number of new sales (N-(N-1)) as below:
Cumulative sales New Sales
0 12 12
1 28 16
2 56 28
3 87 31
You can do
df['new sale']=df.Cumulativesales.diff().fillna(df.Cumulativesales)
df
Cumulativesales new sale
0 12 12.0
1 28 16.0
2 56 28.0
3 87 31.0
Do this:
df['New_sales'] = df['Cumlative_sales'].diff()
df.fillna(df.iloc[0]['Cumlative_sales'], inplace=True)
print(df)
Output:
Cumlative_sales New_sales
0 12 12.0
1 28 16.0
2 56 28.0
3 87 31.0

Rolling Sum in Pandas

I need to find the sum of (each row + 3) subsequent rows. Here is the code I'm using to get this accomplished: df_sum_row_and_3_subsequent_rows = df.iloc[::-1].rolling(4, min_periods=1).sum().sum(axis=1).iloc[::-1]
Is there a better way of doing this?
Here is the DataFrame that I'm using:
NBL NBT NBR SBL SBT SBR EBL EBT EBR WBL WBT WBR
DATETIME
2020-01-01 10:00:00 9 2 28 8 6 5 4 92 9 21 124 5
2020-01-01 10:15:00 13 0 24 12 2 7 5 91 7 20 123 11
2020-01-01 10:30:00 5 1 16 10 2 4 9 115 12 21 118 9
2020-01-01 10:45:00 10 5 25 9 2 6 5 114 6 25 128 13
2020-01-01 11:00:00 11 4 28 11 3 7 7 110 8 30 126 10
2020-01-01 11:15:00 0 0 0 0 0 0 0 0 0 0 0 0
2020-01-01 11:30:00 0 0 0 0 0 0 0 0 0 0 0 0
2020-01-01 11:45:00 8 5 24 12 10 12 14 130 18 42 154 17
2020-01-01 12:00:00 14 5 29 15 1 17 4 138 17 44 141 9
2020-01-01 12:15:00 12 4 45 13 3 13 13 147 27 47 134 13

Resources