Concatenate two dataframes by column - python-3.x
I have 2 dataframes. First dataframe contain number of year and count with 0:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 0
4 1894 0
5 1895 0
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 0
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 0
18 1908 0
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 0
27 1917 0
28 1918 0
29 1919 0
.. ... ...
90 1980 0
91 1981 0
92 1982 0
93 1983 0
94 1984 0
95 1985 0
96 1986 0
97 1987 0
98 1988 0
99 1989 0
100 1990 0
101 1991 0
102 1992 0
103 1993 0
104 1994 0
105 1995 0
106 1996 0
107 1997 0
108 1998 0
109 1999 0
110 2000 0
111 2001 0
112 2002 0
113 2003 0
114 2004 0
115 2005 0
116 2006 0
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
Second dataframe have similar columns but filled with smaller number of years and filled count:
year count
0 1970 1
1 1957 7
2 1947 19
3 1987 12
4 1979 7
5 1940 1
6 1950 19
7 1972 4
8 1954 15
9 1976 15
10 2006 3
11 1963 16
12 1980 6
13 1956 13
14 1967 5
15 1893 1
16 1985 5
17 1964 6
18 1949 11
19 1945 15
20 1948 16
21 1959 16
22 1958 12
23 1929 1
24 1965 12
25 1969 15
26 1946 12
27 1961 1
28 1988 1
29 1918 1
30 1999 3
31 1986 3
32 1981 2
33 1960 2
34 1974 4
35 1953 9
36 1968 11
37 1916 2
38 1955 5
39 1978 1
40 2003 1
41 1982 4
42 1984 3
43 1966 4
44 1983 3
45 1962 3
46 1952 4
47 1992 2
48 1973 4
49 1993 10
50 1975 2
51 1900 1
52 1991 1
53 1907 1
54 1977 4
55 1908 1
56 1998 2
57 1997 3
58 1895 1
I want to create third dataframe df3. For each row, if year in df1 and df2 are equal, then df3["count"] = df2["count"] else df3["count"] = df1["count"].
I tried to use join to do this:
df_new = df2.join(df1, on='year', how='left')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But got an error:
ValueError: columns overlap but no suffix specified: Index(['year'], dtype='object')
I found the solution to this error(Pandas join issue: columns overlap but no suffix specified) But after I run code with those changes:
df_new = df2.join(df1, on='year', how='left', lsuffix='_left', rsuffix='_right')
df_new['count'] = df_new['count'].fillna(0)
print(df_new)
But output is not what I want:
count year
0 NaN 1890
1 NaN 1891
2 NaN 1892
3 NaN 1893
4 NaN 1894
5 NaN 1895
6 NaN 1896
7 NaN 1897
8 NaN 1898
9 NaN 1899
10 NaN 1900
11 NaN 1901
12 NaN 1902
13 NaN 1903
14 NaN 1904
15 NaN 1905
16 NaN 1906
17 NaN 1907
18 NaN 1908
19 NaN 1909
20 NaN 1910
21 NaN 1911
22 NaN 1912
23 NaN 1913
24 NaN 1914
25 NaN 1915
26 NaN 1916
27 NaN 1917
28 NaN 1918
29 NaN 1919
.. ... ...
29 1.0 1918
30 3.0 1999
31 3.0 1986
32 2.0 1981
33 2.0 1960
34 4.0 1974
35 9.0 1953
36 11.0 1968
37 2.0 1916
38 5.0 1955
39 1.0 1978
40 1.0 2003
41 4.0 1982
42 3.0 1984
43 4.0 1966
44 3.0 1983
45 3.0 1962
46 4.0 1952
47 2.0 1992
48 4.0 1973
49 10.0 1993
50 2.0 1975
51 1.0 1900
52 1.0 1991
53 1.0 1907
54 4.0 1977
55 1.0 1908
56 2.0 1998
57 3.0 1997
58 1.0 1895
[179 rows x 2 columns]
Desired output is:
year count
0 1890 0
1 1891 0
2 1892 0
3 1893 1
4 1894 0
5 1895 1
6 1896 0
7 1897 0
8 1898 0
9 1899 0
10 1900 1
11 1901 0
12 1902 0
13 1903 0
14 1904 0
15 1905 0
16 1906 0
17 1907 1
18 1908 1
19 1909 0
20 1910 0
21 1911 0
22 1912 0
23 1913 0
24 1914 0
25 1915 0
26 1916 2
27 1917 0
28 1918 1
29 1919 0
.. ... ...
90 1980 6
91 1981 2
92 1982 4
93 1983 3
94 1984 3
95 1985 5
96 1986 3
97 1987 12
98 1988 1
99 1989 0
100 1990 0
101 1991 1
102 1992 2
103 1993 10
104 1994 0
105 1995 0
106 1996 0
107 1997 3
108 1998 2
109 1999 3
110 2000 0
111 2001 0
112 2002 0
113 2003 1
114 2004 0
115 2005 0
116 2006 3
117 2007 0
118 2008 0
119 2009 0
[120 rows x 2 columns]
The issue if because you should place year as index. In addition, if you don't want to lose data, you should join on outer instead of left.
This is my code:
df = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df2 = pd.DataFrame({
"year" : np.random.randint(1850, 2000, size=(100,)),
"qty" : np.random.randint(0, 10, size=(100,)),
})
df = df.set_index("year")
df2 = df2.set_index("year")
df3 = df.join(df2["qty"], how = "outer", lsuffix='_left', rsuffix='_right')
df3 = df3.fillna(0)
At this step you have 2 columns with values from df1 or df2. In you merge rule, I don't get what you want. You said :
if df1["qty"] == df2["qty"] => df3["qty"] = df2["qty"]
if df1["qty"] != df2["qty"] => df3["qty"] = df1["qty"]
That means you want everytime df1["qty"] because of df1["qty"] == df2["qty"]. Am I right ?
Just in case. If you want a code to adjust you can use apply as follow :
def foo(x1, x2):
if x1 == x2:
return x2
else:
return x1
df3["count"] = df3.apply(lambda row: foo(row["qty_left"], row["qty_left"]), axis=1)
df3.drop(["qty_left","qty_right"], axis = 1, inplace = True)
I hope it helps,
Nicolas
Related
I would like to find consecutive numbers in column A and column B in python (pandas)
I would like to find consecutive numbers in column A and Column B in python, Column A should be ascending but Column B is descending. I am attaching an example file. Input file nucleotide Pos_A Pos_B Connection_Pos20_Pos102 20 102 Connection_Pos19_Pos102 19 102 Connection_Pos20_Pos101 20 101 Connection_Pos18_Pos102 18 102 Connection_Pos19_Pos101 19 101 Connection_Pos20_Pos100 20 100 Connection_Pos17_Pos102 17 102 Connection_Pos18_Pos101 18 101 Connection_Pos19_Pos100 19 100 Connection_Pos20_Pos99 20 99 Connection_Pos16_Pos102 16 102 Connection_Pos17_Pos101 17 101 Connection_Pos18_Pos100 18 100 Connection_Pos19_Pos99 19 99 Connection_Pos20_Pos98 20 98 Connection_Pos15_Pos102 15 102 Connection_Pos16_Pos101 16 101 Connection_Pos17_Pos100 17 100 Connection_Pos18_Pos99 18 99 Connection_Pos19_Pos98 19 98 Connection_Pos20_Pos97 20 97 Connection_Pos14_Pos102 14 102 Connection_Pos15_Pos101 15 101 Connection_Pos16_Pos100 16 100 Output: nucleotide Pos_A Pos_B Consecutive ID Consecutive Number (Size) Connection_Pos20_Pos102 20 102 101 1 Connection_Pos19_Pos102 19 102 100 2 Connection_Pos20_Pos101 20 101 100 2 Connection_Pos18_Pos102 18 102 99 3 Connection_Pos19_Pos101 19 101 99 3 Connection_Pos20_Pos100 20 100 99 3 Connection_Pos17_Pos102 17 102 98 4 Connection_Pos18_Pos101 18 101 98 4 Connection_Pos19_Pos100 19 100 98 4 Connection_Pos20_Pos99 20 99 98 4 Connection_Pos16_Pos102 16 102 97 5 Connection_Pos17_Pos101 17 101 97 5 Connection_Pos18_Pos100 18 100 97 5 Connection_Pos19_Pos99 19 99 97 5 Connection_Pos20_Pos98 20 98 97 5 Connection_Pos15_Pos102 15 102 96 6 Connection_Pos16_Pos101 16 101 96 6 Connection_Pos17_Pos100 17 100 96 6 Connection_Pos18_Pos99 18 99 96 6 Connection_Pos19_Pos98 19 98 96 6 Connection_Pos20_Pos97 20 97 96 6 Connection_Pos14_Pos102 14 102 95 7 Connection_Pos15_Pos101 15 101 95 7 Connection_Pos16_Pos100 16 100 95 7 Connection_Pos17_Pos99 17 99 95 7 Connection_Pos18_Pos98 18 98 95 7 Connection_Pos19_Pos97 19 97 95 7 Connection_Pos20_Pos96 20 96 95 7
For Consecutive ID, if Pos_B's shifted difference != 1, then we want to subtract 1, so we mark those indexes as -1 with mul(-1) and cumsum them: df['ID'] = df.Pos_B.shift().sub(df.Pos_B).ne(1).mul(-1).cumsum() + df.Pos_B[0] For Consecutive Number, if Pos_A's shifted difference != -1, then we want to add 1, so we mark those indexes as 1 and cumsum again: df['Number'] = df.Pos_A.shift().sub(df.Pos_A).ne(-1).mul(1).cumsum() Result: nucleotide Pos_A Pos_B ID Number 0 Connection_Pos20_Pos102 20 102 101 1 1 Connection_Pos19_Pos102 19 102 100 2 2 Connection_Pos20_Pos101 20 101 100 2 3 Connection_Pos18_Pos102 18 102 99 3 4 Connection_Pos19_Pos101 19 101 99 3 5 Connection_Pos20_Pos100 20 100 99 3 6 Connection_Pos17_Pos102 17 102 98 4 7 Connection_Pos18_Pos101 18 101 98 4 8 Connection_Pos19_Pos100 19 100 98 4 9 Connection_Pos20_Pos99 20 99 98 4 10 Connection_Pos16_Pos102 16 102 97 5 11 Connection_Pos17_Pos101 17 101 97 5 12 Connection_Pos18_Pos100 18 100 97 5 13 Connection_Pos19_Pos99 19 99 97 5 14 Connection_Pos20_Pos98 20 98 97 5 15 Connection_Pos15_Pos102 15 102 96 6 16 Connection_Pos16_Pos101 16 101 96 6 17 Connection_Pos17_Pos100 17 100 96 6 18 Connection_Pos18_Pos99 18 99 96 6 19 Connection_Pos19_Pos98 19 98 96 6 20 Connection_Pos20_Pos97 20 97 96 6 21 Connection_Pos14_Pos102 14 102 95 7 22 Connection_Pos15_Pos101 15 101 95 7 23 Connection_Pos16_Pos100 16 100 95 7
Do it one by one then groupby with ngroup s1 = df.Pos_A.diff().le(0).cumsum() s2 = df.Pos_B.diff().ge(0).cumsum() df['out'] = df.groupby([s1,s2]).ngroup()+1 Out[452]: 0 1 1 2 2 2 3 3 4 3 5 3 6 4 7 4 8 4 9 4 10 5 11 5 12 5 13 5 14 5 15 6 16 6 17 6 18 6 19 6 20 6 21 7 22 7 23 7 24 7 25 7 26 7 27 7 dtype: int64
Pandas: Combine pandas columns that have the same column name
If we have the following df, df A A B B B 0 10 2 0 3 3 1 20 4 19 21 36 2 30 20 24 24 12 3 40 10 39 23 46 How can I combine the content of the columns with the same names? e.g. A B 0 10 0 1 20 19 2 30 24 3 40 39 4 2 3 5 4 21 6 20 24 7 10 23 8 Na 3 9 Na 36 10 Na 12 11 Na 46 I tried groupby and merge and both are not doing this job. Any help is appreciated.
If columns names are duplicated you can use DataFrame.melt with concat: df = pd.concat([df['A'].melt()['value'], df['B'].melt()['value']], axis=1, keys=['A','B']) print (df) A B 0 10.0 0 1 20.0 19 2 30.0 24 3 40.0 39 4 2.0 3 5 4.0 21 6 20.0 24 7 10.0 23 8 NaN 3 9 NaN 36 10 NaN 12 11 NaN 46 EDIT: uniq = df.columns.unique() df = pd.concat([df[c].melt()['value'] for c in uniq], axis=1, keys=uniq) print (df) A B 0 10.0 0 1 20.0 19 2 30.0 24 3 40.0 39 4 2.0 3 5 4.0 21 6 20.0 24 7 10.0 23 8 NaN 3 9 NaN 36 10 NaN 12 11 NaN 46
How can I find the sum of certain columns and find avg of other columns in python?
enter image description here I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for. df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly). But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages? import pandas as pd df = pd.DataFrame() for season in range(2010, 2020): url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season) df = df.append(pd.read_html(url)[0], sort=False) df = df[df['Rk'] != 'Rk'] df = df.reset_index(drop=True) df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '') df['Player'] = df['Player'].str.strip() strCols = ['Player','Tm', 'Pos', 'QBrec'] numCols = [ x for x in df.columns if x not in strCols ] df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True) df[numCols] = df[numCols].apply(pd.to_numeric) df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'}) Output: print (df3) Player GS TD QBR 0 A.J. Feeley 3 1 27.300000 1 A.J. McCarron 4 6 NaN 2 Aaron Rodgers 142 305 68.522222 3 Ace Sanders 4 1 100.000000 4 Adam Podlesh 0 0 0.000000 5 Albert Wilson 7 1 99.700000 6 Alex Erickson 6 0 NaN 7 Alex Smith 121 156 55.122222 8 Alex Tanney 0 1 42.900000 9 Alvin Kamara 9 0 NaN 10 Andrew Beck 6 0 NaN 11 Andrew Luck 86 171 62.766667 12 Andy Dalton 133 204 53.375000 13 Andy Lee 0 0 0.000000 14 Anquan Boldin 32 0 11.600000 15 Anthony Miller 4 0 81.200000 16 Antonio Andrews 10 1 100.000000 17 Antonio Brown 55 1 29.300000 18 Antonio Morrison 4 0 NaN 19 Antwaan Randle El 0 2 100.000000 20 Arian Foster 13 1 100.000000 21 Armanti Edwards 0 0 41.466667 22 Austin Davis 10 13 38.150000 23 B.J. Daniels 0 0 NaN 24 Baker Mayfield 29 49 53.200000 25 Ben Roethlisberger 130 236 66.833333 26 Bernard Scott 1 0 5.600000 27 Bilal Powell 12 0 17.700000 28 Billy Volek 0 0 89.400000 29 Blaine Gabbert 48 48 37.687500 .. ... ... ... ... 329 Tim Boyle 0 0 NaN 330 Tim Masthay 0 1 5.700000 331 Tim Tebow 16 17 42.733333 332 Todd Bouman 1 2 57.400000 333 Todd Collins 1 0 0.800000 334 Tom Brady 156 316 72.755556 335 Tom Brandstater 0 0 0.000000 336 Tom Savage 9 5 38.733333 337 Tony Pike 0 0 2.500000 338 Tony Romo 72 141 71.185714 339 Travaris Cadet 1 0 NaN 340 Travis Benjamin 8 0 1.700000 341 Travis Kelce 15 0 1.800000 342 Trent Edwards 3 2 98.100000 343 Tress Way 0 0 NaN 344 Trevone Boykin 0 1 66.400000 345 Trevor Siemian 25 30 40.750000 346 Troy Smith 6 5 38.500000 347 Tyler Boyd 14 0 2.800000 348 Tyler Bray 0 0 0.000000 349 Tyler Palko 4 2 56.600000 350 Tyler Thigpen 1 2 30.233333 351 Tyreek Hill 13 0 0.000000 352 Tyrod Taylor 46 54 51.242857 353 Vince Young 11 14 50.850000 354 Will Grier 2 0 NaN 355 Willie Snead 4 1 100.000000 356 Zach Mettenberger 10 12 24.600000 357 Zach Pascal 13 0 NaN 358 Zay Jones 15 0 0.000000
How to do substruction in the cells of columns in python
I have this dataframe (df) in python: Cumulative sales 0 12 1 28 2 56 3 87 I want to create a new column in which I whould have the the number of new sales (N-(N-1)) as below: Cumulative sales New Sales 0 12 12 1 28 16 2 56 28 3 87 31
You can do df['new sale']=df.Cumulativesales.diff().fillna(df.Cumulativesales) df Cumulativesales new sale 0 12 12.0 1 28 16.0 2 56 28.0 3 87 31.0
Do this: df['New_sales'] = df['Cumlative_sales'].diff() df.fillna(df.iloc[0]['Cumlative_sales'], inplace=True) print(df) Output: Cumlative_sales New_sales 0 12 12.0 1 28 16.0 2 56 28.0 3 87 31.0
Rolling Sum in Pandas
I need to find the sum of (each row + 3) subsequent rows. Here is the code I'm using to get this accomplished: df_sum_row_and_3_subsequent_rows = df.iloc[::-1].rolling(4, min_periods=1).sum().sum(axis=1).iloc[::-1] Is there a better way of doing this? Here is the DataFrame that I'm using: NBL NBT NBR SBL SBT SBR EBL EBT EBR WBL WBT WBR DATETIME 2020-01-01 10:00:00 9 2 28 8 6 5 4 92 9 21 124 5 2020-01-01 10:15:00 13 0 24 12 2 7 5 91 7 20 123 11 2020-01-01 10:30:00 5 1 16 10 2 4 9 115 12 21 118 9 2020-01-01 10:45:00 10 5 25 9 2 6 5 114 6 25 128 13 2020-01-01 11:00:00 11 4 28 11 3 7 7 110 8 30 126 10 2020-01-01 11:15:00 0 0 0 0 0 0 0 0 0 0 0 0 2020-01-01 11:30:00 0 0 0 0 0 0 0 0 0 0 0 0 2020-01-01 11:45:00 8 5 24 12 10 12 14 130 18 42 154 17 2020-01-01 12:00:00 14 5 29 15 1 17 4 138 17 44 141 9 2020-01-01 12:15:00 12 4 45 13 3 13 13 147 27 47 134 13