Copy row of data from one pandas dataframe to another - python-3.x

A pandas newbie here. I imported an excel data into pandas, I want to copy subset of data of a specific row (placeholder) from one dataframe (Error_data1) to another dataframe (Error_data2) where the 'placeholder' exists.
Here is the first 4 rows of Error_data1 (it has 150 rows)
index student Error1 Error2 Error3 Error4 Error5
0 Henry 2.5647 -0.2145 1.3524 2.0124 6.2013
1 John -0.0124 1.0365 3.2145 4.0211 -5.0124
2 Terry 1.1120 2.2154 -6.2013 1.2032 2.3321
3 Gerald 9.2105 1.0212 3.2548 3.6478 4.1020
Here is the first 5 rows of Error_data2 (it has 358 rows)
index Day Time student Error1 Error2 Error3 Error4 Error5
0 Mon 01:00 Terry
1 Tue 05:15 John
2 Wed 05:25 john
3 Wed 12:15 Gerald
4 Thur 11:00 Henry
Here is the code i tried
for i in range(len(Error_data1)):
if Error_data1['Student'][i] == Error_data2['Student'][i]:
a = Error_data1.iloc[i,1:6]
Error_data2.iloc[i,4:9] = a
I expect Error_data2 to look like this:
index Day Time student Error1 Error2 Error3 Error4 Error5
0 Mon 01:00 Terry 1.1120 2.2154 -6.2013 1.2032 2.3321
1 Tue 05:15 John -0.0124 1.0365 3.2145 4.0211 -5.0124
2 Wed 05:25 john -0.0124 1.0365 3.2145 4.0211 -5.0124
3 Wed 12:15 Gerald 9.2105 1.0212 3.2548 3.6478 4.1020
4 Thur 11:00 Henry 2.5647 -0.2145 1.3524 2.0124 6.2013

You can try merging the two dataframes on student names.
combined = Error_data1.merge(Error_data2, on='student', how='left').fillna(0)

Related

Creating multiple named dataframes by a for loop

I have a database that contains 60,000+ rows of college football recruit data. From there, I want to create seperate dataframes where each one contains just one value. This is what a sample of the dataframe looks like:
,Primary Rank,Other Rank,Name,Link,Highschool,Position,Height,weight,Rating,National Rank,Position Rank,State Rank,Team,Class
0,1,,D.J. Williams,https://247sports.com/Player/DJ-Williams-49931,"De La Salle (Concord, CA)",ILB,6-2,235,0.9998,1,1,1,Miami,2000
1,2,,Brock Berlin,https://247sports.com/Player/Brock-Berlin-49926,"Evangel Christian Academy (Shreveport, LA)",PRO,6-2,190,0.9998,2,1,1,Florida,2000
2,3,,Charles Rogers,https://247sports.com/Player/Charles-Rogers-49984,"Saginaw (Saginaw, MI)",WR,6-4,195,0.9988,3,1,1,Michigan State,2000
3,4,,Travis Johnson,https://247sports.com/Player/Travis-Johnson-50043,"Notre Dame (Sherman Oaks, CA)",SDE,6-4,265,0.9982,4,1,2,Florida State,2000
4,5,,Marcus Houston,https://247sports.com/Player/Marcus-Houston-50139,"Thomas Jefferson (Denver, CO)",RB,6-0,208,0.9980,5,1,1,Colorado,2000
5,6,,Kwame Harris,https://247sports.com/Player/Kwame-Harris-49999,"Newark (Newark, DE)",OT,6-7,320,0.9978,6,1,1,Stanford,2000
6,7,,B.J. Johnson,https://247sports.com/Player/BJ-Johnson-50154,"South Grand Prairie (Grand Prairie, TX)",WR,6-1,190,0.9976,7,2,1,Texas,2000
7,8,,Bryant McFadden,https://247sports.com/Player/Bryant-McFadden-50094,"McArthur (Hollywood, FL)",CB,6-1,182,0.9968,8,1,1,Florida State,2000
8,9,,Sam Maldonado,https://247sports.com/Player/Sam-Maldonado-50071,"Harrison (Harrison, NY)",RB,6-2,215,0.9964,9,2,1,Ohio State,2000
9,10,,Mike Munoz,https://247sports.com/Player/Mike-Munoz-50150,"Archbishop Moeller (Cincinnati, OH)",OT,6-7,290,0.9960,10,2,1,Tennessee,2000
10,11,,Willis McGahee,https://247sports.com/Player/Willis-McGahee-50179,"Miami Central (Miami, FL)",RB,6-1,215,0.9948,11,3,2,Miami,2000
11,12,,Antonio Hall,https://247sports.com/Player/Antonio-Hall-50175,"McKinley (Canton, OH)",OT,6-5,295,0.9946,12,3,2,Kentucky,2000
12,13,,Darrell Lee,https://247sports.com/Player/Darrell-Lee-50580,"Kirkwood (Saint Louis, MO)",WDE,6-5,230,0.9940,13,1,1,Florida,2000
13,14,,O.J. Owens,https://247sports.com/Player/OJ-Owens-50176,"North Stanly (New London, NC)",S,6-1,195,0.9932,14,1,1,Tennessee,2000
14,15,,Jeff Smoker,https://247sports.com/Player/Jeff-Smoker-50582,"Manheim Central (Manheim, PA)",PRO,6-3,190,0.9922,15,2,1,Michigan State,2000
15,16,,Marco Cooper,https://247sports.com/Player/Marco-Cooper-50171,"Cass Technical (Detroit, MI)",OLB,6-2,235,0.9918,16,1,2,Ohio State,2000
16,17,,Chance Mock,https://247sports.com/Player/Chance-Mock-50163,"The Woodlands (The Woodlands, TX)",PRO,6-2,190,0.9918,17,3,2,Texas,2000
17,18,,Roy Williams,https://247sports.com/Player/Roy-Williams-55566,"Permian (Odessa, TX)",WR,6-4,202,0.9916,18,3,3,Texas,2000
18,19,,Matt Grootegoed,https://247sports.com/Player/Matt-Grootegoed-50591,"Mater Dei (Santa Ana, CA)",OLB,5-11,205,0.9914,19,2,3,USC,2000
19,20,,Yohance Buchanan,https://247sports.com/Player/Yohance-Buchanan-50182,"Douglass (Atlanta, GA)",S,6-1,210,0.9912,20,2,1,Florida State,2000
20,21,,Mac Tyler,https://247sports.com/Player/Mac-Tyler-50572,"Jess Lanier (Hueytown, AL)",DT,6-6,320,0.9912,21,1,1,Alabama,2000
21,22,,Jason Respert,https://247sports.com/Player/Jason-Respert-55623,"Northside (Warner Robins, GA)",OC,6-3,300,0.9902,22,1,2,Tennessee,2000
22,23,,Casey Clausen,https://247sports.com/Player/Casey-Clausen-50183,"Bishop Alemany (Mission Hills, CA)",PRO,6-4,215,0.9896,23,4,4,Tennessee,2000
23,24,,Albert Means,https://247sports.com/Player/Albert-Means-55968,"Trezevant (Memphis, TN)",SDE,6-6,310,0.9890,24,2,1,Alabama,2000
24,25,,Albert Hollis,https://247sports.com/Player/Albert-Hollis-55958,"Christian Brothers (Sacramento, CA)",RB,6-0,190,0.9890,25,4,5,Georgia,2000
25,26,,Eric Moore,https://247sports.com/Player/Eric-Moore-55973,"Pahokee (Pahokee, FL)",OLB,6-4,226,0.9884,26,3,3,Florida State,2000
26,27,,Willie Dixon,https://247sports.com/Player/Willie-Dixon-55626,"Stockton Christian School (Stockton, CA)",WR,5-11,182,0.9884,27,4,6,Miami,2000
27,28,,Cory Bailey,https://247sports.com/Player/Cory-Bailey-50586,"American (Hialeah, FL)",S,5-10,175,0.9880,28,3,4,Florida,2000
28,29,,Sean Young,https://247sports.com/Player/Sean-Young-55972,"Northwest Whitfield County (Tunnel Hill, GA)",OG,6-6,293,0.9878,29,1,3,Tennessee,2000
29,30,,Johnnie Morant,https://247sports.com/Player/Johnnie-Morant-60412,"Parsippany Hills (Morris Plains, NJ)",WR,6-5,225,0.9871,30,5,1,Syracuse,2000
30,31,,Wes Sims,https://247sports.com/Player/Wes-Sims-60243,"Weatherford (Weatherford, OK)",OG,6-5,310,0.9869,31,2,1,Oklahoma,2000
31,33,,Jason Campbell,https://247sports.com/Player/Jason-Campbell-55976,"Taylorsville (Taylorsville, MS)",PRO,6-5,190,0.9853,33,5,1,Auburn,2000
32,34,,Antwan Odom,https://247sports.com/Player/Antwan-Odom-50168,"Alma Bryant (Irvington, AL)",SDE,6-7,260,0.9851,34,3,2,Alabama,2000
33,35,,Sloan Thomas,https://247sports.com/Player/Sloan-Thomas-55630,"Klein (Spring, TX)",WR,6-2,188,0.9847,35,6,5,Texas,2000
34,36,,Raymond Mann,https://247sports.com/Player/Raymond-Mann-60804,"Hampton (Hampton, VA)",ILB,6-1,233,0.9847,36,2,1,Virginia,2000
35,37,,Alphonso Townsend,https://247sports.com/Player/Alphonso-Townsend-55975,"Lima Central Catholic (Lima, OH)",DT,6-6,280,0.9847,37,2,3,Ohio State,2000
36,38,,Greg Jones,https://247sports.com/Player/Greg-Jones-50158,"Battery Creek (Beaufort, SC)",RB,6-2,245,0.9837,38,6,1,Florida State,2000
37,39,,Paul Mociler,https://247sports.com/Player/Paul-Mociler-60319,"St. John Bosco (Bellflower, CA)",OG,6-5,300,0.9833,39,3,7,UCLA,2000
38,40,,Chris Septak,https://247sports.com/Player/Chris-Septak-57555,"Millard West (Omaha, NE)",TE,6-3,245,0.9833,40,1,1,Nebraska,2000
39,41,,Eric Knott,https://247sports.com/Player/Eric-Knott-60823,"Henry Ford II (Sterling Heights, MI)",TE,6-4,235,0.9831,41,2,3,Michigan State,2000
40,42,,Harold James,https://247sports.com/Player/Harold-James-57524,"Osceola (Osceola, AR)",S,6-1,220,0.9827,42,4,1,Alabama,2000
For example, if I don't use a for loop, this line of code is what I use if I just want to create one dataframe:
recruits2022 = recruits_final[recruits_final['Class'] == 2022]
However, I want to have a named dataframe for each recruiting class.
In other words, recruits2000 would be a dataframe for all rows that have a class value equal to 2000, recruits2001 would be a dataframe for all rows that have a class value to 2001, and so forth.
This is what I tried recently, but have no luck saving the dataframe outside of the for loop.
databases = ['recruits2000', 'recruits2001', 'recruits2002', 'recruits2003', 'recruits2004',
'recruits2005', 'recruits2006', 'recruits2007', 'recruits2008', 'recruits2009',
'recruits2010', 'recruits2011', 'recruits2012', 'recruits2013', 'recruits2014',
'recruits2015', 'recruits2016', 'recruits2017', 'recruits2018', 'recruits2019',
'recruits2020', 'recruits2021', 'recruits2022', 'recruits2023']
for i in range(len(databases)):
year = pd.to_numeric(databases[i][-4:], errors = 'coerce')
db = recruits_final[recruits_final['Class'] == year]
db.name = databases[i]
print(db)
print(db.name)
print(year)
recruits2023
I would get this error instead of what I wanted
NameError Traceback (most recent call last)
<ipython-input-49-7cb5d12ab92f> in <module>()
29
30 # print(db.name)
---> 31 recruits2023
32
33
NameError: name 'recruits2023' is not defined
Is there something that I am missing to get this for loop to work? Any assistance is truly appreciated. Thanks in advance.
List use a dictionary of dataframes using groupby:
dict_dfs = dict(tuple(df.groupby('Class')))
Access you individual dataframes using
dict_dfs[2022]
You override variable db at each iteration and recruits2023 is not a variable so you can't use it like that:
You can use a dict to store your data:
recruits = {}
for year in recruits_final['Class'].unique():
recruits[year] = recruits_final[recruits_final['Class'] == year]
>>> recruits[2000]
Primary Rank Other Rank Name Link ... Position Rank State Rank Team Class
0 1 NaN D.J. Williams https://247sports.com/Player/DJ-Williams-49931 ... 1 1 Miami 2000
1 2 NaN Brock Berlin https://247sports.com/Player/Brock-Berlin-49926 ... 1 1 Florida 2000
2 3 NaN Charles Rogers https://247sports.com/Player/Charles-Rogers-49984 ... 1 1 Michigan State 2000
3 4 NaN Travis Johnson https://247sports.com/Player/Travis-Johnson-50043 ... 1 2 Florida State 2000
...
38 40 NaN Chris Septak https://247sports.com/Player/Chris-Septak-57555 ... 1 1 Nebraska 2000
39 41 NaN Eric Knott https://247sports.com/Player/Eric-Knott-60823 ... 2 3 Michigan State 2000
40 42 NaN Harold James https://247sports.com/Player/Harold-James-57524 ... 4 1 Alabama 2000
>>> recruits.keys()
dict_keys([2000])

Create New DataFrame Columns Based on Year

I have a pandas DataFrame that contains NFL Quarterback Data from the 2015-2016 to the 2019-2020 Seasons. The DataFrame looks like this
Player Season End Year YPG TD
Tom Brady 2019 322.6 25
Tom Brady 2018 308.1 26
Tom Brady 2017 295.7 24
Tom Brady 2016 308.7 28
Aaron Rodgers 2019 360.4 30
Aaron Rodgers 2018 358.8 33
Aaron Rodgers 2017 357.9 35
Aaron Rodgers 2016 355.2 32
I want to be able to create new columns that contains the years' data I select and the last three years' data. For example if the year I select is 2019 the resulting DataFrame would be(SY stands for selected year:
Player Season End Year YPG SY YPG SY-1 YPG SY-2 YPG SY-3 TD
Tom Brady 2019 322.6 308.1 295.7 308.7 25
Aaron Rodgers 2019 360.4 358.8 357.9 355.2 30
This is how I am attempting to do it:
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']), 'YPG SY'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-1), 'YPG SY-1'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-2), 'YPG SY-2'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-3), 'YPG SY-3'] = NFL_Data['YPG']
However, when I run the code above, it doesn't fill out the columns appropriately. Most of the rows are 0. Am I approaching the problem the right way or is there a better way to attack it?
(Edited to include TD Column)
First step is to pivot your data frame.
pivoted = df.pivot_table(index='Player', columns='Season End Year', values='YPG')
Which yields
Season End Year 2016 2017 2018 2019
Player
Aaron Rodgers 355.2 357.9 358.8 360.4
Tom Brady 308.7 295.7 308.1 322.6
Then, you may select:
pivoted.loc[:, range(year, year-3, -1)]
2019 2018 2017
Player
Aaron Rodgers 360.4 358.8 357.9
Tom Brady 322.6 308.1 295.7
Or alternatively as suggested by Quang:
pivoted.loc[:, year:year-3:-1]

group the columns by day and name and get the min value with their start and end using python pandas

need to group the columns by day and name and get the min value with their start and end
dataframe
day name value start end duration
Wednesday AAA 1 10/23/2019 2:46 10/23/2019 3:09 00:23
Wednesday AAA 1 10/23/2019 5:20 10/23/2019 5:44 00:24
Wednesday AAA 1 10/23/2019 6:51 10/23/2019 8:14 01:23
Wednesday AAA 17602 10/23/2019 12:35 10/23/2019 12:38 00:03
Wednesday AAA 1155 10/23/2019 15:50 10/23/2019 15:54 00:04
logic
df.groupby(['day','name']).agg({'duration':[np.min,np.max],'start':[np.min,np.max],'end':[np.min,np.max],'value':[np.min,np.max]})
what i am getting
day name duration_min duration_max duration_max_start duration_max_end duration_min_start duration_min_end value_min value_max
Wednesday AAA 00:03 01:23 10/23/2019 6:51 10/23/2019 3:09 10/23/2019 12:35 10/23/2019 15:54 1 17602
but what should i getting
day name duration_min duration_max duration_max_start duration_max_end value_max duration_min_start duration_min_end value_min
Wednesday AAA 00:03 01:23 10/23/2019 6:51 10/23/2019 8:14 1 10/23/2019 12:35 10/23/2019 12:38 17602
what i want is need to get min value and max value by grouping with their start and end values
What you want is the attributes on the same row where duration min and max occur. What you wrote is the min and max of each individual column, whether they are on the same row or not.
Use idxmin & idxmax to find the row where min and max values occur, then merge with the original frame:
idx = df.groupby(['day','name'])['duration'].agg(['idxmin','idxmax'])
idx.merge(df.add_suffix('_min'), left_on='idxmin', right_index=True) \
.merge(df.add_suffix('_max'), left_on='idxmax', right_index=True) \
[['duration_min', 'duration_max', 'start_min', 'end_min', 'start_max', 'end_max', 'value_min', 'value_max']]
Result:
day | name | duration_min | duration_max | start_min | end_min | start_max | end_max | value_min | value_max
Wednesday | AAA | 00:03 | 01:23 | 2019-10-23 12:35:00 | 2019-10-23 12:38:00 | 2019-10-23 06:51:00 | 2019-10-23 08:14:00 | 17602 | 1
Rename the columns as needed.

Sum IFs of total count without recounting Multiple instances, only the closest date prior to the AS OF DATE

I need a formula that will SUM the amount of, let's say, animal types AS OF DATE given WITHOUT adding the previous animal type count, only for the closest date prior to or on the AS OF DATE. Different animal types maybe added to or taken away. So list is not set.
I prefer not to do this in VBA or with a Pivot Table, But any help will be appreciated.
A B C
DATE ANIMAL TYPE COUNT
JAN 01 DOG 1
JAN 02 CAT 2
JAN 04 Fish 1
JAN 12 DOG 2
JAN 20 CAT 3
FEB 01 PIG 1
FEB 02 CAT 2
AS OF DATE TOTAL ANIMALS
JAN 03 3
JAN 13 5
JAN 21 6
FEB 01 7
FEB 02 6
So.
As of Jan 03, there was 3 animals total. 1 Dog and 2 cats.
As of Jan 13, there was 5 animals total. 2 Dogs, 1 Fish and 2 Cats,,,,,, NOT 6
As of Jan 21, there was 6 animals total. 2 Dogs, 1 Fish and 3 Cats,,,,,, NOT 9
As of Feb 01, there was 7 animals total. 2 Dogs, 1 Fish 1 Pig and 3 Cats, NOT 10
So far this is what I have. By using a helper column to filter the Animal Types I get a list without duplicates. Then I put that in a cell with Data Validation to pick the Type. Same for the Dates. However I would like to drop the Type input and just choose the Date. And be able to get a total.
Here is what works but not what I need.
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=$G$2,IF(TabData1[Date]<=$F$2,TabData1[Date]))))
I want to do away with the Single Cell reference ($F$2) of a single Animal Type and replace it with a Range to get the latest count of Animals for all Animal Types as of a certain date. Like this but this does not work.
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=(OFFSET($J$2,0,0,COUNT(IF(ListAnimalType="","",1)),1)),IF(TabData1[Date]<=$F$2,TabData1[Date]))))
To simplify (OFFSET($J$2,0,0,COUNT(IF(ListAnimalType="","",1)),1)) you can use $J$2:$J$5
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=$J$2:$J$5,IF(TabData1[Date]<=$F$2,TabData1[Date]))))
And it looks like this
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF({"Dog";"Cat";"Fish";"Dog";"Cat";"Pig";"Cat";0;0;0;0;0;0;0;0;0}={"Cat";"Dog";"Fish";"Pig"},IF(TabData1[Date]<=$F$2,TabData1[Date]))))
Like I said, I want one formula that will take each Animal Type find the latest date from a specified cell and return the sum for each Animal Type then sum them all up.

Merging and Adding Data in Excel Worksheets

I have 8 sheets of data (from Dec 2014 to July 2015, separated month wise). Each sheet contains monthly data (e.g. Dec 2014 sheet contains data of dec 2014 in three columns namely AC #, Name, Amount).
Dec 2014 Contains Data as Mentioned Below:
A/C # Name Dec 2014
A12 ABC 100
A13 CBA 200
A14 BCA 300
Whereas January 2015 contains data as below
A/C # Name Dec 2014
A12 ABC 5
A13 CBA 300
*A15 IJK 900*
All sheets contains mostly same data but some additional data based on customers added in that month or amount. E.g. January 2015 may contain an additional client a/c #, name and amount of January 2015 as marked above.
I want a consolidated sheet of data where all data is arranged as below:
A/C # Name Dec 2014 Jan 2015 Feb 2015 Mar 2015 Apr 2015
A12 ABC 100 5
A13 CBA 200 300
A14 BCA 300 0
A15 IJK 0 900
I would suggest connecting to the worksheets using ADODB. Then you can issue an SQL statement that will merge the records together.
This could be run from a VBScript, or from Excel.
For a similar strategy, see here.

Resources