How to Merge two Dataframe using matching indexes? - python-3.x
So I have two DataFrames: Historic and Applet.
Historic contains a list of all courses my school offered in the past and Applet is all courses that my school currently offers
I want to merge the two DataFrames so that any items in my Applet DataFrame that don't exist in Historic are added and any that do exist overwrite their copies in Historic (Some courses may have updated information and should overwrite their historic entries with that information..)
I'm currently using Historic.combine_first(Applet) to merge the two by on their Indexes. However, I want the duplicate entries to overwrite their Historic entries not just make a duplicate entry.
Code:
def update2(self):
historic = pd.read_csv('course_history.txt', header=None, sep='"', encoding = 'ISO-8859-1',
names=['Course_ID', 'Course_Title', 'Professor_Name','Meeting_Time','Enrollment','Room','Year','Term','Credit'],index_col=[0,6,7])
winnet = pd.DataFrame(self.data, columns =['Course_ID', 'Course_Title', 'Professor_Name','Meeting_Time','Enrollment','Room','Year','Term','Credit'] )
winnet.set_index(['Course_ID','Year','Term'], inplace=True)
historic3 = historic.combine_first(winnet)
Historic DataFrame:
Course_ID Year Term ...
AC 230 01 2020-21 May Accounting Systems Crouse, Justin D. ... ROOM NULL 1.00
AC 429 01 2020-21 May CPA Review Sommermeyer, Eric ... ROOM NULL 1.00
ART 150 01 2020-21 May 20th-Century Art, Media, & Design Fedeler, Barbara J. ... ROOM NULL 1.00
ART 208 01 2020-21 May Photography I Payne, Thomas R. ... ROOM NULL 1.00
PSY 222 01 2018-19 FA Cognitive Psychology Eslick Watkins, A ... ROOM NULL 1.00
Applet DataFrame:
Course_ID Year Term
PSY 101 01 2018-19 FA Introduction to Psychology Bane, C T H 9:35AM-11:15AM 40/44/0 LH 330 1.00
PSY 101 02 2018-19 FA Introduction to Psychology Eslick Watkins, A T H 1:00PM-2:40PM 40/43/0 SC 134 1.00
PSY 210 10 2018-19 FA Child Development Munir, S T H 9:35AM-11:15AM 30/10/0 LH 327 0.50
PSY 211 20 2018-19 FA Adolescent Development Munir, S T H 1:00PM-2:40PM 30/6/0 LH 330 0.50
PSY 222 01 2018-19 FA Cognitive Psychology Eslick Watkins, A T H 9:35AM-11:15AM 30/24/0 LH 324 1.00
You can use concat then drop_duplicates
cols = [columns_to_judge_duplicates]
combined = pd.concat([Applet, Historic])
combined = combined.drop_duplicates(subset=cols, method='first')
Related
Creating multiple named dataframes by a for loop
I have a database that contains 60,000+ rows of college football recruit data. From there, I want to create seperate dataframes where each one contains just one value. This is what a sample of the dataframe looks like: ,Primary Rank,Other Rank,Name,Link,Highschool,Position,Height,weight,Rating,National Rank,Position Rank,State Rank,Team,Class 0,1,,D.J. Williams,https://247sports.com/Player/DJ-Williams-49931,"De La Salle (Concord, CA)",ILB,6-2,235,0.9998,1,1,1,Miami,2000 1,2,,Brock Berlin,https://247sports.com/Player/Brock-Berlin-49926,"Evangel Christian Academy (Shreveport, LA)",PRO,6-2,190,0.9998,2,1,1,Florida,2000 2,3,,Charles Rogers,https://247sports.com/Player/Charles-Rogers-49984,"Saginaw (Saginaw, MI)",WR,6-4,195,0.9988,3,1,1,Michigan State,2000 3,4,,Travis Johnson,https://247sports.com/Player/Travis-Johnson-50043,"Notre Dame (Sherman Oaks, CA)",SDE,6-4,265,0.9982,4,1,2,Florida State,2000 4,5,,Marcus Houston,https://247sports.com/Player/Marcus-Houston-50139,"Thomas Jefferson (Denver, CO)",RB,6-0,208,0.9980,5,1,1,Colorado,2000 5,6,,Kwame Harris,https://247sports.com/Player/Kwame-Harris-49999,"Newark (Newark, DE)",OT,6-7,320,0.9978,6,1,1,Stanford,2000 6,7,,B.J. Johnson,https://247sports.com/Player/BJ-Johnson-50154,"South Grand Prairie (Grand Prairie, TX)",WR,6-1,190,0.9976,7,2,1,Texas,2000 7,8,,Bryant McFadden,https://247sports.com/Player/Bryant-McFadden-50094,"McArthur (Hollywood, FL)",CB,6-1,182,0.9968,8,1,1,Florida State,2000 8,9,,Sam Maldonado,https://247sports.com/Player/Sam-Maldonado-50071,"Harrison (Harrison, NY)",RB,6-2,215,0.9964,9,2,1,Ohio State,2000 9,10,,Mike Munoz,https://247sports.com/Player/Mike-Munoz-50150,"Archbishop Moeller (Cincinnati, OH)",OT,6-7,290,0.9960,10,2,1,Tennessee,2000 10,11,,Willis McGahee,https://247sports.com/Player/Willis-McGahee-50179,"Miami Central (Miami, FL)",RB,6-1,215,0.9948,11,3,2,Miami,2000 11,12,,Antonio Hall,https://247sports.com/Player/Antonio-Hall-50175,"McKinley (Canton, OH)",OT,6-5,295,0.9946,12,3,2,Kentucky,2000 12,13,,Darrell Lee,https://247sports.com/Player/Darrell-Lee-50580,"Kirkwood (Saint Louis, MO)",WDE,6-5,230,0.9940,13,1,1,Florida,2000 13,14,,O.J. Owens,https://247sports.com/Player/OJ-Owens-50176,"North Stanly (New London, NC)",S,6-1,195,0.9932,14,1,1,Tennessee,2000 14,15,,Jeff Smoker,https://247sports.com/Player/Jeff-Smoker-50582,"Manheim Central (Manheim, PA)",PRO,6-3,190,0.9922,15,2,1,Michigan State,2000 15,16,,Marco Cooper,https://247sports.com/Player/Marco-Cooper-50171,"Cass Technical (Detroit, MI)",OLB,6-2,235,0.9918,16,1,2,Ohio State,2000 16,17,,Chance Mock,https://247sports.com/Player/Chance-Mock-50163,"The Woodlands (The Woodlands, TX)",PRO,6-2,190,0.9918,17,3,2,Texas,2000 17,18,,Roy Williams,https://247sports.com/Player/Roy-Williams-55566,"Permian (Odessa, TX)",WR,6-4,202,0.9916,18,3,3,Texas,2000 18,19,,Matt Grootegoed,https://247sports.com/Player/Matt-Grootegoed-50591,"Mater Dei (Santa Ana, CA)",OLB,5-11,205,0.9914,19,2,3,USC,2000 19,20,,Yohance Buchanan,https://247sports.com/Player/Yohance-Buchanan-50182,"Douglass (Atlanta, GA)",S,6-1,210,0.9912,20,2,1,Florida State,2000 20,21,,Mac Tyler,https://247sports.com/Player/Mac-Tyler-50572,"Jess Lanier (Hueytown, AL)",DT,6-6,320,0.9912,21,1,1,Alabama,2000 21,22,,Jason Respert,https://247sports.com/Player/Jason-Respert-55623,"Northside (Warner Robins, GA)",OC,6-3,300,0.9902,22,1,2,Tennessee,2000 22,23,,Casey Clausen,https://247sports.com/Player/Casey-Clausen-50183,"Bishop Alemany (Mission Hills, CA)",PRO,6-4,215,0.9896,23,4,4,Tennessee,2000 23,24,,Albert Means,https://247sports.com/Player/Albert-Means-55968,"Trezevant (Memphis, TN)",SDE,6-6,310,0.9890,24,2,1,Alabama,2000 24,25,,Albert Hollis,https://247sports.com/Player/Albert-Hollis-55958,"Christian Brothers (Sacramento, CA)",RB,6-0,190,0.9890,25,4,5,Georgia,2000 25,26,,Eric Moore,https://247sports.com/Player/Eric-Moore-55973,"Pahokee (Pahokee, FL)",OLB,6-4,226,0.9884,26,3,3,Florida State,2000 26,27,,Willie Dixon,https://247sports.com/Player/Willie-Dixon-55626,"Stockton Christian School (Stockton, CA)",WR,5-11,182,0.9884,27,4,6,Miami,2000 27,28,,Cory Bailey,https://247sports.com/Player/Cory-Bailey-50586,"American (Hialeah, FL)",S,5-10,175,0.9880,28,3,4,Florida,2000 28,29,,Sean Young,https://247sports.com/Player/Sean-Young-55972,"Northwest Whitfield County (Tunnel Hill, GA)",OG,6-6,293,0.9878,29,1,3,Tennessee,2000 29,30,,Johnnie Morant,https://247sports.com/Player/Johnnie-Morant-60412,"Parsippany Hills (Morris Plains, NJ)",WR,6-5,225,0.9871,30,5,1,Syracuse,2000 30,31,,Wes Sims,https://247sports.com/Player/Wes-Sims-60243,"Weatherford (Weatherford, OK)",OG,6-5,310,0.9869,31,2,1,Oklahoma,2000 31,33,,Jason Campbell,https://247sports.com/Player/Jason-Campbell-55976,"Taylorsville (Taylorsville, MS)",PRO,6-5,190,0.9853,33,5,1,Auburn,2000 32,34,,Antwan Odom,https://247sports.com/Player/Antwan-Odom-50168,"Alma Bryant (Irvington, AL)",SDE,6-7,260,0.9851,34,3,2,Alabama,2000 33,35,,Sloan Thomas,https://247sports.com/Player/Sloan-Thomas-55630,"Klein (Spring, TX)",WR,6-2,188,0.9847,35,6,5,Texas,2000 34,36,,Raymond Mann,https://247sports.com/Player/Raymond-Mann-60804,"Hampton (Hampton, VA)",ILB,6-1,233,0.9847,36,2,1,Virginia,2000 35,37,,Alphonso Townsend,https://247sports.com/Player/Alphonso-Townsend-55975,"Lima Central Catholic (Lima, OH)",DT,6-6,280,0.9847,37,2,3,Ohio State,2000 36,38,,Greg Jones,https://247sports.com/Player/Greg-Jones-50158,"Battery Creek (Beaufort, SC)",RB,6-2,245,0.9837,38,6,1,Florida State,2000 37,39,,Paul Mociler,https://247sports.com/Player/Paul-Mociler-60319,"St. John Bosco (Bellflower, CA)",OG,6-5,300,0.9833,39,3,7,UCLA,2000 38,40,,Chris Septak,https://247sports.com/Player/Chris-Septak-57555,"Millard West (Omaha, NE)",TE,6-3,245,0.9833,40,1,1,Nebraska,2000 39,41,,Eric Knott,https://247sports.com/Player/Eric-Knott-60823,"Henry Ford II (Sterling Heights, MI)",TE,6-4,235,0.9831,41,2,3,Michigan State,2000 40,42,,Harold James,https://247sports.com/Player/Harold-James-57524,"Osceola (Osceola, AR)",S,6-1,220,0.9827,42,4,1,Alabama,2000 For example, if I don't use a for loop, this line of code is what I use if I just want to create one dataframe: recruits2022 = recruits_final[recruits_final['Class'] == 2022] However, I want to have a named dataframe for each recruiting class. In other words, recruits2000 would be a dataframe for all rows that have a class value equal to 2000, recruits2001 would be a dataframe for all rows that have a class value to 2001, and so forth. This is what I tried recently, but have no luck saving the dataframe outside of the for loop. databases = ['recruits2000', 'recruits2001', 'recruits2002', 'recruits2003', 'recruits2004', 'recruits2005', 'recruits2006', 'recruits2007', 'recruits2008', 'recruits2009', 'recruits2010', 'recruits2011', 'recruits2012', 'recruits2013', 'recruits2014', 'recruits2015', 'recruits2016', 'recruits2017', 'recruits2018', 'recruits2019', 'recruits2020', 'recruits2021', 'recruits2022', 'recruits2023'] for i in range(len(databases)): year = pd.to_numeric(databases[i][-4:], errors = 'coerce') db = recruits_final[recruits_final['Class'] == year] db.name = databases[i] print(db) print(db.name) print(year) recruits2023 I would get this error instead of what I wanted NameError Traceback (most recent call last) <ipython-input-49-7cb5d12ab92f> in <module>() 29 30 # print(db.name) ---> 31 recruits2023 32 33 NameError: name 'recruits2023' is not defined Is there something that I am missing to get this for loop to work? Any assistance is truly appreciated. Thanks in advance.
List use a dictionary of dataframes using groupby: dict_dfs = dict(tuple(df.groupby('Class'))) Access you individual dataframes using dict_dfs[2022]
You override variable db at each iteration and recruits2023 is not a variable so you can't use it like that: You can use a dict to store your data: recruits = {} for year in recruits_final['Class'].unique(): recruits[year] = recruits_final[recruits_final['Class'] == year] >>> recruits[2000] Primary Rank Other Rank Name Link ... Position Rank State Rank Team Class 0 1 NaN D.J. Williams https://247sports.com/Player/DJ-Williams-49931 ... 1 1 Miami 2000 1 2 NaN Brock Berlin https://247sports.com/Player/Brock-Berlin-49926 ... 1 1 Florida 2000 2 3 NaN Charles Rogers https://247sports.com/Player/Charles-Rogers-49984 ... 1 1 Michigan State 2000 3 4 NaN Travis Johnson https://247sports.com/Player/Travis-Johnson-50043 ... 1 2 Florida State 2000 ... 38 40 NaN Chris Septak https://247sports.com/Player/Chris-Septak-57555 ... 1 1 Nebraska 2000 39 41 NaN Eric Knott https://247sports.com/Player/Eric-Knott-60823 ... 2 3 Michigan State 2000 40 42 NaN Harold James https://247sports.com/Player/Harold-James-57524 ... 4 1 Alabama 2000 >>> recruits.keys() dict_keys([2000])
Batch tracking in power query
I have a CSV that contains some production data. When loaded into Excels power query it has a structure similar to this (material batches may contain remainders of old material batches as recycling material): Mat_Batch Date Recyc_Batch RawMaterial1 RawMaterial2 RawMaterial3 Amount1 Amount2 Amount3 123 01.11.2019 Fe Cr Ni 70 19 11 234 01.12.2019 Fe Cr Ni 71 18 11 345 01.02.2020 123 Fe Cr Ni 72 17 9 456 01.01.2020 234 Fe Cr Ni 70 19 11 567 01.02.2020 Fe Cr Ni 72 16 10 678 01.01.2020 456 Fe Cr Ni 70 19 11 Another CSV has the following content (it simply links a production batch to a material batch; production batches may contain more than one material batch): Batch Mat_Batch abc 456 abc 567 bcd 345 Now I would like to use power query m to evaluate which material batches exactly were used to produce a part batch. E.g. batch "abc" was made from 456 + 567 + 234 (as recycling material in 456). As a first step, I filter the production batch table by a specific batch and join both tables via the resulting Mat_Batch column. As a second iteration I seperate the Recyc_Batch column from the matched material batches and do a second join with a copy of my material batch table to gain all additional recycling materials that where used. But how could I do so "infinite" times? The way I'm doing it I have to create additional queries for each iteration but I need a way to automatically repeat those joining steps until there is no more additional recycling material used.
here is a Query (Result) you can use (if I understood correct) let Quelle = Table.NestedJoin(tbl_Material, {"Mat_Batch"}, tbl_Production, {"Mat_Batch"}, "tbl_Production", JoinKind.LeftOuter), Combine_Sources = Table.ExpandTableColumn(Quelle, "tbl_Production", {"Batch"}, {"Batch"}), DeleteOtherColumns = Table.SelectColumns(Combine_Sources,{"Batch", "Mat_Batch", "Recyc_Batch"}), UnpivotOtherColumns = Table.UnpivotOtherColumns(DeleteOtherColumns, {"Batch"}, "Attribut", "Wert"), FilterRows = Table.SelectRows(UnpivotOtherColumns, each ([Batch] <> null)), SortRows = Table.Sort(FilterRows,{{"Batch", Order.Ascending}}) in SortRows The result looks like that Best regards Chris
how can I use multiple operation in awk to edit text file
I have a text file like this small example: chr10:103909786-103910082 147 148 24 BA chr10:103909786-103910082 149 150 11 BA chr10:103909786-103910082 150 151 2 BA chr10:103909786-103910082 152 153 1 BA chr10:103909786-103910082 274 275 5 CA chr10:103909786-103910082 288 289 15 CA chr10:103909786-103910082 294 295 4 CA chr10:103909786-103910082 295 296 15 CA chr10:104573088-104576021 2925 2926 134 CA chr10:104573088-104576021 2926 2927 10 CA chr10:104573088-104576021 2932 2933 2 CA chr10:104573088-104576021 58 59 1 BA chr10:104573088-104576021 689 690 12 BA chr10:104573088-104576021 819 820 33 BA in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID. 1- in the 1st step I would like to filter out the rows based on the 4th column. if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out. 3- 3rd step: I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this: 1- after filtration: chr10:103909786-103910082 147 148 24 BA chr10:103909786-103910082 149 150 11 BA chr10:103909786-103910082 274 275 5 CA chr10:103909786-103910082 288 289 15 CA chr10:103909786-103910082 295 296 15 CA chr10:104573088-104576021 2925 2926 134 CA chr10:104573088-104576021 2926 2927 10 CA chr10:104573088-104576021 689 690 12 BA chr10:104573088-104576021 819 820 33 BA 2- after summarizing each group (CA and BA): chr10:103909786-103910082 147 148 35 BA chr10:103909786-103910082 274 275 35 CA chr10:104573088-104576021 2925 2926 144 CA chr10:104573088-104576021 819 820 45 BA 3- the final output(this ratio is made using the values in 4th column): chr10:103909786-103910082 1 chr10:104573088-104576021 3.2 in the above lines, 1 = 35/35 and 3.2 = 144/45. I am trying to do that in awk awk -F "\t" '{ (if($4 < -10 & $5==BA)), (if($4 < -5 & $5==CA)) ; print $2 = BA/CA} file.txt > out.txt I tried to follow the steps that mentioned in the code but did not succeed. do you know how to solve the problem?
If the records with the same ID are always consecutive, you can do that: awk 'ID!=$1 { if (ID) { print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0; } ID=$1 } $5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 } END{ print ID, a["CA"]/a["BA"] }' file.txt The first block tests if the ID has changed, in this case, it displays the previous ID and the ratio. The second block filter unwanted records. The END block displays the result for the last ID.
SAS Data organization
Dataset Sample I have data set like the attached picture where I want only the observations that have same numsecur every year. How do I do this in SAS proc sql function? Will this be easier to do in STATA? If so what procedure can I use?
You look like a new user to stackoverflow. Welcome. Your question is getting down voted for at least three reasons: 1) It's not really clear what you want from your description of the problem and the data you're providing 2) You haven't shown any attempts at what you've tried 3) Providing your data as a picture is not great. It's most helpful if you're going to provide data to provide it so it's easy for others to consume in their program. After all, you're asking for our help make it easier for us to help you. If You included something like the following we just have to copy and paste to create your dataset to work with: DATA test; INPUT ID YEAR EXEC SUM; DATALINES; 1573 1997 50 1080 1581 1997 51 300 1598 1996 54 80 1598 1998 54 80 1598 1999 54 80 1602 1996 55 112.6 1602 1997 55 335.965 ; RUN; That being said the following MAY give you what you're looking for but it's only a guess as I'm not sure if this is really what you're asking: proc sql no print; create table testout as select *,count(*) as cnt from test group by sum having cnt > 1; quit; Are you asking: show all rows where the same SUM is used or something else?
Assuming I understand your question correctly, you would like to keep the observations from the same company/individual only if the company has the same numsecur every year. So, here is what I would try using STATA: input ID YEAR EXEC SUM 1573 1997 50 1080 // 1581 1997 51 300 // 1598 1996 54 80 // 1598 1998 54 80 // 1598 1999 54 80 // 1602 1996 55 112.6 // 1602 1997 55 335.965 // 1575 1997 50 1080 // 1575 1998 51 1080 // 1595 1996 54 80 // 1595 1998 54 30 // 1595 1999 54 80 // 1605 1996 55 112.6 // 1605 1997 55 335.965 // end bysort ID SUM: gen drop=cond(_N==1, 0,_n) drop if drop==0 The results show ( based on my data): ID YEAR EXEC SUM drop 1. 1575 1997 50 1080 1 2. 1575 1998 51 1080 2 3. 1595 1999 54 80 1 4. 1595 1996 54 80 2 5. 1598 1996 54 80 1 6. 1598 1998 54 80 2 7. 1598 1999 54 80 3
Creating a Dynamic Range with a Macro
YearMth Region Employee Item Units Unit Cost Total -------------------------------------------------------------------- 2006-12 DC Jones Pen Set 700 1.99 1,393 2006-12 NY Peterson Binder 85 19.99 1,699 2006-12 DC Howard Pen Set 62 4.99 309 2006-12 DC Gill Pen 58 19.99 1,159 2006-12 NY Anderson Binder 10 4.99 50 2006-12 NY Anderson Pen Set 19 2.99 57 Using this data how would I create a Dynamic Range using a Macro. Thanks
Depends what you want to do! But here's an example ActiveWorkbook.Names.Add Name:="MyDynamicRange", RefersToR1C1:= "=OFFSET(Sheet1!R2C2,0,0,COUNTA(Sheet1!R2C2:R200C2),1)" This range will contain all your Regions.