How to Merge two Dataframe using matching indexes? - python-3.x

So I have two DataFrames: Historic and Applet.
Historic contains a list of all courses my school offered in the past and Applet is all courses that my school currently offers
I want to merge the two DataFrames so that any items in my Applet DataFrame that don't exist in Historic are added and any that do exist overwrite their copies in Historic (Some courses may have updated information and should overwrite their historic entries with that information..)
I'm currently using Historic.combine_first(Applet) to merge the two by on their Indexes. However, I want the duplicate entries to overwrite their Historic entries not just make a duplicate entry.
Code:
def update2(self):
historic = pd.read_csv('course_history.txt', header=None, sep='"', encoding = 'ISO-8859-1',
names=['Course_ID', 'Course_Title', 'Professor_Name','Meeting_Time','Enrollment','Room','Year','Term','Credit'],index_col=[0,6,7])
winnet = pd.DataFrame(self.data, columns =['Course_ID', 'Course_Title', 'Professor_Name','Meeting_Time','Enrollment','Room','Year','Term','Credit'] )
winnet.set_index(['Course_ID','Year','Term'], inplace=True)
historic3 = historic.combine_first(winnet)
Historic DataFrame:
Course_ID Year Term ...
AC 230 01 2020-21 May Accounting Systems Crouse, Justin D. ... ROOM NULL 1.00
AC 429 01 2020-21 May CPA Review Sommermeyer, Eric ... ROOM NULL 1.00
ART 150 01 2020-21 May 20th-Century Art, Media, & Design Fedeler, Barbara J. ... ROOM NULL 1.00
ART 208 01 2020-21 May Photography I Payne, Thomas R. ... ROOM NULL 1.00
PSY 222 01 2018-19 FA Cognitive Psychology Eslick Watkins, A ... ROOM NULL 1.00
Applet DataFrame:
Course_ID Year Term
PSY 101 01 2018-19 FA Introduction to Psychology Bane, C T H 9:35AM-11:15AM 40/44/0 LH 330 1.00
PSY 101 02 2018-19 FA Introduction to Psychology Eslick Watkins, A T H 1:00PM-2:40PM 40/43/0 SC 134 1.00
PSY 210 10 2018-19 FA Child Development Munir, S T H 9:35AM-11:15AM 30/10/0 LH 327 0.50
PSY 211 20 2018-19 FA Adolescent Development Munir, S T H 1:00PM-2:40PM 30/6/0 LH 330 0.50
PSY 222 01 2018-19 FA Cognitive Psychology Eslick Watkins, A T H 9:35AM-11:15AM 30/24/0 LH 324 1.00

You can use concat then drop_duplicates
cols = [columns_to_judge_duplicates]
combined = pd.concat([Applet, Historic])
combined = combined.drop_duplicates(subset=cols, method='first')

Related

Creating multiple named dataframes by a for loop

I have a database that contains 60,000+ rows of college football recruit data. From there, I want to create seperate dataframes where each one contains just one value. This is what a sample of the dataframe looks like:
,Primary Rank,Other Rank,Name,Link,Highschool,Position,Height,weight,Rating,National Rank,Position Rank,State Rank,Team,Class
0,1,,D.J. Williams,https://247sports.com/Player/DJ-Williams-49931,"De La Salle (Concord, CA)",ILB,6-2,235,0.9998,1,1,1,Miami,2000
1,2,,Brock Berlin,https://247sports.com/Player/Brock-Berlin-49926,"Evangel Christian Academy (Shreveport, LA)",PRO,6-2,190,0.9998,2,1,1,Florida,2000
2,3,,Charles Rogers,https://247sports.com/Player/Charles-Rogers-49984,"Saginaw (Saginaw, MI)",WR,6-4,195,0.9988,3,1,1,Michigan State,2000
3,4,,Travis Johnson,https://247sports.com/Player/Travis-Johnson-50043,"Notre Dame (Sherman Oaks, CA)",SDE,6-4,265,0.9982,4,1,2,Florida State,2000
4,5,,Marcus Houston,https://247sports.com/Player/Marcus-Houston-50139,"Thomas Jefferson (Denver, CO)",RB,6-0,208,0.9980,5,1,1,Colorado,2000
5,6,,Kwame Harris,https://247sports.com/Player/Kwame-Harris-49999,"Newark (Newark, DE)",OT,6-7,320,0.9978,6,1,1,Stanford,2000
6,7,,B.J. Johnson,https://247sports.com/Player/BJ-Johnson-50154,"South Grand Prairie (Grand Prairie, TX)",WR,6-1,190,0.9976,7,2,1,Texas,2000
7,8,,Bryant McFadden,https://247sports.com/Player/Bryant-McFadden-50094,"McArthur (Hollywood, FL)",CB,6-1,182,0.9968,8,1,1,Florida State,2000
8,9,,Sam Maldonado,https://247sports.com/Player/Sam-Maldonado-50071,"Harrison (Harrison, NY)",RB,6-2,215,0.9964,9,2,1,Ohio State,2000
9,10,,Mike Munoz,https://247sports.com/Player/Mike-Munoz-50150,"Archbishop Moeller (Cincinnati, OH)",OT,6-7,290,0.9960,10,2,1,Tennessee,2000
10,11,,Willis McGahee,https://247sports.com/Player/Willis-McGahee-50179,"Miami Central (Miami, FL)",RB,6-1,215,0.9948,11,3,2,Miami,2000
11,12,,Antonio Hall,https://247sports.com/Player/Antonio-Hall-50175,"McKinley (Canton, OH)",OT,6-5,295,0.9946,12,3,2,Kentucky,2000
12,13,,Darrell Lee,https://247sports.com/Player/Darrell-Lee-50580,"Kirkwood (Saint Louis, MO)",WDE,6-5,230,0.9940,13,1,1,Florida,2000
13,14,,O.J. Owens,https://247sports.com/Player/OJ-Owens-50176,"North Stanly (New London, NC)",S,6-1,195,0.9932,14,1,1,Tennessee,2000
14,15,,Jeff Smoker,https://247sports.com/Player/Jeff-Smoker-50582,"Manheim Central (Manheim, PA)",PRO,6-3,190,0.9922,15,2,1,Michigan State,2000
15,16,,Marco Cooper,https://247sports.com/Player/Marco-Cooper-50171,"Cass Technical (Detroit, MI)",OLB,6-2,235,0.9918,16,1,2,Ohio State,2000
16,17,,Chance Mock,https://247sports.com/Player/Chance-Mock-50163,"The Woodlands (The Woodlands, TX)",PRO,6-2,190,0.9918,17,3,2,Texas,2000
17,18,,Roy Williams,https://247sports.com/Player/Roy-Williams-55566,"Permian (Odessa, TX)",WR,6-4,202,0.9916,18,3,3,Texas,2000
18,19,,Matt Grootegoed,https://247sports.com/Player/Matt-Grootegoed-50591,"Mater Dei (Santa Ana, CA)",OLB,5-11,205,0.9914,19,2,3,USC,2000
19,20,,Yohance Buchanan,https://247sports.com/Player/Yohance-Buchanan-50182,"Douglass (Atlanta, GA)",S,6-1,210,0.9912,20,2,1,Florida State,2000
20,21,,Mac Tyler,https://247sports.com/Player/Mac-Tyler-50572,"Jess Lanier (Hueytown, AL)",DT,6-6,320,0.9912,21,1,1,Alabama,2000
21,22,,Jason Respert,https://247sports.com/Player/Jason-Respert-55623,"Northside (Warner Robins, GA)",OC,6-3,300,0.9902,22,1,2,Tennessee,2000
22,23,,Casey Clausen,https://247sports.com/Player/Casey-Clausen-50183,"Bishop Alemany (Mission Hills, CA)",PRO,6-4,215,0.9896,23,4,4,Tennessee,2000
23,24,,Albert Means,https://247sports.com/Player/Albert-Means-55968,"Trezevant (Memphis, TN)",SDE,6-6,310,0.9890,24,2,1,Alabama,2000
24,25,,Albert Hollis,https://247sports.com/Player/Albert-Hollis-55958,"Christian Brothers (Sacramento, CA)",RB,6-0,190,0.9890,25,4,5,Georgia,2000
25,26,,Eric Moore,https://247sports.com/Player/Eric-Moore-55973,"Pahokee (Pahokee, FL)",OLB,6-4,226,0.9884,26,3,3,Florida State,2000
26,27,,Willie Dixon,https://247sports.com/Player/Willie-Dixon-55626,"Stockton Christian School (Stockton, CA)",WR,5-11,182,0.9884,27,4,6,Miami,2000
27,28,,Cory Bailey,https://247sports.com/Player/Cory-Bailey-50586,"American (Hialeah, FL)",S,5-10,175,0.9880,28,3,4,Florida,2000
28,29,,Sean Young,https://247sports.com/Player/Sean-Young-55972,"Northwest Whitfield County (Tunnel Hill, GA)",OG,6-6,293,0.9878,29,1,3,Tennessee,2000
29,30,,Johnnie Morant,https://247sports.com/Player/Johnnie-Morant-60412,"Parsippany Hills (Morris Plains, NJ)",WR,6-5,225,0.9871,30,5,1,Syracuse,2000
30,31,,Wes Sims,https://247sports.com/Player/Wes-Sims-60243,"Weatherford (Weatherford, OK)",OG,6-5,310,0.9869,31,2,1,Oklahoma,2000
31,33,,Jason Campbell,https://247sports.com/Player/Jason-Campbell-55976,"Taylorsville (Taylorsville, MS)",PRO,6-5,190,0.9853,33,5,1,Auburn,2000
32,34,,Antwan Odom,https://247sports.com/Player/Antwan-Odom-50168,"Alma Bryant (Irvington, AL)",SDE,6-7,260,0.9851,34,3,2,Alabama,2000
33,35,,Sloan Thomas,https://247sports.com/Player/Sloan-Thomas-55630,"Klein (Spring, TX)",WR,6-2,188,0.9847,35,6,5,Texas,2000
34,36,,Raymond Mann,https://247sports.com/Player/Raymond-Mann-60804,"Hampton (Hampton, VA)",ILB,6-1,233,0.9847,36,2,1,Virginia,2000
35,37,,Alphonso Townsend,https://247sports.com/Player/Alphonso-Townsend-55975,"Lima Central Catholic (Lima, OH)",DT,6-6,280,0.9847,37,2,3,Ohio State,2000
36,38,,Greg Jones,https://247sports.com/Player/Greg-Jones-50158,"Battery Creek (Beaufort, SC)",RB,6-2,245,0.9837,38,6,1,Florida State,2000
37,39,,Paul Mociler,https://247sports.com/Player/Paul-Mociler-60319,"St. John Bosco (Bellflower, CA)",OG,6-5,300,0.9833,39,3,7,UCLA,2000
38,40,,Chris Septak,https://247sports.com/Player/Chris-Septak-57555,"Millard West (Omaha, NE)",TE,6-3,245,0.9833,40,1,1,Nebraska,2000
39,41,,Eric Knott,https://247sports.com/Player/Eric-Knott-60823,"Henry Ford II (Sterling Heights, MI)",TE,6-4,235,0.9831,41,2,3,Michigan State,2000
40,42,,Harold James,https://247sports.com/Player/Harold-James-57524,"Osceola (Osceola, AR)",S,6-1,220,0.9827,42,4,1,Alabama,2000
For example, if I don't use a for loop, this line of code is what I use if I just want to create one dataframe:
recruits2022 = recruits_final[recruits_final['Class'] == 2022]
However, I want to have a named dataframe for each recruiting class.
In other words, recruits2000 would be a dataframe for all rows that have a class value equal to 2000, recruits2001 would be a dataframe for all rows that have a class value to 2001, and so forth.
This is what I tried recently, but have no luck saving the dataframe outside of the for loop.
databases = ['recruits2000', 'recruits2001', 'recruits2002', 'recruits2003', 'recruits2004',
'recruits2005', 'recruits2006', 'recruits2007', 'recruits2008', 'recruits2009',
'recruits2010', 'recruits2011', 'recruits2012', 'recruits2013', 'recruits2014',
'recruits2015', 'recruits2016', 'recruits2017', 'recruits2018', 'recruits2019',
'recruits2020', 'recruits2021', 'recruits2022', 'recruits2023']
for i in range(len(databases)):
year = pd.to_numeric(databases[i][-4:], errors = 'coerce')
db = recruits_final[recruits_final['Class'] == year]
db.name = databases[i]
print(db)
print(db.name)
print(year)
recruits2023
I would get this error instead of what I wanted
NameError Traceback (most recent call last)
<ipython-input-49-7cb5d12ab92f> in <module>()
29
30 # print(db.name)
---> 31 recruits2023
32
33
NameError: name 'recruits2023' is not defined
Is there something that I am missing to get this for loop to work? Any assistance is truly appreciated. Thanks in advance.
List use a dictionary of dataframes using groupby:
dict_dfs = dict(tuple(df.groupby('Class')))
Access you individual dataframes using
dict_dfs[2022]
You override variable db at each iteration and recruits2023 is not a variable so you can't use it like that:
You can use a dict to store your data:
recruits = {}
for year in recruits_final['Class'].unique():
recruits[year] = recruits_final[recruits_final['Class'] == year]
>>> recruits[2000]
Primary Rank Other Rank Name Link ... Position Rank State Rank Team Class
0 1 NaN D.J. Williams https://247sports.com/Player/DJ-Williams-49931 ... 1 1 Miami 2000
1 2 NaN Brock Berlin https://247sports.com/Player/Brock-Berlin-49926 ... 1 1 Florida 2000
2 3 NaN Charles Rogers https://247sports.com/Player/Charles-Rogers-49984 ... 1 1 Michigan State 2000
3 4 NaN Travis Johnson https://247sports.com/Player/Travis-Johnson-50043 ... 1 2 Florida State 2000
...
38 40 NaN Chris Septak https://247sports.com/Player/Chris-Septak-57555 ... 1 1 Nebraska 2000
39 41 NaN Eric Knott https://247sports.com/Player/Eric-Knott-60823 ... 2 3 Michigan State 2000
40 42 NaN Harold James https://247sports.com/Player/Harold-James-57524 ... 4 1 Alabama 2000
>>> recruits.keys()
dict_keys([2000])

Batch tracking in power query

I have a CSV that contains some production data. When loaded into Excels power query it has a structure similar to this (material batches may contain remainders of old material batches as recycling material):
Mat_Batch Date Recyc_Batch RawMaterial1 RawMaterial2 RawMaterial3 Amount1 Amount2 Amount3
123 01.11.2019 Fe Cr Ni 70 19 11
234 01.12.2019 Fe Cr Ni 71 18 11
345 01.02.2020 123 Fe Cr Ni 72 17 9
456 01.01.2020 234 Fe Cr Ni 70 19 11
567 01.02.2020 Fe Cr Ni 72 16 10
678 01.01.2020 456 Fe Cr Ni 70 19 11
Another CSV has the following content (it simply links a production batch to a material batch; production batches may contain more than one material batch):
Batch Mat_Batch
abc 456
abc 567
bcd 345
Now I would like to use power query m to evaluate which material batches exactly were used to produce a part batch. E.g. batch "abc" was made from 456 + 567 + 234 (as recycling material in 456).
As a first step, I filter the production batch table by a specific batch and join both tables via the resulting Mat_Batch column. As a second iteration I seperate the Recyc_Batch column from the matched material batches and do a second join with a copy of my material batch table to gain all additional recycling materials that where used. But how could I do so "infinite" times? The way I'm doing it I have to create additional queries for each iteration but I need a way to automatically repeat those joining steps until there is no more additional recycling material used.
here is a Query (Result) you can use (if I understood correct)
let
Quelle = Table.NestedJoin(tbl_Material, {"Mat_Batch"}, tbl_Production, {"Mat_Batch"}, "tbl_Production", JoinKind.LeftOuter),
Combine_Sources = Table.ExpandTableColumn(Quelle, "tbl_Production", {"Batch"}, {"Batch"}),
DeleteOtherColumns = Table.SelectColumns(Combine_Sources,{"Batch", "Mat_Batch", "Recyc_Batch"}),
UnpivotOtherColumns = Table.UnpivotOtherColumns(DeleteOtherColumns, {"Batch"}, "Attribut", "Wert"),
FilterRows = Table.SelectRows(UnpivotOtherColumns, each ([Batch] <> null)),
SortRows = Table.Sort(FilterRows,{{"Batch", Order.Ascending}})
in
SortRows
The result looks like that
Best regards Chris

how can I use multiple operation in awk to edit text file

I have a text file like this small example:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 150 151 2 BA
chr10:103909786-103910082 152 153 1 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 294 295 4 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 2932 2933 2 CA
chr10:104573088-104576021 58 59 1 BA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID.
1- in the 1st step I would like to filter out the rows based on the 4th column.
if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out.
3- 3rd step:
I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this:
1- after filtration:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
2- after summarizing each group (CA and BA):
chr10:103909786-103910082 147 148 35 BA
chr10:103909786-103910082 274 275 35 CA
chr10:104573088-104576021 2925 2926 144 CA
chr10:104573088-104576021 819 820 45 BA
3- the final output(this ratio is made using the values in 4th column):
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
in the above lines, 1 = 35/35 and 3.2 = 144/45.
I am trying to do that in awk
awk -F "\t" '{ (if($4 < -10 & $5==BA)), (if($4 < -5 & $5==CA)) ; print $2 = BA/CA} file.txt > out.txt
I tried to follow the steps that mentioned in the code but did not succeed. do you know how to solve the problem?
If the records with the same ID are always consecutive, you can do that:
awk 'ID!=$1 {
if (ID) {
print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0;
}
ID=$1
}
$5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 }
END{ print ID, a["CA"]/a["BA"] }' file.txt
The first block tests if the ID has changed, in this case, it displays the previous ID and the ratio.
The second block filter unwanted records.
The END block displays the result for the last ID.

SAS Data organization

Dataset Sample
I have data set like the attached picture where I want only the observations that have same numsecur every year.
How do I do this in SAS proc sql function? Will this be easier to do in STATA? If so what procedure can I use?
You look like a new user to stackoverflow. Welcome. Your question is getting down voted for at least three reasons:
1) It's not really clear what you want from your description of the problem and the data
you're providing
2) You haven't shown any attempts at what you've tried
3) Providing your data as a picture is not great. It's most helpful if you're going
to provide data to provide it so it's easy for others to consume in their program.
After all, you're asking for our help make it easier for us to help you. If You
included something like the following we just have to copy and paste to create your
dataset to work with:
DATA test;
INPUT ID YEAR EXEC SUM;
DATALINES;
1573 1997 50 1080
1581 1997 51 300
1598 1996 54 80
1598 1998 54 80
1598 1999 54 80
1602 1996 55 112.6
1602 1997 55 335.965
;
RUN;
That being said the following MAY give you what you're looking for but it's only a guess as I'm not sure if this is really what you're asking:
proc sql no print;
create table testout as
select *,count(*) as cnt
from test
group by sum
having cnt > 1;
quit;
Are you asking: show all rows where the same SUM is used or something else?
Assuming I understand your question correctly, you would like to keep the observations from the same company/individual only if the company has the same numsecur every year. So, here is what I would try using STATA:
input ID YEAR EXEC SUM
1573 1997 50 1080 //
1581 1997 51 300 //
1598 1996 54 80 //
1598 1998 54 80 //
1598 1999 54 80 //
1602 1996 55 112.6 //
1602 1997 55 335.965 //
1575 1997 50 1080 //
1575 1998 51 1080 //
1595 1996 54 80 //
1595 1998 54 30 //
1595 1999 54 80 //
1605 1996 55 112.6 //
1605 1997 55 335.965 //
end
bysort ID SUM: gen drop=cond(_N==1, 0,_n)
drop if drop==0
The results show ( based on my data):
ID YEAR EXEC SUM drop
1. 1575 1997 50 1080 1
2. 1575 1998 51 1080 2
3. 1595 1999 54 80 1
4. 1595 1996 54 80 2
5. 1598 1996 54 80 1
6. 1598 1998 54 80 2
7. 1598 1999 54 80 3

Creating a Dynamic Range with a Macro

YearMth Region Employee Item Units Unit Cost Total
--------------------------------------------------------------------
2006-12 DC Jones Pen Set 700 1.99 1,393
2006-12 NY Peterson Binder 85 19.99 1,699
2006-12 DC Howard Pen Set 62 4.99 309
2006-12 DC Gill Pen 58 19.99 1,159
2006-12 NY Anderson Binder 10 4.99 50
2006-12 NY Anderson Pen Set 19 2.99 57
Using this data how would I create a Dynamic Range using a Macro.
Thanks
Depends what you want to do! But here's an example
ActiveWorkbook.Names.Add Name:="MyDynamicRange",
RefersToR1C1:= "=OFFSET(Sheet1!R2C2,0,0,COUNTA(Sheet1!R2C2:R200C2),1)"
This range will contain all your Regions.

Resources