Split one column into multiple columns by multiple delimiters in Pandas - python-3.x
Given a dataframe as follows:
player score
0 Sergio Agüero Forward — Manchester City 209.98
1 Eden Hazard Midfield — Chelsea 274.04
2 Alexis Sánchez Forward — Arsenal 223.86
3 Yaya Touré Midfield — Manchester City 197.91
4 Angel María Midfield — Manchester United 132.23
How could split player into three new columns name, position and team?
player score name position team
0 Sergio Agüero Forward — Manchester City 209.98 Sergio Forward Manchester City
1 Eden Hazard Midfield — Chelsea 274.04 Eden Midfield Chelsea
2 Alexis Sánchez Forward — Arsenal 223.86 Alexis Forward Arsenal
3 Yaya Touré Midfield — Manchester City 197.91 Yaya Midfield Manchester City
4 Angel María Midfield — Manchester United 132.23 Angel Midfield Manchester United
I have considered split it two columns with df[['name_position', 'team']] = df['player'].str.split(pat= ' — ', expand=True), then split name_position to name and position. But is there any better solutions?
Many thanks.
You can use str.extract as well if you want to do it in one go:
print(df["player"].str.extract(r"(?P<name>.*?)\s.*?\s(?P<position>[A-Za-z]+)\s—\s(?P<team>.*)"))
name position team
0 Sergio Forward Manchester City
1 Eden Midfield Chelsea
2 Alexis Forward Arsenal
3 Yaya Midfield Manchester City
4 Angel Midfield Manchester United
You can split a python string by space with string.split(). This will break up your text into 'words', then you can simply access the one you like, like this:
string = "Sergio Agüero Forward — Manchester City"
name = string.split()[0]
position = string.split()[2]
team = string.split()[4] + (string.split().has_key(5) ? string.split()[5] : '')
For more complex patterns, you can use regex, which is a powerful string pattern finding tool.
Hope this helped :)
Related
Creating multiple named dataframes by a for loop
I have a database that contains 60,000+ rows of college football recruit data. From there, I want to create seperate dataframes where each one contains just one value. This is what a sample of the dataframe looks like: ,Primary Rank,Other Rank,Name,Link,Highschool,Position,Height,weight,Rating,National Rank,Position Rank,State Rank,Team,Class 0,1,,D.J. Williams,https://247sports.com/Player/DJ-Williams-49931,"De La Salle (Concord, CA)",ILB,6-2,235,0.9998,1,1,1,Miami,2000 1,2,,Brock Berlin,https://247sports.com/Player/Brock-Berlin-49926,"Evangel Christian Academy (Shreveport, LA)",PRO,6-2,190,0.9998,2,1,1,Florida,2000 2,3,,Charles Rogers,https://247sports.com/Player/Charles-Rogers-49984,"Saginaw (Saginaw, MI)",WR,6-4,195,0.9988,3,1,1,Michigan State,2000 3,4,,Travis Johnson,https://247sports.com/Player/Travis-Johnson-50043,"Notre Dame (Sherman Oaks, CA)",SDE,6-4,265,0.9982,4,1,2,Florida State,2000 4,5,,Marcus Houston,https://247sports.com/Player/Marcus-Houston-50139,"Thomas Jefferson (Denver, CO)",RB,6-0,208,0.9980,5,1,1,Colorado,2000 5,6,,Kwame Harris,https://247sports.com/Player/Kwame-Harris-49999,"Newark (Newark, DE)",OT,6-7,320,0.9978,6,1,1,Stanford,2000 6,7,,B.J. Johnson,https://247sports.com/Player/BJ-Johnson-50154,"South Grand Prairie (Grand Prairie, TX)",WR,6-1,190,0.9976,7,2,1,Texas,2000 7,8,,Bryant McFadden,https://247sports.com/Player/Bryant-McFadden-50094,"McArthur (Hollywood, FL)",CB,6-1,182,0.9968,8,1,1,Florida State,2000 8,9,,Sam Maldonado,https://247sports.com/Player/Sam-Maldonado-50071,"Harrison (Harrison, NY)",RB,6-2,215,0.9964,9,2,1,Ohio State,2000 9,10,,Mike Munoz,https://247sports.com/Player/Mike-Munoz-50150,"Archbishop Moeller (Cincinnati, OH)",OT,6-7,290,0.9960,10,2,1,Tennessee,2000 10,11,,Willis McGahee,https://247sports.com/Player/Willis-McGahee-50179,"Miami Central (Miami, FL)",RB,6-1,215,0.9948,11,3,2,Miami,2000 11,12,,Antonio Hall,https://247sports.com/Player/Antonio-Hall-50175,"McKinley (Canton, OH)",OT,6-5,295,0.9946,12,3,2,Kentucky,2000 12,13,,Darrell Lee,https://247sports.com/Player/Darrell-Lee-50580,"Kirkwood (Saint Louis, MO)",WDE,6-5,230,0.9940,13,1,1,Florida,2000 13,14,,O.J. Owens,https://247sports.com/Player/OJ-Owens-50176,"North Stanly (New London, NC)",S,6-1,195,0.9932,14,1,1,Tennessee,2000 14,15,,Jeff Smoker,https://247sports.com/Player/Jeff-Smoker-50582,"Manheim Central (Manheim, PA)",PRO,6-3,190,0.9922,15,2,1,Michigan State,2000 15,16,,Marco Cooper,https://247sports.com/Player/Marco-Cooper-50171,"Cass Technical (Detroit, MI)",OLB,6-2,235,0.9918,16,1,2,Ohio State,2000 16,17,,Chance Mock,https://247sports.com/Player/Chance-Mock-50163,"The Woodlands (The Woodlands, TX)",PRO,6-2,190,0.9918,17,3,2,Texas,2000 17,18,,Roy Williams,https://247sports.com/Player/Roy-Williams-55566,"Permian (Odessa, TX)",WR,6-4,202,0.9916,18,3,3,Texas,2000 18,19,,Matt Grootegoed,https://247sports.com/Player/Matt-Grootegoed-50591,"Mater Dei (Santa Ana, CA)",OLB,5-11,205,0.9914,19,2,3,USC,2000 19,20,,Yohance Buchanan,https://247sports.com/Player/Yohance-Buchanan-50182,"Douglass (Atlanta, GA)",S,6-1,210,0.9912,20,2,1,Florida State,2000 20,21,,Mac Tyler,https://247sports.com/Player/Mac-Tyler-50572,"Jess Lanier (Hueytown, AL)",DT,6-6,320,0.9912,21,1,1,Alabama,2000 21,22,,Jason Respert,https://247sports.com/Player/Jason-Respert-55623,"Northside (Warner Robins, GA)",OC,6-3,300,0.9902,22,1,2,Tennessee,2000 22,23,,Casey Clausen,https://247sports.com/Player/Casey-Clausen-50183,"Bishop Alemany (Mission Hills, CA)",PRO,6-4,215,0.9896,23,4,4,Tennessee,2000 23,24,,Albert Means,https://247sports.com/Player/Albert-Means-55968,"Trezevant (Memphis, TN)",SDE,6-6,310,0.9890,24,2,1,Alabama,2000 24,25,,Albert Hollis,https://247sports.com/Player/Albert-Hollis-55958,"Christian Brothers (Sacramento, CA)",RB,6-0,190,0.9890,25,4,5,Georgia,2000 25,26,,Eric Moore,https://247sports.com/Player/Eric-Moore-55973,"Pahokee (Pahokee, FL)",OLB,6-4,226,0.9884,26,3,3,Florida State,2000 26,27,,Willie Dixon,https://247sports.com/Player/Willie-Dixon-55626,"Stockton Christian School (Stockton, CA)",WR,5-11,182,0.9884,27,4,6,Miami,2000 27,28,,Cory Bailey,https://247sports.com/Player/Cory-Bailey-50586,"American (Hialeah, FL)",S,5-10,175,0.9880,28,3,4,Florida,2000 28,29,,Sean Young,https://247sports.com/Player/Sean-Young-55972,"Northwest Whitfield County (Tunnel Hill, GA)",OG,6-6,293,0.9878,29,1,3,Tennessee,2000 29,30,,Johnnie Morant,https://247sports.com/Player/Johnnie-Morant-60412,"Parsippany Hills (Morris Plains, NJ)",WR,6-5,225,0.9871,30,5,1,Syracuse,2000 30,31,,Wes Sims,https://247sports.com/Player/Wes-Sims-60243,"Weatherford (Weatherford, OK)",OG,6-5,310,0.9869,31,2,1,Oklahoma,2000 31,33,,Jason Campbell,https://247sports.com/Player/Jason-Campbell-55976,"Taylorsville (Taylorsville, MS)",PRO,6-5,190,0.9853,33,5,1,Auburn,2000 32,34,,Antwan Odom,https://247sports.com/Player/Antwan-Odom-50168,"Alma Bryant (Irvington, AL)",SDE,6-7,260,0.9851,34,3,2,Alabama,2000 33,35,,Sloan Thomas,https://247sports.com/Player/Sloan-Thomas-55630,"Klein (Spring, TX)",WR,6-2,188,0.9847,35,6,5,Texas,2000 34,36,,Raymond Mann,https://247sports.com/Player/Raymond-Mann-60804,"Hampton (Hampton, VA)",ILB,6-1,233,0.9847,36,2,1,Virginia,2000 35,37,,Alphonso Townsend,https://247sports.com/Player/Alphonso-Townsend-55975,"Lima Central Catholic (Lima, OH)",DT,6-6,280,0.9847,37,2,3,Ohio State,2000 36,38,,Greg Jones,https://247sports.com/Player/Greg-Jones-50158,"Battery Creek (Beaufort, SC)",RB,6-2,245,0.9837,38,6,1,Florida State,2000 37,39,,Paul Mociler,https://247sports.com/Player/Paul-Mociler-60319,"St. John Bosco (Bellflower, CA)",OG,6-5,300,0.9833,39,3,7,UCLA,2000 38,40,,Chris Septak,https://247sports.com/Player/Chris-Septak-57555,"Millard West (Omaha, NE)",TE,6-3,245,0.9833,40,1,1,Nebraska,2000 39,41,,Eric Knott,https://247sports.com/Player/Eric-Knott-60823,"Henry Ford II (Sterling Heights, MI)",TE,6-4,235,0.9831,41,2,3,Michigan State,2000 40,42,,Harold James,https://247sports.com/Player/Harold-James-57524,"Osceola (Osceola, AR)",S,6-1,220,0.9827,42,4,1,Alabama,2000 For example, if I don't use a for loop, this line of code is what I use if I just want to create one dataframe: recruits2022 = recruits_final[recruits_final['Class'] == 2022] However, I want to have a named dataframe for each recruiting class. In other words, recruits2000 would be a dataframe for all rows that have a class value equal to 2000, recruits2001 would be a dataframe for all rows that have a class value to 2001, and so forth. This is what I tried recently, but have no luck saving the dataframe outside of the for loop. databases = ['recruits2000', 'recruits2001', 'recruits2002', 'recruits2003', 'recruits2004', 'recruits2005', 'recruits2006', 'recruits2007', 'recruits2008', 'recruits2009', 'recruits2010', 'recruits2011', 'recruits2012', 'recruits2013', 'recruits2014', 'recruits2015', 'recruits2016', 'recruits2017', 'recruits2018', 'recruits2019', 'recruits2020', 'recruits2021', 'recruits2022', 'recruits2023'] for i in range(len(databases)): year = pd.to_numeric(databases[i][-4:], errors = 'coerce') db = recruits_final[recruits_final['Class'] == year] db.name = databases[i] print(db) print(db.name) print(year) recruits2023 I would get this error instead of what I wanted NameError Traceback (most recent call last) <ipython-input-49-7cb5d12ab92f> in <module>() 29 30 # print(db.name) ---> 31 recruits2023 32 33 NameError: name 'recruits2023' is not defined Is there something that I am missing to get this for loop to work? Any assistance is truly appreciated. Thanks in advance.
List use a dictionary of dataframes using groupby: dict_dfs = dict(tuple(df.groupby('Class'))) Access you individual dataframes using dict_dfs[2022]
You override variable db at each iteration and recruits2023 is not a variable so you can't use it like that: You can use a dict to store your data: recruits = {} for year in recruits_final['Class'].unique(): recruits[year] = recruits_final[recruits_final['Class'] == year] >>> recruits[2000] Primary Rank Other Rank Name Link ... Position Rank State Rank Team Class 0 1 NaN D.J. Williams https://247sports.com/Player/DJ-Williams-49931 ... 1 1 Miami 2000 1 2 NaN Brock Berlin https://247sports.com/Player/Brock-Berlin-49926 ... 1 1 Florida 2000 2 3 NaN Charles Rogers https://247sports.com/Player/Charles-Rogers-49984 ... 1 1 Michigan State 2000 3 4 NaN Travis Johnson https://247sports.com/Player/Travis-Johnson-50043 ... 1 2 Florida State 2000 ... 38 40 NaN Chris Septak https://247sports.com/Player/Chris-Septak-57555 ... 1 1 Nebraska 2000 39 41 NaN Eric Knott https://247sports.com/Player/Eric-Knott-60823 ... 2 3 Michigan State 2000 40 42 NaN Harold James https://247sports.com/Player/Harold-James-57524 ... 4 1 Alabama 2000 >>> recruits.keys() dict_keys([2000])
Remove custom stop words from pandas dataframe not working
I am trying to remove a custom list of stop words, but its not working. desc = pd.DataFrame(description, columns =['description']) print(desc) Which gives the following results description 188693 The Kentucky Cannabis Company and Bluegrass He... 181535 Ohio County Sheriff 11443 According to new reports from federal authorit... 213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou... 171509 The crew of Insight, WCNY's weekly public affa... ... ... 2732 The Arkansas Supreme Court on Thursday cleared... 183367 Larry Pegram, co-owner of Pure Ohio Wellness, ... 134291 Joe Biden will spend the next five months pres... 239270 Find out where your Texas representatives stan... 246070 SAN TAN VALLEY — Two men have been charged wit... [9875 rows x 1 columns] I found the following code here, but it doesn't seem to work remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"] pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words]) desc.assign(new_desc=desc.replace(dict(string={pat: ''}), regex=True)) Which produces the following results description new_desc 188693 The Kentucky Cannabis Company and Bluegrass He... The Kentucky Cannabis Company and Bluegrass He... 181535 Ohio County Sheriff Ohio County Sheriff 11443 According to new reports from federal authorit... According to new reports from federal authorit... 213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou... KANSAS CITY, Mo. (AP)The Chiefs will be withou... 171509 The crew of Insight, WCNY's weekly public affa... The crew of Insight, WCNY's weekly public affa... ... ... ... 2732 The Arkansas Supreme Court on Thursday cleared... The Arkansas Supreme Court on Thursday cleared... 183367 Larry Pegram, co-owner of Pure Ohio Wellness, ... Larry Pegram, co-owner of Pure Ohio Wellness, ... 134291 Joe Biden will spend the next five months pres... Joe Biden will spend the next five months pres... 239270 Find out where your Texas representatives stan... Find out where your Texas representatives stan... 246070 SAN TAN VALLEY — Two men have been charged wit... SAN TAN VALLEY — Two men have been charged wit... 9875 rows × 2 columns As you can see, the stop words weren't removed. Any help you can provide would be greatly appreciated.
Handle the case, simplify pattern, remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"] pat = '|'.join(remove_words) desc['new_desc'] = desc.description.str.lower().replace(pat,'', regex=True) description new_desc 0 The Kentucky Cannabis Company and Bluegrass He... the kentucky company and bluegrass he... 1 Ohio County Sheriff ohio county sheriff 2 According to new reports from federal authorit... according to new reports from federal authorit... 3 KANSAS CITY, Mo. (AP)The Chiefs will be mariju... kansas city, mo. (ap)the chiefs will be witho... 4 The crew of Insight, WCNY's weekly public affa... the crew of insight, wcny's weekly public affa...
How to identify where each person have lived in different cities in each time?
Here is a small set of the dataset that I am currently working on. FirstName LastName cities occupation time --------------------------------------------------------------- --------------------------------------------------------------- Alice Oumi Queens software engineer 1/1/2019 Alice Oumi New York software engineer 12/3/2018 Sam Charles Santa Clara Engineer 2/5/2017 Sam Charles Santa Monica Engineer 8/9/2018 Sam Charles Santa Clara Engineer 12/12/2019 Alice Oumi New York software engineer 1/2/2017 As you see above, the same person could be living in a same place but for a different duration of a time. I want to make clean this dataset that should what places did Alice and Sam live. For example, instead of having 2 rows of Alice living in New York, I only need to have one. Something similar to the following table FirstName LastName cities FirstTime SecondTime --------------------------------------------------------------- --------------------------------------------------------------- Alice Oumi Queens 1/1/2019 NA Alice Oumi New York 1/2/2017 12/3/2018 Sam Charles Santa Clara 2/5/2017 12/12/2019 Sam Charles Santa Monica 8/9/2018 NA I am kinda new to python and trying to learn. but i have tried to use for loops using iterrows() but didn't work. What can use to achieve this table? Thank you so much in advance
You can do that as follows: # number the times a person lived in the same city (with the same occupation) df['sequence']= df.groupby(['FirstName', 'LastName', 'cities', 'occupation']).cumcount()+1 # now create the "pivot" table result= df.set_index(['FirstName', 'LastName', 'cities', 'occupation', 'sequence']).unstack() # rename the columns result.columns= ['FirstTime', 'SecondTime'] # reset the index (it was just needed for "pivoting" result.reset_index(inplace=True) The result looks like: Out[483]: FirstName LastName cities occupation FirstTime SecondTime 0 Alice Oumi New York software engineer 12/3/2018 1/2/2017 1 Alice Oumi Queens software engineer 1/1/2019 NaN 2 Sam Charles Santa Clara Engineer 2/5/2017 12/12/2019 3 Sam Charles Santa Monica Engineer 8/9/2018 None NaN
how to separate Unit/Suite/APT/# from an address in Excel
I have a data base 272,000 addresses But some addresses have unit/suite/STE/APT seeexample below 16 BRIARWOOD COURT UNIT B MONTVALE, NJ 07645 100 CROWN COURT #471 EDGEWATER, NJ 07020 23-05 HIGH ST APT A FAIR LAWN, NJ 07410 15-01 BROADWAY STE 6 FAIR LAWN, NJ 07410 80 BROADWAY, SUITE 1A CRESSKILL, N.J. 07626 300 GORGE ROAD APT 11 CLIFFSIDE PARK, N.J. 07010 I would like to split the text to the next column when it comes across unit/suite/STE/APT I want to separate these so I can use Advance filter with unique records and create a master find and replace to clean the list.... Any formulas I can use for this would be helpful....
You can batch geocode your file on geocoder.ca This is the result I got: rawlocation Latitude Longitude Score StandardCivicNumber StandardAddress StandardCity StandardStateorProvinceAbbrv PostalZip Confidence 16 BRIARWOOD COURT UNIT B MONTVALE NJ 07645 41.035587 -74.06744 1 16 Briarwood Crt Montvale NJ 7677 0.7 100 CROWN COURT #471 EDGEWATER NJ 07020 40.822893 -73.978375 1 100 Crown Crt Edgewater NJ 07020-1137 0.8 23-05 HIGH ST APT A FAIR LAWN NJ 07410 40.940276 -74.120329 1 23 High St Fair Lawn NJ 07410-3574 0.8 15-01 BROADWAY STE 6 FAIR LAWN NJ 07410 40.920501 -74.091153 1 1 S Broadway Fair Lawn NJ 07410-5529 0.8 80 BROADWAY - 0 300 GORGE ROAD APT 11 CLIFFSIDE PARK N.J. 07010 40.814151 -73.990015 1 300 Gorge Rd Cliffside Park NJ 07010-2759 0.8 From the cleaned up version you can then street compare to extract additional entities.
Since not all addresses have a secondary number (such as APT C, or STE 312), I would recommend separating every time you come across a ZIP (5 digits) or a ZIP+4 (like 07010-2759). This will help you break that string into discrete addresses. If you then want to clean up the list by correcting small typos and standardizing abbreviations, etc, I recommend using an address validation and standardization service like Melissa Data, or SmartyStreets. SmartyStreets has tools for validating/cleansing large lists of addresses and even extracting addresses out of text. (Full disclosure) I'm a software developer for SmartyStreets.
Multi Criterion Max If Statement
My dataset looks like this... State Close Date Probability Highest Prob/State WA 12/31/2016 50% FALSE WA 12/19/2016 80% FALSE WA 10/15/2016 80% TRUE My objective is to build a formula to populate the right-most column. The formula should assess Close Dates and Probabilities within each state. First, it should select the highest probability, then it should select the nearest close date if there is a tie on probability (as in the example). For that record, it should read "TRUE". I assume this would include a MAX IF statement but haven't been able to get it to work. Here is a more robust set of data I'm working with. It may actually be easier to first find the highest probability within each Region then select the minimum (oldest) date if there is a tie on probability. This too will serve my purposes. Region Forecast Close Date Probability (%) Okeechobee FL 6/27/2016 90 Okeechobee West FL 7/1/2016 40 Albany GA 3/11/2016 100 Emerald Coast FL 6/30/2016 60 Emerald Coast FL 10/1/2016 40 Cullman_Hartselle TN 4/30/2016 10 North MS 10/1/2016 25 Roanoke VA 8/31/2016 25 Roanoke VA 8/1/2016 40 Gardena CA 6/1/2016 80 Gardena CA 6/1/2016 80 Lomita-Harbor City 6/30/2016 60 Lomita-Harbor City 6/30/2016 0 Lomita-Harbor City 6/30/2016 40 Eastern NC 6/30/2016 60 Northwest NC 9/16/2016 10 Fort Collins_Greeley CO 3/1/2016 100 Northwest OK 6/30/2016 100 Southwest MO 7/29/2016 90 Northern NH-VT 3/1/2016 20 South DE 12/1/2016 0 South DE 12/1/2016 20 Kingston NY 12/30/2016 5 Longview WA 11/30/2016 5 North DE 12/1/2016 20 North DE 12/1/2016 0 Salt Lake City UT 8/31/2016 20 Idaho Panhandle 8/26/2016 0 Bridgeton_Salem NJ 7/1/2016 25 Bridgeton_Salem NJ 7/1/2016 65 Layton_Ogden UT 3/25/2016 5 Central OR 6/30/2016 10
The following Array formula should work: =(ABS(B2-$F$2)=MIN(IF(($A$2:$A$33=A2)*(C2=MAX(IF($A$2:$A$33=A2,$C$2:$C$33))),ABS($B$2:$B$33-$F$2))))*(C2=MAX(IF($A$2:$A$33=A2,$C$2:$C$33)))>0 Being an array formula use Ctrl-Shift-Enter when exiting Edit mode. If done properly Excel will put {} around the formula. Edit Added #tigeravatar suggestion to avoid volatile functions.
I think this is OK now but needs to be checked against the more complete set of data provided by OP. It counts:- (1) Any rows with same state but higher probability (2) Any rows with same state and probability, in the future (or present) and nearer to today's date (3) Any rows with same state and probability, in the past and nearer to today's date. If all these are zero, you should have the right one. =COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,">"&$C2) +COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,"<"&$G$2+IF ($B2>=$G$2,DATEDIF($G$2,$B2,"d"),DATEDIF($B2,$G$2,"d")),$B$2:$B$100,">="&$G$2) +COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,">"&$G$2-IF($B2>=$G$2,DATEDIF($G$2,$B2,"d"),DATEDIF($B2,$G$2,"d")),$B$2:$B$100,"<"&$G$2) =0 If the dates are all in the future, it can be simplified a lot:- =COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,">"&$C2) +COUNTIFS($A$2:$A$100,$A2,$C$2:$C$100,$C2,$B$2:$B$100,"<"&$G$2+DATEDIF($G$2,$B2,"d")) =0