Suppose I have the following Dataframe-
company money
jack & jill, Boston, MA 02215 51
jack & jill, MA 02215 49
Now, I know that these 2 rows mean the same company, so I want to merge them and also sum the money-
company money
jack & jill, Boston, MA 02215 100
I don't care about the format of the company name, as long as the duplicates get merged and the money gets added.
How should I go about this? Is there a library out there that merges SIMILAR value rows and sums the corresponding quantitative value?
If you have same pattern in company column i.e. the value before the 1st comma is company name. You can use something like below:
df = pd.DataFrame({'company':['jack & jill, Boston, MA 02215','jack & jill, MA 02215','Google, New Jersey', 'Google'],
'money':[51,49, 33, 22]})
df['company'] = df['company'].apply(lambda x: x.split(",")[0])
new_df = df.groupby(['company'])['money'].sum().reset_index()
print(new_df)
Output:
company money
0 Google 55
1 jack & jill 100
Related
I'm trying to compile a best 5 and worst 5 list. I have two rows, column B with the number score and column C with the name. I only want the list to include the name.
In my previous attempts the formula would get the top/bottom 5 but as soon as a duplicate score appeared the first known name with that value would just repeat.
Here is my data
26 Cal
55 John
55 Mike
100 Steve
26 Thomas
100 Jaden
100 Jack
95 Josh
87 Cole
75 Brett
I've managed to get the bottom 5 list formula correct. This formula works perfectly and includes all names of duplicate scores.
Example of what I get:
Cal
Thomas
John
Mike
Brett
=INDEX($C$56:$E$70,SMALL(IF($B$56:$B$70=SMALL($B$56:$B$70,ROWS(E$2:E2)),ROW($B$56:$B$70)-ROW($B$56)+1),SUM(IF($B$56:$B$70=SMALL($B$56:$B$70,
ROWS(E$2:E2)),1,0))-SUM(IF($B$56:$B$70<=SMALL($B$56:$B$70,ROWS(E$2:E2)),1,0))+ROWS(E$2:E2)))
Here is the formula I've tried to get the top 5 - however I keep getting an error.
=INDEX($C$56:$E$70,LARGE(IF($B$56:$B$70=LARGE($B$56:$B$70,ROWS(E$2:E2)),ROW($B$56:$B$70)-ROW($B$56)+1),SUM(IF($B$56:$B$70=LARGE($B$56:$B$70,
ROWS(E$2:E2)),1,0))-SUM(IF($B$56:$B$70<=LARGE($B$56:$B$70,ROWS(E$2:E2)),1,0))+ROWS(E$2:E2)))
Example of what I'm looking for
Steve
Jaden
Jack
Josh
Cole
You can set two queries like this for both cases:
=QUERY(B56:C70,"Select C order by B desc limit 5")
=QUERY(B56:C70,"Select C order by B limit 5")
Use SORTN() function like-
=SORTN(A1:B10,5,,1,1)
To keep only one column, wrap the SORTN() function with INDEX() and specify column number. Try-
=INDEX(SORTN(A1:B10,5,,1,1),,2)
Trying to create a simple program that finds negative values in a pandas dataframe and combines them with their matching row. Basically I have data that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
...
The idea is that I need to match up fills and refunds, then combine them into a single row. So, in the example above we'd have one row that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 5 1.74
Also, if a refund doesn't match to a fill row, I'm supposed to just delete it, which I've been accomplishing with .drop()
I'm imagining I can build a multiindex for this somehow, where each row that's a negative is marked as a refund and each fill row is marked as a fill. Then I just have some kind of for loop that goes through the list and attempts to match a certain number of times based on name/number of refunds.
Here's what I was trying:
pbm_negative_index = raw_pbm_data.loc['LastName','DrugName','RXNumber','ClientTotalCost']
names = pbm_negative_index = raw_pbm_data.loc[: , 'LastName']
unique_names = unique(pbm_negative_index)
for n in unique_names:
edf["Refund"] = edf["ClientTotalCost"].shift(1, fill_value=edf["ClientTotalCost"].head(1)) < 0
This obviously doesn't work and I'd like to use the indexing tools in Pandas to achieve a similar result.
Your specification reduces to two simple steps:
aggregate +ve & -ve matching rows
drop remaining -ve rows after aggregation
df = pd.read_csv(io.StringIO("""LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
ADAMS2 Drug. 100001 -5 -1.95
"""), sep="\s+")
# aggregate
dfa = df.groupby(["LastName","DrugName","RxNumber"],as_index=False).agg({"Amount":"sum","ClientTotalCost":"sum"})
# drop remaining -ve amounts
dfa = dfa.drop(dfa.loc[dfa.Amount.lt(0)].index)
LastName
DrugName
RxNumber
Amount
ClientTotalCost
0
ADAMS
Drug
100001
5
1.74
I know you can do this with a series, but I can't seem to do this with a dataframe.
I have the following:
name note age
0 jon likes beer on tuesdays 10
1 jon likes beer on tuesdays
2 steve tonight we dine in heck 20
3 steve tonight we dine in heck
I am trying to produce the following:
name note age
0 jon likes beer on tuesdays 10
1 jon likes beer on tuesdays 10
2 steve tonight we dine in heck 20
3 steve tonight we dine in heck 20
I know how to do this with string values using group by and join, but this only works on string values. I'm having issues converting the entire column of age to a string data type in the dataframe.
Any suggestions?
Use GroupBy.first with GroupBy.transform if want repeat first values per groups:
g = df.groupby('name')
df['note'] = g['note'].transform(' '.join)
df['age'] = g['age'].transform('first')
If need processing multiple columns - it means all numeric with first and all strings by join you can generate dictionary by columns names with functions, pass to GroupBy.agg and last use DataFrame.join:
cols1 = df.select_dtypes(np.number).columns
cols2 = df.columns.difference(cols1).difference(['name'])
d1 = dict.fromkeys(cols2, lambda x: ' '.join(x))
d2 = dict.fromkeys(cols1, 'first')
d = {**d1, **d2}
df1 = df[['name']].join(df.groupby('name').agg(d), on='name')
I have a large data set and looking for something that will split my Street Address into two columns Street Number and Street Name.
I am trying to figure out how can I do this efficiently since I first need to process the street address and then check if the first index of the split has a digit or not.
So far I have a working code that looks like this. I created a two function one for extracting street number data from the street address, while the other one replaces the first occurrence of that street number from the street address.
def extract_street_number(row):
if any(map(str.isdigit, row.split(" ")[0])):
return row.split(" ")[0]
def extract_street_name(address, streetnumber):
if streetnumber:
return address.replace(streetnumber, "", 1)
else:
return address
Then using the apply function to have the two columns.
df[street_number] = df.apply(lambda row: extract_street_number(row[address_col]), axis=1)
df[street_name] = df.apply(lambda row: extract_street_name(row[address_col], row[street_number]), axis=1)
I'm wondering if there is a more efficient way to do this? Based on this current routine I need to build first the Street Number Column before I process the street name column.
I'm thinking of something like building the two series on the first iteration of the address column. The pseudo-code is something like this I just can't figure it out how can I code it in python.
Pseudocode:
Split Address into two columns based on first space that encounters a non-numeric character:
street_data = address.split(" ", maxsplit=1)
If street_data[0] has digits then return the columns on this way:
df[street_number] = street_data[0]
df[street_name] = street_data[1]
Else if street_data[0] is not digit then return the columns on this way:
df[street_number] = ""
df[street_name] = street_data[0] + " " + street_data[1]
# or just simply the address
df[street_name] = address
By the way this is the working sample of the data:
# In
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St', '513 Banks St', '5600 77 Center Dr', '1013 1/2 E Main St', '1234C Main St', '37-01 Fair Lawn Ave']})
# Out
Street_Number Street_Name
0 111 Rubin Center
1 Monroe St
2 513 Banks St
3 560 77 Center Dr
4 1013 1/2 E Main St
5 1234C Main St
6 37-01 Fair Lawn Ave
TL;DR:
This can be achieved in three steps-
Step 1-
df['Street Number'] = [street_num[0] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]
Step 2-
df['Street Address'] = [street_num[1] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]
Step 3-
df['Street Address'].loc[df['Street Address'].str.contains("N/A") == True] = df1['Address'].loc[df1['Street Address'].str.contains("N/A") == True]
Explanation-
Added two more test cases in the dataframe for code flexibility (Row 7,8)-
Step 1 - We separate the street numbers from the address here. This is done by slicing the first element from the list after splitting the address string and initialising to Street Number column.
If the first element doesn't contain a number, N/A is appended in the Street Number column.
Step 2 - As the first element in the sliced string contains the Street Number, the second element has to be the Street Address hence is appended to the Street Address column.
Step 3 - Due to step two, the Street Address become 'N/A' for the 'Address` that do not contain a number and that is resolved by this -
Hence, we can solve this in three steps after hours of struggle put in.
solution reflecting your pseudocode is below.
First lets divide "Address" and store is somewhere
new = df["Address"].str.split(" ", n = 1, expand = True)
df["First Part"]= new[0]
df["Last Part"]= new[1]
Next let's write down conditions
cond1 = df['First Part'].apply(str.isdigit)
cond2 = df['Last Part'].apply(str.isdigit)
Now check what meets given conditions
df.loc[cond1 & ~cond2, "Street"] = df.loc[cond1 & ~cond2, "Last Part"]
df.loc[cond1 & ~cond2, "Number"] = df.loc[cond1 & ~cond2, "First Part"]
df.loc[~cond1 & ~cond2, "Street"] = df.loc[~cond1 & ~cond2, ['First Part', 'Last Part']].apply(lambda x: x[0] + ' ' + x[1], axis = 1)
Finally let's clean-up those auxiliary columns
df.drop(["First Part", "Last Part"], axis = 1, inplace=True)
df
Address Street Number
0 111 Rubin Center Rubin Center 111
1 Monroe St Monroe St NaN
2 513 Banks St Banks St 513
#mock test
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St',
'513 Banks St', 'Banks 513 St',
'Rub Cent 111']})
unless i'm missing something, a bit of regex should solve ur request:
#gets number only if it starts the line
df['Street_Number'] = df.Address.str.extract(r'(^\d+)')
#splits only if number is at the start of the line
df['Street_Name'] = df.Address.str.split('^\d+').str[-1]
Address street_number street_name
0 111 Rubin Center 111 Rubin Center
1 Monroe St NaN Monroe St
2 513 Banks St 513 Banks St
3 Banks 513 St NaN Banks 513 St
4 Rub Cent 111 NaN Rub Cent 111
let me know where this falls flat
I'm having difficulty counting records in a file that have a unique ID and listing the number of rows associated with that specific ID.
For this file, the unique ID represents a specific family (column A). Each member of the family is in a different row with the same ID. I would like to count the number of family members(rows) in each unique family. I can have a few thousand rows so automating this would be wonderful. Thanks for any help!!
You can do this now automatically with Excel 2013.
If you have that version, then select your data to create a pivot table, and when you create your table, make sure the option 'Add this data to the Data Model' tickbox is check (see below).
Then, when your pivot table opens, create your rows, columns and values normally. Then click the field you want to calculate the distinct count of and edit the Field Value Settings:
Finally, scroll down to the very last option and choose 'Distinct Count.'
This should update your pivot table values to show the data you're looking for.
So if I'm understanding you correctly, you have something like
A B C
Fam. ID LastName FirstName
1 Smith John
1 Smith Mary
1 Smith Johnnie Jr
2 Roe Rick
3 Doe Jane
3 Doe Sam
and you want a new column (say, D), with a count of members per family:
A B C D
Fam. ID LastName FirstName Fam. Cnt
1 Smith John 3
1 Smith Mary 3
1 Smith Johnnie Jr 3
2 Roe Rick 1
3 Doe Jane 2
3 Doe Sam 2
This will do it -- insert at D2 and drag down:
=COUNTIF(A:A,A2)