Extracting the data between paranthesis and put the resulting value in another column - python-3.x

I would like to extract the data between parenthesis from the below dataframe and put the resulting value in a new column. If there are no parenthesis in the column data then we can leave them empty.
Data
0 The city is far (RANDOM)
1 Omega Fatty Acid is good for health
2 Name of the fruit is (MANGO)
3 The producer had given man good films (GOOD)
4 This summer has a very good (Offer)

We can use str.extract with a regex group where we define everything between paranthesis:
df['Newcol'] = df['Data'].str.extract('\((.*)\)')
Data Newcol
0 The city is far (RANDOM) RANDOM
1 Omega Fatty Acid is good for health NaN
2 Name of the fruit is (MANGO) MANGO
3 The producer had given man good films (GOOD) GOOD
4 This summer has a very good (Offer) Offer

Related

Replace values in a column based on another dataframe

I have a table:
Name Profession Character
Ben cinematographer Nan
Scarlett actress Black Widow
Robert actor Iron Man
Chris actor Thor
Kevin producer Nan
I created a new data frame with a column of unique values sorted in ascending order from the table above and an incremental column
ID Job
1 actor
2 actress
3 cinematographer
4 producer
Now i need to replace the values in the profession column in the original table with their corresponding ID from the new table
Desired Output
Name Profession Character
Ben 3 Nan
Scarlett 2 Black Widow
Robert 1 Iron Man
Chris 1 Thor
Kevin 4 Nan
code so far
df=pdf.read_csv(filename)
column = df['Profession'].unique()
new_df=pd.DataFrame(column, columns=['Job])
new_df=new_df.sort_values(['Job'])
new_df = new_df.reset_index()
new_df.columns.values[0] = 'ID'
new_df['ID'] = new_df.index + 1
df.loc[df['Profession] == new_df['Job'], 'Profession'] = new_df['ID']
The last line yeilds 'ValueError: Can only compare identically-labeled Series objects'
Try with replace then
df1.Profession = df1.Profession.replace(df2.set_index('Job').ID)

MultiIndexing based on row values

Trying to create a simple program that finds negative values in a pandas dataframe and combines them with their matching row. Basically I have data that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
...
The idea is that I need to match up fills and refunds, then combine them into a single row. So, in the example above we'd have one row that looks like this:
LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 5 1.74
Also, if a refund doesn't match to a fill row, I'm supposed to just delete it, which I've been accomplishing with .drop()
I'm imagining I can build a multiindex for this somehow, where each row that's a negative is marked as a refund and each fill row is marked as a fill. Then I just have some kind of for loop that goes through the list and attempts to match a certain number of times based on name/number of refunds.
Here's what I was trying:
pbm_negative_index = raw_pbm_data.loc['LastName','DrugName','RXNumber','ClientTotalCost']
names = pbm_negative_index = raw_pbm_data.loc[: , 'LastName']
unique_names = unique(pbm_negative_index)
for n in unique_names:
edf["Refund"] = edf["ClientTotalCost"].shift(1, fill_value=edf["ClientTotalCost"].head(1)) < 0
This obviously doesn't work and I'd like to use the indexing tools in Pandas to achieve a similar result.
Your specification reduces to two simple steps:
aggregate +ve & -ve matching rows
drop remaining -ve rows after aggregation
df = pd.read_csv(io.StringIO("""LastName DrugName RxNumber Amount ClientTotalCost
ADAMS Drug 100001 30 10.69
ADAMS Drug 100001 -25 -8.95
ADAMS2 Drug. 100001 -5 -1.95
"""), sep="\s+")
# aggregate
dfa = df.groupby(["LastName","DrugName","RxNumber"],as_index=False).agg({"Amount":"sum","ClientTotalCost":"sum"})
# drop remaining -ve amounts
dfa = dfa.drop(dfa.loc[dfa.Amount.lt(0)].index)
LastName
DrugName
RxNumber
Amount
ClientTotalCost
0
ADAMS
Drug
100001
5
1.74

How to find row with specific column value only

I am trying to figure out the names who only have specific column value and nothing else.
I have tried filtering the rows according to the column value but that isn't what I want, I want the names who only went to eat pizza.
I want names who only had pizza, so my code should return John only and not peter as john only had pizza
Click to view data frame
Your description is not clear. At first, it looks like a simple .loc will be enough. However, after viewing your picture of sample data, I realized it is not that simple. To get what you want, you need to identify duplicated or non-duplicated names having one Restaurant value only, and pick it. To do this, you need to use nunique and check it eq(1), and assign it a mask m. Finally, using m with slicing to get your desire output:
Your sample data:
In [512]: df
Out[512]:
Name Restaurant
0 john pizza
1 peter kfc
2 john pizza
3 peter pizza
4 peter kfc
5 peter pizza
6 john pizza
m = df.groupby('Name').Restaurant.transform('nunique').eq(1)
df[m]
Out[513]:
Name Res
0 john pizza
2 john pizza
6 john pizza
If you want to show only one row, just chain additional .drop_duplicates
df[m].drop_duplicates()
Out[515]:
Name Restaurant
0 john pizza

Operation over several columns

I am wondering , if I can write a formula which would operate over several columns, e.g. I want to calculate the amount of males in the school and I have a table:
A B C
Class Sex Number
1 male 3
2 male 4
1 female 6
2 female 5
Right now I have to break the operations into parts:
=(B2="Male")*C2 - additional column and then
=SUMME(D2:D5)
I want to do it at once. It seems like a trivial functionality, but I can not figure it out, how I can do it in one formula.

Excel Vlookup Multiple Values

I am looking for a vlookup formula that returns multiple matches using two lookup values. I am currently trying to use the concatenate method, but I haven't quite figured it out. The table needs to return all of the multiple matches not just one. Currently, its only returning the last match.
For example, lets say I have a list of multiple city and states. The cities differ but the states remain the same obviously. I want to return the number of people in the each city.
City State #OfPeople
Albany NY 10
Orlando FL 5
Tampa FL 3
Seattle WA 1
Queens NY 8
So I concatenated the city and state column.
Join City State #OfPeople
Albany-NY Albany NY 10
Orlando-FL Orlando FL 5
Tampa-FL Tampa FL 3
Seattle-WA Seattle WA 1
Queens-NY Queens NY 8
The purpose of this is to create an updated log of people in each city has time progresses. I want to have a grand total amount of people in each column. (I know this requires another formula. I'm just focused on returning multiple matches for now). However, I don't want to overwrite the existing data. Hopefully, I explained this well. This is just an example of a larger project I'm working on. I need to be able to build on this list. That's why its important that I be able to return matches multiple times.
Join City State #OfPeople Total
Albany-NY Albany NY 10 10
Orlando-FL Orlando FL 5 15
Tampa-FL Tampa FL 3 18
Seattle-WA Seattle WA 1 19
Queens-NY Queens NY 8 27
Any help would be greatly appreciated!
Considering you're trying to get some grand totals based on multiple criteria, I would suggest using SUMIFS() / COUNTIFS() functions, rather than focusing on searching matching row itself.
However, if you need multiple criteria look up, for some reason, I believe INDEX() + MATCH() combination can perfectly do the job.
The table needs to return all of the multiple matches not just one.
Currently, its only returning the last match
You'll need to use SUMIFS() if there are multiple records for the same city/state combo in your people lookup.
=SUMIFS (sum_range, range1, criteria1, [range2], [criteria2], ...)
Let's assume that you have a cities tab and a people tab. Let's assume you have ten cities that you want to return the total amount of people from.
Cities Tab definition
City range: 'Cities'!A$1:A$10
State range: 'Cities'!B$1:B$10
People Tab definition
City range: 'People'!A$1:A$100
State range: 'People'!B$1:B$100
#OfPeople range: 'People'!C$1:C$100
Drop this formula in the first row of your cities tab, drag down the entire range of cities.
=SUMIFS('People'!C$1:C$100, 'Cities'!A$1, People'!A$1:A$100, 'Cities'!B$1, 'People'!B$1:B$100)

Resources