Return pieces of strings from separate pandas dataframes based on multi-conditional logic - python-3.x

I'm new to python, and trying to do some work with dataframes in pandas
On the left side is piece of the primary dataframe (df1), and the right is a second (df2). The goal is to fill in the df1['vd_type'] column with strings based on several pieces of conditional logic. I can make this work with nested np.where() functions, but as this gets deeper into the hierarchy, it gets too long to run at all, so I'm looking for a more elegant solution.
The english version of the logic is this:
For df1['vd_type']: If df1['shape'] == the first two characters in df2['vd_combo'] AND df1['vd_pct'] <= df2['combo_value'], then return the last 3 characters in df2['vd_combo'] on the line where both of these conditions are true. If it can't find a line in df2 where both conditions are true, then return "vd4".
Thanks in advance!
EDIT #2: So I want to implement a 3rd condition based on another variable, with everything else the same, except in df1 there is another column 'log_vsc' with existing values, and the goal is to fill in an empty df1 column 'vsc_type' with one of 4 strings in the same scheme. The extra condition would be just that the 'vd_type' that we just defined would match the 'vd' column arising from the split 'vsc_combo'.
df3 = pd.DataFrame()
df3['vsc_combo'] = ['A1_vd1_vsc1','A1_vd1_vsc2','A1_vd1_vsc3','A1_vd2_vsc1','A1_vd2_vsc2' etc etc etc
df3['combo_value'] = [(number), (number), (number), (number), (number), etc etc
df3[['shape','vd','vsc']] = df3['vsc_combo'].str.split('_', expand = True)
def vsc_condition( row, df3):
df_select = df3[(df3['shape'] == row['shape']) & (df3['vd'] == row['vd_type']) & (row['log_vsc'] <= df3['combo_value'])]
if df_select.empty:
return 'vsc4'
else:
return df_select['vsc'].iloc[0]
## apply vsc_type
df1['vsc_type'] = df1.apply( vsc_condition, args = ([df3]), axis = 1)
And this works!! Thanks again!

so your inputs are like:
import pandas as pd
df1 = pd.DataFrame({'shape': ['A2', 'A1', 'B1', 'B1', 'A2'],
'vd_pct': [0.78, 0.33, 0.48, 0.38, 0.59]} )
df2 = pd.DataFrame({'vd_combo': ['A1_vd1', 'A1_vd2', 'A1_vd3', 'A2_vd1', 'A2_vd2', 'A2_vd3', 'B1_vd1', 'B1_vd2', 'B1_vd3'],
'combo_value':[0.38, 0.56, 0.68, 0.42, 0.58, 0.71, 0.39, 0.57, 0.69]} )
If you are not against creating columns in df2 (you can delete them at the end if it's a problem) you generate two columns shape and vd by splitting the column vd_combo:
df2[['shape','vd']] = df2['vd_combo'].str.split('_',expand=True)
Then you can create a function condition that you will use in apply such as:
def condition( row, df2):
# row will be a row of df1 in apply
# here you select only the rows of df2 with your conditions on shape and value
df_select = df2[(df2['shape'] == row['shape']) & (row['vd_pct'] <= df2['combo_value'])]
# if empty (your condition not met) then return vd4
if df_select.empty:
return 'vd4'
# if your condition met, then return the value of 'vd' the smallest
else:
return df_select['vd'].iloc[0]
Now you can create your column vd_type in df1 with:
df1['vd_type'] = df1.apply( condition, args =([df2]), axis=1)
df1 is like:
shape vd_pct vd_type
0 A2 0.78 vd4
1 A1 0.33 vd1
2 B1 0.48 vd2
3 B1 0.38 vd1
4 A2 0.59 vd3

Related

Pandas: Remove characters from cell in a FOR loop

I have a dataframe with a column labeled Amount which is a dollar amount. For some reason, some of the cells in this column are enclosed in quotation marks (ex: "$47.25").
I'm running this for loop and was wondering what is the best approach to remove the quotes.
for f in files:
print(f)
df = pd.read_csv(f, header = None, nrows=1)
print(df)
je = df.iloc[0,1]
df2 = pd.read_csv(f,header = 6, dtype = {'Amount':float})
df2.to_excel(w, sheet_name = je, index = False)
I have attempted to strip the " from the value using a for loop:
for cell in df2['Amount']:
cell = cell.strip('"')
df2['Amount']=pd.to_numeric(df2['Amount'])
But I am getting:
ValueError: Unable to parse string "$-167.97" at position 0
Thank you in advance!
Given this toy dataframe:
import pandas as pd
df = pd.DataFrame(
{"Transaction": ["t1", "t2"], "Amount": ["$47.25", "'$-167.97'"]}
)
print(df)
# Outputs
Transaction Amount
0 t1 $47.25
1 t2 '$-167.97'
Instead of using a for loop, which should generally be avoided with dataframes, you could simply remove the quotation marks from the Amountcolumn like this:
df["Amount"] = df["Amount"].str.replace("\'", "")
print(df)
# Outputs
Transaction Amount
0 t1 $47.25
1 t2 $-167.97

How can I compare two unsorted data frames and report the differences

I would like to compare two unsorted dataframes and report the differences by column and row ID coordinates and values.
The code, first compares the csv's and if they are not equal it compares based on a merge, and if they still are not equal I know that data is different in some way.
At this point i'm not sure how to identify the column and row coordinates along with the value for each dataframe where the data is identified to be different.
Here are the dataframes:
DATAFRAME 1 - EXP:
DATAFRAME 2 - ACT:
Here is my current code:
import pandas as pd
file1 = "c:\\exp.csv"
file2 = "c:\\act.csv"
exp = pd.read_csv(file1, encoding="ANSI")
act = pd.read_csv(file2, encoding="ANSI")
exp = exp.drop_duplicates(subset=None, keep='first', inplace=False)
act = act.drop_duplicates(subset=None, keep='first', inplace=False)
result = exp.equals(act)
if result:
print("CSV's Match")
else:
act = act.set_index('accounts')
dataMerged = exp.merge(act, how='left')
dataMergedAndSorted = dataMerged.sort_values(['accounts']).set_index('account numbers')
actSorted = act.sort_values(['accounts'])
if dataMergedAndSorted.equals(actSorted):
print("The Merged, Sorted, and Compared Data Now Returns True : PASS")
else:
dataMergedAndSorted = dataMergedAndSorted.reset_index()
actSorted = actSorted.reset_index()
Known Differences by Observation of the Data Frames & How to Report it:
Exp: Col=1,Row=4,Val=9101, Act: Col=1,Row=3,Val=FOO
Exp: Col=3,Row=6,Val=BAR, Act: Col=3,Row=5,Val=malfoy
The easiest way I have found to identify the differences between two unordered dataframes is as follows:
First you order your DataFrames df1 and df2 by all columns, and then you can use pandas.compare to get the exact columns and rows that contain differences between them:
import pandas as pd
df1_comp = df1.sort_values(by = ['column_1', 'column_2', 'column_3']).reset_index(drop = True)
df2_comp = df2.sort_values(by = ['column_1', 'column_2', 'column_3']).reset_index(drop = True)
df_diff = df1_comp.compare(df2_comp)
This will return a dataframe (df_diff) containing only the columns and rows with differences between df1 and df2.

Apply style to a DataFrame using index/column from a list of tuples in Python/Pandas

I have a list of tuples that represent DataFrame index row number and a column name, in a form:
[(12, 'col3'), (16, 'col7'), ...].
I need to be able to find rows/column values that correspond to those tuple values in another dataframe and mark them red for example. Usually I use
df.style.apply(...)
from here: https://pandas.pydata.org/pandas-docs/stable/style.html and it works but in this case I am not sure how to map those tuple values with a dataframe in a function. Any help is much appreciated.
You can use custom function with at for set values by tups:
tups = [(12, 'col3'), (16, 'col7'), ...]
def highlight(x):
r = 'background-color: red'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#rewrite values by selecting by tuples
for i, c in tups:
df1.at[i, c] = r
return df1
df.style.apply(highlight, axis=None)

How to merge two dataframes and return data from another column in new column only if there is match?

I have a two df that look like this:
df1:
id
1
2
df2:
id value
2 a
3 b
How do I merge these two dataframes and only return the data from value column in a new column if there is a match?
new_merged_df
id value new_value
1
2 a a
3 b
You can try this using #JJFord3 setup:
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
#Use isin to create new_value
df2['new_value'] = df2['value'].where(df2.index.isin(df1.index))
#Use reindex with union to rebuild dataframe with both indexes
df2.reindex(df1.index.union(df2.index))
Output:
value new_value
1 NaN NaN
2 a a
3 b NaN
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
new_merged_df_outer = df1.merge(df2,how='outer',left_index=True,right_index=True)
new_merged_df_inner = df1.merge(df2,how='inner',left_index=True,right_index=True)
new_merged_df_inner.rename(columns={'value':'new_value'})
new_merged_df = new_merged_df_outer.merge(new_merged_df_inner,how='left',left_index=True,right_index=True)
First, create an outer merge to keep all indexes.
Then create an inner merge to only get the overlap.
Then merge the inner merge back to the outer merge to get the desired column setup.
You can use full outer join
Lets model your data with case classes:
case class MyClass1(id: String)
case class MyClass2(id: String, value: String)
// this one for the result type
case class MyClass3(id: String, value: Option[String] = None, value2: Option[String] = None)
Creating some inputs:
val input1: Dataset[MyClass1] = ...
val input2: Dataset[MyClass2] = ...
Joining your data:
import scala.implicits._
val joined = input1.as("1").joinWith(input2.as("2"), $"1.id" === $"2.id", "full_outer")
joined map {
case (left, null) if left != null => MyClass3(left.id)
case (null, right) if right != null => MyClass3(right.id, Some(right.value))
case (left, right) => MyClass3(left.id, Some(right.value), Some(right.value))
}
DataFrame.merge has in parameter indicator which
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
This can be used to check if there is a match
import pandas as pd
df1 = pd.DataFrame(index=[1,2])
df2 = pd.DataFrame({'value' : ['a','b']},index=[2,3])
# creates a new column `_merge` with values `right_only`, `left_only` or `both`
merged = df1.merge(df2, how='outer', right_index=True, left_index=True, indicator=True)
merged['new_value'] = merged.loc[(merged['_merge'] == 'both'), 'value']
merged = merged.drop('_merge', axis=1)
Use merge and isin:
df = df1.merge(df2,on='id',how='outer')
id_value = df2.loc[df2['id'].isin(df1.id.tolist()),'id'].unique()
mask = df['id'].isin(id_value)
df.loc[mask,'new_value'] = df.loc[mask,'value']
# alternative df['new_value'] = np.where(mask, df['value'], np.nan)
print(df)
id value new_value
0 1 NaN NaN
1 2 a a
2 3 b NaN

Python - unable to count occurences of values in defined ranges in dataframe

I'm trying to write a code that takes analyses values in a dataframe, if the values fall in a class, the total number of those values are assigned to a key in the dictionary. But the code is not working for me. Im trying to create logarithmic classes and count the total number of values that fall in it
def bins(df):
"""Returns new df with values assigned to bins"""
bins_dict = {500: 0, 5000: 0, 50000: 0, 500000: 0}
for i in df:
if 100<i and i<=1000:
bins_dict[500]+=1,
elif 1000<i and i<=10000:
bins_dict[5000]+=1
print(bins_dict)
However, this is returning the original dictionary.
I've also tried modifying the dataframe using
def transform(df, range):
for i in df:
for j in range:
b=10**j
while j==1:
while i>100:
if i>=b:
j+=1,
elif i<b:
b = b/2,
print (i = b*(int(i/b)))
This code is returning the original dataframe.
My dataframe consists of only one column with values ranging between 100 and 10000000
Data Sample:
Area
0 1815
1 907
2 1815
3 907
4 907
Expected output
dict={500:3, 5000:2, 50000:0}
If i can get a dataframe output directly that would be helpful too
PS. I am very new to programming and I only know python
You need to use pandas for it:
import pandas as pd
df = pd.DataFrame()
df['Area'] = [1815, 907, 1815, 907, 907]
# create new column to categorize your data
df['bins'] = pd.cut(df['Area'], [0,1000,10000,100000], labels=['500', '5000', '50000'])
# converting into dictionary
dic = dict(df['bins'].value_counts())
print(dic)
Output:
{'500': 3, '5000': 2, '50000': 0}

Resources