How to substring the column name in python - python-3.x

I have a column named 'comment1abc'
I am writing a piece of code where I want to see that if a column contains certain string 'abc'
df['col1'].str.contains('abc') == True
Now, instead of hard coding 'abc', I want to use a substring like operation on column 'comment1abc' (to be precise, column name, not the column values)so that I can get the 'abc' part out of it. For example below code does a similar job
x = 'comment1abc'
x[8:11]
But how do I implement that for a column name ? I tried below code but its not working.
for col in ['comment1abc']:
df['col123'].str.contains('col.names[8:11]')
Any suggestion will be helpful.
Sample dataframe:
f = {'name': ['john', 'tom', None, 'rock', 'dick'], 'DoB': [None, '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'location': ['NY', 'NJ', 'PA', 'NY', None], 'code': ['abc1xtr', '778abc4', 'a2bcx98', None, 'ab786c3'], 'comment1abc': ['99', '99', '99', '99', '99'], 'comment2abc': ['99', '99', '99', '99', '99']}
df1 = pd.DataFrame(data = f)
and sample code:
for col in ['comment1abc', 'comment2abc']:
df1[col][df1['code'].str.contains('col.names[8:11]') == True] = '1'

I think the answer would be simple like this:
for col in ['comment1abc', 'comment2abc']:
x = col[8:11]
df1[col][df1['code'].str.contains('x') == True] = '1'
Trying to use a column name within .str.contains() wasn't a good idea. Better use a string.

Related

Change a dataframe column value based on the current value?

I have a pandas dataframe with several columns and in one of them, there are string values. I need to change these strings to an acceptable value based on the current value. The dataframe is relatively large (40.000 x 32)
I've made a small function that takes the string to be changed as a parameter and then lookup what this should be changed to.
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','Monday','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
def lut(txt):
my_lut = {'Script' : ['Script','Scrpt','MyScript'],
'Weekday' : ['Sunday','Monday','Tuesday']}
for key, value in my_lut.items():
if txt in value:
return(key)
break
return('Unknown')
The desired output should be:
A B
0 Script Song
1 Script Blues
2 Script Rock
3 Weekday Classic
4 Weekday Whatever
5 Unknown Something
I can't figure out how to apply this to the dataframe.
I've struggled over this for some time now so any input will be appreciated
Regards,
Check this out:
import pandas as pd
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','sdfsd','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
dic = {'Weekday': ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], 'Script': ['Script','Scrpt','MyScript']}
for k, v in dic.items():
for item in v:
df.loc[df.A == item, 'A'] = k
df.loc[~df.A.isin(k for k, v in dic.items()), 'A'] = "Unknown"
Output:

python3 - check element is actually in list

for example i have excel header list like this
excel_headers = [
'Name',
'Age',
'Sex',
]
and i have another list to check againt it.
headers = {'Name' : 1, 'Age': 2, 'Sex': 3, 'Whatever': 4}
i dont care if headers have whatever elements, i care only element in headers has excel_headers element.
WHAT I've TRIED
lst = all(headers[idx][0] == header for idx,
header in enumerate(excel_headers))
print(lst)
however it always return False.
any help? pleasse
Another way to do it using sets would be to use set difference:
excel_headers = ['Name', 'Age', 'Sex']
headers = {'Name' : 1, 'Age': 2, 'Sex': 3, 'Whatever': 4}
diff = set(excel_headers) - set(headers)
hasAll = len(diff) == 0 # len 0 means every value in excel_headers is present in headers
print(diff) #this will give you unmatched elements
Just sort your list, the results shows you a before and after
excel_headers = [
'Name',
'Age',
'Sex',
]
headers = ['Age' , 'Name', 'Sex']
if excel_headers==headers: print "YES!"
else: print "NO!"
excel_headers.sort()
headers.sort()
if excel_headers==headers: print "YES!"
else: print "NO!"
Output:
No!
Yes!
Tip: this is a good use case for a set, since you're looking up elements by value to see if they exist. However, for small lists (<100 elements) the difference in performance isn't really noticeable, and using a list is fine.
excel_headers = ['Name', 'Age', 'Sex']
headers = {'Name' : 1, 'Age': 2, 'Sex': 3, 'Whatever': 4}
result = all(element in headers for element in excel_headers)
print(result) # --> True

Trouble with a python loop

I'm having issues with a loop that I want to:
a. see if a value in a DF row is greater than a value from a list
b. if it is, concatenate the variable name and the value from the list as a string
c. if it's not, pass until the loop conditions are met.
This is what I've tried.
import pandas as pd
import numpy as np
df = {'level': ['21', '22', '23', '24', '25', '26', '27', '28', '29', '30']
, 'variable':'age'}
df = pd.DataFrame.from_dict(df)
knots = [0, 25]
df.assign(key = np.nan)
for knot in knots:
if df['key'].items == np.nan:
if df['level'].astype('int') > knot:
df['key'] = df['variable']+"_"+knot.astype('str')
else:
pass
else:
pass
However, this only yields the key column to have NaN values. I'm not sure why it's not placing the concatenation.
You can do something like this inside the for loop. No need of any if conditions:
df.loc[df['level'].astype('int') > 25, 'key'] = df.loc[df['level'].astype('int') > 25, 'variable'] + '_' + df.loc[df['level'].astype('int') > 25, 'level']

For every element in a list a, how to count how many times it appear in one specific column in another dataframe

For every element in a dict a, I need to count how many times the element in 'age' column appears in one specific column of another dataframe in pandas
For example , I have a dict below:
a={'age':[22,38,26],'no':[1,2,3]}
and I have another dataframe with a few columns
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
I would like to know how many times every element in dict a appears in the column 'age' in TableB. The result I expect is c={'age':[22,38,26],'count':[2,2,1]}
I have tried apply function but it does not work. It comes with syntax error, I'm new to Pandas, could anyone please help with that? Thank you!
def myfunction(y):
seriesObj = TableB.apply(lambda x: True if y in list(x) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numofRows
c['age']=a['age']
c['count']=a['age'].apply(myfunction)
I would like to know how many times every element in list a appears in the column 'age' in TableB. The result should be
c={'age':[22,38,26],'count':[2,2,1]}
Use value_counts method with pd.Series and to_dict with pd.DataFrame
(pd.Series(TableB['age'])
.value_counts()
.loc[a['age']]
.rename('count')
.rename_axis('age')
.reset_index()
.to_dict(orient='list'))
You can use pandas.Series.value_counts() on the age column and select the results you're interested in. The following solution will also take into account possible missing values in your 'a' list.
a=[22,38,26,99]
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen', 'John', 'Jane', 'Doe'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
tableB_df = pd.DataFrame(TableB)
counts_series = tableB_df['age'].value_counts()
counts_series_intersection = counts_series.loc[counts_series.index.intersection(a)]
counts_df = pd.DataFrame({'age': counts_series.index, 'count': counts_series.values})
Have a look at the following resources for more info:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-list-with-missing-labels-is-deprecated
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html
You can just use merging of data frames to filter out the values that don't appear in a and just count the values.
import pandas as pd
a={'age':[22,38,26],'no':[1,2,3]}
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen', 'Jones', 'Davis', 'Smith'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
df_a = pd.DataFrame(a)
df_tb = pd.DataFrame(TableB)
(pd.merge(df_tb, df_a, on='age')['age']
.value_counts()
.rename('count')
.rename_axis('age')
.reset_index()
.to_dict(orient='list'))
{'age': [22, 38, 26], 'count': [2, 2, 1]}

turn three columns into dictionary python

Name = [list(['Amy', 'A', 'Angu']),
list(['Jon', 'Johnson']),
list(['Bob', 'Barker'])]
Other = [list(['Amy', 'Any', 'Anguish']),
list(['Jon', 'Jan']),
list(['Baker', 'barker'])]
import pandas as pd
df = pd.DataFrame({'Other' : Other,
'ID': ['E123','E456','E789'],
'Other_ID': ['A123','A456','A789'],
'Name' : Name,
})
ID Name Other Other_ID
0 E123 [Amy, A, Angu] [Amy, Any, Anguish] A123
1 E456 [Jon, Johnson] [Jon, Jan] A456
2 E789 [Bob, Barker] [Baker, barker] A789
I have the df as seen above. I want to make columns ID, Name and Other into a dictionary with they key being ID. I tried this according to python pandas dataframe columns convert to dict key and value
todict = dict(zip(df.ID, df.Name))
Which is close to what I want
{'E123': ['Amy', 'A', 'Angu'],
'E456': ['Jon', 'Johnson'],
'E789': ['Bob', 'Barker']}
But I would like to get this output that includes values from Other column
{'E123': ['Amy', 'A', 'Angu','Amy', 'Any','Anguish'],
'E456': ['Jon', 'Johnson','Jon','Jan'],
'E789': ['Bob', 'Barker','Baker','barker']
}
And If I put the third column Other it gives me errors
todict = dict(zip(df.ID, df.Name, df.Other))
How do I get the output I want?
Why not just combine the Name and Other column before creating a dict of the Name column.
df['Name'] = df['Name'] + df['Other']
dict(zip(df.ID, df.Name))
Gives
{'E123': ['Amy', 'A', 'Angu', 'Amy', 'Any', 'Anguish'],
'E456': ['Jon', 'Johnson', 'Jon', 'Jan'],
'E789': ['Bob', 'Barker', 'Baker', 'barker']}

Resources