Trouble with a python loop - python-3.x

I'm having issues with a loop that I want to:
a. see if a value in a DF row is greater than a value from a list
b. if it is, concatenate the variable name and the value from the list as a string
c. if it's not, pass until the loop conditions are met.
This is what I've tried.
import pandas as pd
import numpy as np
df = {'level': ['21', '22', '23', '24', '25', '26', '27', '28', '29', '30']
, 'variable':'age'}
df = pd.DataFrame.from_dict(df)
knots = [0, 25]
df.assign(key = np.nan)
for knot in knots:
if df['key'].items == np.nan:
if df['level'].astype('int') > knot:
df['key'] = df['variable']+"_"+knot.astype('str')
else:
pass
else:
pass
However, this only yields the key column to have NaN values. I'm not sure why it's not placing the concatenation.

You can do something like this inside the for loop. No need of any if conditions:
df.loc[df['level'].astype('int') > 25, 'key'] = df.loc[df['level'].astype('int') > 25, 'variable'] + '_' + df.loc[df['level'].astype('int') > 25, 'level']

Related

Change a dataframe column value based on the current value?

I have a pandas dataframe with several columns and in one of them, there are string values. I need to change these strings to an acceptable value based on the current value. The dataframe is relatively large (40.000 x 32)
I've made a small function that takes the string to be changed as a parameter and then lookup what this should be changed to.
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','Monday','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
def lut(txt):
my_lut = {'Script' : ['Script','Scrpt','MyScript'],
'Weekday' : ['Sunday','Monday','Tuesday']}
for key, value in my_lut.items():
if txt in value:
return(key)
break
return('Unknown')
The desired output should be:
A B
0 Script Song
1 Script Blues
2 Script Rock
3 Weekday Classic
4 Weekday Whatever
5 Unknown Something
I can't figure out how to apply this to the dataframe.
I've struggled over this for some time now so any input will be appreciated
Regards,
Check this out:
import pandas as pd
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','sdfsd','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
dic = {'Weekday': ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], 'Script': ['Script','Scrpt','MyScript']}
for k, v in dic.items():
for item in v:
df.loc[df.A == item, 'A'] = k
df.loc[~df.A.isin(k for k, v in dic.items()), 'A'] = "Unknown"
Output:

Python Pandas How to get rid of groupings with only 1 row?

In my dataset, I am trying to get the margin between two values. The code below runs perfectly if the fourth race was not included. After grouping based on a column, it seems that sometimes, there will be only 1 value, therefore, no other value to get a margin out of. I want to ignore these groupings in that case. Here is my current code:
import pandas as pd
data = {'Name':['A', 'B', 'B', 'C', 'A', 'C', 'A'], 'RaceNumber':
[1, 1, 2, 2, 3, 3, 4], 'PlaceWon':['First', 'Second', 'First', 'Second', 'First', 'Second', 'First'], 'TimeRanInSec':[100, 98, 66, 60, 75, 70, 75]}
df = pd.DataFrame(data)
print(df)
def winning_margin(times):
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin)
winning_margins.columns = ['margin']
winners = df.loc[df.PlaceWon == 'First', :]
winners = winners.join(winning_margins, on='RaceNumber')
avg_margins = winners[['Name', 'margin']].groupby('Name').mean()
avg_margins
How about returning a NaN if times does not have enough elements:
import numpy as np
def winning_margin(times):
if len(times) <= 1: # New code
return np.NaN # New code
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
your code runs with this change and seem to produce sensible results. But you can furthermore remove NaNs later if you want eg in this line
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin).dropna() # note the addition of .dropna()
You could get the winner and margin in one step:
def get_margin(x):
if len(x) < 2:
return np.NaN
i = x['TimeRanInSec'].idxmin()
nl = x['TimeRanInSec'].nsmallest(2)
margin = nl.max()-nl.min()
return [x['Name'].loc[i], margin]
Then:
df.groupby('RaceNumber').apply(get_margin).dropna()
RaceNumber
1 [B, 2]
2 [C, 6]
3 [C, 5]
(the data has the 'First' indicator corresponding to the slower time in the data)

Why list remains unchanged Python

After iterating through a list to change each value to an integer, the list remains unchanged with all the values still being strings.
As a result, sorting does not get applied either
a = ['14', '22', '4', '52', '54', '59']
for ea in a:
ea = int(ea)
a.sort()
print (a)
Output: '14', '22', '4', '52', '54', '59'
Should be : 4, 14, 22, 52, 54, 59
Your code is not changing the list itself. You are creating a new variable, converting it to an int, and throwing it away.
Use this instead
a = ['14', '22', '4', '52', '54', '59']
a = list(map(int, a)) #this converts the strings into integers and assigns the new list to a
a.sort() #this sorts it
print (a)
ea = int(ea) is not changing the element within the list. So as you do not change the list (which can be seen if you print the list before sorting it), the sort operation is doing it's job correctly because it is sorting strings here, not integer values.
You could change your loop to provide the index and modify the original entries in the list by using the enumerate function as follows:
a = ['14', '22', '4', '52', '54', '59']
for index, ea in enumerate(a):
a[index] = int(ea)
a.sort()
print(a)

How to substring the column name in python

I have a column named 'comment1abc'
I am writing a piece of code where I want to see that if a column contains certain string 'abc'
df['col1'].str.contains('abc') == True
Now, instead of hard coding 'abc', I want to use a substring like operation on column 'comment1abc' (to be precise, column name, not the column values)so that I can get the 'abc' part out of it. For example below code does a similar job
x = 'comment1abc'
x[8:11]
But how do I implement that for a column name ? I tried below code but its not working.
for col in ['comment1abc']:
df['col123'].str.contains('col.names[8:11]')
Any suggestion will be helpful.
Sample dataframe:
f = {'name': ['john', 'tom', None, 'rock', 'dick'], 'DoB': [None, '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'location': ['NY', 'NJ', 'PA', 'NY', None], 'code': ['abc1xtr', '778abc4', 'a2bcx98', None, 'ab786c3'], 'comment1abc': ['99', '99', '99', '99', '99'], 'comment2abc': ['99', '99', '99', '99', '99']}
df1 = pd.DataFrame(data = f)
and sample code:
for col in ['comment1abc', 'comment2abc']:
df1[col][df1['code'].str.contains('col.names[8:11]') == True] = '1'
I think the answer would be simple like this:
for col in ['comment1abc', 'comment2abc']:
x = col[8:11]
df1[col][df1['code'].str.contains('x') == True] = '1'
Trying to use a column name within .str.contains() wasn't a good idea. Better use a string.

Predicted Rank Values from Dict

Lets say I have a time dict
dict = {
'9': {'auth': '9', 'duration': 154.92},
'10': {'auth': '10', 'duration': 132.72},
'4': {'auth': '4', 'duration': 144.59}
}
and how can I get the "rank" value from the dict with a new duration number, lets say 133.92.
It should return dict index 1, since the top list is:
auth-10|duration:132.72
auth-4|duration:144.59
auth-9|duration:154.92
and 133.92 is bigger than 132.72, but less than 144.59 or 154.92
I'm sorry if Its unclear explained, but I tried my best.
EDIT:
I'm trying it again:
I need a function which returns the "predicted/rank" for a custom duration from the sorted list/sorted by "duration/DESC".
So a duration of 160 would return the last place, which is 4. (index+1). A 120 duration should return me the first position which is index 0 or 1st (index+1).
As #MartijnPieters mentioned, dictionaries lack indices, but they do have keys. The following function pred() (short for "predecessor") returns the key of the item whose duration is the largest duration <= the passed duration. It returns None (which can be tested for) if the passed duration is smaller than all of the durations in the dictionary:
from bisect import bisect_left
def pred(duration, entries):
entries = list(entries.items())
entries.sort(key = lambda x: x[1]['duration'])
durations = [x[1]['duration'] for x in entries]
j = bisect_left(durations,duration)
if j == 0:
return None
else:
return entries[j-1][0]
For example, if
d = {
'9': {'auth': '9', 'duration': 154.92},
'10': {'auth': '10', 'duration': 132.72},
'4': {'auth': '4', 'duration': 144.59}
}
(by the way -- don't use dict as an identifier since it has a predefined meaning in Python) then:
>>> pred(133.92,d)
'10'
>>> pred(149.92,d)
'4'
>>> pred(159.92,d)
'9'
>>> pred(129.92,d)
>>>
Note that
>>> pred(129.92,d) == None
True
On Edit: Here is another variation on the same idea, one that returns an integer rank:
def rank(duration, entries):
entries = list(entries.items())
durations = sorted(x[1]['duration'] for x in entries)
return bisect_left(durations,duration)
Then:
>>> rank(133.92,d)
1
>>> rank(129.92,d)
0
Final Edit: At the cost of some readability, here is a 1-liner:
def rank(duration, entries):
return bisect_left(sorted(v['duration'] for v in entries.values()),duration)

Resources