checking range of number and writing a value in a new column in pandas dataframe - python-3.x

I need to iterate over column 'movies_rated', check the value against the conditions, and write a value in a newly create column 'expert_level'. When I test on a subset of data, it works. But when I run it against my whole dateset, it only gets filled with value 1.
for num in df_merge['movies_rated']:
if num in range(20,31):
df_merge['expert_level'] = 1
elif num in range(31,53):
df_merge['expert_level'] = 2
elif num in range(53,99):
df_merge['expert_level'] = 3
elif num in range(99,202):
df_merge['expert_level'] = 4
else:
df_merge['expert_level'] = 5
here's a sample dataframe.
movies = [88,20,35,55,1203,99,2222,847]
name = ['angie','chris','pine','benedict','alice','spock','tony','xena']
df = pd.DataFrame(movies,name,columns=['movies_rated'])
certainly there's a less verbose way of doing this?

You could build an IntervalIndex and then apply pd.cut. I'm sure this is a duplicate, but I can't find one right now which uses both closed='left' and .codes, though I'm sure it exists.
bins = pd.IntervalIndex.from_breaks([0, 20, 31, 53, 99, 202, np.inf], closed='left')
df["expert_level"] = pd.cut(movies, bins).codes
which gives me
In [242]: bins
Out[242]:
IntervalIndex([[0.0, 20.0), [20.0, 31.0), [31.0, 53.0), [53.0, 99.0), [99.0, 202.0), [202.0, inf)]
closed='left',
dtype='interval[float64]')
and
In [243]: df
Out[243]:
movies_rated expert_level
angie 88 3
chris 20 1
pine 35 2
benedict 55 3
alice 1203 5
spock 99 4
tony 2222 5
xena 847 5
Note that I've set this up so that scores below 20 get a 0 value, so they can be distinguished from really high rankings. If you really want everything outside the bins to get 5, it'd be straightforward to remap 0 to 5, or just pass breaks of [20, 31, 53, 99, 202] and then map anything with a code of -1 (which means 'not binned') to 5.

I think np.select with the pandas function between is a good choice for you:
conds = [df.movies_rated.between(20,30), df.movies_rated.between(31,52),
df.movies_rated.between(53,98), df.movies_rated.between(99,202)]
choices = [1,2,3,4]
df['expert_level'] = np.select(conds,choices, 5)
>>> df
movies_rated expert_level
angie 88 3
chris 20 1
pine 35 2
benedict 55 3
alice 1203 5
spock 99 4
tony 2222 5
xena 847 5

you could do it with apply and a function:
def expert_level_check(num):
if 20<= num < 31:
return 1
elif 31<= num < 53:
return 2
elif 53<= num < 99:
return 3
elif 99<= num < 202:
return 4
else:
return 5
df['expert_level'] = df['movies_rated'].apply(expert_level_check)
it is slower to manually iterate over a df, I recommend reading this

Related

I want to get/print df by range instead of head or tail

I can't find or understand how to get the data I want by range
I want to know how to get df['Close']from x to y then .mean to sum it up
I have tried "costomclose = df['Close'],range(dagartot,val)"
But it gives me something else like heads and tails from df
if len(df) >= 34:
dagartot = len(df)
valdagar = 5
val = dagartot-valdagar
costomclose = df['Close'],range(dagartot,val)
print(costomclose)
edit:
<bound method NDFrame.tail of High Low ... Volume Adj Close
Date ...
2005-09-29 24.083300 23.583300 ... 74400.0 4.038682
2005-09-30 23.833300 23.500000 ... 148200.0 4.081495
2005-10-03 24.000000 23.333300 ... 27600.0 3.995869
2005-10-04 23.500000 23.416700 ... 132000.0 4.024417
2005-10-05 23.750000 23.500000 ... 15600.0 4.067230
... ... ... ... ... ...
2019-07-25 196.000000 193.050003 ... 355952.0 194.000000
2019-07-26 196.350006 194.000000 ... 320752.0 195.199997
2019-07-29 196.350006 193.550003 ... 301389.0 195.250000
2019-07-30 197.949997 194.850006 ... 233989.0 197.100006
2019-07-31 198.550003 195.600006 ... 323473.0 197.899994
[3479 rows x 6 columns]>
stop
Here is an example of slicing out the middle of something based on the encounter index:
>>> s = pd.Series(list('abcdefghijklmnop'))
>>> s
Out[135]:
0 a
1 b
...
12 m
13 n
14 o
15 p
dtype: object
>>> s.iloc[6:9]
Out[136]:
6 g
7 h
8 i
dtype: object
This also works for DataFrames, e.g. df.iloc[0] returns the first row and df.iloc[5:8] returns those rows, end not included.
You can also slice by actual index of the DataFrame, which is not necessarily a serially-counting sequence of integers by substituting iloc for loc.
Here is an example of slicing out the middle of a dataframe that stores the alphabet:
>>> df = pd.DataFrame([dict(num=i + 65, char=chr(i + 65)) for i in range(26)])
>>> df[(76 <= df.num) & (df.num < 81)]
num char
11 76 L
12 77 M
13 78 N
14 79 O
15 80 P

How to recategorize numeric values into new grouping using Pandas as a function, with no limit of conditions [duplicate]

I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient
I have a pandas dataframe:
It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...
I want to create a new column AgeRange and populate with the following ranges:
<2
2 - 18
18 - 35
35 - 65
65+
so I wrote a function
def agerange(values):
for i in values:
if complete.Age_units == 'Y':
if complete.Age > 1 AND < 18 return '2-18'
elif complete.Age > 17 AND < 35 return '18-35'
elif complete.Age > 34 AND < 65 return '35-65'
elif complete.Age > 64 return '65+'
else return '< 2'
I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:
agedetails['age_range'] = ageRange(agedetails)
BUT when I try to run the first code to create the function I get:
File "<ipython-input-124-cf39c7ce66d9>", line 4
if complete.Age > 1 AND complete.Age < 18 return '2-18'
^
SyntaxError: invalid syntax
Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?
So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?
I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...
With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.
Pandas: pd.cut
As #JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.
You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.
bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']
df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)
print(df.dtypes)
# Age int64
# Age_units object
# AgeRange category
# dtype: object
NumPy: np.digitize
np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.
Note that for boundary cases the lower bound is used for mapping to a bin.
import pandas as pd, numpy as np
df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})
bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']
d = dict(enumerate(names, 1))
df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))
Result
Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+

how to replace a cell in a pandas dataframe

After forming the below python pandas dataframe (for example)
import pandas
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pandas.DataFrame(data,columns=['Name','Age'])
If I iterate through it, I get
In [62]: for i in df.itertuples():
...: print( i.Index, i.Name, i.Age )
...:
0 Alex 10
1 Bob 12
2 Clarke 13
What I would like to achieve is to replace the value of a particular cell
In [67]: for i in df.itertuples():
...: if i.Name == "Alex":
...: df.at[i.Index, 'Age'] = 100
...:
Which seems to work
In [64]: df
Out[64]:
Name Age
0 Alex 100
1 Bob 12
2 Clarke 13
The problem is that when using a larger different dataset, and do:
First, I create a new column named like NETELEMENT with a default value of ""
I would like to replace the default value "" with the string that the function lookup_netelement returns
df['NETELEMENT'] = ""
for i in df.itertuples():
df.at[i.Index, 'NETELEMENT'] = lookup_netelement(i.PEER_SRC_IP)
print( i, lookup_netelement(i.PEER_SRC_IP) )
But what I get as a result is:
Pandas(Index=769, SRC_AS='', DST_AS='', COMMS='', SRC_COMMS=nan, AS_PATH='', SRC_AS_PATH=nan, PREF='', SRC_PREF='0', MED='0', SRC_MED='0', PEER_SRC_AS='0', PEER_DST_AS='', PEER_SRC_IP='x.x.x.x', PEER_DST_IP='', IN_IFACE='', OUT_IFACE='', PROTOCOL='udp', TOS='0', BPS=35200.0, SRC_PREFIX='', DST_PREFIX='', NETELEMENT='', IN_IFNAME='', OUT_IFNAME='') routerX
meaning that it should be:
NETELEMENT='routerX' instead of NETELEMENT=''
Could you please advise what I am doing wrong ?
EDIT: for reasons of completeness the lookup_netelement is defined as
def lookup_netelement(ipaddr):
try:
x = LOOKUP['conn'].hget('ipaddr;{}'.format(ipaddr), 'dev') or b""
except:
logger.error('looking up `ipaddr` for netelement caused `{}`'.format(repr(e)), exc_info=True)
x = b""
x = x.decode("utf-8")
return x
Hope you are looking for where for conditional replacement i.e
def wow(x):
return x ** 10
df['new'] = df['Age'].where(~(df['Name'] == 'Alex'),wow(df['Age']))
Output :
Name Age new
0 Alex 10 10000000000
1 Bob 12 12
2 Clarke 13 13
3 Alex 15 576650390625
Based on your edit your trying to apply the function i.e
df['new'] = df['PEER_SRC_IP'].apply(lookup_netelement)
Edit : For your comment on sending two columns, use lambda with axis 1 i.e
def wow(x,y):
return '{} {}'.format(x,y)
df.apply(lambda x : wow(x['Name'],x['Age']),1)

creating lists from row data

My input data has the following format
id offset code
1 3 21
1 3 24
1 5 21
2 1 84
3 5 57
3 5 21
3 5 92
3 10 83
3 10 21
I would like the output in the following format
id offset code
1 [3,5] [[21,24],[21]]
2 [1] [[84]]
3 [5,10] [[21,57,92],[21,83]]
The code that I have been able to come up with is shown below
import random, pandas
random.seed(10000)
param = dict(nrow=100, nid=10, noffset=8, ncode=100)
#param = dict(nrow=1000, nid=10, noffset=8, ncode=100)
#param = dict(nrow=100000, nid=1000, noffset=50, ncode=5000)
#param = dict(nrow=10000000, nid=10000, noffset=100, ncode=5000)
pd = pandas.DataFrame({
"id":random.choices(range(1,param["nid"]+1), k=param["nrow"]),
"offset":random.choices(range(param["noffset"]), k=param["nrow"])
})
pd["code"] = random.choices(range(param["ncode"]), k=param["nrow"])
pd = pd.sort_values(["id","offset","code"]).reset_index(drop=True)
tmp1 = pd.groupby(by=["id"])["offset"].apply(lambda x:list(set(x))).reset_index()
tmp2 = pd.groupby(by=["id","offset"])["code"].apply(lambda x:list(x)).reset_index().groupby(\
by=["id"], sort=True)["code"].apply(lambda x:list(x)).reset_index()
out = pandas.merge(tmp1, tmp2, on="id", sort=False)
It does give me the output that I want but is VERY slow when the dataframe is large. The dataframe that I have has over 40million rows. In the example
uncomment the fourth param statement and you will see how slow it is.
Can you please help with making this run faster?
(df.groupby(['id','offset']).code.apply(list).reset_index()
.groupby('id').agg(lambda x: x.tolist()))
Out[733]:
offset code
id
1 [3, 5] [[21, 24], [21]]
2 [1] [[84]]
3 [5, 10] [[57, 21, 92], [83, 21]]

Missing value imputation in Python

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.
Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?
This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.
# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix
alpha_z = []
for clust in range(0, num_clusters):
alpha = clust_to_items[clust, :]
alpha_beta = beta.multiply(alpha)
sum_row = alpha_beta.sum(1)[0, 0]
num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
to_impute = sum_row / num_nonzero
Z = np.repeat(to_impute, beta.shape[1])
alpha_z = alpha.multiply(Z)
idx = beta.nonzero()
alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]
# The interact_score is the required modified beta
# This is used to do some work that is very fast
The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.
Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.
I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:
def fast_impute(num_clusters, item_clusters, beta):
# get counts
cluster_counts = np.zeros(num_clusters)
np.add.at(cluster_counts, item_clusters, 1)
# get complete totals
totals = np.zeros(num_clusters)
np.add.at(totals, item_clusters, beta)
# get number of zeros
zero_counts = np.zeros(num_clusters)
z = beta == 0
np.add.at(zero_counts, item_clusters, z)
# non-zero means
cluster_means = totals / (cluster_counts - zero_counts)
# perform imputations
imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])
return imputed_beta
which gives me
>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms
and
>>> imputed[:5]
array([ 1.27582017, -1. , -1. , 1. , 3. ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561
Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:
>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
beta cluster
0 0 506
1 -1 968
2 -1 873
3 1 179
4 3 269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
beta cluster
0 1.27582 506
1 -1.00000 968
2 -1.00000 873
3 1.00000 179
4 3.00000 269
My suspicion is that
alpha_beta = beta.multiply(alpha)
is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:
sum_row = alpha_beta.sum(1)[0, 0]
So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Resources