Pandas: MID & FIND Function - python-3.x

I have the a column in my dataframe that shows different combinations of the values below. I know that I could use the .str[:3] function and then convert this to a value, but the differing string lengths are throwing me off. How would I do a MID(x,FIND(",",x,1)+1,10) esk function on this column to find the sentiment and subjectivity values?
String samples:
df['Output'] =
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.03958333333333333, subjectivity=0.5020833333333334)
Sentiment(polarity=0.16472802559759075, subjectivity=0.4024750611707134)
Error:
def senti(x):
return TextBlob(x).sentiment
df['Output'] = df['stop'].apply(senti)
df.Output.str.split(',|=',expand=True).iloc[:,[1,3]]
IndexError: positional indexers are out-of-bounds
Outputs:
0 (0.0, 0.0)
1 (0.0028273809523809493, 0.48586309523809534)
2 (0.153726035868893, 0.5354359925788496)
3 (0.04357142857142857, 0.5319047619047619)
4 (0.07575757575757575, 0.28446969696969693)
...
92 (0.225, 0.39642857142857146)
93 (0.0, 0.0)
94 (0.5428571428571429, 0.6428571428571428)
95 (0.14393939393939395, 0.39999999999999997)
96 (0.35833333333333334, 0.5777777777777778)
Name: Output, Length: 97, dtype: object

df[['polarity', 'subjectivity']] = df.Output.str.split(',|=|\)',expand=True).iloc[:,[1,3]]
Result:
Output polarity subjectivity
0 Sentiment(polarity=0.0, subjectivity=0.0) 0.0 0.0
1 Sentiment(polarity=-0.03958333333333333, subje... -0.03958333333333333 0.5020833333333334
2 Sentiment(polarity=0.16472802559759075, subjec... 0.16472802559759075 0.4024750611707134

Try:
df['polarity']=df['Output'].str.extract(r"polarity=([-\.\d]+)")
df['subjectivity']=df['Output'].str.extract(r"subjectivity=([-\.\d]+)")
Outputs:
>>> df.iloc[:, -2:]
polarity subjectivity
0 0.0 0.0
1 -0.03958333333333333 0.5020833333333334
2 0.16472802559759075 0.4024750611707134

Related

How to get function output to add columns to my Dataframe

I have a function that produces an output like so when I pass it a name:
W2V('aamir')
array([ 0.12135 , -0.99132 , 0.32347 , 0.31334 , 0.97446 , -0.67629 ,
0.88606 , -0.11043 , 0.79434 , 1.4788 , 0.53169 , 0.95331 ,
-1.1883 , 0.82438 , -0.027177, 0.70081 , 0.87467 , -0.095825,
-0.5937 , 1.4262 , 0.2187 , 1.1763 , 1.6294 , 0.91717 ,
-0.086697, 0.16529 , 0.19095 , -0.39362 , -0.40367 , 0.83966 ,
-0.25251 , 0.46286 , 0.82748 , 0.93061 , 1.136 , 0.85616 ,
0.34705 , 0.65946 , -0.7143 , 0.26379 , 0.64717 , 1.5633 ,
-0.81238 , -0.44516 , -0.2979 , 0.52601 , -0.41725 , 0.086686,
0.68263 , -0.15688 ], dtype=float32)
I have a data frame that has an index Name and a single column Y:
df1
Y
Name
aamir 0
aaron 0
... ...
zulema 1
zuzana 1
I wish to run my function on each value of Name and have it create columns like so:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
Name
aamir 0.12135 -0.99132 0.32347 0.31334 0.97446 -0.67629 0.88606 -0.11043 0.794340 1.47880 ... 0.647170 1.56330 -0.81238 -0.445160 -0.29790 0.52601 -0.41725 0.086686 0.68263 -0.15688
aaron -1.01850 0.80951 0.40550 0.09801 0.50634 0.22301 -1.06250 -0.17397 -0.061715 0.55292 ... -0.144960 0.82696 -0.51106 -0.072066 0.43069 0.32686 -0.00886 -0.850310 -1.31530 0.71631
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zulema 0.56547 0.30961 0.48725 1.41000 -0.76790 0.39908 0.86915 0.68361 -0.019467 0.55199 ... 0.062091 0.62614 0.44548 -0.193820 -0.80556 -0.73575 -0.30031 -1.278900 0.24759 -0.55541
zuzana -1.49480 -0.15111 -0.21853 0.77911 0.44446 0.95019 0.40513 0.26643 0.075182 -1.34340 ... 1.102800 0.51495 1.06230 -1.587600 -0.44667 1.04600 -0.38978 0.741240 0.39457 0.22857
What I have done is real messy, but works:
names = df1.index.to_list()
Lst = []
for name in names:
Lst.append(W2V(name).tolist())
wv_df = pd.DataFrame(index=names, data=Lst)
wv_df.index.name = "Name"
wv_df.sort_index(inplace=True)
df1 = df1.merge(wv_df, how='inner', left_index=True, right_index=True)
I am hoping there is a way to use .apply() or similar but I have not found how to do this. I am looking for an efficient way.
Update:
I modified my function to do like so:
if isinstance(w, pd.core.series.Series):
w = w.to_string()
Although this appears to work at first, the data is wrong. If I pass aamir to my function you can see the result. Yet when I do it with apply the numbers are totally different:
df1
Name Y
0 aamir 0
1 aaron 0
... ... ...
7942 zulema 1
7943 zuzana 1
df3 = df1.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 0.075014 0.824769 0.580976 0.493415 0.409894 0.142214 0.202602 -0.599501 -0.213184 -0.142188 ... 0.627784 0.136511 -0.162938 0.095707 -0.257638 0.396822 0.208624 -0.454204 0.153140 0.803400
1 0.073664 0.868665 0.574581 0.538951 0.394502 0.134773 0.233070 -0.639365 -0.194892 -0.110557 ... 0.722513 0.147112 -0.239356 -0.046832 -0.237434 0.321494 0.206583 -0.454038 0.251605 0.918388
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7942 -0.002117 0.894570 0.834724 0.602266 0.327858 -0.003092 0.197389 -0.675813 -0.311369 -0.174356 ... 0.690172 -0.085517 -0.000235 -0.214937 -0.290900 0.361734 0.290184 -0.497177 0.285071 0.711388
7943 -0.047621 0.850352 0.729225 0.515870 0.439999 0.060711 0.226026 -0.604846 -0.344891 -0.128396 ... 0.557035 -0.048322 -0.070075 -0.265775 -0.330709 0.281492 0.304157 -0.552191 0.281502 0.750304
7944 rows × 50 columns
You can see that the first row is aamir and the first value (column 0) my function returns is 0.1213 (You can see this at the top of my post). Yet with apply that appears to be 0.075014
EDIT:
It appears it passes in Name aamir rather than aamir. How can I get it to just send the Name itself aamir?
Let's say we have some function which transforms a string into a vector of a fixed size, for example:
import numpy as np
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
Also a data frame is given with a meaningful index and junk data:
import pandas as pd
names = pd.Index(['aamir','aaron','zulema','zuzana'], name='Name')
df = pd.DataFrame(index=names).assign(Y=0)
When we apply some function to a DataFrame along columns, i.e. axis=1, its argument is gonna be a row as Series wich name is an index of the row. So we could do something like this:
output = df.apply(lambda row: W2V(row.name), axis=1, result_type='expand')
With result_type='expand', returned vectors will be transformed into columns, which is the required output.
P.S. As an option:
df = pd.DataFrame.from_dict({n: W2V(n) for n in names}, orient='index')
P.P.S. IMO The behavior you describe means that your function can operate not only on str, but also on some common sequence, for example on a Series of strings. In case of the code:
df.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
the function W2V receives not "a name" as a string but pd.Series(["a name"]). If we do not check the type of the passed parameter inside the function, then we can get a silent error, which in this case appears as different output data.
I don't know if this is any better than the other suggestions, but I would use apply to create another n-column dataframe (where n is the length of the array returned by the W2V function) and then concatenate it to the original dataframe.
This first section generates toy versions of your W2V function and your dataframe.
# substitute your W2V function for this:
n = 5
def W2V(name: str):
return [random() for i in range(n)]
# substitute your 2-column dataframe for this:
df1 = pd.DataFrame(data={'Name':['aamir', 'aaron', 'zulema', 'zuzana'],
'Y': [0, 0, 1, 1]},
index=list(range(4)))
df1 is
Name Y
0 aamir 0
1 aaron 0
2 zulema 1
3 zuzana 1
You want to make a second dataframe that applies W2V to every name in the first dataframe. To generate your column numbers, I'm just using a list comprehension that generates [0, 1, ... n], where n is the length of the array returned by W2V.
df2 = df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)
My random-valued df2 is
0 1 2 3 4
0 0.242761 0.415253 0.940213 0.074455 0.444372
1 0.935781 0.968155 0.850091 0.064548 0.737655
2 0.204053 0.845252 0.967767 0.352254 0.028609
3 0.853164 0.698195 0.292238 0.982009 0.402736
Then concatenate the new dataframe to the old one:
df3 = pd.concat([df1, df2], axis=1)
df3 is
Name Y 0 1 2 3 4
0 aamir 0 0.242761 0.415253 0.940213 0.074455 0.444372
1 aaron 0 0.935781 0.968155 0.850091 0.064548 0.737655
2 zulema 1 0.204053 0.845252 0.967767 0.352254 0.028609
3 zuzana 1 0.853164 0.698195 0.292238 0.982009 0.402736
Alternatively, you could do both steps in one line as:
df1 = pd.concat([df1,
df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)],
axis=1)
You can try something like this using map and np.vstack with a dataframe constructor then join:
df.join(pd.DataFrame(np.vstack(df.index.map(W2V)), index=df.index))
Output:
Y 0 1 2 3 4 5 6 7 8 9
A 0 4 0 2 1 0 0 0 0 3 3
B 1 4 0 0 4 4 3 4 3 4 3
C 2 1 5 5 5 3 3 1 3 5 0
D 3 3 5 1 3 4 2 3 1 0 1
E 4 4 0 2 4 4 0 3 3 4 2
F 5 4 3 5 1 0 2 3 2 5 2
G 6 4 5 2 0 0 2 4 3 4 3
H 7 0 2 5 2 3 4 3 5 3 1
I 8 2 2 0 1 4 2 4 1 0 4
J 9 0 2 3 5 0 3 0 2 4 0
Using #Vitalizzare function:
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
df = pd.DataFrame({'Y': np.arange(10)}, index = [*'ABCDEFGHIJ'])
I am going off the names being the axis, and there being a useless column called 0. I think this may be the solution, no way to know without your function or the names
df.reset_index().drop(0, axis=1).apply(my_func, axis=1, result_type='expand')
I would do simply:
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
Example
To start with, let us make some function w2v(name). In the following, we compute a consistent hash of any string. Then we use that hash as a (temporary) seed for np.random, and then draw a random vector size=50:
import numpy as np
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_seed(seed):
state = np.random.get_state()
np.random.seed(seed)
try:
yield
finally:
np.random.set_state(state)
mask = (1 << 32) - 1
def w2v(name, size=50):
fingerprint = int(pd.util.hash_array(np.array([name])))
with temp_seed(fingerprint & mask):
return np.random.uniform(-1, 1, size)
For instance:
>>> w2v('aamir')
array([ 0.65446901, -0.92765123, -0.78188552, -0.62683782, -0.23946784,
0.31315156, 0.22802972, -0.96076167, 0.62577993, -0.59024811,
0.76365736, 0.93033898, -0.56155296, 0.4760905 , -0.92760642,
0.00177959, -0.22761559, 0.81929959, 0.21138229, -0.49882747,
-0.97637984, -0.19452496, -0.91354933, 0.70473533, -0.30394358,
-0.47092087, -0.0329302 , -0.93178517, 0.79118799, 0.98286834,
-0.16024194, -0.02793147, -0.52251214, -0.70732759, 0.10098142,
-0.24880249, 0.28930319, -0.53444863, 0.37887522, 0.58544068,
0.85804119, 0.67048213, 0.58389158, -0.19889071, -0.04281131,
-0.62506126, 0.42872395, -0.12821543, -0.52458052, -0.35493892])
Now, we use the expression given as solution:
df = pd.DataFrame([0,0,1,1], index=['aamir', 'aaron', 'zulema', 'zuzana'])
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
>>> newdf
0 1 2 3 4 5 6 ...
aamir 0.654469 -0.927651 -0.781886 -0.626838 -0.239468 0.313152 0.228030 ...
aaron -0.380524 -0.850608 -0.914642 -0.578885 0.177975 -0.633761 -0.736234 ...
zulema -0.250957 0.882491 -0.197833 -0.707652 0.754575 0.731236 -0.770831 ...
zuzana -0.641296 0.065898 0.466784 0.652776 0.391865 0.918761 0.022798 ...

Statistical Analysis with mixed numbers

I am working on a Survey for a small project that requires users to respond to some questions by selecting from a set of radio values, such as Strongly Disagree, Agree, Neutral, Agree and Strongly Agree. For these selections, the Radio values are -1, -2, 0, 1, and 2, respectively. Finally, I need to perform some type of analysis on the data.
First, I used Python to try to normalized the values, utilizing the Log10 function
import numpy as np
feel = [-1,-2,0,1,2]
for i in feel:
print( np.log10(i))
The results are not favorable:
-inf
nan
0.0
0.6931471805599453
1.0986122886681098
<ipython-input-35-830bb9e2f96e>:3: RuntimeWarning: divide by zero encountered in log1p
print( np.log1p(i))
<ipython-input-35-830bb9e2f96e>:3: RuntimeWarning: invalid value encountered in log1p
print( np.log1p(i))
If I use C# to repeat the Log10 normalization:
List<double> origin = new List<double> { -1,-2,0,1,2};
Program p = new Program();
var norm = 0.0;
var denorm = 0.0;
foreach(var item in origin){
System.Console.WriteLine($"Number: {item}");
norm = p.normalize(item); // 0.2
System.Console.WriteLine($"Normalized: {norm}");
denorm = p.denormalize(norm); //12
System.Console.WriteLine($"Denormalized: {denorm}");
}
public double normalize(double value)
{
var norm = Math.Log10(value);
return norm;
}
public double denormalize(double value)
{
var denorm = Math.Round(Math.Pow(10,value),14);
return denorm;
}
I get:
Number: -1
Normalized: NaN
Denormalized: NaN
Number: -2
Normalized: NaN
Denormalized: NaN
Number: 0
Normalized: -∞
Denormalized: 0
Number: 1
Normalized: 0
Denormalized: 1
Number: 2
Normalized: 0.3010299956639812
Denormalized: 2
Is there a finite way to collect Survey data to then normalize and to finally run some analysis for an attitudinal approach?
The problem with using np.log10 is that there is no root on base 10 for negative numbers. In other words, 10^x = y with y < 0 is not solvable. If you want or need to use that function in particular you will need to sum 3 to all your options. That is, instead of going from -2 to 2 they should go from 1 to 5.
import numpy as np
feel = [1,2,3,4,5]
for i in feel:
print(np.log10(i))
This outputs:
>>> 0.0
>>> 0.3010299956639812
>>> 0.47712125471966244
>>> 0.6020599913279624
>>> 0.6989700043360189

passing parameters in groupby aggregate function

I have dataframe which I've referenced as df in the code and I'm applying aggregate functions on multiple columns of each group. I also applied user-defined lambda functions f4, f5, f6, f7. Some functions are very similar like f4, f6 and f7 where only parameter value are different. Can I pass these parameters from dictionary d, so that I have to write only one function instead of writing multiple functions?
f4 = lambda x: len(x[x>10]) # count the frequency of bearing greater than threshold value
f4.__name__ = 'Frequency'
f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'
f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'
f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f6,
'acc_rate':f7,
'bearing':['sum', f4],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
I like to write a function like
f4(p) = lambda x: len(x[x>p])
f4.__name__ = 'Frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(0.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
The csv file of dataframe df is available at given link for more clarity of data.
https://drive.google.com/open?id=1R_BBL00G_Dlo-6yrovYJp5zEYLwlMPi9
It is possible, but not easy, solution by neilaronson.
Also solution is simplify by sum of True values of boolean mask.
def f4(p):
def ipf(x):
return (x < p).sum()
#your solution
#return len(x[x < p])
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
EDIT: You can also pass parameter for greater or less:
def f4(p, op):
def ipf(x):
if op == 'greater':
return (x > p).sum()
elif op == 'less':
return (x < p).sum()
else:
raise ValueError("second argument has to be greater or less only")
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4, 'less'), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2, 'greater'),
'acc_rate':f4(.25, 'greater'),
'bearing':['sum', f4(10, 'greater')],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
print (df1.head())
userid trip_id segmentid Transportation_Mode acceleration_mean \
0 141 1.0 1 walk 0.061083
1 141 2.0 1 walk 0.109148
2 141 3.0 1 walk 0.106771
3 141 4.0 1 walk 0.141180
4 141 5.0 1 walk 1.147157
acceleration_median acceleration_min velocity_Frequency velocity_sum \
0 -1.168583e-02 -2.994428 1000.0 1506.679506
1 1.665535e-09 -3.234188 464.0 712.429005
2 -3.055414e-08 -3.131293 996.0 1394.746071
3 9.241707e-09 -3.307262 340.0 513.461259
4 -2.609489e-02 -3.190424 493.0 729.702854
velocity_count velocity_median velocity_min velocity_rate_Frequency \
0 1028 1.294657 0.284747 288.0
1 486 1.189650 0.284725 134.0
2 1020 1.241419 0.284733 301.0
3 352 1.326324 0.339590 93.0
4 504 1.247868 0.284740 168.0
acc_rate_Frequency bearing_sum bearing_Frequency bearing_rate_sum \
0 169.0 81604.187066 884.0 -371.276356
1 89.0 25559.589869 313.0 -357.869944
2 203.0 -71540.141199 57.0 946.382581
3 78.0 9548.920765 167.0 -943.184805
4 93.0 -24021.555784 67.0 535.333624
Vincenty_distance_sum
0 1506.679506
1 712.429005
2 1395.328768
3 513.461259
4 731.823664

checking range of number and writing a value in a new column in pandas dataframe

I need to iterate over column 'movies_rated', check the value against the conditions, and write a value in a newly create column 'expert_level'. When I test on a subset of data, it works. But when I run it against my whole dateset, it only gets filled with value 1.
for num in df_merge['movies_rated']:
if num in range(20,31):
df_merge['expert_level'] = 1
elif num in range(31,53):
df_merge['expert_level'] = 2
elif num in range(53,99):
df_merge['expert_level'] = 3
elif num in range(99,202):
df_merge['expert_level'] = 4
else:
df_merge['expert_level'] = 5
here's a sample dataframe.
movies = [88,20,35,55,1203,99,2222,847]
name = ['angie','chris','pine','benedict','alice','spock','tony','xena']
df = pd.DataFrame(movies,name,columns=['movies_rated'])
certainly there's a less verbose way of doing this?
You could build an IntervalIndex and then apply pd.cut. I'm sure this is a duplicate, but I can't find one right now which uses both closed='left' and .codes, though I'm sure it exists.
bins = pd.IntervalIndex.from_breaks([0, 20, 31, 53, 99, 202, np.inf], closed='left')
df["expert_level"] = pd.cut(movies, bins).codes
which gives me
In [242]: bins
Out[242]:
IntervalIndex([[0.0, 20.0), [20.0, 31.0), [31.0, 53.0), [53.0, 99.0), [99.0, 202.0), [202.0, inf)]
closed='left',
dtype='interval[float64]')
and
In [243]: df
Out[243]:
movies_rated expert_level
angie 88 3
chris 20 1
pine 35 2
benedict 55 3
alice 1203 5
spock 99 4
tony 2222 5
xena 847 5
Note that I've set this up so that scores below 20 get a 0 value, so they can be distinguished from really high rankings. If you really want everything outside the bins to get 5, it'd be straightforward to remap 0 to 5, or just pass breaks of [20, 31, 53, 99, 202] and then map anything with a code of -1 (which means 'not binned') to 5.
I think np.select with the pandas function between is a good choice for you:
conds = [df.movies_rated.between(20,30), df.movies_rated.between(31,52),
df.movies_rated.between(53,98), df.movies_rated.between(99,202)]
choices = [1,2,3,4]
df['expert_level'] = np.select(conds,choices, 5)
>>> df
movies_rated expert_level
angie 88 3
chris 20 1
pine 35 2
benedict 55 3
alice 1203 5
spock 99 4
tony 2222 5
xena 847 5
you could do it with apply and a function:
def expert_level_check(num):
if 20<= num < 31:
return 1
elif 31<= num < 53:
return 2
elif 53<= num < 99:
return 3
elif 99<= num < 202:
return 4
else:
return 5
df['expert_level'] = df['movies_rated'].apply(expert_level_check)
it is slower to manually iterate over a df, I recommend reading this

Missing value imputation in Python

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.
Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?
This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.
# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix
alpha_z = []
for clust in range(0, num_clusters):
alpha = clust_to_items[clust, :]
alpha_beta = beta.multiply(alpha)
sum_row = alpha_beta.sum(1)[0, 0]
num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
to_impute = sum_row / num_nonzero
Z = np.repeat(to_impute, beta.shape[1])
alpha_z = alpha.multiply(Z)
idx = beta.nonzero()
alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]
# The interact_score is the required modified beta
# This is used to do some work that is very fast
The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.
Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.
I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:
def fast_impute(num_clusters, item_clusters, beta):
# get counts
cluster_counts = np.zeros(num_clusters)
np.add.at(cluster_counts, item_clusters, 1)
# get complete totals
totals = np.zeros(num_clusters)
np.add.at(totals, item_clusters, beta)
# get number of zeros
zero_counts = np.zeros(num_clusters)
z = beta == 0
np.add.at(zero_counts, item_clusters, z)
# non-zero means
cluster_means = totals / (cluster_counts - zero_counts)
# perform imputations
imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])
return imputed_beta
which gives me
>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms
and
>>> imputed[:5]
array([ 1.27582017, -1. , -1. , 1. , 3. ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561
Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:
>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
beta cluster
0 0 506
1 -1 968
2 -1 873
3 1 179
4 3 269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
beta cluster
0 1.27582 506
1 -1.00000 968
2 -1.00000 873
3 1.00000 179
4 3.00000 269
My suspicion is that
alpha_beta = beta.multiply(alpha)
is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:
sum_row = alpha_beta.sum(1)[0, 0]
So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Resources