How to get function output to add columns to my Dataframe - python-3.x

I have a function that produces an output like so when I pass it a name:
W2V('aamir')
array([ 0.12135 , -0.99132 , 0.32347 , 0.31334 , 0.97446 , -0.67629 ,
0.88606 , -0.11043 , 0.79434 , 1.4788 , 0.53169 , 0.95331 ,
-1.1883 , 0.82438 , -0.027177, 0.70081 , 0.87467 , -0.095825,
-0.5937 , 1.4262 , 0.2187 , 1.1763 , 1.6294 , 0.91717 ,
-0.086697, 0.16529 , 0.19095 , -0.39362 , -0.40367 , 0.83966 ,
-0.25251 , 0.46286 , 0.82748 , 0.93061 , 1.136 , 0.85616 ,
0.34705 , 0.65946 , -0.7143 , 0.26379 , 0.64717 , 1.5633 ,
-0.81238 , -0.44516 , -0.2979 , 0.52601 , -0.41725 , 0.086686,
0.68263 , -0.15688 ], dtype=float32)
I have a data frame that has an index Name and a single column Y:
df1
Y
Name
aamir 0
aaron 0
... ...
zulema 1
zuzana 1
I wish to run my function on each value of Name and have it create columns like so:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
Name
aamir 0.12135 -0.99132 0.32347 0.31334 0.97446 -0.67629 0.88606 -0.11043 0.794340 1.47880 ... 0.647170 1.56330 -0.81238 -0.445160 -0.29790 0.52601 -0.41725 0.086686 0.68263 -0.15688
aaron -1.01850 0.80951 0.40550 0.09801 0.50634 0.22301 -1.06250 -0.17397 -0.061715 0.55292 ... -0.144960 0.82696 -0.51106 -0.072066 0.43069 0.32686 -0.00886 -0.850310 -1.31530 0.71631
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zulema 0.56547 0.30961 0.48725 1.41000 -0.76790 0.39908 0.86915 0.68361 -0.019467 0.55199 ... 0.062091 0.62614 0.44548 -0.193820 -0.80556 -0.73575 -0.30031 -1.278900 0.24759 -0.55541
zuzana -1.49480 -0.15111 -0.21853 0.77911 0.44446 0.95019 0.40513 0.26643 0.075182 -1.34340 ... 1.102800 0.51495 1.06230 -1.587600 -0.44667 1.04600 -0.38978 0.741240 0.39457 0.22857
What I have done is real messy, but works:
names = df1.index.to_list()
Lst = []
for name in names:
Lst.append(W2V(name).tolist())
wv_df = pd.DataFrame(index=names, data=Lst)
wv_df.index.name = "Name"
wv_df.sort_index(inplace=True)
df1 = df1.merge(wv_df, how='inner', left_index=True, right_index=True)
I am hoping there is a way to use .apply() or similar but I have not found how to do this. I am looking for an efficient way.
Update:
I modified my function to do like so:
if isinstance(w, pd.core.series.Series):
w = w.to_string()
Although this appears to work at first, the data is wrong. If I pass aamir to my function you can see the result. Yet when I do it with apply the numbers are totally different:
df1
Name Y
0 aamir 0
1 aaron 0
... ... ...
7942 zulema 1
7943 zuzana 1
df3 = df1.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 0.075014 0.824769 0.580976 0.493415 0.409894 0.142214 0.202602 -0.599501 -0.213184 -0.142188 ... 0.627784 0.136511 -0.162938 0.095707 -0.257638 0.396822 0.208624 -0.454204 0.153140 0.803400
1 0.073664 0.868665 0.574581 0.538951 0.394502 0.134773 0.233070 -0.639365 -0.194892 -0.110557 ... 0.722513 0.147112 -0.239356 -0.046832 -0.237434 0.321494 0.206583 -0.454038 0.251605 0.918388
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7942 -0.002117 0.894570 0.834724 0.602266 0.327858 -0.003092 0.197389 -0.675813 -0.311369 -0.174356 ... 0.690172 -0.085517 -0.000235 -0.214937 -0.290900 0.361734 0.290184 -0.497177 0.285071 0.711388
7943 -0.047621 0.850352 0.729225 0.515870 0.439999 0.060711 0.226026 -0.604846 -0.344891 -0.128396 ... 0.557035 -0.048322 -0.070075 -0.265775 -0.330709 0.281492 0.304157 -0.552191 0.281502 0.750304
7944 rows × 50 columns
You can see that the first row is aamir and the first value (column 0) my function returns is 0.1213 (You can see this at the top of my post). Yet with apply that appears to be 0.075014
EDIT:
It appears it passes in Name aamir rather than aamir. How can I get it to just send the Name itself aamir?

Let's say we have some function which transforms a string into a vector of a fixed size, for example:
import numpy as np
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
Also a data frame is given with a meaningful index and junk data:
import pandas as pd
names = pd.Index(['aamir','aaron','zulema','zuzana'], name='Name')
df = pd.DataFrame(index=names).assign(Y=0)
When we apply some function to a DataFrame along columns, i.e. axis=1, its argument is gonna be a row as Series wich name is an index of the row. So we could do something like this:
output = df.apply(lambda row: W2V(row.name), axis=1, result_type='expand')
With result_type='expand', returned vectors will be transformed into columns, which is the required output.
P.S. As an option:
df = pd.DataFrame.from_dict({n: W2V(n) for n in names}, orient='index')
P.P.S. IMO The behavior you describe means that your function can operate not only on str, but also on some common sequence, for example on a Series of strings. In case of the code:
df.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
the function W2V receives not "a name" as a string but pd.Series(["a name"]). If we do not check the type of the passed parameter inside the function, then we can get a silent error, which in this case appears as different output data.

I don't know if this is any better than the other suggestions, but I would use apply to create another n-column dataframe (where n is the length of the array returned by the W2V function) and then concatenate it to the original dataframe.
This first section generates toy versions of your W2V function and your dataframe.
# substitute your W2V function for this:
n = 5
def W2V(name: str):
return [random() for i in range(n)]
# substitute your 2-column dataframe for this:
df1 = pd.DataFrame(data={'Name':['aamir', 'aaron', 'zulema', 'zuzana'],
'Y': [0, 0, 1, 1]},
index=list(range(4)))
df1 is
Name Y
0 aamir 0
1 aaron 0
2 zulema 1
3 zuzana 1
You want to make a second dataframe that applies W2V to every name in the first dataframe. To generate your column numbers, I'm just using a list comprehension that generates [0, 1, ... n], where n is the length of the array returned by W2V.
df2 = df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)
My random-valued df2 is
0 1 2 3 4
0 0.242761 0.415253 0.940213 0.074455 0.444372
1 0.935781 0.968155 0.850091 0.064548 0.737655
2 0.204053 0.845252 0.967767 0.352254 0.028609
3 0.853164 0.698195 0.292238 0.982009 0.402736
Then concatenate the new dataframe to the old one:
df3 = pd.concat([df1, df2], axis=1)
df3 is
Name Y 0 1 2 3 4
0 aamir 0 0.242761 0.415253 0.940213 0.074455 0.444372
1 aaron 0 0.935781 0.968155 0.850091 0.064548 0.737655
2 zulema 1 0.204053 0.845252 0.967767 0.352254 0.028609
3 zuzana 1 0.853164 0.698195 0.292238 0.982009 0.402736
Alternatively, you could do both steps in one line as:
df1 = pd.concat([df1,
df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)],
axis=1)

You can try something like this using map and np.vstack with a dataframe constructor then join:
df.join(pd.DataFrame(np.vstack(df.index.map(W2V)), index=df.index))
Output:
Y 0 1 2 3 4 5 6 7 8 9
A 0 4 0 2 1 0 0 0 0 3 3
B 1 4 0 0 4 4 3 4 3 4 3
C 2 1 5 5 5 3 3 1 3 5 0
D 3 3 5 1 3 4 2 3 1 0 1
E 4 4 0 2 4 4 0 3 3 4 2
F 5 4 3 5 1 0 2 3 2 5 2
G 6 4 5 2 0 0 2 4 3 4 3
H 7 0 2 5 2 3 4 3 5 3 1
I 8 2 2 0 1 4 2 4 1 0 4
J 9 0 2 3 5 0 3 0 2 4 0
Using #Vitalizzare function:
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
df = pd.DataFrame({'Y': np.arange(10)}, index = [*'ABCDEFGHIJ'])

I am going off the names being the axis, and there being a useless column called 0. I think this may be the solution, no way to know without your function or the names
df.reset_index().drop(0, axis=1).apply(my_func, axis=1, result_type='expand')

I would do simply:
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
Example
To start with, let us make some function w2v(name). In the following, we compute a consistent hash of any string. Then we use that hash as a (temporary) seed for np.random, and then draw a random vector size=50:
import numpy as np
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_seed(seed):
state = np.random.get_state()
np.random.seed(seed)
try:
yield
finally:
np.random.set_state(state)
mask = (1 << 32) - 1
def w2v(name, size=50):
fingerprint = int(pd.util.hash_array(np.array([name])))
with temp_seed(fingerprint & mask):
return np.random.uniform(-1, 1, size)
For instance:
>>> w2v('aamir')
array([ 0.65446901, -0.92765123, -0.78188552, -0.62683782, -0.23946784,
0.31315156, 0.22802972, -0.96076167, 0.62577993, -0.59024811,
0.76365736, 0.93033898, -0.56155296, 0.4760905 , -0.92760642,
0.00177959, -0.22761559, 0.81929959, 0.21138229, -0.49882747,
-0.97637984, -0.19452496, -0.91354933, 0.70473533, -0.30394358,
-0.47092087, -0.0329302 , -0.93178517, 0.79118799, 0.98286834,
-0.16024194, -0.02793147, -0.52251214, -0.70732759, 0.10098142,
-0.24880249, 0.28930319, -0.53444863, 0.37887522, 0.58544068,
0.85804119, 0.67048213, 0.58389158, -0.19889071, -0.04281131,
-0.62506126, 0.42872395, -0.12821543, -0.52458052, -0.35493892])
Now, we use the expression given as solution:
df = pd.DataFrame([0,0,1,1], index=['aamir', 'aaron', 'zulema', 'zuzana'])
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
>>> newdf
0 1 2 3 4 5 6 ...
aamir 0.654469 -0.927651 -0.781886 -0.626838 -0.239468 0.313152 0.228030 ...
aaron -0.380524 -0.850608 -0.914642 -0.578885 0.177975 -0.633761 -0.736234 ...
zulema -0.250957 0.882491 -0.197833 -0.707652 0.754575 0.731236 -0.770831 ...
zuzana -0.641296 0.065898 0.466784 0.652776 0.391865 0.918761 0.022798 ...

Related

I want to get/print df by range instead of head or tail

I can't find or understand how to get the data I want by range
I want to know how to get df['Close']from x to y then .mean to sum it up
I have tried "costomclose = df['Close'],range(dagartot,val)"
But it gives me something else like heads and tails from df
if len(df) >= 34:
dagartot = len(df)
valdagar = 5
val = dagartot-valdagar
costomclose = df['Close'],range(dagartot,val)
print(costomclose)
edit:
<bound method NDFrame.tail of High Low ... Volume Adj Close
Date ...
2005-09-29 24.083300 23.583300 ... 74400.0 4.038682
2005-09-30 23.833300 23.500000 ... 148200.0 4.081495
2005-10-03 24.000000 23.333300 ... 27600.0 3.995869
2005-10-04 23.500000 23.416700 ... 132000.0 4.024417
2005-10-05 23.750000 23.500000 ... 15600.0 4.067230
... ... ... ... ... ...
2019-07-25 196.000000 193.050003 ... 355952.0 194.000000
2019-07-26 196.350006 194.000000 ... 320752.0 195.199997
2019-07-29 196.350006 193.550003 ... 301389.0 195.250000
2019-07-30 197.949997 194.850006 ... 233989.0 197.100006
2019-07-31 198.550003 195.600006 ... 323473.0 197.899994
[3479 rows x 6 columns]>
stop
Here is an example of slicing out the middle of something based on the encounter index:
>>> s = pd.Series(list('abcdefghijklmnop'))
>>> s
Out[135]:
0 a
1 b
...
12 m
13 n
14 o
15 p
dtype: object
>>> s.iloc[6:9]
Out[136]:
6 g
7 h
8 i
dtype: object
This also works for DataFrames, e.g. df.iloc[0] returns the first row and df.iloc[5:8] returns those rows, end not included.
You can also slice by actual index of the DataFrame, which is not necessarily a serially-counting sequence of integers by substituting iloc for loc.
Here is an example of slicing out the middle of a dataframe that stores the alphabet:
>>> df = pd.DataFrame([dict(num=i + 65, char=chr(i + 65)) for i in range(26)])
>>> df[(76 <= df.num) & (df.num < 81)]
num char
11 76 L
12 77 M
13 78 N
14 79 O
15 80 P

Add columns to pandas data frame with for-loop

The code block below produces the this table:
Trial Week Branch Num_Dep Tot_dep_amt
1 1 1 4 4200
1 1 2 7 9000
1 1 3 6 4800
1 1 4 6 5800
1 1 5 5 3800
1 1 6 4 3200
1 1 7 3 1600
. . . . .
. . . . .
1 1 8 5 6000
9 19 40 3 2800
Code:
trials=10
dep_amount=[]
branch=41
total=[]
week=1
week_num=[]
branch_num=[]
dep_num=[]
trial_num=[]
weeks=20
df=pd.DataFrame()
for a in range(1,trials):
print("Starting trial", a)
for b in range(1,weeks):
for c in range(1,branch):
depnum = int(np.round(np.random.normal(5,2,1)/1)*1)
acc_dep=0
for d in range(1,depnum):
dep_amt=int(np.round(np.random.normal(1200,400,1)/200)*200)
acc_dep=acc_dep+dep_amt
temp = pd.DataFrame.from_records([{'Trial': a, 'Week': b, 'branch': c,'Num_Dep': depnum, 'Tot_dep_amt':acc_dep }])
df = pd.concat([df, temp])
df = df[['Trial', 'Week', 'branch', 'Num_Dep','Tot_dep_amt']]
df=df.reset_index()
df=df.drop('index',axis=1)
I would like to be able to break branches apart in the for-loop and instead have the resultant df represented with headers:
Trial Week Branch_1_Num_Dep Branch_1_Tot_dep_amount Branch_2_Num_ Dep .....etc
I know this could be done by generating the DF and performing an encoding, but for this task I would like it to be generated in the for loop if possible?
In order to achieve this with minimal changes to your code, you can do something like the following:
df = pd.DataFrame()
for a in range(1, trials):
print("Starting trial", a)
for b in range(1, weeks):
records = {'Trial': a, 'Week': b}
for c in range(1, branch):
depnum = int(np.round(np.random.normal(5, 2, 1) / 1) * 1)
acc_dep = 0
for d in range(1, depnum):
dep_amt = int(np.round(np.random.normal(1200, 400, 1) / 200) * 200)
acc_dep = acc_dep + dep_amt
records['Branch_{}_Num_Dep'.format(c)] = depnum
records['Branch_{}_Tot_dep_amount'.format(c)] = acc_dep
temp = pd.DataFrame.from_records([records])
df = pd.concat([df, temp])
df = df.reset_index()
df = df.drop('index', axis=1)
Overall it seems that what you are doing can be done in more elegant and faster ways. I would recommend taking a look to vectorization as a concept (e.g. here).

passing parameters in groupby aggregate function

I have dataframe which I've referenced as df in the code and I'm applying aggregate functions on multiple columns of each group. I also applied user-defined lambda functions f4, f5, f6, f7. Some functions are very similar like f4, f6 and f7 where only parameter value are different. Can I pass these parameters from dictionary d, so that I have to write only one function instead of writing multiple functions?
f4 = lambda x: len(x[x>10]) # count the frequency of bearing greater than threshold value
f4.__name__ = 'Frequency'
f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'
f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'
f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f6,
'acc_rate':f7,
'bearing':['sum', f4],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
I like to write a function like
f4(p) = lambda x: len(x[x>p])
f4.__name__ = 'Frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(0.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
The csv file of dataframe df is available at given link for more clarity of data.
https://drive.google.com/open?id=1R_BBL00G_Dlo-6yrovYJp5zEYLwlMPi9
It is possible, but not easy, solution by neilaronson.
Also solution is simplify by sum of True values of boolean mask.
def f4(p):
def ipf(x):
return (x < p).sum()
#your solution
#return len(x[x < p])
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
EDIT: You can also pass parameter for greater or less:
def f4(p, op):
def ipf(x):
if op == 'greater':
return (x > p).sum()
elif op == 'less':
return (x < p).sum()
else:
raise ValueError("second argument has to be greater or less only")
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4, 'less'), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2, 'greater'),
'acc_rate':f4(.25, 'greater'),
'bearing':['sum', f4(10, 'greater')],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
print (df1.head())
userid trip_id segmentid Transportation_Mode acceleration_mean \
0 141 1.0 1 walk 0.061083
1 141 2.0 1 walk 0.109148
2 141 3.0 1 walk 0.106771
3 141 4.0 1 walk 0.141180
4 141 5.0 1 walk 1.147157
acceleration_median acceleration_min velocity_Frequency velocity_sum \
0 -1.168583e-02 -2.994428 1000.0 1506.679506
1 1.665535e-09 -3.234188 464.0 712.429005
2 -3.055414e-08 -3.131293 996.0 1394.746071
3 9.241707e-09 -3.307262 340.0 513.461259
4 -2.609489e-02 -3.190424 493.0 729.702854
velocity_count velocity_median velocity_min velocity_rate_Frequency \
0 1028 1.294657 0.284747 288.0
1 486 1.189650 0.284725 134.0
2 1020 1.241419 0.284733 301.0
3 352 1.326324 0.339590 93.0
4 504 1.247868 0.284740 168.0
acc_rate_Frequency bearing_sum bearing_Frequency bearing_rate_sum \
0 169.0 81604.187066 884.0 -371.276356
1 89.0 25559.589869 313.0 -357.869944
2 203.0 -71540.141199 57.0 946.382581
3 78.0 9548.920765 167.0 -943.184805
4 93.0 -24021.555784 67.0 535.333624
Vincenty_distance_sum
0 1506.679506
1 712.429005
2 1395.328768
3 513.461259
4 731.823664

how to replace a cell in a pandas dataframe

After forming the below python pandas dataframe (for example)
import pandas
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pandas.DataFrame(data,columns=['Name','Age'])
If I iterate through it, I get
In [62]: for i in df.itertuples():
...: print( i.Index, i.Name, i.Age )
...:
0 Alex 10
1 Bob 12
2 Clarke 13
What I would like to achieve is to replace the value of a particular cell
In [67]: for i in df.itertuples():
...: if i.Name == "Alex":
...: df.at[i.Index, 'Age'] = 100
...:
Which seems to work
In [64]: df
Out[64]:
Name Age
0 Alex 100
1 Bob 12
2 Clarke 13
The problem is that when using a larger different dataset, and do:
First, I create a new column named like NETELEMENT with a default value of ""
I would like to replace the default value "" with the string that the function lookup_netelement returns
df['NETELEMENT'] = ""
for i in df.itertuples():
df.at[i.Index, 'NETELEMENT'] = lookup_netelement(i.PEER_SRC_IP)
print( i, lookup_netelement(i.PEER_SRC_IP) )
But what I get as a result is:
Pandas(Index=769, SRC_AS='', DST_AS='', COMMS='', SRC_COMMS=nan, AS_PATH='', SRC_AS_PATH=nan, PREF='', SRC_PREF='0', MED='0', SRC_MED='0', PEER_SRC_AS='0', PEER_DST_AS='', PEER_SRC_IP='x.x.x.x', PEER_DST_IP='', IN_IFACE='', OUT_IFACE='', PROTOCOL='udp', TOS='0', BPS=35200.0, SRC_PREFIX='', DST_PREFIX='', NETELEMENT='', IN_IFNAME='', OUT_IFNAME='') routerX
meaning that it should be:
NETELEMENT='routerX' instead of NETELEMENT=''
Could you please advise what I am doing wrong ?
EDIT: for reasons of completeness the lookup_netelement is defined as
def lookup_netelement(ipaddr):
try:
x = LOOKUP['conn'].hget('ipaddr;{}'.format(ipaddr), 'dev') or b""
except:
logger.error('looking up `ipaddr` for netelement caused `{}`'.format(repr(e)), exc_info=True)
x = b""
x = x.decode("utf-8")
return x
Hope you are looking for where for conditional replacement i.e
def wow(x):
return x ** 10
df['new'] = df['Age'].where(~(df['Name'] == 'Alex'),wow(df['Age']))
Output :
Name Age new
0 Alex 10 10000000000
1 Bob 12 12
2 Clarke 13 13
3 Alex 15 576650390625
Based on your edit your trying to apply the function i.e
df['new'] = df['PEER_SRC_IP'].apply(lookup_netelement)
Edit : For your comment on sending two columns, use lambda with axis 1 i.e
def wow(x,y):
return '{} {}'.format(x,y)
df.apply(lambda x : wow(x['Name'],x['Age']),1)

creating lists from row data

My input data has the following format
id offset code
1 3 21
1 3 24
1 5 21
2 1 84
3 5 57
3 5 21
3 5 92
3 10 83
3 10 21
I would like the output in the following format
id offset code
1 [3,5] [[21,24],[21]]
2 [1] [[84]]
3 [5,10] [[21,57,92],[21,83]]
The code that I have been able to come up with is shown below
import random, pandas
random.seed(10000)
param = dict(nrow=100, nid=10, noffset=8, ncode=100)
#param = dict(nrow=1000, nid=10, noffset=8, ncode=100)
#param = dict(nrow=100000, nid=1000, noffset=50, ncode=5000)
#param = dict(nrow=10000000, nid=10000, noffset=100, ncode=5000)
pd = pandas.DataFrame({
"id":random.choices(range(1,param["nid"]+1), k=param["nrow"]),
"offset":random.choices(range(param["noffset"]), k=param["nrow"])
})
pd["code"] = random.choices(range(param["ncode"]), k=param["nrow"])
pd = pd.sort_values(["id","offset","code"]).reset_index(drop=True)
tmp1 = pd.groupby(by=["id"])["offset"].apply(lambda x:list(set(x))).reset_index()
tmp2 = pd.groupby(by=["id","offset"])["code"].apply(lambda x:list(x)).reset_index().groupby(\
by=["id"], sort=True)["code"].apply(lambda x:list(x)).reset_index()
out = pandas.merge(tmp1, tmp2, on="id", sort=False)
It does give me the output that I want but is VERY slow when the dataframe is large. The dataframe that I have has over 40million rows. In the example
uncomment the fourth param statement and you will see how slow it is.
Can you please help with making this run faster?
(df.groupby(['id','offset']).code.apply(list).reset_index()
.groupby('id').agg(lambda x: x.tolist()))
Out[733]:
offset code
id
1 [3, 5] [[21, 24], [21]]
2 [1] [[84]]
3 [5, 10] [[57, 21, 92], [83, 21]]

Resources