how to convert bytes to buffer in python - python-3.x

i'm retrive a blob from mssql server which is showing something like this
b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(en-GB) /StructTreeRoot 30 0 R/MarkInfo<</Marked true>>/Metadata 820 0 R/ViewerPreferences 821 0 R>>\r\nendobj\r\n2 0 obj\r\n<</Type/Pages/Count 3/Kids[
3 0 R 18 0 R 25 0 R
] >>\r\nendobj\r\n3 0 obj\r\n<</Type/Page/Parent 2 0 R/Resources<</XObject<</Image5 5 0 R>>/ExtGState<</GS6 6 0 R/GS9 9 0 R>>/Font<</F1 7 0 R/F2 10 0 R/F3 12 0 R/F4 14 0 R/F5 16 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI
] >>/MediaBox[
0 0 595.44 841.68
] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>\r\nendobj\r\n4 0 obj\r\n<</Filter/FlateDecode/Length 7807>>\r\nstream\r\nx\x9c\xcd
]ms\xdc6\x92\xfe\xee*\xff\x07~\xba\x9a\xd9=Q\x04\xdf\xb9\xb5\xb7U\x8e\x9d\xb7\xdd8qb\xed\xa6R\xc9~P\xa4\x91\xadD\x96|\x92\xbc[\xfe\xf7\xd7\xddh\x80\x04\x89\x060\xc3\xf1\xdd\xa5\xca\x13I\xec\xe9~\xd8h4\x1a\x0f#\xf0\xf4\xd9\xfd\xe3\xf5\xd5\xf9\xc5c\xf6\xe7?\x9f>{|<\xbfx\xbb\xbb\xcc~>=\xbb{\xff\xcf\xd3\xb3\x8f\xefw\xa7\xaf\xce\xdf\\\xdf\x9e?^\xdf\xdd\x9e\xbe\xfe\xf0\xeb#\xfe\xe9\xab\xdd\xf9\xe5\xee\xfe/\x7f\xc9>{\xf1<;\xfd\xf2u\x9b\xbdyx\xfa\xe4\xbf\x9f>)\xb2"/
when i print the type of above blob it shows bytes i want to convert to buffer which is some thing like this {"type":"Buffer","data":[37,80,68,70,45,49,46,55,13,10,37,181,181,181,181,13,10,49,32,48,32,111,98,106,13,10,60,60,47,84,121,112,101,47,67,97,116,97,108,111,103,47,80,97,103,101,115,32,50,32,48,32,82,47,76,97,110,103,40,101,110,45,71,66,41,32,47,83,116,114,117,99,116,84,114,101,101,82,111,111,116,32,51,48,32,48,32,82,47,77,97,114,107,73,110,102,111,60,60,47,77,97,114,107,101,100,32,116,114,117,101,62,62,47,77,101,116,97,100,97,116,97,32,56,50,48,32,48,32,82,47,86,105,101,119,101,114,80,114,101,102,101,114,101,110,99,101,115,32,56,50,49,32,48,32,82,62,62,13,10,101,110,100,111,98,106,13,10,50,32,48,32,111,98,106,13,10,60,60,47,84,121,112,101,47,80,97,103,101,115,47,67,111,117,110,116,32,51,47,75,105,100,115,91,32,51,32,48,32,82,32,49,56,32,48,32,82,32,50,53,32,48,32,82,93,32,62,62,13,10,101,110,100,111,98,106,13,10,51,32,48,32,111,98,106,13} ..thanks in advance.

Python doesn't have the equivalent of a "Buffer" type, which is solely a Node.js invention. However, if you want to export the bytes object as JSON in a Node.js-compatible way, you could use the built-in json module:
>>> import json
>>> data = b'hello' # your data goes here
>>> json.dumps({ 'type': 'Buffer', 'data': list(data) })
'{"type": "Buffer", "data": [104, 101, 108, 108, 111]}'

Related

How to get function output to add columns to my Dataframe

I have a function that produces an output like so when I pass it a name:
W2V('aamir')
array([ 0.12135 , -0.99132 , 0.32347 , 0.31334 , 0.97446 , -0.67629 ,
0.88606 , -0.11043 , 0.79434 , 1.4788 , 0.53169 , 0.95331 ,
-1.1883 , 0.82438 , -0.027177, 0.70081 , 0.87467 , -0.095825,
-0.5937 , 1.4262 , 0.2187 , 1.1763 , 1.6294 , 0.91717 ,
-0.086697, 0.16529 , 0.19095 , -0.39362 , -0.40367 , 0.83966 ,
-0.25251 , 0.46286 , 0.82748 , 0.93061 , 1.136 , 0.85616 ,
0.34705 , 0.65946 , -0.7143 , 0.26379 , 0.64717 , 1.5633 ,
-0.81238 , -0.44516 , -0.2979 , 0.52601 , -0.41725 , 0.086686,
0.68263 , -0.15688 ], dtype=float32)
I have a data frame that has an index Name and a single column Y:
df1
Y
Name
aamir 0
aaron 0
... ...
zulema 1
zuzana 1
I wish to run my function on each value of Name and have it create columns like so:
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
Name
aamir 0.12135 -0.99132 0.32347 0.31334 0.97446 -0.67629 0.88606 -0.11043 0.794340 1.47880 ... 0.647170 1.56330 -0.81238 -0.445160 -0.29790 0.52601 -0.41725 0.086686 0.68263 -0.15688
aaron -1.01850 0.80951 0.40550 0.09801 0.50634 0.22301 -1.06250 -0.17397 -0.061715 0.55292 ... -0.144960 0.82696 -0.51106 -0.072066 0.43069 0.32686 -0.00886 -0.850310 -1.31530 0.71631
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zulema 0.56547 0.30961 0.48725 1.41000 -0.76790 0.39908 0.86915 0.68361 -0.019467 0.55199 ... 0.062091 0.62614 0.44548 -0.193820 -0.80556 -0.73575 -0.30031 -1.278900 0.24759 -0.55541
zuzana -1.49480 -0.15111 -0.21853 0.77911 0.44446 0.95019 0.40513 0.26643 0.075182 -1.34340 ... 1.102800 0.51495 1.06230 -1.587600 -0.44667 1.04600 -0.38978 0.741240 0.39457 0.22857
What I have done is real messy, but works:
names = df1.index.to_list()
Lst = []
for name in names:
Lst.append(W2V(name).tolist())
wv_df = pd.DataFrame(index=names, data=Lst)
wv_df.index.name = "Name"
wv_df.sort_index(inplace=True)
df1 = df1.merge(wv_df, how='inner', left_index=True, right_index=True)
I am hoping there is a way to use .apply() or similar but I have not found how to do this. I am looking for an efficient way.
Update:
I modified my function to do like so:
if isinstance(w, pd.core.series.Series):
w = w.to_string()
Although this appears to work at first, the data is wrong. If I pass aamir to my function you can see the result. Yet when I do it with apply the numbers are totally different:
df1
Name Y
0 aamir 0
1 aaron 0
... ... ...
7942 zulema 1
7943 zuzana 1
df3 = df1.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 0.075014 0.824769 0.580976 0.493415 0.409894 0.142214 0.202602 -0.599501 -0.213184 -0.142188 ... 0.627784 0.136511 -0.162938 0.095707 -0.257638 0.396822 0.208624 -0.454204 0.153140 0.803400
1 0.073664 0.868665 0.574581 0.538951 0.394502 0.134773 0.233070 -0.639365 -0.194892 -0.110557 ... 0.722513 0.147112 -0.239356 -0.046832 -0.237434 0.321494 0.206583 -0.454038 0.251605 0.918388
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7942 -0.002117 0.894570 0.834724 0.602266 0.327858 -0.003092 0.197389 -0.675813 -0.311369 -0.174356 ... 0.690172 -0.085517 -0.000235 -0.214937 -0.290900 0.361734 0.290184 -0.497177 0.285071 0.711388
7943 -0.047621 0.850352 0.729225 0.515870 0.439999 0.060711 0.226026 -0.604846 -0.344891 -0.128396 ... 0.557035 -0.048322 -0.070075 -0.265775 -0.330709 0.281492 0.304157 -0.552191 0.281502 0.750304
7944 rows × 50 columns
You can see that the first row is aamir and the first value (column 0) my function returns is 0.1213 (You can see this at the top of my post). Yet with apply that appears to be 0.075014
EDIT:
It appears it passes in Name aamir rather than aamir. How can I get it to just send the Name itself aamir?
Let's say we have some function which transforms a string into a vector of a fixed size, for example:
import numpy as np
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
Also a data frame is given with a meaningful index and junk data:
import pandas as pd
names = pd.Index(['aamir','aaron','zulema','zuzana'], name='Name')
df = pd.DataFrame(index=names).assign(Y=0)
When we apply some function to a DataFrame along columns, i.e. axis=1, its argument is gonna be a row as Series wich name is an index of the row. So we could do something like this:
output = df.apply(lambda row: W2V(row.name), axis=1, result_type='expand')
With result_type='expand', returned vectors will be transformed into columns, which is the required output.
P.S. As an option:
df = pd.DataFrame.from_dict({n: W2V(n) for n in names}, orient='index')
P.P.S. IMO The behavior you describe means that your function can operate not only on str, but also on some common sequence, for example on a Series of strings. In case of the code:
df.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand')
the function W2V receives not "a name" as a string but pd.Series(["a name"]). If we do not check the type of the passed parameter inside the function, then we can get a silent error, which in this case appears as different output data.
I don't know if this is any better than the other suggestions, but I would use apply to create another n-column dataframe (where n is the length of the array returned by the W2V function) and then concatenate it to the original dataframe.
This first section generates toy versions of your W2V function and your dataframe.
# substitute your W2V function for this:
n = 5
def W2V(name: str):
return [random() for i in range(n)]
# substitute your 2-column dataframe for this:
df1 = pd.DataFrame(data={'Name':['aamir', 'aaron', 'zulema', 'zuzana'],
'Y': [0, 0, 1, 1]},
index=list(range(4)))
df1 is
Name Y
0 aamir 0
1 aaron 0
2 zulema 1
3 zuzana 1
You want to make a second dataframe that applies W2V to every name in the first dataframe. To generate your column numbers, I'm just using a list comprehension that generates [0, 1, ... n], where n is the length of the array returned by W2V.
df2 = df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)
My random-valued df2 is
0 1 2 3 4
0 0.242761 0.415253 0.940213 0.074455 0.444372
1 0.935781 0.968155 0.850091 0.064548 0.737655
2 0.204053 0.845252 0.967767 0.352254 0.028609
3 0.853164 0.698195 0.292238 0.982009 0.402736
Then concatenate the new dataframe to the old one:
df3 = pd.concat([df1, df2], axis=1)
df3 is
Name Y 0 1 2 3 4
0 aamir 0 0.242761 0.415253 0.940213 0.074455 0.444372
1 aaron 0 0.935781 0.968155 0.850091 0.064548 0.737655
2 zulema 1 0.204053 0.845252 0.967767 0.352254 0.028609
3 zuzana 1 0.853164 0.698195 0.292238 0.982009 0.402736
Alternatively, you could do both steps in one line as:
df1 = pd.concat([df1,
df1.apply(lambda x: pd.Series(W2V(x['Name']),
index=[i for i in range(n)]),
axis=1)],
axis=1)
You can try something like this using map and np.vstack with a dataframe constructor then join:
df.join(pd.DataFrame(np.vstack(df.index.map(W2V)), index=df.index))
Output:
Y 0 1 2 3 4 5 6 7 8 9
A 0 4 0 2 1 0 0 0 0 3 3
B 1 4 0 0 4 4 3 4 3 4 3
C 2 1 5 5 5 3 3 1 3 5 0
D 3 3 5 1 3 4 2 3 1 0 1
E 4 4 0 2 4 4 0 3 3 4 2
F 5 4 3 5 1 0 2 3 2 5 2
G 6 4 5 2 0 0 2 4 3 4 3
H 7 0 2 5 2 3 4 3 5 3 1
I 8 2 2 0 1 4 2 4 1 0 4
J 9 0 2 3 5 0 3 0 2 4 0
Using #Vitalizzare function:
def W2V(name: str) -> np.ndarray:
low, high, size = 0, 5, 10
rng = np.random.default_rng(abs(hash(name)))
return rng.integers(low, high, size, endpoint=True)
df = pd.DataFrame({'Y': np.arange(10)}, index = [*'ABCDEFGHIJ'])
I am going off the names being the axis, and there being a useless column called 0. I think this may be the solution, no way to know without your function or the names
df.reset_index().drop(0, axis=1).apply(my_func, axis=1, result_type='expand')
I would do simply:
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
Example
To start with, let us make some function w2v(name). In the following, we compute a consistent hash of any string. Then we use that hash as a (temporary) seed for np.random, and then draw a random vector size=50:
import numpy as np
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_seed(seed):
state = np.random.get_state()
np.random.seed(seed)
try:
yield
finally:
np.random.set_state(state)
mask = (1 << 32) - 1
def w2v(name, size=50):
fingerprint = int(pd.util.hash_array(np.array([name])))
with temp_seed(fingerprint & mask):
return np.random.uniform(-1, 1, size)
For instance:
>>> w2v('aamir')
array([ 0.65446901, -0.92765123, -0.78188552, -0.62683782, -0.23946784,
0.31315156, 0.22802972, -0.96076167, 0.62577993, -0.59024811,
0.76365736, 0.93033898, -0.56155296, 0.4760905 , -0.92760642,
0.00177959, -0.22761559, 0.81929959, 0.21138229, -0.49882747,
-0.97637984, -0.19452496, -0.91354933, 0.70473533, -0.30394358,
-0.47092087, -0.0329302 , -0.93178517, 0.79118799, 0.98286834,
-0.16024194, -0.02793147, -0.52251214, -0.70732759, 0.10098142,
-0.24880249, 0.28930319, -0.53444863, 0.37887522, 0.58544068,
0.85804119, 0.67048213, 0.58389158, -0.19889071, -0.04281131,
-0.62506126, 0.42872395, -0.12821543, -0.52458052, -0.35493892])
Now, we use the expression given as solution:
df = pd.DataFrame([0,0,1,1], index=['aamir', 'aaron', 'zulema', 'zuzana'])
newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index)
>>> newdf
0 1 2 3 4 5 6 ...
aamir 0.654469 -0.927651 -0.781886 -0.626838 -0.239468 0.313152 0.228030 ...
aaron -0.380524 -0.850608 -0.914642 -0.578885 0.177975 -0.633761 -0.736234 ...
zulema -0.250957 0.882491 -0.197833 -0.707652 0.754575 0.731236 -0.770831 ...
zuzana -0.641296 0.065898 0.466784 0.652776 0.391865 0.918761 0.022798 ...

Aggregating the total count with Json Data

I have an API endpoint which has the details of confirmed / recovered / tested count for each state
https://data.covid19india.org/v4/min/data.min.json
I would like to to aggregate the total count of confirmed / recovered / tested across each state.. What is the easiest way to achieve the results?
To write the final results in pandas we can procede by adding this to the code.
import Pandas as Pd
columns = ('Confirmed', 'Deceased', 'Recovered', 'Tested')
Panda = pd.DataFrame(data = StateWiseData).T # T for transpose
print(Panda)
The output will be:
confirmed deceased recovered tested
AN 7557 129 7420 0
AP 2003342 13735 1975448 9788047
AR 52214 259 50695 398545
AS 584434 5576 570847 326318
BR 725588 9649 715798 17107895
CH 65066 812 64213 652657
CT 1004144 13553 989728 338344
DL 1437334 25079 1411881 25142853
DN 10662 4 10620 72410
GA 173221 3186 169160 0
GJ 825302 10079 815041 10900176
HP 211746 3553 206094 481328
HR 770347 9667 760004 3948145
JH 347730 5132 342421 233773
JK 324295 4403 318838 139552
KA 2939767 37155 2882331 9791334
KL 3814305 19494 3631066 3875002
LA 20491 207 20223 110068
LD 10309 51 10194 234256
MH 6424651 135962 6231999 8421643
ML 74070 1281 69859 0
MN 111212 1755 105751 13542
MP 792101 10516 781499 3384824
MZ 52472 200 46675 0
NL 29589 610 27151 116359
OR 1001698 7479 986334 2774807
PB 600266 16352 583426 2938477
PY 122934 1808 120330 567923
RJ 954023 8954 944917 5852578
SK 29340 367 27185 0
TG 654989 3858 644747 0
TN 2600885 34709 2547005 4413963
TR 82092 784 80150 607962
TT 0 0 0 0
UP 1709119 22792 1685954 23724581
UT 342749 7377 329006 2127358
WB 1543496 18371 1515789 0
Yes, my interpretation was incorrect earlier. We have to get the districts total and add them.
import json
file = open('data.min.json')
dictionary = json.load(file)
stateCodes = ['AN', 'AP', 'AR', 'AS', 'BR', 'CH', 'CT', 'DL', 'DN', 'GA', 'GJ', 'HP', 'HR', 'JH', 'JK', 'KA', 'KL', 'LA', 'LD', 'MH', 'ML', 'MN', 'MP', 'MZ', 'NL', 'OR', 'PB', 'PY', 'RJ', 'SK', 'TG', 'TN', 'TR', 'TT', 'UP', 'UT', 'WB']
StateWiseData = {}
for state in stateCodes:
StateInfo = dictionary[state]
Confirmed = 0
Recovered = 0
Tested = 0
Deceased = 0
StateData = {}
if "districts" in StateInfo:
for District in StateInfo['districts']:
DistrictInfo = StateInfo['districts'][District]['total']
if 'confirmed' in DistrictInfo:
if type(Confirmed) == type(DistrictInfo['confirmed']):
Confirmed += (DistrictInfo['confirmed'])
if 'recovered' in DistrictInfo:
if type(Recovered) == type(DistrictInfo['recovered']):
Recovered += (DistrictInfo['recovered'])
if 'tested' in DistrictInfo:
if type(Tested) == type(DistrictInfo['tested']):
Tested += (DistrictInfo['tested'])
if 'deceased' in DistrictInfo:
if type(Deceased) == type(DistrictInfo['deceased']):
Deceased += (DistrictInfo['deceased'])
StateData['confirmed'] = Confirmed
StateData['deceased'] = Deceased
StateData['recovered'] = Recovered
StateData['tested'] = Tested
StateWiseData[state] = StateData
print(StateWiseData)

Pandas operation, putting multiple results into a df column

This is a quant work. I used my previous work to filter out some desired stocks(candidates) with technical analysis based on moving average, MACD, KDJ, and etc. And now I wanna check my candidates' fundamental values, in this case, ROE. here is my code:
root_path = '.\\__fundamentals__'
df = pd.DataFrame(pd.read_csv("C:\\candidates.csv", encoding='GBK')) # 14 candidates this time
for code in list(df['code']):
i = str(code).zfill(6)
for root, dirs, files in os.walk(root_path):
for csv in files:
if csv.startswith('{}.csv'.format(i)):
csv_path = os.path.join(root, csv) # based on candidates looking for dupont value
df1 = pd.DataFrame(pd.read_csv("{}".format(csv_path), encoding='GBK'))
df2['ROE'] = df2['净资产收益率'].str.strip("%").astype(float)/100
ROE = [df2['ROE'].mean().round(decimals=3)]
df3 = pd.DataFrame({'ROE_Mean': ROE})
print(df3)
Here is the DOM
C:\Users\Mike_Leigh\.conda\envs\LEIGH\python.exe "P:/LEIGH PYTHON/Codes/Quant/analyze_stock.py"
ROE_Mean
0 -0.218
ROE_Mean
0 0.121
ROE_Mean
0 0.043
ROE_Mean
0 0.197
ROE_Mean
0 0.095
ROE_Mean
0 0.085
...
ROE_Mean
0 0.178
Process finished with exit code 0
my desired output would be like this:
ROE_Mean
0 -0.218
1 0.121
2 0.043
3 0.197
4 0.095
5 0.085
...
14 0.178
Would you please give me a hint on this? Thanks alot, much appreciated!
acually I wasn't that bad solving the issue.
first, make a list outside the loop, I mean the very outside the loop, in this case, before df
roe_avg = []
df = pd.DataFrame(pd.read_csv("C:\\candidates.csv", encoding='GBK'))
....
df2['ROE'] = df2['净资产收益率'].str.strip("%").astype(float) / 100
ROE_avg = df2['ROE'].mean().round(decimals=3)
roe_avg.append(ROE_avg)
df['ROE_avg'] = roe_avg
print(df)
DOM
name code ROE_avg
1 仙鹤股份 603733 0.121
3 泸州老窖 568 0.197
4 兴蓉环境 598 0.095
...
15 濮阳惠成 300481 0.148
16 中科创达 300496 0.101
17 森霸传感 300701 0.178
Process finished with exit code 0
thanks to #filippo

Pandas: MID & FIND Function

I have the a column in my dataframe that shows different combinations of the values below. I know that I could use the .str[:3] function and then convert this to a value, but the differing string lengths are throwing me off. How would I do a MID(x,FIND(",",x,1)+1,10) esk function on this column to find the sentiment and subjectivity values?
String samples:
df['Output'] =
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.03958333333333333, subjectivity=0.5020833333333334)
Sentiment(polarity=0.16472802559759075, subjectivity=0.4024750611707134)
Error:
def senti(x):
return TextBlob(x).sentiment
df['Output'] = df['stop'].apply(senti)
df.Output.str.split(',|=',expand=True).iloc[:,[1,3]]
IndexError: positional indexers are out-of-bounds
Outputs:
0 (0.0, 0.0)
1 (0.0028273809523809493, 0.48586309523809534)
2 (0.153726035868893, 0.5354359925788496)
3 (0.04357142857142857, 0.5319047619047619)
4 (0.07575757575757575, 0.28446969696969693)
...
92 (0.225, 0.39642857142857146)
93 (0.0, 0.0)
94 (0.5428571428571429, 0.6428571428571428)
95 (0.14393939393939395, 0.39999999999999997)
96 (0.35833333333333334, 0.5777777777777778)
Name: Output, Length: 97, dtype: object
df[['polarity', 'subjectivity']] = df.Output.str.split(',|=|\)',expand=True).iloc[:,[1,3]]
Result:
Output polarity subjectivity
0 Sentiment(polarity=0.0, subjectivity=0.0) 0.0 0.0
1 Sentiment(polarity=-0.03958333333333333, subje... -0.03958333333333333 0.5020833333333334
2 Sentiment(polarity=0.16472802559759075, subjec... 0.16472802559759075 0.4024750611707134
Try:
df['polarity']=df['Output'].str.extract(r"polarity=([-\.\d]+)")
df['subjectivity']=df['Output'].str.extract(r"subjectivity=([-\.\d]+)")
Outputs:
>>> df.iloc[:, -2:]
polarity subjectivity
0 0.0 0.0
1 -0.03958333333333333 0.5020833333333334
2 0.16472802559759075 0.4024750611707134

creating lists from row data

My input data has the following format
id offset code
1 3 21
1 3 24
1 5 21
2 1 84
3 5 57
3 5 21
3 5 92
3 10 83
3 10 21
I would like the output in the following format
id offset code
1 [3,5] [[21,24],[21]]
2 [1] [[84]]
3 [5,10] [[21,57,92],[21,83]]
The code that I have been able to come up with is shown below
import random, pandas
random.seed(10000)
param = dict(nrow=100, nid=10, noffset=8, ncode=100)
#param = dict(nrow=1000, nid=10, noffset=8, ncode=100)
#param = dict(nrow=100000, nid=1000, noffset=50, ncode=5000)
#param = dict(nrow=10000000, nid=10000, noffset=100, ncode=5000)
pd = pandas.DataFrame({
"id":random.choices(range(1,param["nid"]+1), k=param["nrow"]),
"offset":random.choices(range(param["noffset"]), k=param["nrow"])
})
pd["code"] = random.choices(range(param["ncode"]), k=param["nrow"])
pd = pd.sort_values(["id","offset","code"]).reset_index(drop=True)
tmp1 = pd.groupby(by=["id"])["offset"].apply(lambda x:list(set(x))).reset_index()
tmp2 = pd.groupby(by=["id","offset"])["code"].apply(lambda x:list(x)).reset_index().groupby(\
by=["id"], sort=True)["code"].apply(lambda x:list(x)).reset_index()
out = pandas.merge(tmp1, tmp2, on="id", sort=False)
It does give me the output that I want but is VERY slow when the dataframe is large. The dataframe that I have has over 40million rows. In the example
uncomment the fourth param statement and you will see how slow it is.
Can you please help with making this run faster?
(df.groupby(['id','offset']).code.apply(list).reset_index()
.groupby('id').agg(lambda x: x.tolist()))
Out[733]:
offset code
id
1 [3, 5] [[21, 24], [21]]
2 [1] [[84]]
3 [5, 10] [[57, 21, 92], [83, 21]]

Resources