how to convert bytes to buffer in python - python-3.x
i'm retrive a blob from mssql server which is showing something like this
b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(en-GB) /StructTreeRoot 30 0 R/MarkInfo<</Marked true>>/Metadata 820 0 R/ViewerPreferences 821 0 R>>\r\nendobj\r\n2 0 obj\r\n<</Type/Pages/Count 3/Kids[
3 0 R 18 0 R 25 0 R
] >>\r\nendobj\r\n3 0 obj\r\n<</Type/Page/Parent 2 0 R/Resources<</XObject<</Image5 5 0 R>>/ExtGState<</GS6 6 0 R/GS9 9 0 R>>/Font<</F1 7 0 R/F2 10 0 R/F3 12 0 R/F4 14 0 R/F5 16 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI
] >>/MediaBox[
0 0 595.44 841.68
] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>\r\nendobj\r\n4 0 obj\r\n<</Filter/FlateDecode/Length 7807>>\r\nstream\r\nx\x9c\xcd
]ms\xdc6\x92\xfe\xee*\xff\x07~\xba\x9a\xd9=Q\x04\xdf\xb9\xb5\xb7U\x8e\x9d\xb7\xdd8qb\xed\xa6R\xc9~P\xa4\x91\xadD\x96|\x92\xbc[\xfe\xf7\xd7\xddh\x80\x04\x89\x060\xc3\xf1\xdd\xa5\xca\x13I\xec\xe9~\xd8h4\x1a\x0f#\xf0\xf4\xd9\xfd\xe3\xf5\xd5\xf9\xc5c\xf6\xe7?\x9f>{|<\xbfx\xbb\xbb\xcc~>=\xbb{\xff\xcf\xd3\xb3\x8f\xefw\xa7\xaf\xce\xdf\\\xdf\x9e?^\xdf\xdd\x9e\xbe\xfe\xf0\xeb#\xfe\xe9\xab\xdd\xf9\xe5\xee\xfe/\x7f\xc9>{\xf1<;\xfd\xf2u\x9b\xbdyx\xfa\xe4\xbf\x9f>)\xb2"/
when i print the type of above blob it shows bytes i want to convert to buffer which is some thing like this {"type":"Buffer","data":[37,80,68,70,45,49,46,55,13,10,37,181,181,181,181,13,10,49,32,48,32,111,98,106,13,10,60,60,47,84,121,112,101,47,67,97,116,97,108,111,103,47,80,97,103,101,115,32,50,32,48,32,82,47,76,97,110,103,40,101,110,45,71,66,41,32,47,83,116,114,117,99,116,84,114,101,101,82,111,111,116,32,51,48,32,48,32,82,47,77,97,114,107,73,110,102,111,60,60,47,77,97,114,107,101,100,32,116,114,117,101,62,62,47,77,101,116,97,100,97,116,97,32,56,50,48,32,48,32,82,47,86,105,101,119,101,114,80,114,101,102,101,114,101,110,99,101,115,32,56,50,49,32,48,32,82,62,62,13,10,101,110,100,111,98,106,13,10,50,32,48,32,111,98,106,13,10,60,60,47,84,121,112,101,47,80,97,103,101,115,47,67,111,117,110,116,32,51,47,75,105,100,115,91,32,51,32,48,32,82,32,49,56,32,48,32,82,32,50,53,32,48,32,82,93,32,62,62,13,10,101,110,100,111,98,106,13,10,51,32,48,32,111,98,106,13} ..thanks in advance.
Python doesn't have the equivalent of a "Buffer" type, which is solely a Node.js invention. However, if you want to export the bytes object as JSON in a Node.js-compatible way, you could use the built-in json module:
>>> import json
>>> data = b'hello' # your data goes here
>>> json.dumps({ 'type': 'Buffer', 'data': list(data) })
'{"type": "Buffer", "data": [104, 101, 108, 108, 111]}'
Related
How to get function output to add columns to my Dataframe
I have a function that produces an output like so when I pass it a name: W2V('aamir') array([ 0.12135 , -0.99132 , 0.32347 , 0.31334 , 0.97446 , -0.67629 , 0.88606 , -0.11043 , 0.79434 , 1.4788 , 0.53169 , 0.95331 , -1.1883 , 0.82438 , -0.027177, 0.70081 , 0.87467 , -0.095825, -0.5937 , 1.4262 , 0.2187 , 1.1763 , 1.6294 , 0.91717 , -0.086697, 0.16529 , 0.19095 , -0.39362 , -0.40367 , 0.83966 , -0.25251 , 0.46286 , 0.82748 , 0.93061 , 1.136 , 0.85616 , 0.34705 , 0.65946 , -0.7143 , 0.26379 , 0.64717 , 1.5633 , -0.81238 , -0.44516 , -0.2979 , 0.52601 , -0.41725 , 0.086686, 0.68263 , -0.15688 ], dtype=float32) I have a data frame that has an index Name and a single column Y: df1 Y Name aamir 0 aaron 0 ... ... zulema 1 zuzana 1 I wish to run my function on each value of Name and have it create columns like so: 0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49 Name aamir 0.12135 -0.99132 0.32347 0.31334 0.97446 -0.67629 0.88606 -0.11043 0.794340 1.47880 ... 0.647170 1.56330 -0.81238 -0.445160 -0.29790 0.52601 -0.41725 0.086686 0.68263 -0.15688 aaron -1.01850 0.80951 0.40550 0.09801 0.50634 0.22301 -1.06250 -0.17397 -0.061715 0.55292 ... -0.144960 0.82696 -0.51106 -0.072066 0.43069 0.32686 -0.00886 -0.850310 -1.31530 0.71631 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... zulema 0.56547 0.30961 0.48725 1.41000 -0.76790 0.39908 0.86915 0.68361 -0.019467 0.55199 ... 0.062091 0.62614 0.44548 -0.193820 -0.80556 -0.73575 -0.30031 -1.278900 0.24759 -0.55541 zuzana -1.49480 -0.15111 -0.21853 0.77911 0.44446 0.95019 0.40513 0.26643 0.075182 -1.34340 ... 1.102800 0.51495 1.06230 -1.587600 -0.44667 1.04600 -0.38978 0.741240 0.39457 0.22857 What I have done is real messy, but works: names = df1.index.to_list() Lst = [] for name in names: Lst.append(W2V(name).tolist()) wv_df = pd.DataFrame(index=names, data=Lst) wv_df.index.name = "Name" wv_df.sort_index(inplace=True) df1 = df1.merge(wv_df, how='inner', left_index=True, right_index=True) I am hoping there is a way to use .apply() or similar but I have not found how to do this. I am looking for an efficient way. Update: I modified my function to do like so: if isinstance(w, pd.core.series.Series): w = w.to_string() Although this appears to work at first, the data is wrong. If I pass aamir to my function you can see the result. Yet when I do it with apply the numbers are totally different: df1 Name Y 0 aamir 0 1 aaron 0 ... ... ... 7942 zulema 1 7943 zuzana 1 df3 = df1.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand') 0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49 0 0.075014 0.824769 0.580976 0.493415 0.409894 0.142214 0.202602 -0.599501 -0.213184 -0.142188 ... 0.627784 0.136511 -0.162938 0.095707 -0.257638 0.396822 0.208624 -0.454204 0.153140 0.803400 1 0.073664 0.868665 0.574581 0.538951 0.394502 0.134773 0.233070 -0.639365 -0.194892 -0.110557 ... 0.722513 0.147112 -0.239356 -0.046832 -0.237434 0.321494 0.206583 -0.454038 0.251605 0.918388 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 7942 -0.002117 0.894570 0.834724 0.602266 0.327858 -0.003092 0.197389 -0.675813 -0.311369 -0.174356 ... 0.690172 -0.085517 -0.000235 -0.214937 -0.290900 0.361734 0.290184 -0.497177 0.285071 0.711388 7943 -0.047621 0.850352 0.729225 0.515870 0.439999 0.060711 0.226026 -0.604846 -0.344891 -0.128396 ... 0.557035 -0.048322 -0.070075 -0.265775 -0.330709 0.281492 0.304157 -0.552191 0.281502 0.750304 7944 rows × 50 columns You can see that the first row is aamir and the first value (column 0) my function returns is 0.1213 (You can see this at the top of my post). Yet with apply that appears to be 0.075014 EDIT: It appears it passes in Name aamir rather than aamir. How can I get it to just send the Name itself aamir?
Let's say we have some function which transforms a string into a vector of a fixed size, for example: import numpy as np def W2V(name: str) -> np.ndarray: low, high, size = 0, 5, 10 rng = np.random.default_rng(abs(hash(name))) return rng.integers(low, high, size, endpoint=True) Also a data frame is given with a meaningful index and junk data: import pandas as pd names = pd.Index(['aamir','aaron','zulema','zuzana'], name='Name') df = pd.DataFrame(index=names).assign(Y=0) When we apply some function to a DataFrame along columns, i.e. axis=1, its argument is gonna be a row as Series wich name is an index of the row. So we could do something like this: output = df.apply(lambda row: W2V(row.name), axis=1, result_type='expand') With result_type='expand', returned vectors will be transformed into columns, which is the required output. P.S. As an option: df = pd.DataFrame.from_dict({n: W2V(n) for n in names}, orient='index') P.P.S. IMO The behavior you describe means that your function can operate not only on str, but also on some common sequence, for example on a Series of strings. In case of the code: df.reset_index().drop('Y', axis=1).apply(W2V, axis=1, result_type='expand') the function W2V receives not "a name" as a string but pd.Series(["a name"]). If we do not check the type of the passed parameter inside the function, then we can get a silent error, which in this case appears as different output data.
I don't know if this is any better than the other suggestions, but I would use apply to create another n-column dataframe (where n is the length of the array returned by the W2V function) and then concatenate it to the original dataframe. This first section generates toy versions of your W2V function and your dataframe. # substitute your W2V function for this: n = 5 def W2V(name: str): return [random() for i in range(n)] # substitute your 2-column dataframe for this: df1 = pd.DataFrame(data={'Name':['aamir', 'aaron', 'zulema', 'zuzana'], 'Y': [0, 0, 1, 1]}, index=list(range(4))) df1 is Name Y 0 aamir 0 1 aaron 0 2 zulema 1 3 zuzana 1 You want to make a second dataframe that applies W2V to every name in the first dataframe. To generate your column numbers, I'm just using a list comprehension that generates [0, 1, ... n], where n is the length of the array returned by W2V. df2 = df1.apply(lambda x: pd.Series(W2V(x['Name']), index=[i for i in range(n)]), axis=1) My random-valued df2 is 0 1 2 3 4 0 0.242761 0.415253 0.940213 0.074455 0.444372 1 0.935781 0.968155 0.850091 0.064548 0.737655 2 0.204053 0.845252 0.967767 0.352254 0.028609 3 0.853164 0.698195 0.292238 0.982009 0.402736 Then concatenate the new dataframe to the old one: df3 = pd.concat([df1, df2], axis=1) df3 is Name Y 0 1 2 3 4 0 aamir 0 0.242761 0.415253 0.940213 0.074455 0.444372 1 aaron 0 0.935781 0.968155 0.850091 0.064548 0.737655 2 zulema 1 0.204053 0.845252 0.967767 0.352254 0.028609 3 zuzana 1 0.853164 0.698195 0.292238 0.982009 0.402736 Alternatively, you could do both steps in one line as: df1 = pd.concat([df1, df1.apply(lambda x: pd.Series(W2V(x['Name']), index=[i for i in range(n)]), axis=1)], axis=1)
You can try something like this using map and np.vstack with a dataframe constructor then join: df.join(pd.DataFrame(np.vstack(df.index.map(W2V)), index=df.index)) Output: Y 0 1 2 3 4 5 6 7 8 9 A 0 4 0 2 1 0 0 0 0 3 3 B 1 4 0 0 4 4 3 4 3 4 3 C 2 1 5 5 5 3 3 1 3 5 0 D 3 3 5 1 3 4 2 3 1 0 1 E 4 4 0 2 4 4 0 3 3 4 2 F 5 4 3 5 1 0 2 3 2 5 2 G 6 4 5 2 0 0 2 4 3 4 3 H 7 0 2 5 2 3 4 3 5 3 1 I 8 2 2 0 1 4 2 4 1 0 4 J 9 0 2 3 5 0 3 0 2 4 0 Using #Vitalizzare function: def W2V(name: str) -> np.ndarray: low, high, size = 0, 5, 10 rng = np.random.default_rng(abs(hash(name))) return rng.integers(low, high, size, endpoint=True) df = pd.DataFrame({'Y': np.arange(10)}, index = [*'ABCDEFGHIJ'])
I am going off the names being the axis, and there being a useless column called 0. I think this may be the solution, no way to know without your function or the names df.reset_index().drop(0, axis=1).apply(my_func, axis=1, result_type='expand')
I would do simply: newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index) Example To start with, let us make some function w2v(name). In the following, we compute a consistent hash of any string. Then we use that hash as a (temporary) seed for np.random, and then draw a random vector size=50: import numpy as np import pandas as pd from contextlib import contextmanager #contextmanager def temp_seed(seed): state = np.random.get_state() np.random.seed(seed) try: yield finally: np.random.set_state(state) mask = (1 << 32) - 1 def w2v(name, size=50): fingerprint = int(pd.util.hash_array(np.array([name]))) with temp_seed(fingerprint & mask): return np.random.uniform(-1, 1, size) For instance: >>> w2v('aamir') array([ 0.65446901, -0.92765123, -0.78188552, -0.62683782, -0.23946784, 0.31315156, 0.22802972, -0.96076167, 0.62577993, -0.59024811, 0.76365736, 0.93033898, -0.56155296, 0.4760905 , -0.92760642, 0.00177959, -0.22761559, 0.81929959, 0.21138229, -0.49882747, -0.97637984, -0.19452496, -0.91354933, 0.70473533, -0.30394358, -0.47092087, -0.0329302 , -0.93178517, 0.79118799, 0.98286834, -0.16024194, -0.02793147, -0.52251214, -0.70732759, 0.10098142, -0.24880249, 0.28930319, -0.53444863, 0.37887522, 0.58544068, 0.85804119, 0.67048213, 0.58389158, -0.19889071, -0.04281131, -0.62506126, 0.42872395, -0.12821543, -0.52458052, -0.35493892]) Now, we use the expression given as solution: df = pd.DataFrame([0,0,1,1], index=['aamir', 'aaron', 'zulema', 'zuzana']) newdf = pd.DataFrame(df.index.to_series().apply(w2v).tolist(), index=df.index) >>> newdf 0 1 2 3 4 5 6 ... aamir 0.654469 -0.927651 -0.781886 -0.626838 -0.239468 0.313152 0.228030 ... aaron -0.380524 -0.850608 -0.914642 -0.578885 0.177975 -0.633761 -0.736234 ... zulema -0.250957 0.882491 -0.197833 -0.707652 0.754575 0.731236 -0.770831 ... zuzana -0.641296 0.065898 0.466784 0.652776 0.391865 0.918761 0.022798 ...
Aggregating the total count with Json Data
I have an API endpoint which has the details of confirmed / recovered / tested count for each state https://data.covid19india.org/v4/min/data.min.json I would like to to aggregate the total count of confirmed / recovered / tested across each state.. What is the easiest way to achieve the results?
To write the final results in pandas we can procede by adding this to the code. import Pandas as Pd columns = ('Confirmed', 'Deceased', 'Recovered', 'Tested') Panda = pd.DataFrame(data = StateWiseData).T # T for transpose print(Panda) The output will be: confirmed deceased recovered tested AN 7557 129 7420 0 AP 2003342 13735 1975448 9788047 AR 52214 259 50695 398545 AS 584434 5576 570847 326318 BR 725588 9649 715798 17107895 CH 65066 812 64213 652657 CT 1004144 13553 989728 338344 DL 1437334 25079 1411881 25142853 DN 10662 4 10620 72410 GA 173221 3186 169160 0 GJ 825302 10079 815041 10900176 HP 211746 3553 206094 481328 HR 770347 9667 760004 3948145 JH 347730 5132 342421 233773 JK 324295 4403 318838 139552 KA 2939767 37155 2882331 9791334 KL 3814305 19494 3631066 3875002 LA 20491 207 20223 110068 LD 10309 51 10194 234256 MH 6424651 135962 6231999 8421643 ML 74070 1281 69859 0 MN 111212 1755 105751 13542 MP 792101 10516 781499 3384824 MZ 52472 200 46675 0 NL 29589 610 27151 116359 OR 1001698 7479 986334 2774807 PB 600266 16352 583426 2938477 PY 122934 1808 120330 567923 RJ 954023 8954 944917 5852578 SK 29340 367 27185 0 TG 654989 3858 644747 0 TN 2600885 34709 2547005 4413963 TR 82092 784 80150 607962 TT 0 0 0 0 UP 1709119 22792 1685954 23724581 UT 342749 7377 329006 2127358 WB 1543496 18371 1515789 0
Yes, my interpretation was incorrect earlier. We have to get the districts total and add them. import json file = open('data.min.json') dictionary = json.load(file) stateCodes = ['AN', 'AP', 'AR', 'AS', 'BR', 'CH', 'CT', 'DL', 'DN', 'GA', 'GJ', 'HP', 'HR', 'JH', 'JK', 'KA', 'KL', 'LA', 'LD', 'MH', 'ML', 'MN', 'MP', 'MZ', 'NL', 'OR', 'PB', 'PY', 'RJ', 'SK', 'TG', 'TN', 'TR', 'TT', 'UP', 'UT', 'WB'] StateWiseData = {} for state in stateCodes: StateInfo = dictionary[state] Confirmed = 0 Recovered = 0 Tested = 0 Deceased = 0 StateData = {} if "districts" in StateInfo: for District in StateInfo['districts']: DistrictInfo = StateInfo['districts'][District]['total'] if 'confirmed' in DistrictInfo: if type(Confirmed) == type(DistrictInfo['confirmed']): Confirmed += (DistrictInfo['confirmed']) if 'recovered' in DistrictInfo: if type(Recovered) == type(DistrictInfo['recovered']): Recovered += (DistrictInfo['recovered']) if 'tested' in DistrictInfo: if type(Tested) == type(DistrictInfo['tested']): Tested += (DistrictInfo['tested']) if 'deceased' in DistrictInfo: if type(Deceased) == type(DistrictInfo['deceased']): Deceased += (DistrictInfo['deceased']) StateData['confirmed'] = Confirmed StateData['deceased'] = Deceased StateData['recovered'] = Recovered StateData['tested'] = Tested StateWiseData[state] = StateData print(StateWiseData)
Pandas operation, putting multiple results into a df column
This is a quant work. I used my previous work to filter out some desired stocks(candidates) with technical analysis based on moving average, MACD, KDJ, and etc. And now I wanna check my candidates' fundamental values, in this case, ROE. here is my code: root_path = '.\\__fundamentals__' df = pd.DataFrame(pd.read_csv("C:\\candidates.csv", encoding='GBK')) # 14 candidates this time for code in list(df['code']): i = str(code).zfill(6) for root, dirs, files in os.walk(root_path): for csv in files: if csv.startswith('{}.csv'.format(i)): csv_path = os.path.join(root, csv) # based on candidates looking for dupont value df1 = pd.DataFrame(pd.read_csv("{}".format(csv_path), encoding='GBK')) df2['ROE'] = df2['净资产收益率'].str.strip("%").astype(float)/100 ROE = [df2['ROE'].mean().round(decimals=3)] df3 = pd.DataFrame({'ROE_Mean': ROE}) print(df3) Here is the DOM C:\Users\Mike_Leigh\.conda\envs\LEIGH\python.exe "P:/LEIGH PYTHON/Codes/Quant/analyze_stock.py" ROE_Mean 0 -0.218 ROE_Mean 0 0.121 ROE_Mean 0 0.043 ROE_Mean 0 0.197 ROE_Mean 0 0.095 ROE_Mean 0 0.085 ... ROE_Mean 0 0.178 Process finished with exit code 0 my desired output would be like this: ROE_Mean 0 -0.218 1 0.121 2 0.043 3 0.197 4 0.095 5 0.085 ... 14 0.178 Would you please give me a hint on this? Thanks alot, much appreciated!
acually I wasn't that bad solving the issue. first, make a list outside the loop, I mean the very outside the loop, in this case, before df roe_avg = [] df = pd.DataFrame(pd.read_csv("C:\\candidates.csv", encoding='GBK')) .... df2['ROE'] = df2['净资产收益率'].str.strip("%").astype(float) / 100 ROE_avg = df2['ROE'].mean().round(decimals=3) roe_avg.append(ROE_avg) df['ROE_avg'] = roe_avg print(df) DOM name code ROE_avg 1 仙鹤股份 603733 0.121 3 泸州老窖 568 0.197 4 兴蓉环境 598 0.095 ... 15 濮阳惠成 300481 0.148 16 中科创达 300496 0.101 17 森霸传感 300701 0.178 Process finished with exit code 0 thanks to #filippo
Pandas: MID & FIND Function
I have the a column in my dataframe that shows different combinations of the values below. I know that I could use the .str[:3] function and then convert this to a value, but the differing string lengths are throwing me off. How would I do a MID(x,FIND(",",x,1)+1,10) esk function on this column to find the sentiment and subjectivity values? String samples: df['Output'] = Sentiment(polarity=0.0, subjectivity=0.0) Sentiment(polarity=-0.03958333333333333, subjectivity=0.5020833333333334) Sentiment(polarity=0.16472802559759075, subjectivity=0.4024750611707134) Error: def senti(x): return TextBlob(x).sentiment df['Output'] = df['stop'].apply(senti) df.Output.str.split(',|=',expand=True).iloc[:,[1,3]] IndexError: positional indexers are out-of-bounds Outputs: 0 (0.0, 0.0) 1 (0.0028273809523809493, 0.48586309523809534) 2 (0.153726035868893, 0.5354359925788496) 3 (0.04357142857142857, 0.5319047619047619) 4 (0.07575757575757575, 0.28446969696969693) ... 92 (0.225, 0.39642857142857146) 93 (0.0, 0.0) 94 (0.5428571428571429, 0.6428571428571428) 95 (0.14393939393939395, 0.39999999999999997) 96 (0.35833333333333334, 0.5777777777777778) Name: Output, Length: 97, dtype: object
df[['polarity', 'subjectivity']] = df.Output.str.split(',|=|\)',expand=True).iloc[:,[1,3]] Result: Output polarity subjectivity 0 Sentiment(polarity=0.0, subjectivity=0.0) 0.0 0.0 1 Sentiment(polarity=-0.03958333333333333, subje... -0.03958333333333333 0.5020833333333334 2 Sentiment(polarity=0.16472802559759075, subjec... 0.16472802559759075 0.4024750611707134
Try: df['polarity']=df['Output'].str.extract(r"polarity=([-\.\d]+)") df['subjectivity']=df['Output'].str.extract(r"subjectivity=([-\.\d]+)") Outputs: >>> df.iloc[:, -2:] polarity subjectivity 0 0.0 0.0 1 -0.03958333333333333 0.5020833333333334 2 0.16472802559759075 0.4024750611707134
creating lists from row data
My input data has the following format id offset code 1 3 21 1 3 24 1 5 21 2 1 84 3 5 57 3 5 21 3 5 92 3 10 83 3 10 21 I would like the output in the following format id offset code 1 [3,5] [[21,24],[21]] 2 [1] [[84]] 3 [5,10] [[21,57,92],[21,83]] The code that I have been able to come up with is shown below import random, pandas random.seed(10000) param = dict(nrow=100, nid=10, noffset=8, ncode=100) #param = dict(nrow=1000, nid=10, noffset=8, ncode=100) #param = dict(nrow=100000, nid=1000, noffset=50, ncode=5000) #param = dict(nrow=10000000, nid=10000, noffset=100, ncode=5000) pd = pandas.DataFrame({ "id":random.choices(range(1,param["nid"]+1), k=param["nrow"]), "offset":random.choices(range(param["noffset"]), k=param["nrow"]) }) pd["code"] = random.choices(range(param["ncode"]), k=param["nrow"]) pd = pd.sort_values(["id","offset","code"]).reset_index(drop=True) tmp1 = pd.groupby(by=["id"])["offset"].apply(lambda x:list(set(x))).reset_index() tmp2 = pd.groupby(by=["id","offset"])["code"].apply(lambda x:list(x)).reset_index().groupby(\ by=["id"], sort=True)["code"].apply(lambda x:list(x)).reset_index() out = pandas.merge(tmp1, tmp2, on="id", sort=False) It does give me the output that I want but is VERY slow when the dataframe is large. The dataframe that I have has over 40million rows. In the example uncomment the fourth param statement and you will see how slow it is. Can you please help with making this run faster?
(df.groupby(['id','offset']).code.apply(list).reset_index() .groupby('id').agg(lambda x: x.tolist())) Out[733]: offset code id 1 [3, 5] [[21, 24], [21]] 2 [1] [[84]] 3 [5, 10] [[57, 21, 92], [83, 21]]