Column level parsing in pandas data frame - python-3.x

Currently I am working with 20M records with 5 columns. My data frame looks like -
tran_id id code
123 1 1759#1#83#0#1362#0.2600#25.7400#2.8600#1094#1#129.6#14.4
254 1 1356#0.4950#26.7300#2.9700
831 2 1354#1.78#35.244#3.916#1101#2#40#0#1108#2#30#0
732 5 1430#1#19.35#2.15#1431#3#245.62#60.29#1074#12#385.2#58.8#1109
141 2 1809#8#75.34#292.66#1816#4#24.56#95.44#1076#47#510.89#1110.61
Desired output -
id new_code
1 1759
1 1362
1 1094
1 1356
2 1354
2 1101
2 1108
5 1430
5 1431
5 1074
5 1109
2 1809
2 1816
2 1076
What I have done so far -
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dd= pd.DataFrame({'col' : d["code"].apply(lambda x: re.split('[# # ]', x))})
dd.head()
s = dd['col'].str[:]
dd= pd.DataFrame(s.values.tolist())
dd.head()
cols = range(len(list(dd)))
num_cols = len(list(dd))
new_cols = ['col' + str(i) for i in cols]
dd.columns = new_cols[:num_cols]
Just remember the size of the data is huge...20 million.Can't do any looping.
Thanks in advance

You can use Series.str.findall for extract integers with length 4 between separators:
#https://stackoverflow.com/a/55096994/2901002
s = df['code'].str.findall(r'(?<![^#])\d{4}(?![^#])')
#alternative
#s = df['code'].str.replace('[##]', ' ').str.findall(r'(?<!\S)\d{4}(?!\S)')
And then create new DataFrame by numpy.repeat with str.len and flaten by chain.from_iterable:
from itertools import chain
df = pd.DataFrame({
'id' : df['id'].values.repeat(s.str.len()),
'new_code' : list(chain.from_iterable(s.tolist()))
})
print (df)
id new_code
0 1 1759
1 1 1362
2 1 1094
3 1 1356
4 2 1354
5 2 1101
6 2 1108
7 5 1430
8 5 1431
9 5 1074
10 5 1109
11 2 1809
12 2 1816
13 2 1076

An alternative approach using Series.str.extractall with a different regex pattern:
(df.set_index('id').code.str.extractall(r'(?:[^\.]|^)(?P<new_code>\d{4})')
.reset_index(0)
.reset_index(drop=True)
)
[out]
id new_code
0 1 1759
1 1 1362
2 1 1094
3 1 1356
4 2 1354
5 2 1101
6 2 1108
7 5 1430
8 5 1431
9 5 1074
10 5 1109
11 2 1809
12 2 1816
13 2 1076
14 2 1110

Related

Create a new column to indicate dataframe split at least n-rows groups in Python [duplicate]

This question already has an answer here:
Pandas - Fill N rows for a specific column with a integer value and increment the integer there after
(1 answer)
Closed 1 year ago.
Given a dataframe df as follows:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Value': [11, 8, 10, 15, 110, 60, 100, 40]})
Out:
Date Sym Value
0 2015-05-08 aapl 11
1 2015-05-07 aapl 8
2 2015-05-06 aapl 10
3 2015-05-05 aapl 15
4 2015-05-08 aaww 110
5 2015-05-07 aaww 60
6 2015-05-06 aaww 100
7 2015-05-05 aaww 40
I hope to create a new column Group to indicate groups with a range of integers starting from 1, each group should have 3 rows, except for the last group which may have less than 3 rows.
The final result will like this:
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
How could I achieve that with Pandas or Numpy? Thanks.
My trial code:
n = 3
for g, df in df.groupby(np.arange(len(df)) // n):
print(df.shape)
You are close, assign output from groupby to new column and add 1:
n = 3
df['Group'] = np.arange(len(df)) // n + 1
print (df)
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3

Add workdays to pandas df date columns based of other column

Is there a way to increment a date field in a pandas data frame by the number of working days specified in an another column?
np.random.seed(10)
df = pd.DataFrame({'Date':pd.date_range(start=dt.datetime(2020,7,1), end = dt.datetime(2020,7,10))})
df['Offset'] = np.random.randint(0,10, len(df))
Date Offset
0 2020-07-01 9
1 2020-07-02 4
2 2020-07-03 0
3 2020-07-04 1
4 2020-07-05 9
5 2020-07-06 0
6 2020-07-07 1
7 2020-07-08 8
8 2020-07-09 9
9 2020-07-10 0
I would expect this to work, however it throws and error:
df['Date'] + pd.tseries.offsets.BusinessDay(n = df['Offset'])
TypeError: n argument must be an integer, got <class
'pandas.core.series.Series'>
pd.to_timedelta does not support working days.
Like I mentioned in my comment, you are trying to pass an entire Series as an integer. instead you want to apply the function row wise:
df['your_answer'] = df.apply(lambda x:x['Date'] + pd.tseries.offsets.BusinessDay(n= x['Offset']), axis=1)
df
Date Offset your_answer
0 2020-07-01 9 2020-07-14
1 2020-07-02 7 2020-07-13
2 2020-07-03 3 2020-07-08
3 2020-07-04 2 2020-07-07
4 2020-07-05 7 2020-07-14
5 2020-07-06 7 2020-07-15
6 2020-07-07 7 2020-07-16
7 2020-07-08 2 2020-07-10
8 2020-07-09 1 2020-07-10
9 2020-07-10 0 2020-07-10
Line of code broken down:
# notice how this returns every value of that column
df.apply(lambda x:x['Date'], axis=1)
0 2020-07-01
1 2020-07-02
2 2020-07-03
3 2020-07-04
4 2020-07-05
5 2020-07-06
6 2020-07-07
7 2020-07-08
8 2020-07-09
9 2020-07-10
# same thing with `Offset`
df.apply(lambda x:x['Offset'], axis=1)
0 9
1 7
2 3
3 2
4 7
5 7
6 7
7 2
8 1
9 0
Since pd.tseries.offsets.BusinessDay(n=foo_bar) takes an integer and not a series. We use the two columns in the apply() together - It's as if you are looping each number in the Offset column into the offsets.BusinessDay() function

Why does the number of processes affect the output format when using multiprocessing.Pool(N)

I have taken this code and modified it slightly to suit Python3.8.
The issue I am having is that the output initially has some sort of windings characters in it, I suspespect this is a newline charactor that is being incorrectly converted?. See output snippets below.
I couldn't work this out so before spending too much time on it, I tested the program with different numbers of processes. For some reason the output changes when I increase this.
import multiprocessing
from textwrap import dedent
from itertools import zip_longest
def process_chunk(d):
#test function, change this later
return d
def grouper(n, iterable, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
outfile= open("C:/Users/#####/Desktop/test.txt","w+")
if __name__ == '__main__':
#open input file
test_data = open('D:/test.txt')
# Create pool (p)
p = multiprocessing.Pool(4)
for chunk in grouper(1000, test_data):
results = p.map(process_chunk, chunk)
for r in results:
outfile.write(f'{r}')
The below samples are of the end of file so I suspect the 'None' and 敮 output is just part of a chunk Expected output:
5.2615 19.522 -0.968 3 134 120 124
5.9195 19.695 -0.828 49 197 192 170
6.0985 19.192 -0.984 0 150 137 130
5.2255 19.915 -0.939 3 92 92 81
6.3825 19.286 -1.166 5 100 99 92
5.8965 19.705 -0.411 67 211 209 205
With multiprocessing.Pool(4) (same output for N=2 to N=10)
5.9195 19.695 -0.828 49 197 192 170਍ഀ
6.0985 19.192 -0.984 0 150 137 130਍ഀ
5.2255 19.915 -0.939 3 92 92 81਍ഀ
6.3825 19.286 -1.166 5 100 99 92਍ഀ
5.8965 19.705 -0.411 67 211 209 205਍ഀ
潎
With multiprocessing.Pool(12) (same output for N=11 to N=24)
5 . 2 6 1 5 1 9 . 5 2 2 - 0 . 9 6 8 3 1 3 4 1 2 0 1 2 4
5 . 9 1 9 5 1 9 . 6 9 5 - 0 . 8 2 8 4 9 1 9 7 1 9 2 1 7 0
6 . 0 9 8 5 1 9 . 1 9 2 - 0 . 9 8 4 0 1 5 0 1 3 7 1 3 0
5 . 2 2 5 5 1 9 . 9 1 5 - 0 . 9 3 9 3 9 2 9 2 8 1
6 . 3 8 2 5 1 9 . 2 8 6 - 1 . 1 6 6 5 1 0 0 9 9 9 2
5 . 8 9 6 5 1 9 . 7 0 5 - 0 . 4 1 1 6 7 2 1 1 2 0 9 2 0 5
None

How to write Python code that does cumprod for forward 2 periods with groupby

I want to calculate Return, RET, which is Cumulative of 2 periods (now & next period) with groupby(id).
df['RET'] = df.groupby('id')['trt1m1'].rolling(2,min_periods=2).apply(lambda x:x.prod()).reset_index(0,drop=True)
Expected Result:
id datadate trt1m1 RET
1 20051231 1 2
1 20060131 2 6
1 20060228 3 12
1 20060331 4 16
1 20060430 4 20
1 20060531 5 Nan
2 20061031 10 110
2 20061130 11 165
2 20061231 15 300
2 20070131 20 420
2 20070228 21 Nan
Actual Result:
id datadate trt1m1 RET
1 20051231 1 Nan
1 20060131 2 2
1 20060228 3 6
1 20060331 4 12
1 20060430 4 16
1 20060531 5 20
2 20061031 10 Nan
2 20061130 11 110
2 20061231 15 165
2 20070131 20 300
2 20070228 21 420
The code i used calculate cumprod for trailing 2 periods instead of forward.

Transposing multi index dataframe in pandas

HID gen views
1 1 20
1 2 2532
1 3 276
1 4 1684
1 5 779
1 6 200
1 7 545
2 1 20
2 2 7478
2 3 750
2 4 7742
2 5 2643
2 6 208
2 7 585
3 1 21
3 2 4012
3 3 2019
3 4 1073
3 5 3372
3 6 8
3 7 1823
3 8 22
this is a sample section of a data frame, where HID and gen are indexes.
how can it be transformed like this
HID 1 2 3 4 5 6 7 8
1 20 2532 276 1684 779 200 545 nan
2 20 7478 750 7742 2643 208 585 nan
3 21 4012 2019 1073 3372 8 1823 22
Its called pivoting i.e
df.reset_index().pivot('HID','gen','views')
gen 1 2 3 4 5 6 7 8
HID
1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0
Use unstack:
df = df['views'].unstack()
If need also HID column add reset_index + rename_axis:
df = df['views'].unstack().reset_index().rename_axis(None, 1)
print (df)
HID 1 2 3 4 5 6 7 8
0 1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
1 2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
2 3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0

Resources