Column level parsing in pandas data frame

Column level parsing in pandas data frame - python-3.x

Currently I am working with 20M records with 5 columns. My data frame looks like -
tran_id id code
123 1 1759#1#83#0#1362#0.2600#25.7400#2.8600#1094#1#129.6#14.4
254 1 1356#0.4950#26.7300#2.9700
831 2 1354#1.78#35.244#3.916#1101#2#40#0#1108#2#30#0
732 5 1430#1#19.35#2.15#1431#3#245.62#60.29#1074#12#385.2#58.8#1109
141 2 1809#8#75.34#292.66#1816#4#24.56#95.44#1076#47#510.89#1110.61
Desired output -
id new_code
1 1759
1 1362
1 1094
1 1356
2 1354
2 1101
2 1108
5 1430
5 1431
5 1074
5 1109
2 1809
2 1816
2 1076
What I have done so far -
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dd= pd.DataFrame({'col' : d["code"].apply(lambda x: re.split('[# # ]', x))})
dd.head()
s = dd['col'].str[:]
dd= pd.DataFrame(s.values.tolist())
dd.head()
cols = range(len(list(dd)))
num_cols = len(list(dd))
new_cols = ['col' + str(i) for i in cols]
dd.columns = new_cols[:num_cols]
Just remember the size of the data is huge...20 million.Can't do any looping.
Thanks in advance

You can use Series.str.findall for extract integers with length 4 between separators:
#https://stackoverflow.com/a/55096994/2901002
s = df['code'].str.findall(r'(?<![^#])\d{4}(?![^#])')
#alternative
#s = df['code'].str.replace('[##]', ' ').str.findall(r'(?<!\S)\d{4}(?!\S)')
And then create new DataFrame by numpy.repeat with str.len and flaten by chain.from_iterable:
from itertools import chain
df = pd.DataFrame({
'id' : df['id'].values.repeat(s.str.len()),
'new_code' : list(chain.from_iterable(s.tolist()))
})
print (df)
id new_code
0 1 1759
1 1 1362
2 1 1094
3 1 1356
4 2 1354
5 2 1101
6 2 1108
7 5 1430
8 5 1431
9 5 1074
10 5 1109
11 2 1809
12 2 1816
13 2 1076

An alternative approach using Series.str.extractall with a different regex pattern:
(df.set_index('id').code.str.extractall(r'(?:[^\.]|^)(?P<new_code>\d{4})')
.reset_index(0)
.reset_index(drop=True)
)
[out]
id new_code
0 1 1759
1 1 1362
2 1 1094
3 1 1356
4 2 1354
5 2 1101
6 2 1108
7 5 1430
8 5 1431
9 5 1074
10 5 1109
11 2 1809
12 2 1816
13 2 1076
14 2 1110

Related

Create a new column to indicate dataframe split at least n-rows groups in Python [duplicate]

This question already has an answer here:
Pandas - Fill N rows for a specific column with a integer value and increment the integer there after
(1 answer)
Closed 1 year ago.
Given a dataframe df as follows:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Value': [11, 8, 10, 15, 110, 60, 100, 40]})
Out:
Date Sym Value
0 2015-05-08 aapl 11
1 2015-05-07 aapl 8
2 2015-05-06 aapl 10
3 2015-05-05 aapl 15
4 2015-05-08 aaww 110
5 2015-05-07 aaww 60
6 2015-05-06 aaww 100
7 2015-05-05 aaww 40
I hope to create a new column Group to indicate groups with a range of integers starting from 1, each group should have 3 rows, except for the last group which may have less than 3 rows.
The final result will like this:
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3
How could I achieve that with Pandas or Numpy? Thanks.
My trial code:
n = 3
for g, df in df.groupby(np.arange(len(df)) // n):
print(df.shape)

You are close, assign output from groupby to new column and add 1:
n = 3
df['Group'] = np.arange(len(df)) // n + 1
print (df)
Date Sym Value Group
0 2015-05-08 aapl 11 1
1 2015-05-07 aapl 8 1
2 2015-05-06 aapl 10 1
3 2015-05-05 aapl 15 2
4 2015-05-08 aaww 110 2
5 2015-05-07 aaww 60 2
6 2015-05-06 aaww 100 3
7 2015-05-05 aaww 40 3

Add workdays to pandas df date columns based of other column

Is there a way to increment a date field in a pandas data frame by the number of working days specified in an another column?
np.random.seed(10)
df = pd.DataFrame({'Date':pd.date_range(start=dt.datetime(2020,7,1), end = dt.datetime(2020,7,10))})
df['Offset'] = np.random.randint(0,10, len(df))
Date Offset
0 2020-07-01 9
1 2020-07-02 4
2 2020-07-03 0
3 2020-07-04 1
4 2020-07-05 9
5 2020-07-06 0
6 2020-07-07 1
7 2020-07-08 8
8 2020-07-09 9
9 2020-07-10 0
I would expect this to work, however it throws and error:
df['Date'] + pd.tseries.offsets.BusinessDay(n = df['Offset'])
TypeError: n argument must be an integer, got <class
'pandas.core.series.Series'>
pd.to_timedelta does not support working days.

Like I mentioned in my comment, you are trying to pass an entire Series as an integer. instead you want to apply the function row wise:
df['your_answer'] = df.apply(lambda x:x['Date'] + pd.tseries.offsets.BusinessDay(n= x['Offset']), axis=1)
df
Date Offset your_answer
0 2020-07-01 9 2020-07-14
1 2020-07-02 7 2020-07-13
2 2020-07-03 3 2020-07-08
3 2020-07-04 2 2020-07-07
4 2020-07-05 7 2020-07-14
5 2020-07-06 7 2020-07-15
6 2020-07-07 7 2020-07-16
7 2020-07-08 2 2020-07-10
8 2020-07-09 1 2020-07-10
9 2020-07-10 0 2020-07-10
Line of code broken down:
# notice how this returns every value of that column
df.apply(lambda x:x['Date'], axis=1)
0 2020-07-01
1 2020-07-02
2 2020-07-03
3 2020-07-04
4 2020-07-05
5 2020-07-06
6 2020-07-07
7 2020-07-08
8 2020-07-09
9 2020-07-10
# same thing with `Offset`
df.apply(lambda x:x['Offset'], axis=1)
0 9
1 7
2 3
3 2
4 7
5 7
6 7
7 2
8 1
9 0
Since pd.tseries.offsets.BusinessDay(n=foo_bar) takes an integer and not a series. We use the two columns in the apply() together - It's as if you are looping each number in the Offset column into the offsets.BusinessDay() function

Why does the number of processes affect the output format when using multiprocessing.Pool(N)

I have taken this code and modified it slightly to suit Python3.8.
The issue I am having is that the output initially has some sort of windings characters in it, I suspespect this is a newline charactor that is being incorrectly converted?. See output snippets below.
I couldn't work this out so before spending too much time on it, I tested the program with different numbers of processes. For some reason the output changes when I increase this.
import multiprocessing
from textwrap import dedent
from itertools import zip_longest
def process_chunk(d):
#test function, change this later
return d
def grouper(n, iterable, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
outfile= open("C:/Users/#####/Desktop/test.txt","w+")
if __name__ == '__main__':
#open input file
test_data = open('D:/test.txt')
# Create pool (p)
p = multiprocessing.Pool(4)
for chunk in grouper(1000, test_data):
results = p.map(process_chunk, chunk)
for r in results:
outfile.write(f'{r}')
The below samples are of the end of file so I suspect the 'None' and 敮 output is just part of a chunk Expected output:
5.2615 19.522 -0.968 3 134 120 124
5.9195 19.695 -0.828 49 197 192 170
6.0985 19.192 -0.984 0 150 137 130
5.2255 19.915 -0.939 3 92 92 81
6.3825 19.286 -1.166 5 100 99 92
5.8965 19.705 -0.411 67 211 209 205
With multiprocessing.Pool(4) (same output for N=2 to N=10)
5.9195 19.695 -0.828 49 197 192 170਍ഀ
6.0985 19.192 -0.984 0 150 137 130਍ഀ
5.2255 19.915 -0.939 3 92 92 81਍ഀ
6.3825 19.286 -1.166 5 100 99 92਍ഀ
5.8965 19.705 -0.411 67 211 209 205਍ഀ
潎
With multiprocessing.Pool(12) (same output for N=11 to N=24)
5 . 2 6 1 5 1 9 . 5 2 2 - 0 . 9 6 8 3 1 3 4 1 2 0 1 2 4
5 . 9 1 9 5 1 9 . 6 9 5 - 0 . 8 2 8 4 9 1 9 7 1 9 2 1 7 0
6 . 0 9 8 5 1 9 . 1 9 2 - 0 . 9 8 4 0 1 5 0 1 3 7 1 3 0
5 . 2 2 5 5 1 9 . 9 1 5 - 0 . 9 3 9 3 9 2 9 2 8 1
6 . 3 8 2 5 1 9 . 2 8 6 - 1 . 1 6 6 5 1 0 0 9 9 9 2
5 . 8 9 6 5 1 9 . 7 0 5 - 0 . 4 1 1 6 7 2 1 1 2 0 9 2 0 5
None

How to write Python code that does cumprod for forward 2 periods with groupby

I want to calculate Return, RET, which is Cumulative of 2 periods (now & next period) with groupby(id).
df['RET'] = df.groupby('id')['trt1m1'].rolling(2,min_periods=2).apply(lambda x:x.prod()).reset_index(0,drop=True)
Expected Result:
id datadate trt1m1 RET
1 20051231 1 2
1 20060131 2 6
1 20060228 3 12
1 20060331 4 16
1 20060430 4 20
1 20060531 5 Nan
2 20061031 10 110
2 20061130 11 165
2 20061231 15 300
2 20070131 20 420
2 20070228 21 Nan
Actual Result:
id datadate trt1m1 RET
1 20051231 1 Nan
1 20060131 2 2
1 20060228 3 6
1 20060331 4 12
1 20060430 4 16
1 20060531 5 20
2 20061031 10 Nan
2 20061130 11 110
2 20061231 15 165
2 20070131 20 300
2 20070228 21 420
The code i used calculate cumprod for trailing 2 periods instead of forward.

Transposing multi index dataframe in pandas

HID gen views
1 1 20
1 2 2532
1 3 276
1 4 1684
1 5 779
1 6 200
1 7 545
2 1 20
2 2 7478
2 3 750
2 4 7742
2 5 2643
2 6 208
2 7 585
3 1 21
3 2 4012
3 3 2019
3 4 1073
3 5 3372
3 6 8
3 7 1823
3 8 22
this is a sample section of a data frame, where HID and gen are indexes.
how can it be transformed like this
HID 1 2 3 4 5 6 7 8
1 20 2532 276 1684 779 200 545 nan
2 20 7478 750 7742 2643 208 585 nan
3 21 4012 2019 1073 3372 8 1823 22

Its called pivoting i.e
df.reset_index().pivot('HID','gen','views')
gen 1 2 3 4 5 6 7 8
HID
1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0

Use unstack:
df = df['views'].unstack()
If need also HID column add reset_index + rename_axis:
df = df['views'].unstack().reset_index().rename_axis(None, 1)
print (df)
HID 1 2 3 4 5 6 7 8
0 1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
1 2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
2 3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Column level parsing in pandas data frame - python-3.x

Related

Create a new column to indicate dataframe split at least n-rows groups in Python [duplicate]

Add workdays to pandas df date columns based of other column

Why does the number of processes affect the output format when using multiprocessing.Pool(N)

How to write Python code that does cumprod for forward 2 periods with groupby

Transposing multi index dataframe in pandas

Categories

Resources