Expanding/Duplicating dataframe rows based on condition - python-3.x

I am an R user who has recently started using Python 3 for data management. I am struggling with a way to expand/duplicate data frame rows based on a condition. I also need to be able to expand rows in a variable way. I'll illustrate with this example.
I have this data:
df = pd.DataFrame([[1, 10], [1,15], [2,10], [2, 15], [2, 20], [3, 10], [3, 15]], columns = ['id', 'var'])
df
Out[6]:
id var
0 1 10
1 1 15
2 2 10
3 2 15
4 2 20
5 3 10
6 3 15
I would like to expand rows for both ID == 1 and ID == 3. I would also like to expand each ID == 1 row by 1 duplicate each, and I would like to expand each ID == 3 row by 2 duplicates each. The result would look like this:
df2
Out[8]:
id var
0 1 10
1 1 10
2 1 15
3 1 15
4 2 10
5 2 15
6 2 20
7 3 10
8 3 10
9 3 10
10 3 15
11 3 15
12 3 15
13 3 15
I've been trying to use np.repeat, but I am failing to think of a way that I can use both ID condition and variable duplication numbers at the same time. Index ordering does not matter here, only that the rows are duplicated appropriately. I apologize in advance if this is an easy question. Thanks in advance for any help and feel free to ask clarifying questions.

This should do it:
dup = {1: 1, 3:2} #what value and how much to add
res = df.copy()
for k, v in dup.items():
for i in range(v):
res = res.append(df.loc[df['id']==k], ignore_index=True)
res.sort_values(['id', 'var'], inplace=True)
res.reset_index(inplace=True, drop=True)
res
# id var
#0 1 10
#1 1 10
#2 1 15
#3 1 15
#4 2 10
#5 2 15
#6 2 20
#7 3 10
#8 3 10
#9 3 10
#10 3 15
#11 3 15
#12 3 15
P.S. your desired solution had 7 values for id 3 while your description implies 6 values.

I think below code gets your job done:
df_1=df.loc[df.id==1]
df_3=df.loc[df.id==3]
df1=df.append([df_1]*1,ignore_index=True)
df1.append([df_3]*2,ignore_index=True).sort_values(by='id')
id var
0 1 10
1 1 15
7 1 10
8 1 15
2 2 10
3 2 15
4 2 20
5 3 10
6 3 15
9 3 10
10 3 15
11 3 10
12 3 15

Related

Calculate mean value by interval coordinates in pandas

I have a dataframe such as :
Name Position Value
A 1 10
A 2 11
A 3 10
A 4 8
A 5 6
A 6 12
A 7 10
A 8 9
A 9 9
A 10 9
A 11 9
A 12 9
and I woulde like for each interval of 3 position, to calculate the mean of Values.
And create a new df with start and end coordinates (of length 3 then), with the Mean_value column.
Name Start End Mean_value
A 1 3 10.33 <---- here this is (10+11+10)/3 = 10.33
A 4 6 8.7
A 7 9 9.3
A 10 13 9
Does someone have an idea using pandas please ?
Solution for get each 3 rows (if exist) per Name groups - first get counter by GroupBy.cumcount with integer division and pass it to named aggregations:
g = df.groupby('Name').cumcount() // 3
df = df.groupby(['Name',g]).agg(Start=('Position','first'),
End=('Position','last'),
Value=('Value','mean')).droplevel(1).reset_index()
print (df)
Name Start End Value
0 A 1 3 10.333333
1 A 4 6 8.666667
2 A 7 9 9.333333
3 A 10 12 9.000000

Replacing the first column values according to the second column pattern

How to use regex to replace values in Data Frames, here, 5th column according to pattern of the 1st column? The column 5 consist only in ones for now. However, I would like to start changing this column when in the 1st column pattern 34444 appears. Then program suppose to replace ones with 11111, 22222, 33333 etc. until the end of the file when the pattern appears.
Sample of the file:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 1 3 138.998480 12.596951 0.223780
22 12 1 4 138.333252 11.884713 -0.281429
23 13 1 4 139.498084 13.356891 -0.480091
24 14 1 4 139.710930 11.981460 0.697098
25 15 1 4 138.452807 13.136061 0.990663
Expected result:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 2 3 138.998480 12.596951 0.223780
22 12 2 4 138.333252 11.884713 -0.281429
23 13 2 4 139.498084 13.356891 -0.480091
24 14 2 4 139.710930 11.981460 0.697098
25 15 2 4 138.452807 13.136061 0.990663
Yeah, if you really want re, there is a way. But I doubt it would be really more efficient than a for-loop.
1. re.finditer
import pandas as pd
import numpy as np
import re
# present col1 as number-strings
arr1 = df['1'].values
str1 = "".join([str(i) for i in arr1])
ans = np.ones(len(str1), dtype=int)
# when a pattern is found, increase latter elements by 1
for match in re.finditer('34444', str1):
e = match.end()
ans[e:] += 1
# replace column 5
df['5'] = ans
# Output
df[['0', '5', '1']]
Out[50]:
0 5 1
11 1 1 1
12 2 1 1
13 3 1 1
14 4 1 1
15 5 1 1
16 6 1 3
17 7 1 4
18 8 1 4
19 9 1 4
20 10 1 4
21 11 2 3
22 12 2 4
23 13 2 4
24 14 2 4
25 15 2 4
2. naïve for-loop
Checks the array directly element-by-element. By comparison with re.finditer, no typecasting is involved, but an explicit for-loop is written. The same output is obtained. Please benchmark by yourself if efficiency became relevant, say, if there were tens of millions of rows involved.
arr1 = df['1'].values
ans = np.ones(len(str1), dtype=int)
n = len(arr1)
for i, el in enumerate(arr1):
# termination
if i > n - 5:
break
# ignore non-3 elements
if el != 3:
continue
# if found, increase latter elements by 1
if np.all(arr1[i+1:i+5] == 4):
ans[i+5:] += 1
df['5'] = ans

How to copy values from other dataframe based on condition (same values of specific column)?

I have two dataframes (df1 and df2) and they look like this:
data1 = {'col1':[1,2,3,4,1,2,3,4,1,2,3,4], 'col2':np.arange(1,13)*2}
df1 = pd.DataFrame(data1)
data2 = {'x': [1,2,3,4], 'y': [10,20,40,5]}
df2 = pd.DataFrame(data2)
I would like to add a new column 'col3' to df1 with the values of df2['y'] when df1['col1'] is equal to df2['x']. So my df1 would stay like:
col1 col2 col3
1 2 10
2 4 20
3 6 40
4 8 5
1 10 10
2 12 20
3 14 40
4 16 5
1 18 10
2 20 20
3 22 40
4 24 5
Anyone could help me?
Use map with the dictionary creating from df2
df1['col3'] = df1.col1.map(dict(df2[['x', 'y']].values))
or
df1['col3'] = df1.col1.map(dict(zip(df2.x, df2.y)))
Out[886]:
col1 col2 col3
0 1 2 10
1 2 4 20
2 3 6 40
3 4 8 5
4 1 10 10
5 2 12 20
6 3 14 40
7 4 16 5
8 1 18 10
9 2 20 20
10 3 22 40
11 4 24 5
Use a merge:
df1.merge(df2, how='left', left_on='col1', right_on='x') \
[['col1', 'col2', 'y']] \
.rename(columns={'y': 'col3'})

pd.Series(pred).value_counts() how to get the first column in dataframe?

I apply pd.Series(pred).value_counts() and get this output:
0 2084
-1 15
1 13
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
dtype: int64
When I create a list I get only the second column:
c_list=list(pd.Series(pred).value_counts()), Out:
[2084, 15, 13, 10, 7, 4, 3, 3, 3, 2, 2, 2, 2]
How do I get ultimately a dataframe that looks like this including a new column for size% of total size?
df=
[class , size ,relative_size]
0 2084 , x%
-1 15 , y%
1 13 , etc.
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
You are very nearly there. Typing this in the blind as you didn't provide a sample input:
df = pd.Series(pred).value_counts().to_frame().reset_index()
df.columns = ['class', 'size']
df['relative_size'] = df['size'] / df['size'].sum()

Understanding J From

In J:
a =: 2 3 $ 1 2 3 4 5 6
Gives:
1 2 3
4 5 6
Which is a 2 3 shaped array.
If I do:
0 1 { a
I (noting that 0 1 is a 2 shaped list) expected to have back:
1 2 3 4 5 6
But got the following instead:
1 2 3
4 5 6
Reading the documentation I was expecting the shape of the index to kinda govern the shape of the answer.
Can someone clarify what I am missing here?
Higher-dimensional arrays may help make this clear. An array with n dimensions has items with n-1 dimensions. When you select an item from ({) a three-dimensional array, your result is a two-dimensional array:
1 { i. 5 3 4
12 13 14 15
16 17 18 19
20 21 22 23
When you select multiple items from an array, the items are assembled into a new array, using each atom of x to select a item of y. This might be where you picked up the idea that the shape of x affects the shape of the result.
2 1 0 2 { 'set'
test
$ 2 1 0 2
4
$ 'test'
4
The dimensions of the result is equal to the dimensions of x plus the dimensions of the items of y. So, if you have a two-dimensional x taking two-dimensional items from a three-dimensional y, you will have a four-dimensional result:
(2 2 $ 1 1 0 1) { i. 5 3 4
12 13 14 15
16 17 18 19
20 21 22 23
12 13 14 15
16 17 18 19
20 21 22 23
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
$ (2 2 $ 1 1 0 1) { i. 5 3 4
2 2 3 4
One final note: the monadic Ravel (,) will reduce the result to a list (one-dimensional array).
, 0 1 { 2 3 $ 1 2 3 4 5 6
1 2 3 4 5 6
, i. 2 2 2 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
From ({) selects the items of a noun. For 2 3 $ 1 2 3 4 5 6 the items are the two rows because items are the components that make up the noun.
[ a=. 2 3 $ 1 2 3 4 5
1 2 3
4 5 1
0 { a
1 2 3
If you just had 1 2 3 then the items would be the individual atoms.
[ b=. 1 2 3
1 2 3
0 { b
1
If you used 1 3 $ 1 2 3 then there is only one item and the result would be
[ c=. 1 3 $ 1 2 3
1 2 3
0 { c
1 2 3
The number of items can be found with Tally (#), and is the lead dimension of the Shape ($) of the noun.
$ a
2 3
$ b
3
$ c
1 3
# a
2
# b
3
# c
1

Resources