I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10
This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed 3 years ago.
So I have a DataFrame with two columns, one with label names (df['Labels']) and the other with int values (df['Volume']).
df = pd.DataFrame({'Labels':
['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[10,40,20,20,50,60,40,50,50,60,10,10,10,10,20,20,10,20,80,90,90,80,100]})
I would like to identify intervals where my labels change and then calculate the median on the column 'Volume' for each of these intervals. Later I should replace every value of column 'Volume' by the respective median of each interval.
In case of label A, I would like to have the median for both intervals.
Here is how my DataFrame should looks like:
df2 = pd.DataFrame({'Labels':['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[20,20,20,20,50,50,50,50,50,50,10,10,10,10,10,10,10,10,90,90,90,90,90]})
You want to groupby the blocks and transform median:
blocks = df['Labels'].ne(df['Labels'].shift()).cumsum()
df['group_median'] = df['Volume'].groupby(blocks).transform('median')
Use Series.cumsum + Series.shift() to create groups using groupby and then use transform
df['Volume']=df.groupby(df['Labels'].ne(df['Labels'].shift()).cumsum())['Volume'].transform('median')
print(df)
Labels Volume
0 A 20
1 A 20
2 A 20
3 A 20
4 B 50
5 B 50
6 B 50
7 B 50
8 B 50
9 B 50
10 A 10
11 A 10
12 A 10
13 A 10
14 A 10
15 A 10
16 A 10
17 A 10
18 C 90
19 C 90
20 C 90
21 C 90
22 C 90
I apologise if the title is confusing. it's hard for me to summarise this issue in one sentence.
I'm trying to automate some spreadsheets, but sadly using VBA is not an option (most people here get confused by them and end up avoiding those spreadsheets).
The problem: I have rows in one sheet with data for velocity and angle and I'm trying to get a value from this other table based on those parameters.
The issue is that this other table is based in ranges of values for both columns and rows.
A B C D E
1 0 1-30 31-60 61-90
2 0 to 1 10 20 20 30
3 1.1 to 2 10 20 30 30
4 2.1 to 3 20 30 30 40
5 '>3 30 40 40 40
Where column A is velocity range, Row 1 is angle range
So for example if I have a velocity of 1.5 m/s with an angle of 40°, I want to be able to get the result of 30.
My best idea is to create auxiliary or helper columns to indicate which velocity and angle range they belong to and then use a VLOOKUP MATCH combo.
Even though it's a simple solution, I just wanted to know if there is a more elegant solution available that comes to mind or if you think this is already elegant enough.
Thanks.
As Scott mentioned, using ranges as titles makes it difficult.
If you simply put in the minimums instead youcan make the following:
A B C D E F G H
1 0 1 31 61 Velocity 1.5
2 0 10 20 20 30 Angle 40
3 1.1 10 20 30 30 Result 30
4 2.1 20 30 30 40
5 3.1 30 40 40 40
Where H1 and H2 are your input cells.
H3 gives you the result with: =INDEX(B2:E5,MATCH(H1,A2:A5,1),MATCH(H2,B1:E1,1))
I have a dataset on height that looks like below.
Height Phase
0 A
2 A
3 A
4 P
4 P
3 D
2 D
1 D
0 D .
I want to create a second column called Phase as above that tells Ascent, Peak, or Descent at corresponding height. I tried to use the IF function as IF(HeiPh="A",B3>=B2,IF(HeiPh="P",4,"D")) . However i'm not getting the required result. I have a big dataset and there are height that is same for few times i.e. like 0 2 2 3 4 5 5 5 5 6 and so on
Try this:
=IF(A2=MAX(A:A),"P",IF(ROW(A2)<MATCH(MAX(A:A),A:A,0),"A","D"))
You can do this =IF(MAX($A$4:$A$13)=A4,"P",IFS(A5>=A4,"A",A5<A4,"D"))
I am trying to randomly select a certain percentage of rows and columns in my dataframe and fit these features into a logistic regression over 10 iterations. My dependent variable is whether a team won (1) or lost (0).
If I have a df that looks something like this (data is made up):
Won Field Injuries Weather Fouls Players
1 2 3 1 2 8
0 3 2 0 1 5
1 4 5 3 2 6
1 3 2 1 4 5
0 2 3 0 1 6
1 4 2 0 2 8
...
For example, let's say I want to select 50% (but this could change). I want to randomly select 50% (or the closest amount to 50% if its an odd number) of the columns (field,injuries,weather,fouls,players) and 50% of the rows in those columns to place in my model.
Here is my current code which right now runs by selecting all of the columns and rows and fitting it into my model but I would like to dictate a random percentage:
z = []
For i in range(10):
train_cols = df.columns[1:]
logit = sm.Logit(df['Won'], df[train_cols])
result = logit.fit()
exp = np.exp(result.params)
z.append([i, exp])