I have the following data set.
ID Date description V1 V2 V3
1 31-Jan-2013 Des1 10 20 30
1 31-Jan-2013 Des2 20 30 20
1 31-jan-2014 Des1 56 30 20
1 31-jan-2014 des2 30 40 60
2 31-dec-2013 Decc1 10 20 30
2 31-dec-2013 Decc2 20 30 20
2 31-dec-2014 Decc1 56 30 20
2 31-dec-2014 decc2 30 40 60
I want extract only the latest year values for the ID's.
expected output.
ID Date description V1 V2 V3
1 31-jan-2014 Des1 56 30 20
1 31-jan-2014 des2 30 40 60
2 31-dec-2014 Decc1 56 30 20
2 31-dec-2014 decc2 30 40 60
Can anybody help, how we can achieve this in pandas.
Thanks
Anubhav
may be use groupby().
data_u.set_index(['ID', 'Date'],inplace=True)
data_u.sort_index(inplace=True)
data_u.groupby(data_u.index).index.agg(['count'])
this gives me count of the rows of multindex.
But I want to select the latest year of all ID's. Number of records are >500000
You could do the following:
df['Date'] = pd.to_datetime(df['Date'])
df[df.apply(lambda x : x['Date'] == df[(df['ID'] == x['ID'])]['Date'].max() , axis =1)]
Output
+---+----+------------+-------------+----+----+----+
| | ID | Date | description | V1 | V2 | V3 |
+---+----+------------+-------------+----+----+----+
| 2 | 1 | 2014-01-31 | Des1 | 56 | 30 | 20 |
| 3 | 1 | 2014-01-31 | des2 | 30 | 40 | 60 |
| 6 | 2 | 2014-12-31 | Decc1 | 56 | 30 | 20 |
| 7 | 2 | 2014-12-31 | decc2 | 30 | 40 | 60 |
+---+----+------------+-------------+----+----+----+
Related
I am trying to use the "groupby" to plot all features in a dataframe in different subplots based on each serial number with ignoring the feature 'Ft' of each serial number (in subplot) where the all data are Zeros, for example, we should ignore 'Ft1' in S/N 'KLM10015' because all data in this feature are Zeros. The size of the dataframe is "5514 rows and 565 columns" with the ability of using a dataframe with different sizes.
The x-axis of each subplot represent the "Date" , y-axis represents each feature values (Ft) and the title represent the serial number (S/N).
This is an example of the dataframe which I have:
df =
S/N Ft1 Ft12 Ft17 ---- Ft1130 Ft1140 Ft1150
DATE
2021-01-06 KLM10015 0 12 14 ---- 17 52 47
2021-01-07 KLM10015 0 10 48 ---- 19 20 21
2021-01-11 KLM10015 0 0 45 ---- 0 19 0
2021-01-17 KLM10015 0 1 0 ---- 16 44 66
| | | | | | | |
| | | | | | | |
| | | | | | | |
2021-02-09 KLM10018 1 11 0 ---- 25 27 19
2021-12-13 KLM10018 12 0 19 ---- 78 77 18
2021-12-16 kLM10018 14 17 14 ---- 63 19 0
2021-07-09 KLM10018 18 0 77 ---- 65 34 98
2021-07-15 KLM10018 0 88 82 ---- 63 31 22
Code:
list_ID = ["ft1","ft12", "ft17, ......, ft1130, 1140, ft1150]
def plot_fun (dataframe):
for item in list_ID:
fig = plt.figure(figsize=(35, 20))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax1)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax2)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax3)
dataframe.groupby('S/N')[item].plot(legend='True', ax=ax4)
plt.show()
plot_fun (df)
I need really to your help. Thanks a lot
I want to create a new column in Python dataframe with specific requirements from other columns. For example, my python dataframe df:
A | B
-----------
5 | 0
5 | 1
15 | 1
10 | 1
10 | 1
20 | 2
15 | 2
10 | 2
5 | 3
15 | 3
10 | 4
20 | 0
I want to create new column C, with below requirements:
When the value of B = 0, then C = 0
The same value in B will have the same value in C. The same values in B will be classified as start, middle, and end. So for values 1, it has 1 start, 2 middle, and 1 end, for values 3, it has 1 start, 0 middle, and 1 end. And the calculation for each section:
I specify a threshold = 10.
Let's look at values B = 1 :
Start :
C.loc[2] = min(threshold, A.loc[1]) + A.loc[2]
Middle :
C.loc[3] = A.loc[3]
C.loc[4] = A.loc[4]
End:
C.loc[5] = min(Threshold, A.loc[6])
However, the output value of C will be the sum of the above calculations.
When the value of B is unique and not 0. For example when B = 4
C[10] = min(threshold, A.loc[9]) + min(threshold, A.loc[11])
I can solve point 0 and 3. But I'm struggling to solve point 2.
So, the final output will be:
A | B | c
--------------------
5 | 0 | 0
5 | 1 | 45
15 | 1 | 45
10 | 1 | 45
10 | 1 | 45
20 | 2 | 50
15 | 2 | 50
10 | 2 | 50
5 | 3 | 25
10 | 3 | 25
10 | 4 | 20
20 | 0 | 0
This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 3 years ago.
I want to create a new column in python dataframe based on other column values in multiple rows.
For example, my python dataframe df:
A | B
------------
10 | 1
20 | 1
30 | 1
10 | 1
10 | 2
15 | 3
10 | 3
I want to create variable C that is based on the value of variable A with condition from variable B in multiple rows. When the value of variable B in row i,i+1,..., the the value of C is the sum of variable A in those rows. In this case, my output data frame will be:
A | B | C
--------------------
10 | 1 | 70
20 | 1 | 70
30 | 1 | 70
10 | 1 | 70
10 | 2 | 10
15 | 3 | 25
10 | 3 | 25
I haven't got any idea the best way to achieve this. Can anyone help?
Thanks in advance
recreate the data:
import pandas as pd
A = [10,20,30,10,10,15,10]
B = [1,1,1,1,2,3,3]
df = pd.DataFrame({'A':A, 'B':B})
df
A B
0 10 1
1 20 1
2 30 1
3 10 1
4 10 2
5 15 3
6 10 3
and then i'll create a lookup Series from the df:
lookup = df.groupby('B')['A'].sum()
lookup
A
B
1 70
2 10
3 25
and then i'll use that lookup on the df using apply
df.loc[:,'C'] = df.apply(lambda row: lookup[lookup.index == row['B']].values[0], axis=1)
df
A B C
0 10 1 70
1 20 1 70
2 30 1 70
3 10 1 70
4 10 2 10
5 15 3 25
6 10 3 25
You have to use groupby() method, to group the rows on B and sum() on A.
df['C'] = df.groupby('B')['A'].transform(sum)
I need to turn this excel sheet (one number per cell):
A B C D E F G
--------------------------------
1 | 1 2 3 4 5 6 7
2 | 8 9 10 11 12 13 14
3 | 15 16 17 18 19 20 21
into this (with all spaces between numbers, each row in one cell):
A
------------
| 1
1 | 2
| 3 4 5
| 6 7
—-line break—-
| 8
2 | 9
| 10 11 12
| 13 14
—-line break—-
| 15
3 | 16
| 17 18 19
| 20 21
Does anyone have any ideas or suggestions?
After a little playing around, I finally found a formula that worked for the above. The CHAR(10)s are the line breaks.
=(TRANSPOSE(A1)&CHAR(10)&TRANSPOSE(B1)&CHAR(10)&CONCATENATE(C1," ",D1," ",E1)&CHAR(10)&CONCATENATE(F1," ",G1)&CHAR(10))
I have a table full of numbers with with headings. I also have a separate list of numbers that are contained in the table. I would like to find the location of each number on the list, in the table. I would then like to use the cell location to provide the corresponding row heading. I demonstrated what I'm looking for below.
How do I go about doing this? I'm imagining some combination of index/match functions, or perhaps vlookup, but none of the formulas that I've tried have worked so far. I'm completely lost at this point, so any help will be appreciated.
Thanks in advance!
Imagine something like this:
Table:
- Category A 1 2 3 4 5
- Category B 6 7 8 9 10
- Category C 11 12 13 14 15
- Category D 16 17 18 19 20
- Category E 21 22 23 24 25
List:
22
5
10
4
18
6
14
2
Desired Outcome:
- 22 Category E
- 5 Category A
- 10 Category B
- 4 Category A
- 18 Category D
- 6 Category B
- 14 Category C
- 2 Category A
Step 1: Find the row that the matching value is in
You can find the matching row by using a combination of a boolean function and SUMPRODUCT:
SUMPRODUCT((dataRange=22)*ROW(dataRange))
(note that this assumes that the items are all unique; it will not work if you have more than one match)
Step 2: find the category for that row
OFFSET(categoryACell, rows, 0)
so the resulting function would be:
OFFSET(categoryACell, SUMPRODUCT(--(dataRange=22)*ROW(dataRange)), 0)
A | B | C | D | E | F
_________________________________________________________
1 || Category A | 1 | 2 | 3 | 4 | 5
2 || Category B | 6 | 7 | 8 | 9 | 10
3 || Category C | 11 | 12 | 13 | 14 | 15
4 || Category D | 16 | 17 | 18 | 19 | 20
5 || Category E | 21 | 22 | 23 | 24 | 25
6 ||
7 ||
8 ||
9 ||
10 || 22 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A10)*ROW(B1:F5)))
11 || 5 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A11)*ROW(B1:F5)))
12 || 10 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A12)*ROW(B1:F5)))
13 || 4 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A13)*ROW(B1:F5)))
14 || 18 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A14)*ROW(B1:F5)))
15 || 6 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A15)*ROW(B1:F5)))
16 || 14 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A16)*ROW(B1:F5)))
17 || 2 | =INDIRECT("A"&SUMPRODUCT((B1:F5=A17)*ROW(B1:F5)))