I'm new to Pandas. I have a data frame that looks something like this.
Name
Storage Location
Total Quantity
a
S1
100
a
S2
200
a
S3
300
a
S4
110
a
S5
200
b
S1
200
b
S2
300
b
S4
400
b
S5
150
c
S1
400
c
S5
500
I wanna sum the "Total Quantity" group by the Name and also specific storage location which are only "S1,S2,S3".
Name
Total Quantity
a
600
b
500
c
400
My desired output would be something like the above.
Kindly appreciate for you guys help. Thank you in advance!
You could use where to replace the unwanted Locations with NaN and use groupby + sum (since sum skips NaN by default):
out = df.where(df['Storage Location'].isin(['S1','S2','S3'])).groupby('Name', as_index=False)['Total Quantity'].sum()
Output:
Name Total Quantity
0 a 600.0
1 b 500.0
2 c 400.0
Use:
In [2378]: out = df[df['Storage Location'].isin(['S1', 'S2', 'S3'])].groupby('Name')['Total Quantity'].sum().reset_index()
In [2379]: out
Out[2379]:
Name Total Quantity
0 a 600
1 b 500
2 c 400
Related
I have two data frames. One dataframe (dfA) looks like:
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (dfB) looks like
Name position string
Peter 89 aa
Jennie 568 bb
Jennie 90 cc
I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.
The expected out therefore, becomes:
##new_dfA
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
##new_dfB
Name position string
Peter 89 aa
Jennie 568 bb
In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.
If you have that many rows, pandas might not be well suited for your application.
That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:
dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]
output:
>>> dfA_new
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
>>> dfB_new
Name position string
0 Peter 89 aa
1 Jennie 568 bb
use pandasql
pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name position string
0 Peter 89 aa
1 Jennie 568 bb
I have the below dataframe,
Category Value
A 100
A -
B -
C 50
D 200
D 400
D -
As you can see, there are some values which have the hyphen symbol '-'. I want to replace those hyphons with the means of the corresponding category.
In the example, there are two entries for "A" - One row with value 100 and other with hyphen. So the mean would be 100 itself. For B, since there are no valid values, the mean would be the mean of the entire column which would be (100+50+200+400/4 = 187.5). For C, no changes and for D, the hyphen will be replaced by 300 (same logic as for "A").
Output:
Category Value
A 100
A 100
B 187.5
C 50
D 200
D 400
D 300
Try:
df = df.replace("-", np.nan)
df["Value"] = pd.to_numeric(df["Value"])
avg = df["Value"].mean()
df["Value"] = df["Value"].fillna(
df.groupby("Category")["Value"].transform(
lambda x: avg if x.isna().all() else x.mean()
)
)
print(df)
Prints:
Category Value
0 A 100.0
1 A 100.0
2 B 187.5
3 C 50.0
4 D 200.0
5 D 400.0
6 D 300.0
I have a dataframe which looks like:
A B C
a 100 200
a NA 100
a 200 NA
a 100 100
b 200 200
b 100 200
b 200 100
b 200 100
I use the aggregate function on column B and column C as:
ag=data.groupby(['A']).agg({'B':'sum','C':'sum'}).reset_index()
Output:
A B C
a NULL NULL
b 700 600
Expected Output:
A B C
a 400 400
b 700 600
How can I modify my aggregate function so that NULL values are ignored?
Maybe you already though about this but is not possible in your problem, but you can replace the NA values by 0 in the dataframe before this operation. If you donĀ“t want to change the original dataframe you can transform it in a copy.
ag=data.replace(np.nan,0).groupby(['A']).agg({'B':'sum','C':'sum'}).reset_index()
How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())
Help me with the following problem without adding any helper cells and changing the data:
"There are 8 cities in the country of Eight, A, B, C, D, E, F, G and H. Mr. Z decides to visit each city in his car starting from A. His planned itinerary is A-->B-->C-->D-->E-->F-->G-->H.
The distance between the cities is given in Table I."
Table I
Distance
A B C D E F G H
A 0 200
B 200 0 350
C 350 0 500
D 500 0 250
E 250 0 850
F 850 0 1250
G 1250 0 150
H 150 0
"Write a formula which takes the number of kms that Mr. Z has traveled from A as the input and displays the name of the city which is nearest to that point.
Mr. Z will enter the number of kms he has travelled from A in cell D30 and the name of the city nearest to the point will be displayed in E30"
Presumably a variation on the following (doesn't say can't hardcode cumulative sums)
=INDEX({"A","B","C","D","E","F","G","H"},MATCH(MIN(ABS({0,200,550,1050,1300,2150,3400,3350}-D30)),ABS({0,200,550,1050,1300,2150,3400,3350}-D30),0))