I am a noob python user and my purpose is got name and shift to next row
import pandas as pd
import numpy as np
df = pd.DataFrame({"1": ['Alfred', 'car', 'bike','Alex','car'],
"2": [np.nan, 'Ford', 'Giant',np.nan,'Toyota'],
"3": [pd.NaT, pd.Timestamp("2018-01-01"),
pd.Timestamp("2018-07-01"),np.nan,pd.Timestamp("2021-01-01")]})
1 2 3
0 Alfred NaN NaT
1 car Ford 2018-01-01
2 bike Giant 2018-07-01
3 Alex NaN NaT
4 car Toyota 2021-01-01
my goal result like as below
df = pd.DataFrame({"transportation": ['car', 'bike','car'],
"Mark": ['Ford', 'Giant','Toyota'],
"BuyDate":[pd.Timestamp("2018-01-01"),
pd.Timestamp("2018-07-01"),pd.Timestamp("2021-01-01")],
"Name":['Alfred','Alfred','Alex']
})
transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
i'm try to search some method , but cannot solve this
thanks for see my post and help
thanks mozway、jezrael、mcsoini help,it's work and i'm going learning those different method 。
Joseph Assaker
i had a question for your answer , when i run as below code and show error code 。 am i miss something ??
j = 0
for i in range(1, df.shape[0]):
if df.loc[i][1] is np.nan:
running_name = df.loc[i][0]
continue
new_df.loc[j] = list(df.loc[i]) + [running_name]
j += 1
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14216/1012510729.py in <module>
4 running_name = df.loc[i][0]
5 continue
----> 6 new_df.loc[j] = list(df.loc[i]) + [running_name]
7 j += 1
NameError: name 'running_name' is not defined
Idea is forward filling missing values by Mark column to Name column and then filter rows in same mask:
df.columns = ["Transportation", "Mark", "BuyDate"]
m = df["Mark"].notna()
df["Name"] = df["transportation"].mask(m).ffill()
df = df[m].reset_index(drop=True)
print(df)
Transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
You can do this using a helper column and then a forward fill:
# rename columns
df.columns = ["transportation", "Mark", "BuyDate"]
# assumption: the rows where "Mark" is NaN defines the name for the following rows
df["is_name"] = df["Mark"].isna()
# create a new column which is NaN everywhere except for the name rows
df["name"] = np.where(df.is_name, df["transportation"], np.nan)
# do a forward fill to extend the names to all rows
df["name"] = df["name"].fillna(method="ffill")
# filter by non-name rows and drop the temporary is_name column
df = df.loc[~df.is_name].drop("is_name", axis=1)
print(df)
Out:
transportation Mark BuyDate name
1 car Ford 2018-01-01 Alfred
2 bike Giant 2018-07-01 Alfred
4 car Toyota 2021-01-01 Alex
You could use this pipeline:
m = df.iloc[:,1].notna()
(df.assign(Name=df.iloc[:,0].mask(m).ffill()) # add new column
.loc[m] # keep only the columns with info
# below: rework df to fit output
.rename(columns={'1': 'transportation', '2': 'Mark', '3': 'BuyDate'})
.reset_index(drop=True)
)
output:
transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
You can do this like so:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"1": ['Alfred', 'car', 'bike','Alex','car'],
... "2": [np.nan, 'Ford', 'Giant',np.nan,'Toyota'],
... "3": [pd.NaT, pd.Timestamp("2018-01-01"),
... pd.Timestamp("2018-07-01"),np.nan,pd.Timestamp("2021-01-01")]})
>>>
>>> df
1 2 3
0 Alfred NaN NaT
1 car Ford 2018-01-01
2 bike Giant 2018-07-01
3 Alex NaN NaT
4 car Toyota 2021-01-01
>>>
>>> new_df = pd.DataFrame(columns=['Transportation', 'Mark', 'BuyDate', 'Name'])
>>>
>>> j = 0
>>> for i in range(1, df.shape[0]):
... if df.loc[i][1] is np.nan:
... running_name = df.loc[i][0]
... continue
... new_df.loc[j] = list(df.loc[i]) + [running_name]
... j += 1
...
>>> new_df
Transportation Mark BuyDate Name
0 car Ford 2018-01-01 Alfred
1 bike Giant 2018-07-01 Alfred
2 car Toyota 2021-01-01 Alex
>>>
Related
Given a small dataset df1 as follow:
city year quarter
0 sh 2019 q4
1 bj 2020 q3
2 bj 2020 q2
3 sh 2020 q4
4 sh 2020 q1
5 bj 2021 q1
I would like to create date range in quarter from 2019-q2 to 2021-q1 as column names, then check if each row in df1's year and quarter for each city exist in df2.
If they exist, then return ys for that cell, otherwise, return NaNs.
The final result will like:
city 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN y y NaN y
1 sh NaN NaN y y NaN NaN y NaN
To create column names for df2:
pd.date_range('2019-04-01', '2021-04-01', freq = 'Q').to_period('Q')
How could I achieve this in Python? Thanks.
We can use crosstab on city and the string concatenation of the year and quarter columns:
new_df = pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
new_df:
col_0 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
city
bj 0 0 1 1 0 1
sh 1 1 0 0 1 0
We can convert to bool, replace False and True to be the correct values, reindex to add missing columns, and cleanup axes and index to get exact output:
col_names = pd.date_range('2019-01-01', '2021-04-01', freq='Q').to_period('Q')
new_df = (
pd.crosstab(df['city'], df['year'].astype(str) + '-' + df['quarter'])
.astype(bool) # Counts to boolean
.replace({False: np.NaN, True: 'y'}) # Fill values
.reindex(columns=col_names.strftime('%Y-q%q')) # Add missing columns
.rename_axis(columns=None) # Cleanup axis name
.reset_index() # reset index
)
new_df:
city 2019-q1 2019-q2 2019-q3 2019-q4 2020-q1 2020-q2 2020-q3 2020-q4 2021-q1
0 bj NaN NaN NaN NaN NaN y y NaN y
1 sh NaN NaN NaN y y NaN NaN y NaN
DataFrame and imports:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'city': ['sh', 'bj', 'bj', 'sh', 'sh', 'bj'],
'year': [2019, 2020, 2020, 2020, 2020, 2021],
'quarter': ['q4', 'q3', 'q2', 'q4', 'q1', 'q1']
})
Below is an example dataframe using Pandas. Sorry about the table format.
I would like to take the text 'RF-543' in column 'Test Case' and populate it in the two empty rows above it.
Likewise for text 'RT-483' in column 'Test Case', I would like to be populate it in the three empty rows above it.
I have over 900 rows of data with varying empty rows to be populated with the next non-empty row that follows. Any suggestions are appreciated.
Req#
Test Case
745
AB-003
348
AA-006
932
335
675
RF-543
348
988
444
124
RT-483
Regards,
Gus
If they are NaN something as below:
Creating dataframe test
>>> df = pd.DataFrame({"A": ["apple",np.nan, np.nan, "banana", np.nan, "grape"]})
>>> df
A
0 apple
1 NaN
2 NaN
3 banana
4 NaN
5 grape
Use [fillna] method
>>> df["A"].fillna(method='bfill', inplace=True)
>>> df
0 apple
1 banana
2 banana
3 banana
4 grape
5 grape
Name: A, dtype: object
If the fields are empty:
>>> df = pd.DataFrame({"A": ["apple", "", "", "banana", "", "grape"]})
>>> df
A
0 apple
1
2
3 banana
4
5 grape
>>> df = df.apply(lambda x: x["A"] if x["A"] != "" else np.nan, axis=1)
>>> df
0 apple
1 NaN
2 NaN
3 banana
4 NaN
5 grape
dtype: object
Then, use the fillna()
Try something like this:
df.loc[df["Test Case"] == '', "Test Case"] = np.nan
df.bfill()
Output
Req# Test Case
0 745 AB-003
1 348 AA-006
2 932 RF-543
3 335 RF-543
4 675 RF-543
5 348 RT-483
6 988 RT-483
7 444 RT-483
8 124 RT-483
Let's say I have dataframes that looks like this:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
The logic I am trying to implement is something like this:
If all of column C = "NaN" then drop the entire column
Else if all of column C = "Member" drop the entire column
else do nothing
Any suggestions?
Edit: Added Expected Output
Expected Output if using on both data frames:
df_one
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
df_two:
a b
0 dave blue
1 bill red
2 sally green
Edit #2: Why am I doing this in the first place?
I am ripping text from a PDF file into placing into CSV files using the Tabula library.
The data is not coming out in the way that I am hoping it would, so I am applying ETL concepts to move the data around.
The final outcome would be for management to be able to open the final result into a nicely formatted Excel file.
Some of the columns have part of the headers put into a separate row and things got shifted around for some reason while ripping the data out of the PDF.
The headers look something like this:
Team Type Member Contact
Count
What I am doing is checking an entire column for certain header values. If the entire column has a header value, I'm dropping the entire column.
Idea is replace Member to missing values first, then test if at least one no missing value by notna with any and add all columns with Trues for mask by Series.reindex:
mask = (df[['c']].replace('Member',np.nan)
.notna()
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Another idea id chain both mask by & for bitwise AND:
mask = ((df[['c']].notna() & df[['c']].ne('Member'))
.any()
.reindex(df.columns, axis=1, fill_value=True))
print (mask)
Last filter by columns in DataFrame.loc:
df = df.loc[:, mask]
Here's an alternate approach to do this.
import pandas as pd
import numpy as np
c = ['a','b','c']
d = [
['dave', 'blue', np.NaN],
['bill', 'red', np.NaN],
['sally', 'green', 'Member'],
['Ian', 'Org', 'Paid']]
df1 = pd.DataFrame(d,columns = c)
df2 = df1.loc[df1['a'] != 'Ian']
print (df1)
print (df2)
if df1.c.replace('Member',np.NaN).isnull().all():
df1 = df1[df1.columns.drop(['c'])]
print (df1)
if df2.c.replace('Member',np.NaN).isnull().all():
df2 = df2[df2.columns.drop(['c'])]
print (df2)
Output of this is:
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
a b c
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 Ian Org Paid
a b
0 dave blue
1 bill red
2 sally green
my idea is simple, maybe it will help you. I wanna make sure that you want this one: drop the whole column if this column contains only NaN or 'Member' else do nothing.
So I need to check the column first (contain only NaN or 'Member'). We change 'Member' to NaN and use numpy for a test(or something else).
import pandas as pd
df = pd.DataFrame({'A':['dave','bill','sally','ian'],'B':['blue','red','green','org'],'C':[np.nan,np.nan,'Member','Paid']})
df2 = df.drop(index=[3],axis=0)
print(df)
print(df2)
# df 1
col = pd.Series([np.nan if x=='Member' else x for x in df['C'].tolist()])
if col.isnull().all():
df = df.drop(columns='C')
# df2
col = pd.Series([np.nan if x=='Member' else x for x in df2['C'].tolist()])
if col.isnull().all():
df2 = df2.drop(columns='C')
print(df)
print(df2)
A B C
0 dave blue NaN
1 bill red NaN
2 sally green Member
3 ian org Paid
A B
0 dave blue
1 bill red
2 sally green
The dataframe looks like this:
import pandas as pd
df = pd.DataFrame ({
'Name':['Brian','John','Adam'],
'HomeAddr':[12,32,44],
'Age':['M','M','F'],
'Genre': ['NaN','NaN','NaN']
})
The current output is:
Name HomeAddr Age Genre
0 Brian 12 M NaN
1 John 32 M NaN
2 Adam 44 F NaN
I would like to shift somehow content of HomeAddr and Age columns to columns+1. Below is a sample of the expected output.
Name HomeAddr Age Genre
0 Brian NaN 12 M
1 John NaN 32 M
2 Adam NaN 44 F
I tried with .shift() but it doesn't work.
import pandas as pd
df = pd.DataFrame ({
'Name':['Brian','John','Adam'],
'HomeAddr':[12,32,44],
'Age':['M','M','F'],
'Genre': ['NaN','NaN','NaN']
})
df['HomeAddr'] = df['HomeAddr'].shift(-1)
print(df)
Name HomeAddr Age Genre
0 Brian 32.0 M NaN
1 John 44.0 M NaN
2 Adam NaN F NaN
Any ideas guys? Thank you!
Use DataFrame.shift, but is necessary convert columns to strings for avoid missing values, then convert numeric columns back:
df.loc[:, 'HomeAddr':] = df.loc[:, 'HomeAddr':].astype(str).shift(1, axis=1)
df['Age'] = pd.to_numeric(df['Age'])
print (df)
Name HomeAddr Age Genre
0 Brian NaN 12 M
1 John NaN 32 M
2 Adam NaN 44 F
Another out of box solution:
df = df.drop('Genre', axis=1).rename(columns={'HomeAddr':'Age', 'Age':'Genre'})
df.insert(1, 'HomeAddr', np.nan)
print (df)
Name HomeAddr Age Genre
0 Brian NaN 12 M
1 John NaN 32 M
2 Adam NaN 44 F
For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)