Names of the columns we're searching for missing values - median

Searching for missing values ?
columns = ['median', 'p25th', 'p75th']
# Look at the dtypes of the columns
print(____)
# Find how missing values are represented (Search for missing values in the median, p25th, and p75th columns.)
print(recent_grads["median"].____)
# Replace missing values with NaN,using numpy's np.nan.
for column in ___:
recent_grads.loc[____ == '____', column] = ____?

Names of the columns we're searching for missing values
columns = ['median', 'p25th', 'p75th']
Take a look at the dtypes
print(recent_grads[columns].dtypes)
Find how missing values are represented
print(recent_grads["median"].unique())
Replace missing values with NaN
for column in columns:
recent_grads.loc[recent_grads[column] == 'UN', column] = np.nan

right answer is--
for column in columns:
recent_grads.loc[recent_grads[column] == 'UN', column] = np.nan

# Print .dtypes
print(recent_grads.dtypes)
# Output summary statistics
print(recent_grads.describe())
# Exclude data of type object
print(recent_grads.describe(exclude=["object"]))

Related

Is there any python-Dataframe function in which I can iterate over rows of certain columns?

Want to solve this kind of problem in python:
tran_df['bad_debt']=train_df.frame_apply(lambda x: 1 if (x['second_mortgage']!=0 and x['home_equity']!=0) else x['debt'])
I want be able to create a new column and iterate over index row for specific columns.
in excel it's really easy I did:
if(AND(col_name1<>0,col_name2<>0),1,col_name5)
Any help will be very appreciated.
To iterate over rows only for certain columns:
for rowIndex, row in df[['col1','col2']].iterrows(): #iterate over rows
To create a new column:
df['new'] = 0 # Initialise as 0
As a rule, iterating over rows in pandas is wrong. Use the np.where function from NumPy to select the right values for the rows:
tran_df['bad_debt'] = np.where(
(tran_df['second_mortgage'] != 0) & (tran_df['home_equity'] != 0),
1, tran_df['debt'])
First to create a new column with initial value, then to use .loc to locate rows that match certain condition and assign new value:
tran_df['bad_debt']=tran_df['debt']
tran_df.loc[(tran_df['second_mortgage']!=0)&(tran_df['home_equity']!=0),'bad_debt']=1
Or
tran_df['bad_debt']=1
tran_df.loc[(tran_df['second_mortgage']==0)|(tran_df['home_equity']==0),'bad_debt']=tran_df['debt']
Remember to put round brackets for each condition between bitwise operators (& |)

Split all column names by specific characters and take the last part as new column names in Pandas

I have a dataframe which has column names like this:
id, xxx>xxx>x, yy>y, zzzz>zzz>zz>z, ...
I need to split by the second > from the right side and take the first element as new column names, id, xxx>x, yy>y, zz>z, ....
I have used: 'zzzz>zzz>zz>z'.rsplit('>', 1)[-1] to get z as the expected new column name for the third column.
When I use: df.columns = df.columns.rsplit('>', 1)[-1]:
Out:
ValueError: Length mismatch: Expected axis has 13 elements, new values have 2 elements
How could I do that correctly?
try doing:
names = pd.Index(['xxx>xxx>x', 'yy>y', 'zzzz>zzz>zz>z'])
names = pd.Index([idx[-1] for idx in names.str.rsplit('>')])
print(names)
# Index(['x', 'y', 'z'], dtype='object')

Double a column values when a different column equals a specified value

I want to double the value in the distance column in the rows which have the value of 'one-way' in the hike_type column. I am iterating through the df and finding all of the proper rows but I am having trouble getting the multiplication to stick.
This is finding the proper rows but will not put the change into effect
for index, row in df.iterrows():
if row['hike_type'] == 'one-way':
row['distance'] * 2
This hasn't worked either
for index, row in df.iterrows():
if row['hike_type'] == 'one-way':
row['distance'] = row['distance'] * 2
for some reason when I do (below) it prints what I want.
for index, row in df.iterrows():
if row['hike_type'] == 'one-way':
print(row['distance'] * 2)
IIUC, what you want could be achieved with just one line as below
df['distance']= np.where (df['hike_type'] == 'one-way', df['distance'].astype(int)*2,df['distance'])
OR you can use df.loc as below
df.update(df.loc[df['hike_type'] == 'one-way','distance'].astype(int)*2)
OR
df.update(df[df['hike_type'] == 'one-way']['distance'].astype(int)*2)

how how iloc[:,1:] works ? can any one explain [:,1:] params

What is the meaning of below lines., especially confused about how iloc[:,1:] is working ? and also data[:,:1]
data = np.asarray(train_df_mv_norm.iloc[:,1:])
X, Y = data[:,1:],data[:,:1]
Here train_df_mv_norm is a dataframe --
Definition: pandas iloc
.iloc[] is primarily integer position based (from 0 to length-1 of the
axis), but may also be used with a boolean array.
For example:
df.iloc[:3] # slice your object, i.e. first three rows of your dataframe
df.iloc[0:3] # same
df.iloc[0, 1] # index both axis. Select the element from the first row, second column.
df.iloc[:, 0:5] # first five columns of data frame with all rows
So, your dataframe train_df_mv_norm.iloc[:,1:] will select all rows but your first column will be excluded.
Note that:
df.iloc[:,:1] select all rows and columns from 0 (included) to 1 (excluded).
df.iloc[:,1:] select all rows and columns, but exclude column 1.
To complete the answer by KeyMaker00, I add that data[:,:1] means:
The first : - take all rows.
:1 - equal to 0:1 take columns starting from column 0,
up to (excluding) column 1.
So, to sum up, the second expression reads only the first column from data.
As your expression has the form:
<variable_list> = <expression_list>
each expression is substituted under the corresponding variable (X and Y).
Maybe it will complete the answers before.
You will know
what you get,
its shape
how to use it with de column name
df.iloc[:,1:2] # get column 1 as a DATAFRAME of shape (n, 1)
df.iloc[:,1:2].values # get column 1 as an NDARRAY of shape (n, 1)
df.iloc[:,1].values # get column 1 as an NDARRAY of shape ( n,)
df.iloc[:,1] # get column 1 as a SERIES of shape (n,)
# iloc with the name of a column
df.iloc[:, df.columns.get_loc('my_col')] # maybe there is some more
elegants methods

Comparing strings in same series (row) but different columns

I ran into this problem with comparing strings between two columns. What I want to do is to: For each row, check whether the string is column A is included in column B and if so, print a new string 'Yes' in column C.
Column A contains NaN values (blank cells in the csv I imported).
I have tried:
df['C']=df['B'].str.contains(df.loc['A'])
df.loc[df['A'].isin(df['B']), 'C']='Yes'
They both didn't work as I couldn't find the right way to compare strings.
This uses list comprehension, so it may not be the fastest solution, but works and is concise.
df['C'] = pd.Series(['Yes' if a in b else 'No' for a,b in zip(df['A'],df['B'])])
EDIT: If you don't want to keep the values in C instead of overwriting them with 'No', you can do it like this:
df['C'] = pd.Series(['Yes' if a in b else c for a,b,c in zip(df['A'],df['B'], df['C'])])
df = pd.DataFrame([['ab', 'abc'],
['abc', 'ab']], columns=list('AB'))
df['C'] = np.where(df.apply(lambda x: x.A in x.B, axis=1), 'Yes', 'No')
df
Try regex: https://docs.python.org/2/library/re.html if you already made the code to id every cell or value have to work with.

Resources