replacing a special character in a pandas dataframe - python-3.x

I have a dataset that '?' instead of 'NaN' for missing values. I could have gone through each column using replace but the only problem is I have 22 columns. I am trying to create a loop do it effectively but I am getting wrong. Here is what I am doing:
for col in adult.columns:
if adult[col]=='?':
adult[col]=adult[col].str.replace('?', 'NaN')
The plan is to use the 'NaN' then use the fillna function or to drop them with dropna. The second problem is that not all the columns are categorical so the str function is also wrong. How can I easily deal with this situation?

If you're reading the data from a .csv or .xlsx file you can use the na_values parameter:
adult = pd.read_csv('path/to/file.csv', na_values=['?'])
Otherwise do what #MasonCaiby said and use adult.replace('?', float('nan'))

Related

Better way to read columns from excel file as variables in Python

I have an .xlsx file with 5 columns(X,Y,Z,Row_Cog,Col_Cog) and will be in the same order each time. I would like to have each column as a variable in python. I am implementing the below method but would like to know if there is a better way to do it.
Also I am writing the range manually(in the for loop) while I would like to have a robust way to know the length of each column in excel(no of rows) and assign it.
#READ THE TEST DATA from Excel file
import xlrd
workbook = xlrd.open_workbook(r"C:\Desktop\SawToothCalib\TestData.xlsx")
worksheet = workbook.sheet_by_index(0)
X_Test=[]
Y_Test=[]
Row_Test=[]
Col_Test=[]
for i in range(1, 29):
x_val= worksheet.cell_value(i,0)
X_Test.append(x_val)
y_val= worksheet.cell_value(i,2)
Y_Test.append(y_val)
row_val= worksheet.cell_value(i,3)
Row_Test.append(row_val)
col_val= worksheet.cell_value(i,4)
Col_Test.append(col_val)
Do you really need this package? You can easily do this kind of operation with pandas.
You can read your file as a DataFrame with:
import pandas as pd
df = pd.read_excel(path + 'file.xlsx', sheet_name=the_sheet_you_want)
and access the list of columns with df.columns. You can acces each column with df['column name']. If there are empty entries, they are stored as NaN. You can know how many you have with df['column_name'].isnull().
If you are uncomfortable with DataFrames, you can then convert the columns to lists or arrays, like
df['my_col'].tolist()
or
df['my_col'].to_numpy()

Python: how can I get the mode from a month column that i extracted from a datetime column?

I'm new at this! Doing my first Python project. :)
My tasks are:
convert df['Start Time'] from string to datetime
create a month column from df['Start Time']
get the mode of that month.
I used a few different ways to do all 3 of the steps, but trying to get the mode always returns TypeError: tuple indices must be integers or slices, not str. This happens even if I try converting the "tuple" into a list or NumPy array.
Ways I tried to extract month from Start Time:
df['extracted_month'] = pd.DatetimeIndex(df['Start Time']).month
df['extracted_month'] = np.asarray(df['extracted_month'])
df['extracted_month'] = df['Start Time'].dt.month
Ways I've tried to get the mode:
print(df['extracted_month'].mode())
print(df['extracted_month'].mode()[0])
print(stat.mode(df['extracted_month']))
Trying to get the index with df.columns.get_loc("extracted_month") then replacing it in the mode code gives me the SAME error (TypeError: tuple indices must be integers or slices, not str).
I think I should convert df['extracted_month'] into a different... something. What is it?
Note: My extracted_month column is a STRING, but you should still be able to get the mode from a string variable! I'm not changing it, that would be giving up.
Edit: using the following code still results in the same error
extracted_month = pd.Index(df['extracted_month'])
print(extracted_month.value_counts())
The error is likely caused by the way you are creating your dataframe.
If the dataframe is created in another function, and that function returns other things along with the dataframe, but you assign it to the variable df, then df will be a tuple that contains the actual dataframe, and not the dataframe itself.

pd.to_numeric not working

I am facing a weird problem with pandas.
I donot know where I am going wrong?
But when I am creating a new df, there seems to be no problem. like
Any idea why?
Edit :
sat=pd.read_csv("2012_SAT_Results.csv")
sat.head()
#converted columns to numeric types
sat.iloc[:,2:]=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat.dtypes
sat_1=sat.iloc[:,2:].apply(pd.to_numeric,errors="coerce")
sat_1.head()
The fact that you can't apply to_numeric directly using .iloc appears to be a bug, but to get the same results that you're looking for (applying to_numeric to multiple columns at the same time), you could instead use:
df = pd.DataFrame({'a':['1','2'],'b':['3','4']})
# If you're applying to entire columns
df[df.columns[1:]] = df[df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')
# If you want to apply to specific rows within columns
df.loc[df.index[1:], df.columns[1:]] = df.loc[df.index[1:], df.columns[1:]].apply(pd.to_numeric, errors = 'coerce')

Hash each row of pandas dataframe column using apply

I'm trying to hash each value of a python 3.6 pandas dataframe column with the following algorithm on the dataframe-column ORIG:
HK_ORIG = base64.b64encode(hashlib.sha1(str(df.ORIG).encode("UTF-8")).digest())
However, the above mentioned code does not hash each value of the column, so, in order to hash each value of the df-column ORIG, I need to use the apply function. Unfortunatelly, I don't seem to be good enough to get this done.
I imagine it to look like the following code:
df["HK_ORIG"] = str(df['ORIG']).encode("UTF-8")).apply(hashlib.sha1)
I'm looking very much forward to your answers!
Many thanks in advance!
You can either create a named function and apply it - or apply a lambda function. In either case, do as much processing as possible withing the dataframe.
A lambda-based solution:
df['ORIG'].astype(str).str.encode('UTF-8')\
.apply(lambda x: base64.b64encode(hashlib.sha1(x).digest()))
A named function solution:
def hashme(x):
return base64.b64encode(hashlib.sha1(x).digest())
df['ORIG'].astype(str).str.encode('UTF-8')\
.apply(hashme)

How to use Pandas to display specific columns from csv file?

I have a csv file with a number of columns in it. It is for students. I want to display only male students and their names. I used 1 for male students and 0 for female students. My code is:
import pandas as pd
data = pd.read_csv('normalizedDataset.csv')
results = pd.concat([data['name'], ['students']==1])
print results
I have got this error:
TypeError: cannot concatenate a non-NDFrame object
Can anyone help please. Thanks.
You can specify to read only certain column names of your data when you load your csv. Then use loc to locate all values where students equals 1.
data = pd.read_csv('normalizedDataset.csv', usecols=['name', 'students'])
data = data.loc[data.students == 1, :]
BTW, your original error is because you are trying to concatenate a dataframe with False.
>>> ['students']==1
False
No need to concat, you're stripping things away, not building.
Try:
data[data['friends']==1]['name']
To provide clarity on why you were getting the error:
The second thing you were trying to concat was:
['students']==1
Which is not an NDFrame object. You'd want to replace that with.
data[data['students']==1]['students']

Resources