Convert All Items in a Dataframe to Float - python-3.x

I am trying to convert all items in my dataframe to a float. The types are varies at the moment. The following error persist -> ValueError: could not convert string to float: '116,584.54'
The file can be found at https://www.imf.org/external/pubs/ft/weo/2019/01/weodata/WEOApr2019all.xls
I checked the value in excel, it is a Number. I tried .replace, .astype, pd.to_numeric.
for i in weo['1980']:
if i == float:
print(i)
i.replace(",",'')
i.replace("--",np.nan)
else:
continue
Also, I have tried:
weo['1980'] = weo['1980'].apply(pd.to_numeric)

You can try using DataFrame.astype in order to conduct the conversion which is usually the recommended approach. As you already attempted in your question, you may have to remove all the comas form the string in column 1980 first as it may cause the same error as quoted in your question:
weo['1980'] = weo['1980'].replace(',', '')
weo['1980'] = weo['1980'].asytpe(float)
If you're reading your DataFrame from Excel using pandas.read_excel, you can also specify the thousands argument to do this conversion for you which will likely result in a higher performance:
pandas.read_excel(file, thousands=',')

I had types error all the time while playing with dataframes. I now always use this to convert all the values that can be converted into floats.
# Convert all columns that can be converted into float into float.
# Error were raised because their type was Object
df = df.apply(pd.to_numeric, errors='ignore')

Related

Convert an unknown data item to string in Python

I have certain data that need to be converted to strings. Example:
[ABCGHDEF-12345, ABCDKJEF-123235,...]
The example above does not represent a constant or a string by itself but is taken from an Excel sheet (ranging upto 30+ items for each row). I want to convert these to strings. Since data is undefined, explicitly converting them doesn't work. Is there a way to do this iteratively without placing double/single quotes manually between each data element?
What I want finally:
["ABCGHDEF-12345", "ABCDKJEF-123235",...]
To convert the string to list of strings you can try:
s = "[ABCGHDEF-12345, ABCDKJEF-123235]"
s = s.strip("[]").split(", ")
print(s)
Prints:
['ABCGHDEF-12345', 'ABCDKJEF-123235']

Adding labels to data in csv format for machine learning

I intend to make a model using sklearn to predict cuisines. I however have this column in my data (Column B) that brings me a ValueError: could not convert string to float: 'indian'
please help if you can.
csv file
You are probably trying to cast that column to a float somewhere in your code. If you're using sklearn, it will handle converting the label column specified into a numeric label representation. If you want to specify the string label name to integer label you can do it like this:
label_mapper = dict(zip(set(df['Column B']), len(set(df['Column B'])))
df['Column B'] = df['Column B'].apply(lambda x: label_mapper[x])

Getting an error when calculating Z score

I am trying to find the outliers in my dataset and remove them. So I did the following:
z_scores = stats.zscore(dataset_sex)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
new_df = dataset_sex[filtered_entries]
new_df.head()
but I got this error:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
The error seems to generate from the first line of code (z_scores = stats.zscore(dataset_sex)). I don't understand why. How can I fix this?
This comes from some of your data in the columns being strings (in python terms 'str').
When it comes from working out the z-score, it will have to divide the mean with a standard deviation. One of the columns is a string like 'M' or 'F' for sex, or strings like '1,232.23' not converted to floats, and z-scoring does not work for that.
My first suggestion is to check that they are all numbers.
df.dtypes
will show you what types they are and then convert them to numeric.
Post a little of the data (a couple of rows) and we can help you.

Python: how can I get the mode from a month column that i extracted from a datetime column?

I'm new at this! Doing my first Python project. :)
My tasks are:
convert df['Start Time'] from string to datetime
create a month column from df['Start Time']
get the mode of that month.
I used a few different ways to do all 3 of the steps, but trying to get the mode always returns TypeError: tuple indices must be integers or slices, not str. This happens even if I try converting the "tuple" into a list or NumPy array.
Ways I tried to extract month from Start Time:
df['extracted_month'] = pd.DatetimeIndex(df['Start Time']).month
df['extracted_month'] = np.asarray(df['extracted_month'])
df['extracted_month'] = df['Start Time'].dt.month
Ways I've tried to get the mode:
print(df['extracted_month'].mode())
print(df['extracted_month'].mode()[0])
print(stat.mode(df['extracted_month']))
Trying to get the index with df.columns.get_loc("extracted_month") then replacing it in the mode code gives me the SAME error (TypeError: tuple indices must be integers or slices, not str).
I think I should convert df['extracted_month'] into a different... something. What is it?
Note: My extracted_month column is a STRING, but you should still be able to get the mode from a string variable! I'm not changing it, that would be giving up.
Edit: using the following code still results in the same error
extracted_month = pd.Index(df['extracted_month'])
print(extracted_month.value_counts())
The error is likely caused by the way you are creating your dataframe.
If the dataframe is created in another function, and that function returns other things along with the dataframe, but you assign it to the variable df, then df will be a tuple that contains the actual dataframe, and not the dataframe itself.

pandas reading data from column in as float or int and not str despite dtype setting

i have an issue with pandas (0.23.4) on python 3.7 where the data is being read in as scientific notation instead of just a string despite setting the dtype setting. Here is an example of the data that is being read in
-------------------
codes
-------------------
001234544
00023455
123456789
A1253532
780E9000
00678E10
The problem comes with lines 5 and 6 of the above because they contain, i think, 'E' characters and they are being turned into scientific notation.
My reader is setup as follows.
accounts = pd.read_excel('gym_accounts.xlsx', sheet_name='Sheet1', dtype=str)
despite that dtype=str setting, it appears that pandas using something called ... a "sniffer" that detects the data type automatically and its being changed back to what I assume is float or int, and then changing it to scientific notation. One suggestion in another thread says to use something called a converter statement within the read_csv like the following
pd.read_csv('my.csv', converters = {i: str for i in range(0, 100)})
I am curious if this is a possible solution to my problem, but also i have no idea how long that range should be as it changes often. Is there any way to query the length of the column and feed that as a variable into that range call?
I looks like i can do something like len(accounts.index) ... but i cant do this till after the reader has read the file so something like this below doesnt work
accounts = pd.read_excel('gym_accounts.xlsx', sheet_name='Sheet1', converters = {i: str for i in range(0, gym_length)}))
gym_length = len(accounts.index)
the length check is after the .. i guess you call it ... data reader, so it doesnt work obviously.

Resources