Adding labels to data in csv format for machine learning - scikit-learn

I intend to make a model using sklearn to predict cuisines. I however have this column in my data (Column B) that brings me a ValueError: could not convert string to float: 'indian'
please help if you can.
csv file

You are probably trying to cast that column to a float somewhere in your code. If you're using sklearn, it will handle converting the label column specified into a numeric label representation. If you want to specify the string label name to integer label you can do it like this:
label_mapper = dict(zip(set(df['Column B']), len(set(df['Column B'])))
df['Column B'] = df['Column B'].apply(lambda x: label_mapper[x])

Related

Is there an python solution for mapping a (pandas data frame) with (unique values of Split a string column)

I have a data frame (df).
The Data frame contains a string column called: supported_cpu.
The (supported_cpu) data is a string type separated by a comma.
I want to use this data for the ML model.
enter image description here
I had to get unique values for the column (supported_cpu). The output is a (list) of unique values.
def pars_string(df,col):
#Separate the column from the string using split
data=df[col].value_counts().reset_index()
data['index']=data['index'].str.split(",")
# Create a list including all of the items, which is separated by column
df_01=[]
for i in range(data.shape[0]):
for j in data['index'][i]:
df_01.append(j)
# get unique value from sub_df
list_01=list(set(df_01))
# there are some leading or trailing spaces in the list_01 which need to be deleted to get unique value
list_02=[x.strip(' ') for x in list_01]
# get unique value from list_02
list_03=list(set(list_02))
return(list_03)
supported_cpu_list = pars_string(df=df,col='supported_cpu')
The output:
enter image description here
I want to map this output to the data frame to encode it for the ML model.
How could I store the output in the data frame? Note : Some row have a multi-value(more than one CPU)
Input: string type separated by a column
output: I did not know what it should be.
Input: string type separated by a column
output: I did not know what it should be.
I really recommend to anyone who's starting using pandas to read about vectorization and thinking in terms of columns (aka Series). This is the way it was build and it is the way in which its supposed to be used.
And from what I understand (I may be wrong) is that you want to get unique values from supported_cpu column. So you could use the Series methods on string to split that particular column, then flatten the resulting array using internal `chain
from itertools import chain
df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')
unique_vals = set(chain(*df['supported_cpus'].tolist()))
unique_vals = (item for item in unique_vals if item)
Multi-values in some rows should be parsed to single values for later ML model training. The list can be converted to dataframe simply by pd.DataFrame(supported_cpu_list).

How to preprocess a dataset with many types of missing data

I'm trying to do the beginner machine learning project Big Mart Sales.
The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)
My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:
from sklearn.impute import SimpleImputer as Imputer
# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])
# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])
# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.
Here are some rows of my data:
There is a python package which can do this for you in a simple way, ctrl4ai
pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.impute_nulls(dataset)
Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
If you have a numerical column, you can use some approaches to fill the missing data:
A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Lets see how it works for a mean for one column e.g.:
One method would be to use fillna from pandas:
X['Name'].fillna(X['Name'].mean(), inplace=True)
For categorical data please have a look here: Impute categorical missing values in scikit-learn

Convert All Items in a Dataframe to Float

I am trying to convert all items in my dataframe to a float. The types are varies at the moment. The following error persist -> ValueError: could not convert string to float: '116,584.54'
The file can be found at https://www.imf.org/external/pubs/ft/weo/2019/01/weodata/WEOApr2019all.xls
I checked the value in excel, it is a Number. I tried .replace, .astype, pd.to_numeric.
for i in weo['1980']:
if i == float:
print(i)
i.replace(",",'')
i.replace("--",np.nan)
else:
continue
Also, I have tried:
weo['1980'] = weo['1980'].apply(pd.to_numeric)
You can try using DataFrame.astype in order to conduct the conversion which is usually the recommended approach. As you already attempted in your question, you may have to remove all the comas form the string in column 1980 first as it may cause the same error as quoted in your question:
weo['1980'] = weo['1980'].replace(',', '')
weo['1980'] = weo['1980'].asytpe(float)
If you're reading your DataFrame from Excel using pandas.read_excel, you can also specify the thousands argument to do this conversion for you which will likely result in a higher performance:
pandas.read_excel(file, thousands=',')
I had types error all the time while playing with dataframes. I now always use this to convert all the values that can be converted into floats.
# Convert all columns that can be converted into float into float.
# Error were raised because their type was Object
df = df.apply(pd.to_numeric, errors='ignore')

How to convert Excel negative value to Pandas negative value

I am a beginner in python pandas. I am working on a data-set named fortune_company. Data set are like below.
In this data-set for Profits_In_Million column there are some negative value which is indicating by red color and parenthesis.
but in pandas it's showing like below screenshot
I was trying to convert the data type Profits_In_Million column using below code
import pandas as pd
fortune.Profits_In_Million = fortune.Profits_In_Million.str.replace("$","").str.replace(",","").str.replace(")","").str.replace("(","-").str.strip()
fortune.Profits_In_Million.astype("float")
But I am getting the below error. Please someone help me one that. How I can convert this string datatype to float.
ValueError: could not convert string to float: '-'
Assuming you have no control over the cell format in Excel, the converters kwarg of read_excel can be used:
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels, values are functions that take
one input argument, the Excel cell content, and return the transformed
content.
From read_excel's docs.
def negative_converter(x):
# a somewhat naive implementation
if '(' in x:
x = '-' + x.strip('()')
return x
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 $1000
# 1 -$1000
Note however that the values of this column are still strings and not numbers (int/float). You can quite easily implement the conversion in negative_converter (remove the the dollar sign, and most probably the comma as well), for example:
def negative_converter(x):
# a somewhat naive implementation
x = x.replace('$', '')
if '(' in x:
x = '-' + x.strip('()')
return float(x)
df = pd.read_excel('test.xlsx', converters={'Profits_In_Million': negative_converter})
print(df)
# Profits_In_Million
# 0 1000.0
# 1 -1000.0

Converting an array of strings containing range of integer values into an array of floats

Click here to see an image that contains a screenshot sample of the data.
I have a CSV file with a column for temperature range with values like "20-25" stored as string. I need to convert this to 22.5 as a float.
Need this to be done for the entire column of such values, not a single value.I want to know how this can be done in Python as i am very new to it.
Notice in the sample data image that there are NaN values as well in the records
Like said in the reactions split the array using "-" as argument.
Second, create a float array of it. Finally, take the average using numpy.
import numpy as np
temp_input = ["20-25", "36-40", "10-11", "23-24"]
# split and convert to float
# [t.split("-") for t in temp_input] is an inline iterator
tmp = np.array([t.split("-") for t in temp_input], dtype=np.float32)
# average the tmp array
temp_output = np.average(tmp, axis=1)
And here's a oneliner:
temp_output = [np.average(np.array(t.split('-'), dtype=np.float32)) for t in temp_input]

Resources