Float is not converting to integer pandas - python-3.x

I used this code to convert my float numbers into an integer, however, it does not work. Here are all step I gone through so far:
Step 1: I converted timestamp1 and timestamp2 to datetime in order subtract and get days:
a=pd.to_datetime(df['timestamp1'], format='%Y-%m-%dT%H:%M:%SZ')
b=pd.to_datetime(df['timestamp2'], format='%Y-%m-%dT%H:%M:%SZ')
df['delta'] = (b-a).dt.days
Step 2: Converted the strings into integers as the day:
df['delta'] = pd.to_datetime(df['delta'], format='%Y-%m-%d', errors='coerce')
df['delta'] = df['delta'].dt.day
Step 3: I am trying to convert floats into integers.
categorical_feature_mask = df.dtypes==object
categorical_cols = df.columns[categorical_feature_mask].tolist()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
df[categorical_cols].head(10)
However, it throws an error TypeError: ('argument must be a string or number', 'occurred at index col1')

To convert a float column to an integer with float columns having NaN values two things you can do:
Convert to naive int and change NaN values to an arbitrary value such as this:
df[col].fillna(0).astype("int32")
If you want to conserve NaN values use this:
df[col].astype("Int32")
Note the difference with the capital "I". For further information on this implementation made by Pandas just look at this: Nullable integer data type.
Why do you need to do that ? Because by default Pandas considers that when your column has at least on NaN value, the column is a Float, because this is how numpy behaves.
The same thing happen with strings, if you have at least one string value in your column, the whole column would be labeled as object for Pandas, so this is why your first attempt failed.

You can convert columns from float to int using this. Use errors='ignore' if the data contains null values
df[column_name] = df[column_name].astype("Int64", errors="ignore")

Related

pandas converts long decimal string into "-inf"

I have a CSV with a float value represented as a long decimal string (a -1 followed by 342 0's). Example below:
ID,AMOUNT
"id_1","-1.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"
The issue is that, when reading into a pandas (0.25) DataFrame, the value automatically gets converted to a -inf:
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -inf
Name: AMOUNT, dtype: float64
If I change the value in the CSV to a "-1.0", it works fine as expected. Weirdly, there seems to be a sweet spot regarding how long the string can be. If I manually truncate the string to only 308 0's, it reads in the value correctly as a -1.0:
# when value is "-1.0" or "-1." followed by 308 0's
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -1.0
Name: AMOUNT, dtype: float64
While the ideal solution would be to ensure that value is truncated in the source itself before we process it. But in the meanwhile, what is the reason for this behavior? and/or is there a workaround for this?
We are currently using Python 3.6 and Pandas 0.25
One workaround might be to read in the columns as strings, then truncate the trailing zeros using the built-in float function.
df = pd.read_csv("/path/to/file.csv", dtype="string")
df['AMOUNT'] = df['AMOUNT'].apply(lambda x: float(x))

Data Cleaning with Pandas in Python

I am trying to clean a csv file for data analysis. How do I convert TRUE FALSE into 1 and 0?
When I search Google, they suggested df.somecolumn=df.somecolumn.astype(int). However this csv file has 100 columns and not every column is true false(some are categorical, some are numerical). How do I do a sweeping code that allows us to convert any column with TRUE FALSE to 1 and 0 without typing 50 lines of df.somecolumn=df.somecolumn.astype(int)
you can use:
df.select_dtypes(include='bool')=df.select_dtypes(include='bool').astype(int)
A slightly different approach.
First, dtypes of a dataframe can be returned using df.dtypes, which gives a pandas series that looks like this,
a int64
b bool
c object
dtype: object
Second, we could replace bool with int type using replace,
df.dtypes.replace('bool', 'int8'), this gives
a int64
b int8
c object
dtype: object
Finally, pandas seires is essentially a dictionary which can be passed to pd.DataFrame.astype.
We could write it as a oneliner,
df.astype(df.dtypes.replace('bool', 'int8'))
I would do it like this:
df.somecolumn = df.somecolumn.apply(lambda x: 1 if x=="TRUE" else 0)
If you want to iterate through all your columns and check wether they have TRUE/FALSE values, you can do this:
for c in df:
if 'TRUE' in df[c] or 'FALSE' in df[c]:
df[c] = df[c].apply(lambda x: 1 if x=='TRUE' else 0)
Note that this approach is case-sensitive and won't work well if in the column the TRUE/FALSE values are mixed with others.

Pandas and datetime coercion. Can't convert whole column to Timestamp

So, I have an issue. Pandas keeps telling me that
'datetime.date' is coerced to a datetime.
In the future pandas will not coerce, and a TypeError will be raised. To >retain the current behavior, convert the 'datetime.date' to a datetime with >'pd.Timestamp'.
I'd like to get rid of this warning
So until now I had a dataframe with some data, I was doing some filtration and manipulation. At some point I have a column with dates in string format. I don't care about timzeones etc. It's all about day accuracy. I'm getting a warning mentioned above, when I convert the strings to datetime, like below:
df['Some_date'] = pd.to_datetime(df['Some_date'], format='%m/%d/%Y')
So I tried to do something like that:
df['Some_date'] = pd.Timestamp(df['Some_date'])
But it fails as pd.Timestamp doesn't accept Series as an argument.
I'm looking for a quickest way to convert those strings to Timestamp.
=====================================
EDIT
I'm so sorry, for confusion. I'm getting my error at another place. It happens when I try to filtrate my data like this:
df = df[(df['Some_date'] > firstday)]
Where firstday is being calculated basing on datetime. Like here:
import datetime
def get_dates_filter():
lastday = datetime.date.today().replace(day=1) - datetime.timedelta(days=1)
firstday = lastday.replace(day=1)
return firstday, lastday
So probably the issue is comparing two different types of date representation
In pandas python dates are still poor supported, the best is working with datetimes with no times.
If there are python dates you can convert to strings before to_datetime:
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str))
If need remove times from datetimes in column use:
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str)).dt.floor('d')
Test:
rng = pd.date_range('2017-04-03', periods=3).date
df = pd.DataFrame({'Some_date': rng})
print (df)
Some_date
0 2017-04-03
1 2017-04-04
2 2017-04-05
print (type(df.loc[0, 'Some_date']))
<class 'datetime.date'>
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str))
print (df)
Some_date
0 2017-04-03
1 2017-04-04
2 2017-04-05
print (type(df.loc[0, 'Some_date']))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
print (df['Some_date'].dtype)
datetime64[ns]

Cannot convert object as strings to int - Unable to parse string

I have a data frame with one column denoting range of Ages. The data type of the Age column in shown as string. I am trying to convert string values to numeric for the model to interpret the features.
I tried the following to convert to 'int'.
df.Age = pd.to_numeric(df.Age)
I get the following error:
ValueError: Unable to parse string "0-17" at position 0
I also tried using the 'errors = coerce' parameter but it gave me a different error:
df.Age = pd.to_numeric(df.Age, errors='coerce').astype(int)
Error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
But there are no NA values in any column in my df
Age seems to be a categorical variable, so you should treat it as such. pandas has a neat category dtype which converts your labels to integers under the hood:
df['Age'] = df['Age'].astype('category')
Then you can access the underlying integers usin the cat accessor method
codes = df['Age'].cat.codes # This returns integers
Also you probably want to make Age an ordered categorical variable, for which you can also find a neat recipe in the docs.
from pandas.api.types import CategoricalDtype
age_category = CategoricalDtype([...your labels in order...], ordered=True)
df['Age'] = df['Age'].astype(age_category)
Then you can acces the underlying codes in the same way and be sure that they will reflect the order you entered for your labels.
At first glance, I would say it is because you are attempting to convert a string that has not only an int in it. Your string is "0-17", which is not an integer. If it had been "17" or "0", the conversion would have worked.
val = int("0")
val = int("17")
I have no idea what your to_numeric method is, so I am not sure if I am answering your question.
Why don't you split
a=df["age"].str.split("-", n=2, expand=True)
df['age_from']=a[0].to_frame()
df['age_to']=a[1].to_frame()
Here is what I got at the end!
date age
0 2018-04-15 12-20
1 2018-04-15 2-30
2 2018-04-18 5-46+
date age age_from age_to
0 2018-04-15 12-20 12 20
1 2018-04-15 2-30 2 30
2 2018-04-18 5-46+ 5 46+

Replacing NaN Values in a Pandas DataFrame with Different Random Uniform Variables

I have a uniform distribution in a pandas dataframe column with a few NaN values I'd like to replace.
Since the data is uniformly distributed, I decided that I would like to fill the null values with random uniform samples drawn from a range of the column's min and max values. I used the following code to get the random uniform sample:
df_copy['ep'] = df_copy['ep'].fillna(value=np.random.uniform(3, 331))
Of course, using pd.DafaFrame.fillna() replaces all existing NaNs with the same value. I would like each NaN to be a different value. I assume that a for loop could get the job done, but am unsure how to create such a loop to specifically handle these NaN values. Thanks for the help!
If looks like you are doing this on a series (column), but the same implementation would work on a DataFrame:
Sample Data:
series = pd.Series(range(100))
series.loc[2] = np.nan
series.loc[10:15] = np.nan
Solution:
series.mask(series.isnull(), np.random.uniform(3, 331, size=series.shape))
Use boolean indexing with DataFrame.loc:
m = df_copy['ep'].isna()
df_copy.loc[m, 'ep'] = np.random.uniform(3, 331, size=m.sum())

Resources