Cannot convert object as strings to int - Unable to parse string - python-3.x

I have a data frame with one column denoting range of Ages. The data type of the Age column in shown as string. I am trying to convert string values to numeric for the model to interpret the features.
I tried the following to convert to 'int'.
df.Age = pd.to_numeric(df.Age)
I get the following error:
ValueError: Unable to parse string "0-17" at position 0
I also tried using the 'errors = coerce' parameter but it gave me a different error:
df.Age = pd.to_numeric(df.Age, errors='coerce').astype(int)
Error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
But there are no NA values in any column in my df

Age seems to be a categorical variable, so you should treat it as such. pandas has a neat category dtype which converts your labels to integers under the hood:
df['Age'] = df['Age'].astype('category')
Then you can access the underlying integers usin the cat accessor method
codes = df['Age'].cat.codes # This returns integers
Also you probably want to make Age an ordered categorical variable, for which you can also find a neat recipe in the docs.
from pandas.api.types import CategoricalDtype
age_category = CategoricalDtype([...your labels in order...], ordered=True)
df['Age'] = df['Age'].astype(age_category)
Then you can acces the underlying codes in the same way and be sure that they will reflect the order you entered for your labels.

At first glance, I would say it is because you are attempting to convert a string that has not only an int in it. Your string is "0-17", which is not an integer. If it had been "17" or "0", the conversion would have worked.
val = int("0")
val = int("17")
I have no idea what your to_numeric method is, so I am not sure if I am answering your question.

Why don't you split
a=df["age"].str.split("-", n=2, expand=True)
df['age_from']=a[0].to_frame()
df['age_to']=a[1].to_frame()
Here is what I got at the end!
date age
0 2018-04-15 12-20
1 2018-04-15 2-30
2 2018-04-18 5-46+
date age age_from age_to
0 2018-04-15 12-20 12 20
1 2018-04-15 2-30 2 30
2 2018-04-18 5-46+ 5 46+

Related

How to change dtype & apply mathematical calculations in np.where?

I have dataframe like this
df = pd.DataFrame()
df['yy'] = [2012,2011,2010]
df['mm'] = ['10','','8']
yy mm
0 2012 10
1 2011
2 2010 8
I want to multiply values in column 'mm' with 2. However all values on the column are string.
I tried it with np.where as follows:
df['X'] = np.where(df['mm']!='',df['mm'].astype(int) * 2,'')
However its not working & giving error as follows:
ValueError: invalid literal for int() with base 10: ''.
Its clear from the error that the first filter in the where doesnt work here & its applying df['mm'].astype(int) on all values hence failing for empty string value ''.
Can anyone please suggest a another way to achieve this ? I don't want to use for loop as y actual df is too big & for loop will take lot of time.
Thanks in advance.
It's better if you replace the empty strint with NaN first:
df['mm'] = df.mm.replace({'': np.nan}).fillna(0).astype(int) * 2

pandas converts long decimal string into "-inf"

I have a CSV with a float value represented as a long decimal string (a -1 followed by 342 0's). Example below:
ID,AMOUNT
"id_1","-1.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"
The issue is that, when reading into a pandas (0.25) DataFrame, the value automatically gets converted to a -inf:
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -inf
Name: AMOUNT, dtype: float64
If I change the value in the CSV to a "-1.0", it works fine as expected. Weirdly, there seems to be a sweet spot regarding how long the string can be. If I manually truncate the string to only 308 0's, it reads in the value correctly as a -1.0:
# when value is "-1.0" or "-1." followed by 308 0's
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -1.0
Name: AMOUNT, dtype: float64
While the ideal solution would be to ensure that value is truncated in the source itself before we process it. But in the meanwhile, what is the reason for this behavior? and/or is there a workaround for this?
We are currently using Python 3.6 and Pandas 0.25
One workaround might be to read in the columns as strings, then truncate the trailing zeros using the built-in float function.
df = pd.read_csv("/path/to/file.csv", dtype="string")
df['AMOUNT'] = df['AMOUNT'].apply(lambda x: float(x))

Pandas and datetime coercion. Can't convert whole column to Timestamp

So, I have an issue. Pandas keeps telling me that
'datetime.date' is coerced to a datetime.
In the future pandas will not coerce, and a TypeError will be raised. To >retain the current behavior, convert the 'datetime.date' to a datetime with >'pd.Timestamp'.
I'd like to get rid of this warning
So until now I had a dataframe with some data, I was doing some filtration and manipulation. At some point I have a column with dates in string format. I don't care about timzeones etc. It's all about day accuracy. I'm getting a warning mentioned above, when I convert the strings to datetime, like below:
df['Some_date'] = pd.to_datetime(df['Some_date'], format='%m/%d/%Y')
So I tried to do something like that:
df['Some_date'] = pd.Timestamp(df['Some_date'])
But it fails as pd.Timestamp doesn't accept Series as an argument.
I'm looking for a quickest way to convert those strings to Timestamp.
=====================================
EDIT
I'm so sorry, for confusion. I'm getting my error at another place. It happens when I try to filtrate my data like this:
df = df[(df['Some_date'] > firstday)]
Where firstday is being calculated basing on datetime. Like here:
import datetime
def get_dates_filter():
lastday = datetime.date.today().replace(day=1) - datetime.timedelta(days=1)
firstday = lastday.replace(day=1)
return firstday, lastday
So probably the issue is comparing two different types of date representation
In pandas python dates are still poor supported, the best is working with datetimes with no times.
If there are python dates you can convert to strings before to_datetime:
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str))
If need remove times from datetimes in column use:
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str)).dt.floor('d')
Test:
rng = pd.date_range('2017-04-03', periods=3).date
df = pd.DataFrame({'Some_date': rng})
print (df)
Some_date
0 2017-04-03
1 2017-04-04
2 2017-04-05
print (type(df.loc[0, 'Some_date']))
<class 'datetime.date'>
df['Some_date'] = pd.to_datetime(df['Some_date'].astype(str))
print (df)
Some_date
0 2017-04-03
1 2017-04-04
2 2017-04-05
print (type(df.loc[0, 'Some_date']))
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
print (df['Some_date'].dtype)
datetime64[ns]

Float is not converting to integer pandas

I used this code to convert my float numbers into an integer, however, it does not work. Here are all step I gone through so far:
Step 1: I converted timestamp1 and timestamp2 to datetime in order subtract and get days:
a=pd.to_datetime(df['timestamp1'], format='%Y-%m-%dT%H:%M:%SZ')
b=pd.to_datetime(df['timestamp2'], format='%Y-%m-%dT%H:%M:%SZ')
df['delta'] = (b-a).dt.days
Step 2: Converted the strings into integers as the day:
df['delta'] = pd.to_datetime(df['delta'], format='%Y-%m-%d', errors='coerce')
df['delta'] = df['delta'].dt.day
Step 3: I am trying to convert floats into integers.
categorical_feature_mask = df.dtypes==object
categorical_cols = df.columns[categorical_feature_mask].tolist()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
df[categorical_cols].head(10)
However, it throws an error TypeError: ('argument must be a string or number', 'occurred at index col1')
To convert a float column to an integer with float columns having NaN values two things you can do:
Convert to naive int and change NaN values to an arbitrary value such as this:
df[col].fillna(0).astype("int32")
If you want to conserve NaN values use this:
df[col].astype("Int32")
Note the difference with the capital "I". For further information on this implementation made by Pandas just look at this: Nullable integer data type.
Why do you need to do that ? Because by default Pandas considers that when your column has at least on NaN value, the column is a Float, because this is how numpy behaves.
The same thing happen with strings, if you have at least one string value in your column, the whole column would be labeled as object for Pandas, so this is why your first attempt failed.
You can convert columns from float to int using this. Use errors='ignore' if the data contains null values
df[column_name] = df[column_name].astype("Int64", errors="ignore")

python having issues with large numbers used as IDs

I am putting together a script to analyze campaigns and report out. I'm building it in python in order to make this easy the next time around. I'm running into issues with the IDs involved in my data, they are essentially really large numbers (no strings, no characters).
when pulling the data in from excel I get floats like this (7.000000e+16) when in reality it is an integer like so(70000000001034570). my problem is that im losing a ton of data and all kinds of unique ID's are getting converted to a couple of different floats. I realize this may be an issue with the read_csv function I use to pull these in like this all comes from .csv. I am not sure what to do as converting to string gives me the same results as the float only as a string datatype, and converting to int gives me the literal results of the scientific notation (i.e. 70000000000000000). Is there a datatype I can store these as or a method I can use for preserving the data? I will have to merge on the ID's later with data pulled from a query so ideally, I would like to find a datatype that can preserve them. The few lines of code below run but return a handful of rows because of the issue I described.
`high_lvl_df = pd.read_csv(r"mycsv.csv")
full_df = low_lvl_df.merge(right=high_lvl_df, on='fact', how='outer')
full_df.to_csv(r'fullmycsv.csv')`
It may have to do with missing values.
Consider this CSV:
70000000001034570,2.
70000000001034571,3.
Then:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 70000000001034570 2.0
1 70000000001034571 3.0
Gives you the expected result.
Hoever with:
70000000001034570,2.
,1.
70000000001034571,3.
You get:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 7.000000e+16 2.0
1 NaN 2.0
2 7.000000e+16 3.0
That is because integers do not have NaN values, whereas floats do have that as a valid value. Pandas, therefore, infers the column type is float, not integer.
You can use pandas.read_csv()'s dtype parameter to force type string, for example:
pandas.read_csv('asdf.csv', header=None, dtype={0: str})
0 1
0 70000000001034570 2.0
1 NaN 2.0
2 70000000001034571 3.0
According to Pandas' documentation:
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Resources