pandas converts long decimal string into "-inf" - python-3.x

I have a CSV with a float value represented as a long decimal string (a -1 followed by 342 0's). Example below:
ID,AMOUNT
"id_1","-1.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"
The issue is that, when reading into a pandas (0.25) DataFrame, the value automatically gets converted to a -inf:
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -inf
Name: AMOUNT, dtype: float64
If I change the value in the CSV to a "-1.0", it works fine as expected. Weirdly, there seems to be a sweet spot regarding how long the string can be. If I manually truncate the string to only 308 0's, it reads in the value correctly as a -1.0:
# when value is "-1.0" or "-1." followed by 308 0's
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -1.0
Name: AMOUNT, dtype: float64
While the ideal solution would be to ensure that value is truncated in the source itself before we process it. But in the meanwhile, what is the reason for this behavior? and/or is there a workaround for this?
We are currently using Python 3.6 and Pandas 0.25

One workaround might be to read in the columns as strings, then truncate the trailing zeros using the built-in float function.
df = pd.read_csv("/path/to/file.csv", dtype="string")
df['AMOUNT'] = df['AMOUNT'].apply(lambda x: float(x))

Related

What is the difference between Null Nan and None in python [duplicate]

I am reading two columns of a csv file using pandas readcsv() and then assigning the values to a dictionary. The columns contain strings of numbers and letters. Occasionally there are cases where a cell is empty. In my opinion, the value read to that dictionary entry should be None but instead nan is assigned. Surely None is more descriptive of an empty cell as it has a null value, whereas nan just says that the value read is not a number.
Is my understanding correct, what IS the difference between None and nan? Why is nan assigned instead of None?
Also, my dictionary check for any empty cells has been using numpy.isnan():
for k, v in my_dict.iteritems():
if np.isnan(v):
But this gives me an error saying that I cannot use this check for v. I guess it is because an integer or float variable, not a string is meant to be used. If this is true, how can I check v for an "empty cell"/nan case?
NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.
Wes writes in the docs 'choice of NA-representation':
After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.
...
Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.
Note: the "gotcha" that integer Series containing missing data are upcast to floats.
In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.
# without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])
In [13]: s_bad.dtype
Out[13]: dtype('O')
In [14]: s_good.dtype
Out[14]: dtype('float64')
Jeff comments (below) on this:
np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.
So repeat 3 times fast: object==bad, float==good
Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):
In [15]: s_bad.sum()
Out[15]: 1
In [16]: s_good.sum()
Out[16]: 1.0
To answer the second question:
You should be using pd.isnull and pd.notnull to test for missing data (NaN).
NaN can be used as a numerical value on mathematical operations, while None cannot (or at least shouldn't).
NaN is a numeric value, as defined in IEEE 754 floating-point standard.
None is an internal Python type (NoneType) and would be more like "inexistent" or "empty" than "numerically invalid" in this context.
The main "symptom" of that is that, if you perform, say, an average or a sum on an array containing NaN, even a single one, you get NaN as a result...
In the other hand, you cannot perform mathematical operations using None as operand.
So, depending on the case, you could use None as a way to tell your algorithm not to consider invalid or inexistent values on computations. That would mean the algorithm should test each value to see if it is None.
Numpy has some functions to avoid NaN values to contaminate your results, such as nansum and nan_to_num for example.
The function isnan() checks to see if something is "Not A Number" and will return whether or not a variable is a number, for example isnan(2) would return false
The conditional myVar is not None returns whether or not the variable is defined
Your numpy array uses isnan() because it is intended to be an array of numbers and it initializes all elements of the array to NaN these elements are considered "empty"
Below are the differences:
nan belongs to the class float
None belongs to the class NoneType
I found the below article very helpful:
https://medium.com/analytics-vidhya/dealing-with-missing-values-nan-and-none-in-python-6fc9b8fb4f31

How to add a trailing zeros to a pandas dataframe column?

I have a pandas dataframe column that I would like to be formatted to two decimal places.
Such that:
10.1
Appears as:
10.10
How can I do that? I have already tried rounding.
This can be accomplished by mapping a string format object to the column of floats:
df.colName.map('{:.2f}'.format)
(Credit to exp1orer)
You can use:
pd.options.display.float_format = '{:,.2f}'.format
Note that this will only display two decimals for every float in your dataframes.
To go back to normal:
pd.reset_option('display.float_format')
From pyformat.info
Padding numbers
For floating points the padding value represents the length of the complete output. In the example below we want our output to have at least 6 characters with 2 after the decimal point.
'{:06.2f}'.format(3.141592653589793)
The :06 is the length of your output regardless of how many digits are in your input. The .2 indicates you want 2 places after the decimal point. The f indicates you want a float output.
Output
003.14
If you are using Python 3.6 or later you can use f strings. Check out this other answer: https://stackoverflow.com/a/45310389/12229158
>>> a = 10.1234
>>> f'{a:.2f}'
'10.12'

Float is not converting to integer pandas

I used this code to convert my float numbers into an integer, however, it does not work. Here are all step I gone through so far:
Step 1: I converted timestamp1 and timestamp2 to datetime in order subtract and get days:
a=pd.to_datetime(df['timestamp1'], format='%Y-%m-%dT%H:%M:%SZ')
b=pd.to_datetime(df['timestamp2'], format='%Y-%m-%dT%H:%M:%SZ')
df['delta'] = (b-a).dt.days
Step 2: Converted the strings into integers as the day:
df['delta'] = pd.to_datetime(df['delta'], format='%Y-%m-%d', errors='coerce')
df['delta'] = df['delta'].dt.day
Step 3: I am trying to convert floats into integers.
categorical_feature_mask = df.dtypes==object
categorical_cols = df.columns[categorical_feature_mask].tolist()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))
df[categorical_cols].head(10)
However, it throws an error TypeError: ('argument must be a string or number', 'occurred at index col1')
To convert a float column to an integer with float columns having NaN values two things you can do:
Convert to naive int and change NaN values to an arbitrary value such as this:
df[col].fillna(0).astype("int32")
If you want to conserve NaN values use this:
df[col].astype("Int32")
Note the difference with the capital "I". For further information on this implementation made by Pandas just look at this: Nullable integer data type.
Why do you need to do that ? Because by default Pandas considers that when your column has at least on NaN value, the column is a Float, because this is how numpy behaves.
The same thing happen with strings, if you have at least one string value in your column, the whole column would be labeled as object for Pandas, so this is why your first attempt failed.
You can convert columns from float to int using this. Use errors='ignore' if the data contains null values
df[column_name] = df[column_name].astype("Int64", errors="ignore")

Cannot convert object as strings to int - Unable to parse string

I have a data frame with one column denoting range of Ages. The data type of the Age column in shown as string. I am trying to convert string values to numeric for the model to interpret the features.
I tried the following to convert to 'int'.
df.Age = pd.to_numeric(df.Age)
I get the following error:
ValueError: Unable to parse string "0-17" at position 0
I also tried using the 'errors = coerce' parameter but it gave me a different error:
df.Age = pd.to_numeric(df.Age, errors='coerce').astype(int)
Error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
But there are no NA values in any column in my df
Age seems to be a categorical variable, so you should treat it as such. pandas has a neat category dtype which converts your labels to integers under the hood:
df['Age'] = df['Age'].astype('category')
Then you can access the underlying integers usin the cat accessor method
codes = df['Age'].cat.codes # This returns integers
Also you probably want to make Age an ordered categorical variable, for which you can also find a neat recipe in the docs.
from pandas.api.types import CategoricalDtype
age_category = CategoricalDtype([...your labels in order...], ordered=True)
df['Age'] = df['Age'].astype(age_category)
Then you can acces the underlying codes in the same way and be sure that they will reflect the order you entered for your labels.
At first glance, I would say it is because you are attempting to convert a string that has not only an int in it. Your string is "0-17", which is not an integer. If it had been "17" or "0", the conversion would have worked.
val = int("0")
val = int("17")
I have no idea what your to_numeric method is, so I am not sure if I am answering your question.
Why don't you split
a=df["age"].str.split("-", n=2, expand=True)
df['age_from']=a[0].to_frame()
df['age_to']=a[1].to_frame()
Here is what I got at the end!
date age
0 2018-04-15 12-20
1 2018-04-15 2-30
2 2018-04-18 5-46+
date age age_from age_to
0 2018-04-15 12-20 12 20
1 2018-04-15 2-30 2 30
2 2018-04-18 5-46+ 5 46+

python having issues with large numbers used as IDs

I am putting together a script to analyze campaigns and report out. I'm building it in python in order to make this easy the next time around. I'm running into issues with the IDs involved in my data, they are essentially really large numbers (no strings, no characters).
when pulling the data in from excel I get floats like this (7.000000e+16) when in reality it is an integer like so(70000000001034570). my problem is that im losing a ton of data and all kinds of unique ID's are getting converted to a couple of different floats. I realize this may be an issue with the read_csv function I use to pull these in like this all comes from .csv. I am not sure what to do as converting to string gives me the same results as the float only as a string datatype, and converting to int gives me the literal results of the scientific notation (i.e. 70000000000000000). Is there a datatype I can store these as or a method I can use for preserving the data? I will have to merge on the ID's later with data pulled from a query so ideally, I would like to find a datatype that can preserve them. The few lines of code below run but return a handful of rows because of the issue I described.
`high_lvl_df = pd.read_csv(r"mycsv.csv")
full_df = low_lvl_df.merge(right=high_lvl_df, on='fact', how='outer')
full_df.to_csv(r'fullmycsv.csv')`
It may have to do with missing values.
Consider this CSV:
70000000001034570,2.
70000000001034571,3.
Then:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 70000000001034570 2.0
1 70000000001034571 3.0
Gives you the expected result.
Hoever with:
70000000001034570,2.
,1.
70000000001034571,3.
You get:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 7.000000e+16 2.0
1 NaN 2.0
2 7.000000e+16 3.0
That is because integers do not have NaN values, whereas floats do have that as a valid value. Pandas, therefore, infers the column type is float, not integer.
You can use pandas.read_csv()'s dtype parameter to force type string, for example:
pandas.read_csv('asdf.csv', header=None, dtype={0: str})
0 1
0 70000000001034570 2.0
1 NaN 2.0
2 70000000001034571 3.0
According to Pandas' documentation:
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Resources