python having issues with large numbers used as IDs - python-3.x

I am putting together a script to analyze campaigns and report out. I'm building it in python in order to make this easy the next time around. I'm running into issues with the IDs involved in my data, they are essentially really large numbers (no strings, no characters).
when pulling the data in from excel I get floats like this (7.000000e+16) when in reality it is an integer like so(70000000001034570). my problem is that im losing a ton of data and all kinds of unique ID's are getting converted to a couple of different floats. I realize this may be an issue with the read_csv function I use to pull these in like this all comes from .csv. I am not sure what to do as converting to string gives me the same results as the float only as a string datatype, and converting to int gives me the literal results of the scientific notation (i.e. 70000000000000000). Is there a datatype I can store these as or a method I can use for preserving the data? I will have to merge on the ID's later with data pulled from a query so ideally, I would like to find a datatype that can preserve them. The few lines of code below run but return a handful of rows because of the issue I described.
`high_lvl_df = pd.read_csv(r"mycsv.csv")
full_df = low_lvl_df.merge(right=high_lvl_df, on='fact', how='outer')
full_df.to_csv(r'fullmycsv.csv')`

It may have to do with missing values.
Consider this CSV:
70000000001034570,2.
70000000001034571,3.
Then:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 70000000001034570 2.0
1 70000000001034571 3.0
Gives you the expected result.
Hoever with:
70000000001034570,2.
,1.
70000000001034571,3.
You get:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 7.000000e+16 2.0
1 NaN 2.0
2 7.000000e+16 3.0
That is because integers do not have NaN values, whereas floats do have that as a valid value. Pandas, therefore, infers the column type is float, not integer.
You can use pandas.read_csv()'s dtype parameter to force type string, for example:
pandas.read_csv('asdf.csv', header=None, dtype={0: str})
0 1
0 70000000001034570 2.0
1 NaN 2.0
2 70000000001034571 3.0
According to Pandas' documentation:
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Related

Pandas Dataframe Integer Column Data Cleanup

I have a CSV file, which I read through pandas read_csv module.
There is one column, which is supposed to have numbers only, but the data has some bad values.
Some rows (very few) have "alphanumeric" strings, few rows are empty while a few others have floating point numbers. Also, for some reason, some numbers are also being read as strings.
I want to convert it in the following way:
Alphanumeric, None, empty (numpy.nan) should be converted to 0
Floating point should be typecasted to int
Integers should remain as they are
And obvs, numbers should be read as numbers only.
How should I proceed, as I have no other idea than to read each row one by one and typecast into int, in a try-except block, while assigning 0 if exception is raised.
like:
def typecast_int(n):
try:
return int(n)
except:
return 0
for idx, row in df.iterrows:
row["number_column"] = typecast_int(row["number_column"])
But there are some issues with this approach. Firstly, iterrows is bad performance wise. And my dataframe may have upto 700k to 1M records and I have to process ~500 such CSV files. And secondly, it just doesn't feel right to do it this way.
I could do a tad better by using df.apply instead of iterrows but that is also not too different.
From your 4 conditions, there's
df.number_column = (pd.to_numeric(df.number_column, errors="coerce")
.fillna(0)
.astype(int))
This first converts the column to be numeric values only. If errors arise (e.g., due to alphanumerics) they got "coerce"d to NaN. Then we fill those NaN's with 0 and lastly cast everything to integers.

pandas converts long decimal string into "-inf"

I have a CSV with a float value represented as a long decimal string (a -1 followed by 342 0's). Example below:
ID,AMOUNT
"id_1","-1.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"
The issue is that, when reading into a pandas (0.25) DataFrame, the value automatically gets converted to a -inf:
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -inf
Name: AMOUNT, dtype: float64
If I change the value in the CSV to a "-1.0", it works fine as expected. Weirdly, there seems to be a sweet spot regarding how long the string can be. If I manually truncate the string to only 308 0's, it reads in the value correctly as a -1.0:
# when value is "-1.0" or "-1." followed by 308 0's
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -1.0
Name: AMOUNT, dtype: float64
While the ideal solution would be to ensure that value is truncated in the source itself before we process it. But in the meanwhile, what is the reason for this behavior? and/or is there a workaround for this?
We are currently using Python 3.6 and Pandas 0.25
One workaround might be to read in the columns as strings, then truncate the trailing zeros using the built-in float function.
df = pd.read_csv("/path/to/file.csv", dtype="string")
df['AMOUNT'] = df['AMOUNT'].apply(lambda x: float(x))

What is the difference between Null Nan and None in python [duplicate]

I am reading two columns of a csv file using pandas readcsv() and then assigning the values to a dictionary. The columns contain strings of numbers and letters. Occasionally there are cases where a cell is empty. In my opinion, the value read to that dictionary entry should be None but instead nan is assigned. Surely None is more descriptive of an empty cell as it has a null value, whereas nan just says that the value read is not a number.
Is my understanding correct, what IS the difference between None and nan? Why is nan assigned instead of None?
Also, my dictionary check for any empty cells has been using numpy.isnan():
for k, v in my_dict.iteritems():
if np.isnan(v):
But this gives me an error saying that I cannot use this check for v. I guess it is because an integer or float variable, not a string is meant to be used. If this is true, how can I check v for an "empty cell"/nan case?
NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.
Wes writes in the docs 'choice of NA-representation':
After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.
...
Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.
Note: the "gotcha" that integer Series containing missing data are upcast to floats.
In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.
# without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])
In [13]: s_bad.dtype
Out[13]: dtype('O')
In [14]: s_good.dtype
Out[14]: dtype('float64')
Jeff comments (below) on this:
np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.
So repeat 3 times fast: object==bad, float==good
Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):
In [15]: s_bad.sum()
Out[15]: 1
In [16]: s_good.sum()
Out[16]: 1.0
To answer the second question:
You should be using pd.isnull and pd.notnull to test for missing data (NaN).
NaN can be used as a numerical value on mathematical operations, while None cannot (or at least shouldn't).
NaN is a numeric value, as defined in IEEE 754 floating-point standard.
None is an internal Python type (NoneType) and would be more like "inexistent" or "empty" than "numerically invalid" in this context.
The main "symptom" of that is that, if you perform, say, an average or a sum on an array containing NaN, even a single one, you get NaN as a result...
In the other hand, you cannot perform mathematical operations using None as operand.
So, depending on the case, you could use None as a way to tell your algorithm not to consider invalid or inexistent values on computations. That would mean the algorithm should test each value to see if it is None.
Numpy has some functions to avoid NaN values to contaminate your results, such as nansum and nan_to_num for example.
The function isnan() checks to see if something is "Not A Number" and will return whether or not a variable is a number, for example isnan(2) would return false
The conditional myVar is not None returns whether or not the variable is defined
Your numpy array uses isnan() because it is intended to be an array of numbers and it initializes all elements of the array to NaN these elements are considered "empty"
Below are the differences:
nan belongs to the class float
None belongs to the class NoneType
I found the below article very helpful:
https://medium.com/analytics-vidhya/dealing-with-missing-values-nan-and-none-in-python-6fc9b8fb4f31

Cannot convert object as strings to int - Unable to parse string

I have a data frame with one column denoting range of Ages. The data type of the Age column in shown as string. I am trying to convert string values to numeric for the model to interpret the features.
I tried the following to convert to 'int'.
df.Age = pd.to_numeric(df.Age)
I get the following error:
ValueError: Unable to parse string "0-17" at position 0
I also tried using the 'errors = coerce' parameter but it gave me a different error:
df.Age = pd.to_numeric(df.Age, errors='coerce').astype(int)
Error:
ValueError: Cannot convert non-finite values (NA or inf) to integer
But there are no NA values in any column in my df
Age seems to be a categorical variable, so you should treat it as such. pandas has a neat category dtype which converts your labels to integers under the hood:
df['Age'] = df['Age'].astype('category')
Then you can access the underlying integers usin the cat accessor method
codes = df['Age'].cat.codes # This returns integers
Also you probably want to make Age an ordered categorical variable, for which you can also find a neat recipe in the docs.
from pandas.api.types import CategoricalDtype
age_category = CategoricalDtype([...your labels in order...], ordered=True)
df['Age'] = df['Age'].astype(age_category)
Then you can acces the underlying codes in the same way and be sure that they will reflect the order you entered for your labels.
At first glance, I would say it is because you are attempting to convert a string that has not only an int in it. Your string is "0-17", which is not an integer. If it had been "17" or "0", the conversion would have worked.
val = int("0")
val = int("17")
I have no idea what your to_numeric method is, so I am not sure if I am answering your question.
Why don't you split
a=df["age"].str.split("-", n=2, expand=True)
df['age_from']=a[0].to_frame()
df['age_to']=a[1].to_frame()
Here is what I got at the end!
date age
0 2018-04-15 12-20
1 2018-04-15 2-30
2 2018-04-18 5-46+
date age age_from age_to
0 2018-04-15 12-20 12 20
1 2018-04-15 2-30 2 30
2 2018-04-18 5-46+ 5 46+

How to change stringified numbers in data frame into pure numeric values in R

I have the following data.frame:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
# Note that the salary below is in stringified format.
# In reality there are more such stringified numerical columns.
salary <- as.character(c(21000, 23400, 26800))
df <- data.frame(employee,salary)
The output is:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
$ salary : Factor w/ 3 levels "21000","23400",..: 1 2 3
What I want to do is to convert the change the value from string into pure number
straight fro the df variable. At the same time preserve the string name for employee.
I tried this but won't work:
as.numeric(df)
At the end of the day I'd like to perform arithmetic on these numeric
values from df. Such as df2 <- log2(df), etc.
Ok, there's a couple of things going on here:
R has two different datatypes that look like strings: factor and character
You can't modify most R objects in place, you have to change them by assignment
The actual fix for your example is:
df$salary = as.numeric(as.character(df$salary))
If you try to call as.numeric on df$salary without converting it to character first, you'd get a somewhat strange result:
> as.numeric(df$salary)
[1] 1 2 3
When R creates a factor, it turns the unique elements of the vector into levels, and then represents those levels using integers, which is what you see when you try to convert to numeric.

Resources