numpy.genfromtxt() can't read header - python-3.x

I am trying to use numpy.genfromtxt() to read a csv. file, but I can't make it read the header correctly.
As by default the function doesn't skip the header, but as the values in each column are numbers, it seems to set var type to float (for the entire column), at which point it detects the header row as a missing value and returns NaN.
Here is my code:
import numpy
dataset = numpy.loadtxt('datasets/BAL_dataset01.csv',
delimiter=',')
print(dataset[0:5])
Here is first 7 rows of my .csv:
patient_nr,Age,Native_CD45,LYM,Macr,NEU
1,48,35.8,3.4,92.5,3.7
1,48,14.5,12.6,78.3,1.2
1,48,12.1,5.6,87.1,4.3
1,48,5.6,25.9,72.7,0.4
1,49,13.2,N/A,N/A,N/A
2,18,43.0,17.9,76.2,4.2
3,59,53.2,1.07,47.8,49.6
And here is the resulting array:
[[ nan nan nan nan nan nan]
[ 1. 48. 35.8 3.4 92.5 3.7]
[ 1. 48. 14.5 12.6 78.3 1.2]
[ 1. 48. 12.1 5.6 87.1 4.3]
[ 1. 48. 5.6 25.9 72.7 0.4]]
Process finished with exit code 0
I tried setting encoding to 'UTF-8-sig' and playing around with parameters, but to no avail. I tried numpy.loadtxt(), but it doesn't work for me since there are missing values within the dataset
The only solution that worked for me is to read the first row in a separate array and then concatenate them.
Is there a more elegant solution to reading the header as strings while preserving the float nature of the values? I am probably missing something trivial here.
Preferably using numpy or other package – I am not fond of creating for loops everywhere, aka reinventing the wheel while standing at the car park.
Thank you for any and all input.

That is feasible with numpy, or even with the standard lib (csv), but I would suggest looking at the pandas package (whose whole point is the handling of CSV-like data).
import pandas as pd
file_to_read = r'path/to/your/csv'
res = pd.read_csv(file_to_read)
print(res)
The "N/A" will get out as NaN (for more options, see parameters na_values and keep_default_na in the doc for pandas.read_csv).

A solution by commenter hpaulj did the job for me:
Using names=True and dype=None (and possibly encoding=None), should produce a structured array. Look at it's shape` and dtype. Or use the skip_header parameter, and accept floats.
Also for anyone starting with numpy and not reading the full documentation like me:
the names of columns are not stored in the array itself, but in its' .dtype.names. And because I didn't look there, I didn't see the code worked with names=True.
The working code:
import numpy
dataset = numpy.genfromtxt('datasets/BAL_dataset01.csv',
delimiter=',',
encoding='UTF-8-sig',
dtype=None,
names=True)
print(dataset[0:7])
print(dataset.dtype.names)

Related

pandas converts long decimal string into "-inf"

I have a CSV with a float value represented as a long decimal string (a -1 followed by 342 0's). Example below:
ID,AMOUNT
"id_1","-1.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"
The issue is that, when reading into a pandas (0.25) DataFrame, the value automatically gets converted to a -inf:
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -inf
Name: AMOUNT, dtype: float64
If I change the value in the CSV to a "-1.0", it works fine as expected. Weirdly, there seems to be a sweet spot regarding how long the string can be. If I manually truncate the string to only 308 0's, it reads in the value correctly as a -1.0:
# when value is "-1.0" or "-1." followed by 308 0's
>>> pd.read_csv('/path/to/file.csv')['AMOUNT']
0 -1.0
Name: AMOUNT, dtype: float64
While the ideal solution would be to ensure that value is truncated in the source itself before we process it. But in the meanwhile, what is the reason for this behavior? and/or is there a workaround for this?
We are currently using Python 3.6 and Pandas 0.25
One workaround might be to read in the columns as strings, then truncate the trailing zeros using the built-in float function.
df = pd.read_csv("/path/to/file.csv", dtype="string")
df['AMOUNT'] = df['AMOUNT'].apply(lambda x: float(x))

What is the difference between Null Nan and None in python [duplicate]

I am reading two columns of a csv file using pandas readcsv() and then assigning the values to a dictionary. The columns contain strings of numbers and letters. Occasionally there are cases where a cell is empty. In my opinion, the value read to that dictionary entry should be None but instead nan is assigned. Surely None is more descriptive of an empty cell as it has a null value, whereas nan just says that the value read is not a number.
Is my understanding correct, what IS the difference between None and nan? Why is nan assigned instead of None?
Also, my dictionary check for any empty cells has been using numpy.isnan():
for k, v in my_dict.iteritems():
if np.isnan(v):
But this gives me an error saying that I cannot use this check for v. I guess it is because an integer or float variable, not a string is meant to be used. If this is true, how can I check v for an "empty cell"/nan case?
NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.
Wes writes in the docs 'choice of NA-representation':
After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.
...
Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.
Note: the "gotcha" that integer Series containing missing data are upcast to floats.
In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.
# without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])
In [13]: s_bad.dtype
Out[13]: dtype('O')
In [14]: s_good.dtype
Out[14]: dtype('float64')
Jeff comments (below) on this:
np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.
So repeat 3 times fast: object==bad, float==good
Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):
In [15]: s_bad.sum()
Out[15]: 1
In [16]: s_good.sum()
Out[16]: 1.0
To answer the second question:
You should be using pd.isnull and pd.notnull to test for missing data (NaN).
NaN can be used as a numerical value on mathematical operations, while None cannot (or at least shouldn't).
NaN is a numeric value, as defined in IEEE 754 floating-point standard.
None is an internal Python type (NoneType) and would be more like "inexistent" or "empty" than "numerically invalid" in this context.
The main "symptom" of that is that, if you perform, say, an average or a sum on an array containing NaN, even a single one, you get NaN as a result...
In the other hand, you cannot perform mathematical operations using None as operand.
So, depending on the case, you could use None as a way to tell your algorithm not to consider invalid or inexistent values on computations. That would mean the algorithm should test each value to see if it is None.
Numpy has some functions to avoid NaN values to contaminate your results, such as nansum and nan_to_num for example.
The function isnan() checks to see if something is "Not A Number" and will return whether or not a variable is a number, for example isnan(2) would return false
The conditional myVar is not None returns whether or not the variable is defined
Your numpy array uses isnan() because it is intended to be an array of numbers and it initializes all elements of the array to NaN these elements are considered "empty"
Below are the differences:
nan belongs to the class float
None belongs to the class NoneType
I found the below article very helpful:
https://medium.com/analytics-vidhya/dealing-with-missing-values-nan-and-none-in-python-6fc9b8fb4f31

Replacing NaN Values in a Pandas DataFrame with Different Random Uniform Variables

I have a uniform distribution in a pandas dataframe column with a few NaN values I'd like to replace.
Since the data is uniformly distributed, I decided that I would like to fill the null values with random uniform samples drawn from a range of the column's min and max values. I used the following code to get the random uniform sample:
df_copy['ep'] = df_copy['ep'].fillna(value=np.random.uniform(3, 331))
Of course, using pd.DafaFrame.fillna() replaces all existing NaNs with the same value. I would like each NaN to be a different value. I assume that a for loop could get the job done, but am unsure how to create such a loop to specifically handle these NaN values. Thanks for the help!
If looks like you are doing this on a series (column), but the same implementation would work on a DataFrame:
Sample Data:
series = pd.Series(range(100))
series.loc[2] = np.nan
series.loc[10:15] = np.nan
Solution:
series.mask(series.isnull(), np.random.uniform(3, 331, size=series.shape))
Use boolean indexing with DataFrame.loc:
m = df_copy['ep'].isna()
df_copy.loc[m, 'ep'] = np.random.uniform(3, 331, size=m.sum())

python having issues with large numbers used as IDs

I am putting together a script to analyze campaigns and report out. I'm building it in python in order to make this easy the next time around. I'm running into issues with the IDs involved in my data, they are essentially really large numbers (no strings, no characters).
when pulling the data in from excel I get floats like this (7.000000e+16) when in reality it is an integer like so(70000000001034570). my problem is that im losing a ton of data and all kinds of unique ID's are getting converted to a couple of different floats. I realize this may be an issue with the read_csv function I use to pull these in like this all comes from .csv. I am not sure what to do as converting to string gives me the same results as the float only as a string datatype, and converting to int gives me the literal results of the scientific notation (i.e. 70000000000000000). Is there a datatype I can store these as or a method I can use for preserving the data? I will have to merge on the ID's later with data pulled from a query so ideally, I would like to find a datatype that can preserve them. The few lines of code below run but return a handful of rows because of the issue I described.
`high_lvl_df = pd.read_csv(r"mycsv.csv")
full_df = low_lvl_df.merge(right=high_lvl_df, on='fact', how='outer')
full_df.to_csv(r'fullmycsv.csv')`
It may have to do with missing values.
Consider this CSV:
70000000001034570,2.
70000000001034571,3.
Then:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 70000000001034570 2.0
1 70000000001034571 3.0
Gives you the expected result.
Hoever with:
70000000001034570,2.
,1.
70000000001034571,3.
You get:
>>> pandas.read_csv('asdf.csv', header=None)
0 1
0 7.000000e+16 2.0
1 NaN 2.0
2 7.000000e+16 3.0
That is because integers do not have NaN values, whereas floats do have that as a valid value. Pandas, therefore, infers the column type is float, not integer.
You can use pandas.read_csv()'s dtype parameter to force type string, for example:
pandas.read_csv('asdf.csv', header=None, dtype={0: str})
0 1
0 70000000001034570 2.0
1 NaN 2.0
2 70000000001034571 3.0
According to Pandas' documentation:
dtype : Type name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Counter python 3

I read in a [data set(https://outcomestat.baltimorecity.gov/Transportation/100EBaltimoreST/k7ux-mv7u/about) with pandas.read_csv() with no modifying args.
In the stolenVehicleFlag column there are 0, 1, and NaN.
The nans returnFalse when compared to np.nan or np.NaN.
The column is typed numpy.float64 so I tried typing the np.nans
to that from the float-type that they normally are but it still
returns False.
I also tried using a Counter to roll them up but each nan returns its
own count of 1.
Any ideas on how this is happening and how to deal with it?
I'm not sure what you are expecting to do but may be this could help if you want to get rid of this NaN values considering "df" your dataframre use:
df.dropna()
This will help you with NaN values,
You can check for more information here : pandas.DataFrame.dropna

Resources