I read in a [data set(https://outcomestat.baltimorecity.gov/Transportation/100EBaltimoreST/k7ux-mv7u/about) with pandas.read_csv() with no modifying args.
In the stolenVehicleFlag column there are 0, 1, and NaN.
The nans returnFalse when compared to np.nan or np.NaN.
The column is typed numpy.float64 so I tried typing the np.nans
to that from the float-type that they normally are but it still
returns False.
I also tried using a Counter to roll them up but each nan returns its
own count of 1.
Any ideas on how this is happening and how to deal with it?
I'm not sure what you are expecting to do but may be this could help if you want to get rid of this NaN values considering "df" your dataframre use:
df.dropna()
This will help you with NaN values,
You can check for more information here : pandas.DataFrame.dropna
Related
I am reading two columns of a csv file using pandas readcsv() and then assigning the values to a dictionary. The columns contain strings of numbers and letters. Occasionally there are cases where a cell is empty. In my opinion, the value read to that dictionary entry should be None but instead nan is assigned. Surely None is more descriptive of an empty cell as it has a null value, whereas nan just says that the value read is not a number.
Is my understanding correct, what IS the difference between None and nan? Why is nan assigned instead of None?
Also, my dictionary check for any empty cells has been using numpy.isnan():
for k, v in my_dict.iteritems():
if np.isnan(v):
But this gives me an error saying that I cannot use this check for v. I guess it is because an integer or float variable, not a string is meant to be used. If this is true, how can I check v for an "empty cell"/nan case?
NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.
Wes writes in the docs 'choice of NA-representation':
After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.
...
Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.
Note: the "gotcha" that integer Series containing missing data are upcast to floats.
In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.
# without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])
In [13]: s_bad.dtype
Out[13]: dtype('O')
In [14]: s_good.dtype
Out[14]: dtype('float64')
Jeff comments (below) on this:
np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.
So repeat 3 times fast: object==bad, float==good
Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):
In [15]: s_bad.sum()
Out[15]: 1
In [16]: s_good.sum()
Out[16]: 1.0
To answer the second question:
You should be using pd.isnull and pd.notnull to test for missing data (NaN).
NaN can be used as a numerical value on mathematical operations, while None cannot (or at least shouldn't).
NaN is a numeric value, as defined in IEEE 754 floating-point standard.
None is an internal Python type (NoneType) and would be more like "inexistent" or "empty" than "numerically invalid" in this context.
The main "symptom" of that is that, if you perform, say, an average or a sum on an array containing NaN, even a single one, you get NaN as a result...
In the other hand, you cannot perform mathematical operations using None as operand.
So, depending on the case, you could use None as a way to tell your algorithm not to consider invalid or inexistent values on computations. That would mean the algorithm should test each value to see if it is None.
Numpy has some functions to avoid NaN values to contaminate your results, such as nansum and nan_to_num for example.
The function isnan() checks to see if something is "Not A Number" and will return whether or not a variable is a number, for example isnan(2) would return false
The conditional myVar is not None returns whether or not the variable is defined
Your numpy array uses isnan() because it is intended to be an array of numbers and it initializes all elements of the array to NaN these elements are considered "empty"
Below are the differences:
nan belongs to the class float
None belongs to the class NoneType
I found the below article very helpful:
https://medium.com/analytics-vidhya/dealing-with-missing-values-nan-and-none-in-python-6fc9b8fb4f31
It is super wired, I was going to drop the NAN for features with missing less than 5%. After I dropped it, I wanted to see if it worked or not, I surprisingly found out that I couldn't drop NAN for these variables and the NAN values are even more ??
please tell me where I am wrong
Thank you so much
Why the code doesn't work: you have more columns than in dropna_list. Then you tell pandas to delete the rows in train_set[drop_list]. But these rows are still kept in train_set because it contains more columns. Basically, train_set tries to merge the data from columns outside of drop_list and with modified columns. Missing subrows are imputed by zeros.
How to fix it? Use masks. You can determine what rows are deleted:
mask = training_set[drop_list.pop(0)].isna()
for col in drop_list:
mask = mask & training_set[col].isna()
training_set = training_set[mask]
I try to remove all NaN rows from a dataframe which I get by pd.read_excel("test.xlsx", sheet_name = "Sheet1"), I have tried with df = df.dropna(how='all') and df.dropna(how='all', inplace=True), both cannot remove the last empty rows which I printed as follows: df.tail(1).
a b c
3463 NaN NaN
I noticed the value in column c is not null but empty. Someone could help to deal with this issue? Thank you.
Maybe you want replace empty values to missing before:
df = df.replace(r'^\s+$', np.nan, regex=True).dropna(how='all')
Regex ^\s+$ means:
^ is start of string
\s+ is one or more whitespaces
$ means end of string
Here NaN is also value and empty will also be treated as a part of row.
In case of NaN, you must drop or replace with something:
dropna()
If you use this function then whenever python finds NaN in a row, it will return True and will remove whole row, doesn't matter if any value is there or not besides NaN.
fillna() to fill some values instead of NaN
In your case :
df['C'].fillna(values="Any value")
Note: It is important to specify columns in which you want to fill values otherwise it will update whole dataframe respective to NaN
Now if there is empty row then try this :
df[df['C']==" "]="Anyvalue"
I have not tried this but my assumption on above is:
Lets break down:
a. df['C']==""
This will return boolean values
b. df[df['C']==""]="Anyvalue"
wherever python finds True, value "Anyvalue" will get applied.
I am trying to use numpy.genfromtxt() to read a csv. file, but I can't make it read the header correctly.
As by default the function doesn't skip the header, but as the values in each column are numbers, it seems to set var type to float (for the entire column), at which point it detects the header row as a missing value and returns NaN.
Here is my code:
import numpy
dataset = numpy.loadtxt('datasets/BAL_dataset01.csv',
delimiter=',')
print(dataset[0:5])
Here is first 7 rows of my .csv:
patient_nr,Age,Native_CD45,LYM,Macr,NEU
1,48,35.8,3.4,92.5,3.7
1,48,14.5,12.6,78.3,1.2
1,48,12.1,5.6,87.1,4.3
1,48,5.6,25.9,72.7,0.4
1,49,13.2,N/A,N/A,N/A
2,18,43.0,17.9,76.2,4.2
3,59,53.2,1.07,47.8,49.6
And here is the resulting array:
[[ nan nan nan nan nan nan]
[ 1. 48. 35.8 3.4 92.5 3.7]
[ 1. 48. 14.5 12.6 78.3 1.2]
[ 1. 48. 12.1 5.6 87.1 4.3]
[ 1. 48. 5.6 25.9 72.7 0.4]]
Process finished with exit code 0
I tried setting encoding to 'UTF-8-sig' and playing around with parameters, but to no avail. I tried numpy.loadtxt(), but it doesn't work for me since there are missing values within the dataset
The only solution that worked for me is to read the first row in a separate array and then concatenate them.
Is there a more elegant solution to reading the header as strings while preserving the float nature of the values? I am probably missing something trivial here.
Preferably using numpy or other package – I am not fond of creating for loops everywhere, aka reinventing the wheel while standing at the car park.
Thank you for any and all input.
That is feasible with numpy, or even with the standard lib (csv), but I would suggest looking at the pandas package (whose whole point is the handling of CSV-like data).
import pandas as pd
file_to_read = r'path/to/your/csv'
res = pd.read_csv(file_to_read)
print(res)
The "N/A" will get out as NaN (for more options, see parameters na_values and keep_default_na in the doc for pandas.read_csv).
A solution by commenter hpaulj did the job for me:
Using names=True and dype=None (and possibly encoding=None), should produce a structured array. Look at it's shape` and dtype. Or use the skip_header parameter, and accept floats.
Also for anyone starting with numpy and not reading the full documentation like me:
the names of columns are not stored in the array itself, but in its' .dtype.names. And because I didn't look there, I didn't see the code worked with names=True.
The working code:
import numpy
dataset = numpy.genfromtxt('datasets/BAL_dataset01.csv',
delimiter=',',
encoding='UTF-8-sig',
dtype=None,
names=True)
print(dataset[0:7])
print(dataset.dtype.names)
I have a strange problem which i am not able to figure out. I have a dataframe subset that looks like this
in the dataframe, I add "zero" columns using the following code:
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
and i get a result similar to this
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
I dont understand why sometimes i get zeros the other i get either NaNs or a mix of NaNs and zeros. Please help if you can
Thanks
I believe you need assign with dictionary for set new columns names:
subset = subset.assign(**dict.fromkeys(['IRNotional','IPNotional'], 0))
#you can define each column separately
#subset = subset.assign(**{'IRNotional': 0, 'IPNotional': 1})
Or simplier:
subset['IRNotional'] = 0
subset['IPNotional'] = 0
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
I think problem is different index values, so is necessary create same indices, else for not matched indices get NaNs:
subset['IPNotional']=pd.DataFrame(numpy.zeros(shape=(len(subset),1)), index=subset.index)