Dropna is surprisingly not woking? - python-3.x

It is super wired, I was going to drop the NAN for features with missing less than 5%. After I dropped it, I wanted to see if it worked or not, I surprisingly found out that I couldn't drop NAN for these variables and the NAN values are even more ??
please tell me where I am wrong
Thank you so much

Why the code doesn't work: you have more columns than in dropna_list. Then you tell pandas to delete the rows in train_set[drop_list]. But these rows are still kept in train_set because it contains more columns. Basically, train_set tries to merge the data from columns outside of drop_list and with modified columns. Missing subrows are imputed by zeros.
How to fix it? Use masks. You can determine what rows are deleted:
mask = training_set[drop_list.pop(0)].isna()
for col in drop_list:
mask = mask & training_set[col].isna()
training_set = training_set[mask]

Related

Expanding a DataFrame from a col full of Dictionaries

I have a big data frame, where one of the columns (event_properties) has one dictionary per row (with different keys).
I would like to expand the data frame with one col per unique key with a None value in the rows where that particular key does not exist.
I tried something like:
p = df.copy()
y = df.event_properties.values.tolist()
event_properties = pd.DataFrame(y)
df = pd.concat([p, event_properties])
But there is something wrong with this code because if I do:
df[['event_properties', 'author']][:2]
I visualize something like:
event_properties author
120 {'author': 'Larsson, Stieg', 'product_id': '6a... NaN
128 {'author': 'Larsson, Stieg', 'product_id': '40... NaN
and this is clearly wrong.
I am quite annoyed by this totally unexpected behaviour, can anyone clarify why is this happening and how should I fix it?

Pandas: get first datetime-in and last datetime-out in one row

First of all thanks in advance, there are always answers here so we learn a lot from the experts. I'm a noob using "pandas" (it's super handie for what i tried and achieved so far).
I have these data, handed to me like this (don't have access to the origin), 20k rows or more sometimes. The 'in' and 'out' columns may have one or more data per date, so when i get a 'in' the next data could be a 'out' or a 'in', depending, leaving me a blank cell, that's the problem (see first image).
I want to filter the first datetime-in, to left it in one column and the last datetime-out in another but the two in one row (see second image); the data comes in a csv file. I am doing this particular work manually with LibreOffice Calc (yeap).
So far, I have tried locating and relocating, tried merging, grouping... nothing works for me so i feel frustrated, ¿would you please lend me a hand? here is a minimal sample of the file
By the way english is not my language. ¡Thanks so much!
First:
out_column = df["out"].tolist()
This gives you all the out dates as a list, we will need that later.
in_column = df["in"].tolist() # in is used by python so I suggest renaming that row
I treat NaT as NaN (Null) in this Case.
Now we have to find what rows to keep, which we do by going through the in column and only keeping the rows after a NaN (and the first one):
filtered_df = []
tracker = False
for index, element in enumerate(in):
if index == 0 or tracker is True:
filtered_df.append(True)
tracker = False
continue
if element is None:
tracker = True
filtered_df.append(False)
Then you filter your df by this Boolean List:
df = df[filtered_df]
Now you fix up your out column by removing the null values:
while null in out_column:
out_column.remove(null)
Last but not least you overwrite your old out column with the new one:
df["out"] = out_column

Replacing NaN Values in a Pandas DataFrame with Different Random Uniform Variables

I have a uniform distribution in a pandas dataframe column with a few NaN values I'd like to replace.
Since the data is uniformly distributed, I decided that I would like to fill the null values with random uniform samples drawn from a range of the column's min and max values. I used the following code to get the random uniform sample:
df_copy['ep'] = df_copy['ep'].fillna(value=np.random.uniform(3, 331))
Of course, using pd.DafaFrame.fillna() replaces all existing NaNs with the same value. I would like each NaN to be a different value. I assume that a for loop could get the job done, but am unsure how to create such a loop to specifically handle these NaN values. Thanks for the help!
If looks like you are doing this on a series (column), but the same implementation would work on a DataFrame:
Sample Data:
series = pd.Series(range(100))
series.loc[2] = np.nan
series.loc[10:15] = np.nan
Solution:
series.mask(series.isnull(), np.random.uniform(3, 331, size=series.shape))
Use boolean indexing with DataFrame.loc:
m = df_copy['ep'].isna()
df_copy.loc[m, 'ep'] = np.random.uniform(3, 331, size=m.sum())

Adding a zero columns to a dataframe

I have a strange problem which i am not able to figure out. I have a dataframe subset that looks like this
in the dataframe, I add "zero" columns using the following code:
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
and i get a result similar to this
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
I dont understand why sometimes i get zeros the other i get either NaNs or a mix of NaNs and zeros. Please help if you can
Thanks
I believe you need assign with dictionary for set new columns names:
subset = subset.assign(**dict.fromkeys(['IRNotional','IPNotional'], 0))
#you can define each column separately
#subset = subset.assign(**{'IRNotional': 0, 'IPNotional': 1})
Or simplier:
subset['IRNotional'] = 0
subset['IPNotional'] = 0
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
I think problem is different index values, so is necessary create same indices, else for not matched indices get NaNs:
subset['IPNotional']=pd.DataFrame(numpy.zeros(shape=(len(subset),1)), index=subset.index)

Counter python 3

I read in a [data set(https://outcomestat.baltimorecity.gov/Transportation/100EBaltimoreST/k7ux-mv7u/about) with pandas.read_csv() with no modifying args.
In the stolenVehicleFlag column there are 0, 1, and NaN.
The nans returnFalse when compared to np.nan or np.NaN.
The column is typed numpy.float64 so I tried typing the np.nans
to that from the float-type that they normally are but it still
returns False.
I also tried using a Counter to roll them up but each nan returns its
own count of 1.
Any ideas on how this is happening and how to deal with it?
I'm not sure what you are expecting to do but may be this could help if you want to get rid of this NaN values considering "df" your dataframre use:
df.dropna()
This will help you with NaN values,
You can check for more information here : pandas.DataFrame.dropna

Resources