Pandas Series - Trouble Dropping First Value - python-3.x

I'm having trouble dropping the first NaN value in my Series.
I took the difference of the normal Series to the shifted Series of 1 period. This is how I calculated it:
x[c] = x[c] - x[c].shift(periods=1)
When i try to drop the first value using these methods:
x[c].drop(labels=[0])
x[c].dropna()
x[c].iloc[1:]
It doesn't work for me in the reassignment
# these are not used all together, but separately
x[c] = x[c].dropna()
x[c] = x[c].drop(labels=['1981-09-29'])
x[c] = x[c][1:]
print(x[c])
Date
1981-09-29 NaN
1981-09-30 -0.006682
1981-10-01 -0.014575
1981-10-02 -0.004963
1981-10-05 -0.004963
However, when I call the drop or dropna function in a print statement, it works!
print(x[c].dropna())
Date
1981-09-30 -0.006682
1981-10-01 -0.014575
1981-10-02 -0.004963
1981-10-05 -0.004963
1981-10-06 -0.005514
It doesn't matter what method, I just want to get rid of the first element in my Series.
Pls help.

The dataframe has multiple series where if I tried reassigning just one series, it would still give me an NaN. This is because Series need to be of the same length in a dataframe. Therefore I need to call dropna on the dataframe after the calculation is performed

Try this?
x[c] = x[c].dropna(inplace=True)
print(x[c])

If you know it is the first element you don't want, the simplest way is to use series.iloc[1:, :] or dataframe.iloc[1:, :].

Related

Delete rows in Dataframe using Pandas

I have a dataset with 250,000 samples. The column "CHANNEL" has 7 missing values. I want to delete those 7 rows. Here is my code:
mask = df_train["CHANNEL"].notnull()
df_train = df_train[mask]
I checked the shape by
df_train.shape
It correctly outputs 249993 rows. However, when I tried to output the entire dataset, it still shows index from 0 to 249999, like the below picture:
enter image description here
I also checked the number of missing values in each column of df_train, and each of them is zero. This problem matters because I want to do concatenation later and some issues arise. I am not sure if I missed some points when using the above commands. I would appreciate any suggestions and comments!
Try using dropna()
df_train = df_train.dropna()
You may see that the end still has the index 249999, that's just because the original index hasn't changed. To reset the index of the new data frame without the missing values, you can use reset_index()
df_train = df_train.dropna()
df_train = df_train.reset_index(drop=True)

Python: how can I get the mode from a month column that i extracted from a datetime column?

I'm new at this! Doing my first Python project. :)
My tasks are:
convert df['Start Time'] from string to datetime
create a month column from df['Start Time']
get the mode of that month.
I used a few different ways to do all 3 of the steps, but trying to get the mode always returns TypeError: tuple indices must be integers or slices, not str. This happens even if I try converting the "tuple" into a list or NumPy array.
Ways I tried to extract month from Start Time:
df['extracted_month'] = pd.DatetimeIndex(df['Start Time']).month
df['extracted_month'] = np.asarray(df['extracted_month'])
df['extracted_month'] = df['Start Time'].dt.month
Ways I've tried to get the mode:
print(df['extracted_month'].mode())
print(df['extracted_month'].mode()[0])
print(stat.mode(df['extracted_month']))
Trying to get the index with df.columns.get_loc("extracted_month") then replacing it in the mode code gives me the SAME error (TypeError: tuple indices must be integers or slices, not str).
I think I should convert df['extracted_month'] into a different... something. What is it?
Note: My extracted_month column is a STRING, but you should still be able to get the mode from a string variable! I'm not changing it, that would be giving up.
Edit: using the following code still results in the same error
extracted_month = pd.Index(df['extracted_month'])
print(extracted_month.value_counts())
The error is likely caused by the way you are creating your dataframe.
If the dataframe is created in another function, and that function returns other things along with the dataframe, but you assign it to the variable df, then df will be a tuple that contains the actual dataframe, and not the dataframe itself.

replacing a special character in a pandas dataframe

I have a dataset that '?' instead of 'NaN' for missing values. I could have gone through each column using replace but the only problem is I have 22 columns. I am trying to create a loop do it effectively but I am getting wrong. Here is what I am doing:
for col in adult.columns:
if adult[col]=='?':
adult[col]=adult[col].str.replace('?', 'NaN')
The plan is to use the 'NaN' then use the fillna function or to drop them with dropna. The second problem is that not all the columns are categorical so the str function is also wrong. How can I easily deal with this situation?
If you're reading the data from a .csv or .xlsx file you can use the na_values parameter:
adult = pd.read_csv('path/to/file.csv', na_values=['?'])
Otherwise do what #MasonCaiby said and use adult.replace('?', float('nan'))

Iterating throughput dataframe columns and using .apply() gives KeyError

So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)

Outer Join Two Pandas Dataframes [duplicate]

This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 1 year ago.
I'm not sure where I am astray but I cannot seem to reset the index on a dataframe.
When I run test.head(), I get the output below:
As you can see, the dataframe is a slice, so the index is out of bounds.
What I'd like to do is to reset the index for this dataframe. So I run test.reset_index(drop=True). This outputs the following:
That looks like a new index, but it's not. Running test.head again, the index is still the same. Attempting to use lambda.apply or iterrows() creates problems with the dataframe.
How can I really reset the index?
reset_index by default does not modify the DataFrame; it returns a new DataFrame with the reset index. If you want to modify the original, use the inplace argument: df.reset_index(drop=True, inplace=True). Alternatively, assign the result of reset_index by doing df = df.reset_index(drop=True).
BrenBarn's answer works.
The following also worked via this thread, which isn't a troubleshooting so much as an articulation of how to reset the index:
test = test.reset_index(drop=True)
As an extension of in code veritas's answer... instead of doing del at the end:
test = test.reset_index()
del test['index']
You can set drop to True.
test = test.reset_index(drop=True)
I would add to in code veritas's answer:
If you already have an index column specified, then you can save the del, of course. In my hypothetical example:
df_total_sales_customers = pd.DataFrame({'Sales': total_sales_customers['Sales'],
'Customers': total_sales_customers['Customers']}, index = total_sales_customers.index)
df_total_sales_customers = df_total_sales_customers.reset_index()

Resources