is it possible to manually assign the value of the dummy variable? - python-3.x

I have a data set for automotive sales and i want to change the feature 'aspiration' which contains two unique values 'std' & 'turbo' to categorical values using pd.get_dummies. using the code below;
dummy_variable_2 = pd.get_dummies(df['aspiration'])
It is automatically assigning 0 to 'std' & 1 to 'turbo'.
I would like to change to 'std' to 1 & 'turbo' to 0.

The return of pd.get_dummmies is a dataframe, which contains one column for each unique value in the dataframe. Whereby, in each column only the values of the corresponding unique value are set to one.
In your case, the dataframe contains two columns. One column is named turbo and one column std. If you want the column where the values of std are set to one, you have to do following:
df = pd.DataFrame({"aspiration":["std", "turbo", "std", "std", "std", "turbo"]})
dummies = pd.get_dummies(df)
std= dummies["aspiration_std"]
In this example, the variable dummy looks like:
std turbo
0 1 0
1 0 1
2 1 0
3 1 0
4 1 0
5 0 1
and std looks like:
0 1
1 0
2 1
3 1
4 1
5 0

Related

Is there a way to convert my column of incrementing integers separated by zero to the number of intervals encountered so far in a pandas datafram?

I'm working in pandas and I have a column in my dataframe filled by 0s and incrementing integers starting at one. I would like to add another column of integers but that column would be a counter of how many intervals separated by zero we have encountered to this point. For example my data would like like
Index
1
2
3
0
1
2
0
1
and I would like it to look like
Index IntervalCount
1 1
2 1
3 1
0 1
1 2
2 2
0 2
1 2
Is it possible to do this with vectorized operation or do I have to do this iteratively? Note, it's not important that it be a new column could also overwrite the old one.
You can use cumsum function.
df["IntervalCount"] = (df["Index"] == 1).cumsum()

How to extract row before and after when flag change from 0 to 1

I have one dataframe , i want to extract 2 rows before flag change from 0 to one and get row where value 'B' is minimum , also extract two rows after flag 1 and get row with minimum value of 'B'
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
df_out=pd.DataFrame({'A':[4,1],
'B':[4,1],
'flag':[0,1]})
To find indices of both rows of interest, run:
ind1 = df[df.flag.shift(-1).eq(0) & df.flag.shift(-2).eq(1)].index[0]
ind2 = df[df.index > ind1].B.idxmin()
For your data sample the result is 2 and 6.
Then, to retrieve rows with these indices, run:
df.loc[[ind1, ind2]]
The result is:
A B flag
2 4 4 0
6 1 1 1

Take a column from a Dataframe and normalize all of the other columns against it?

I've got a Dataframe like this:
df = pd.DataFrame(np.reshape(np.arange(0,9), (3,3)))
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
I'd like to normalize two of the columns against a reference column. For example, if I chose df[0] as my reference column, then df[1] and df[2] would also have a mean of 3 and a standard deviation of 3.
What's the best way to do this?
You can shift and scale the values in each column by the mean and standard deviation of the reference column ref:
ref = 0
means = df.mean()
stds = df.std()
(df - means + means[ref]) / stds * stds[ref]

Pandas groupby value and return observation count to dataset

I have a dataset like the following:
id value
a 0
a 0
a 0
a 0
a 1
a 2
a 2
a 2
b 0
b 0
b 1
b 2
b 2
I want to groupby the "id" column and grab the number of observations in the "value" column, and return a new column in the original dataset that counts the number of times the "value" observation occurs within each id.
An example of the output I'm looking for is represented in column "output":
id value output
a 0 4
a 0 4
a 0 4
a 0 4
a 1 1
a 2 3
a 2 3
a 2 3
b 0 2
b 0 2
b 1 1
b 2 2
b 2 2
When grouping on id "a", there are 4 observations of 0, which is provided in the column "output" for each row that contains id of "a" and value of 0.
I have tried applications of groupby and apply, to no avail. Any suggestions would be very helpful. Thank you.
Update: I figured out a solution for anyone who also faces this problem, and it works well.
grouped = df.groupby(['id','value'])
df['output'] = grouped['value'].transform('count')
This will return the count of observations under each bucket and return that count to each observation that meets that criteria, as shown in the "output" column above.
group by id and and value then count value.
data.groupby(['id' , 'value'])['id'].transform('count')

Trying to ignore zero values in Excel column

Expected OutputIn an Excel spreadsheet I am working on there are two columns of interest, column B and column E. In column B there are some 0 values and these are getting carried over to the column E based on the loop that I am running with respect to column D. I want to write a Python script to ignore these 0's and pick the next highest value based on their frequencies into column E.
12NC ModifiedSOCwrt12NC SOC
0 232270463903 0 0
1 232270463903 0 0
2 232270463903 0 0
3 232270463903 0 0
4 232270463903 0 RC0603FR-0738KL
5 232270463903 0 RC0603FR-0738KL
6 232270463903 0 RC0603FR-0738KL
I want to run a loop which picks non-zero values from SOC (column B) and carries it over to ModifiedSOCwrt12NC (column E) based on unique values in Column D.
For example, Column B has values = [0, RCK2] in multiple rows which are based on unique values in column D. So the current loop picks the maximum occurrences of values in column B and fills it into column E. If there is a tie between occurrences of 0 and RCK2, it picks 0 as per the ASCII standard (which I don't want to happen). I want the code to pick RCK2 and fill those in column E.
Since your Data is not accessible, I have created a test data similar to one below -
We can read data in pandas -
import pandas as pd
df = pd.read_excel("ExcelTemplate.xlsx")
df
Index SOC Index2 12NC
0 YXGMY 0 ZJIZX 23445
1 NQHQC 0 JKJKT 23445
2 MWTLY 0 EFCYD 23445
3 RPQFE AC VLOJZ 23445
4 GPLUQ AC AKKKG 23445
5 WGYYM AC DSMLO 23445
6 XGTAQ 0 ZHGWS 45667
7 AMWDT 0 YROLO 45667
following code will do the summarization -
First summarize data on 12NC and SOC and take count
Sort by 12NC, count and SOC, with highest count first
Take the first value of SOC for each 12NC
Merge with original Data to create column E
Export back to Excel
df1 = df.groupby(['12NC', 'SOC'])['Index'].count().reset_index()
df = df.merge(df1[df1['SOC']!=0].sort_values(by=['12NC', 'Index', 'SOC'], ascending=[True, False, True])\
.drop_duplicates(subset=['12NC'], keep='first')[['12NC', 'SOC']].\
rename(index=str, columns={'SOC': 'ModifiedSOCwrt12NC'}),\
on = ['12NC'], how='left')
df.to_excel("ExcelTemplate_modifies.xlsx", index=False)

Resources