Looping in rows in pandas - python-3.x

My data frame has first columns as IDs as follows:
ID
A123
A234
A456
A123
A234
Now I need to create a new column Indicator which will add one in front of each ID which is getting repeated.
Desired Output:
ID Indicator
A123 1
A234 1
A456 0
A123 1
A234 1

This is a pretty simple operation in Pandas once you get the hang of it, so you may want to invest some time in a tutorial. What you need to do is call the conventient function duplicated() of the ID column, an instance of pandas.core.series.Series. So:
import pandas as pd
df = pd.DataFrame(["A123", "A234", "A456", "A123", "A234"], columns=["ID"])
df.ID.duplicated()
0 False
1 False
2 False
3 True
4 True
Name: ID, dtype: bool
It returns a Series with boolean values. You can take that new Seriesand call its apply function that will then return a Series with values using the return of apply. So to turn each boolean into 0 or 1, all you need to do is apply int:
df.ID.duplicated().apply(int) // or df["ID"].duplicated().apply(int)
0 0
1 0
2 0
3 1
4 1
Name: ID, dtype: int64
There are lots of other convention functions in Series. If you need to do something more complicated, you can apply() a custom function, e.g.
def custom_function(value):
return str(int(value))
df.ID.duplicated().apply(custom_function)
0 0
1 0
2 0
3 1
4 1
Name: ID, dtype: object
You can also use the apply() of the DataFrame itself to call functions across all rows or columns, specified using axis.

Related

Python and Pandas, find rows that contain value, target column has many sets of ranges

I have a messy dataframe where I am trying to "flag" the rows that contain a certain number in the ids column. The values in this column represent an inclusive range: for example, "row 4" contains the following numbers:
2409,2410,2411,2412,2413,2414,2377,2378,1478,1479,1480,1481,1482,1483,1484 And in "row 0" and "row 1" the range for one of the sets is backwards (1931,1930,1929)
If I want to know which rows have sets that contain "2340" and "1930" for example, how would I do this? I think a loop is needed, sometimes will need to query more than just two numbers. Using Python 3.8.
Example Dataframe
x = ['1331:1332,1552:1551,1931:1928,1965:1973,1831:1811,1927:1920',
'1331:1332,1552:1551,1931:1929,180:178,1966:1973,1831:1811,1927:1920',
'2340:2341,1142:1143,1594:1593,1597:1596,1310,1311',
'2339:2341,1142:1143,1594:1593,1597:1596,1310:1318,1977:1974',
'2409:2414,2377:2378,1478:1484',
'2474:2476',
]
y = [6.48,7.02,7.02,6.55,5.99,6.39,]
df = pd.DataFrame(list(zip(x, y)), columns =['ids', 'val'])
display(df)
Desired Output Dataframe
I would write a function that perform 2 steps:
Given the ids_string that contains the range of ids, list all the ids as ids_num_list
Check if the query_id is in the ids_num_list
def check_num_in_ids_string(ids_string, query_id):
# Convert ids_string to ids_num_list
ids_range_list = ids_string.split(',')
ids_num_list = set()
for ids_range in ids_range_list:
if ':' in ids_range:
lower, upper = sorted(ids_range.split(":"))
num_list = list(range(int(lower), int(upper)+ 1))
ids_num_list.update(num_list)
else:
ids_num_list.add(int(ids_range))
# Check if query number is in the list
if int(query_id) in ids_num_list:
return 1
else:
return 0
# Example usage
query_id_list = ['2340', '1930']
for query_id in query_id_list:
df[f'n{query_id}'] = (
df['ids']
.apply(lambda x : check_num_in_ids_string(x, query_id))
)
which returns you what you require:
ids val n2340 n1930
0 1331:1332,1552:1551,1931:1928,1965:1973,1831:1... 6.48 0 1
1 1331:1332,1552:1551,1931:1929,180:178,1966:197... 7.02 0 1
2 2340:2341,1142:1143,1594:1593,1597:1596,1310,1311 7.02 1 0
3 2339:2341,1142:1143,1594:1593,1597:1596,1310:1... 6.55 1 0
4 2409:2414,2377:2378,1478:1484 5.99 0 0
5 2474:2476 6.39 0 0

is it possible to manually assign the value of the dummy variable?

I have a data set for automotive sales and i want to change the feature 'aspiration' which contains two unique values 'std' & 'turbo' to categorical values using pd.get_dummies. using the code below;
dummy_variable_2 = pd.get_dummies(df['aspiration'])
It is automatically assigning 0 to 'std' & 1 to 'turbo'.
I would like to change to 'std' to 1 & 'turbo' to 0.
The return of pd.get_dummmies is a dataframe, which contains one column for each unique value in the dataframe. Whereby, in each column only the values of the corresponding unique value are set to one.
In your case, the dataframe contains two columns. One column is named turbo and one column std. If you want the column where the values of std are set to one, you have to do following:
df = pd.DataFrame({"aspiration":["std", "turbo", "std", "std", "std", "turbo"]})
dummies = pd.get_dummies(df)
std= dummies["aspiration_std"]
In this example, the variable dummy looks like:
std turbo
0 1 0
1 0 1
2 1 0
3 1 0
4 1 0
5 0 1
and std looks like:
0 1
1 0
2 1
3 1
4 1
5 0

Removing repetitive/duplicate occurance in excel using python

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.
This is my input excel:
And need output like this:
This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :
by creating a mask where you return a true/false boolean if the row is == the row above.
assuming your dataframe is called df
mask = df['NAME'].ne(df['NAME'].shift())
df.loc[~mask,'NAME'] = ''
explanation :
what we are doing above is the following,
first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=
lets see this in action.
import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']
now, lets create a dataframe similar to yours.
df = pd.DataFrame({'NAME' : names,
'DEFAULT' : defaults,
'CLASS' : classes,
'AGE' : [np.random.randint(1,5) for len in names],
'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables.
so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.
we do this by calling [.shift][2] hyperlinked for more info.
what this basically does is shift the rows by its index with a defined variable number, lets call this n.
if we called df['NAME'].shift(1)
0 NaN
1 Rekha
2 Rekha
3 Jaya
4 Jaya
5 Sushma
6 Nita
7 Nita
we can see here that that Rekha has moved down
so putting that all together,
df['NAME'].ne(df['NAME'].shift())
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
we assign this to a self defined variable called mask you could call this whatever you want.
we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.
however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.
NAME DEFAULT CLASS AGE GROUP
1 Rekha third 1 4
3 Jaya fifth 1 1
6 Nita fifth 1 2
7 Nita fourth 1 4
all we need to do now is change these rows to blanks as your initial requirment, and we are left with.
NAME DEFAULT CLASS AGE GROUP
0 Rekha forth 2 2
1 third 1 4
2 Jaya c-default forth 3 3
3 fifth 1 1
4 Sushma fourth3 1
5 Nita c-default third 4 2
6 fifth 1 2
7 fourth1 4
hope that helps!

Pandas groupby value and return observation count to dataset

I have a dataset like the following:
id value
a 0
a 0
a 0
a 0
a 1
a 2
a 2
a 2
b 0
b 0
b 1
b 2
b 2
I want to groupby the "id" column and grab the number of observations in the "value" column, and return a new column in the original dataset that counts the number of times the "value" observation occurs within each id.
An example of the output I'm looking for is represented in column "output":
id value output
a 0 4
a 0 4
a 0 4
a 0 4
a 1 1
a 2 3
a 2 3
a 2 3
b 0 2
b 0 2
b 1 1
b 2 2
b 2 2
When grouping on id "a", there are 4 observations of 0, which is provided in the column "output" for each row that contains id of "a" and value of 0.
I have tried applications of groupby and apply, to no avail. Any suggestions would be very helpful. Thank you.
Update: I figured out a solution for anyone who also faces this problem, and it works well.
grouped = df.groupby(['id','value'])
df['output'] = grouped['value'].transform('count')
This will return the count of observations under each bucket and return that count to each observation that meets that criteria, as shown in the "output" column above.
group by id and and value then count value.
data.groupby(['id' , 'value'])['id'].transform('count')

Iterating over columns and comparing each row value of that column to another column's value in Pandas

I am trying to iterate through a range of 3 columns (named 0 ,1, 2). in each iteration of that column I want to compare each row-wise value to another column called Flag (row-wise comparison for equality) in the same frame. I then want to return the matching field.
I want to check if the values match.
Maybe there is an easier approach to concatenate those columns into a single list then iterate through that list and see if there are any matches to that extra column? I am not very well versed in Pandas or Numpy yet.
I'm trying to think of something efficient as well as I have a large data set to perform this on.
Most of this is pretty free thought so I am just trying lots of different methods
Some attempts so far using the iterate over each column method:
##Sample Data
df = pd.DataFrame([['123','456','789','123'],['357','125','234','863'],['168','298','573','298'], ['123','234','573','902']])
df = df.rename(columns = {3: 'Flag'})
##Loop to find matches
i = 0
while i <= 2:
df['Matches'] = df[i].equals(df['Flag'])
i += 1
My thought process is to iterate over each column named 0 - 2, check to see if the row-wise values match between 'Flag' and the columns 0-2. Then return if they matched or not. I am not entirely sure which would be the best way to store the match result.
Maybe utilizing a different structured approach would be beneficial.
I provided a sample frame that should have some matches if I can execute this properly.
Thanks for any help.
You can use iloc in combination with eq than return the row if any of the columns match with .any:
m = df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
df['indicator'] = m
0 1 2 Flag indicator
0 123 456 789 123 True
1 357 125 234 863 False
2 168 298 573 298 True
3 123 234 573 902 False
The result you get back you can select by boolean indexing:
df.iloc[:, :-1].eq(df['Flag'], axis=0)
0 1 2
0 True False False
1 False False False
2 False True False
3 False False False
Then if we chain it with any:
df.iloc[:, :-1].eq(df['Flag'], axis=0).any(axis=1)
0 True
1 False
2 True
3 False
dtype: bool

Resources