i want to count the -ve in a col and put them in another colm using groupby in an col - python-3.x

count the number of neg values in delay column using groupby
merged_inner['delayed payments']=merged_inner.groupby('Customer Name')['delay'].apply(lambda x: x [x < 0].count())
the delayed payments col is showing null

I believe the problem here is that you are trying to put the results back to the same dataframe as you did .groupby with, which won't have Customer Name as index.
Consider following minified example:
df = pd.DataFrame({
'Customer Name':['a', 'b','c','a', 'c','a','b','a'],
'Delay':[1, 2, -3, 0, -1,-2, -3,2]
})
You can even try:
df.loc[df['col']<0].groupby('Customer Name')['delay'].size()
Output:
Customer Name
a 1
b 1
c 2
Name: Delay, dtype: int64
You can have dataframe using:
df.loc[df['Delay']<0].groupby('Customer Name')['Delay'].size().reset_index(name='delayed_payment')
Output:
Customer Name delayed_payment
0 a 1
1 b 1
2 c 2

Related

The correct way to get the size of a list

I have a dataframe with one column and one row like:
list_id
12,13,14,16
The list_id column type is Object:
df['list'].dtypes => Object
When I try to get the number of elements in this column
l = list(df.list_id)
len(l) => 1
Or
len(df['list_id'] => 1
Why I am getting 1 instead of 4?
I want to get the count of elements as 4. What Can I do?
Example
data = {'list_id': {0: '12,13,14,16'}}
df = pd.DataFrame(data)
Code
df['list_id'].str.split(',').str.len()
Result
0 4
Name: list_id, dtype: int64
if you want get only 4 not series, use following code:
df['list_id'].str.split(',').str.len()[0]
If you have actual lists in the list_id column:
df['list_id'].str.len()
Note that if you have strings this will give you an incorrect result.*
If you have strings, you can count the number of ',' and add 1 (except if the string is empty'):
df['list_id'].str.count(',') + df['list_id'].ne('')
Output:
0 4
Name: list_id, dtype: int64

Search column name with multiple conditions pandas

I have a query to retrieve all columns with the name date in them as below
date_raw_cols = [col for col in df_raw.columns if 'date' in col]
That is also picking up columns with updated which I want to exclude. I've also tried a regex filter as below with same problem of returning updated
df_dates = df_raw.filter(regex='date', axis='columns')
How do I combine conditions to filter column names. i.e.
Where column name is date but not update, but could be date1, _date, date_
Instead of searching for 'date' in a column name, you can be more explicit:
# Assume example df_raw
>>> df_raw
date date1 prev_date update
0 1 2 3 200
1 4 2 5 300
2 5 5 3 100
>>> date_raw_cols = [col for col in df_raw.columns if col == 'date']
>>> print(date_raw_cols)
['date']
EDIT: If your question fully covers your data at hand, you can add an extra condition in the list comprehension with len() < 6, which will only grab column names with number of characters less than 6. This way you don't have to explicitly deal with underscores or digits.
>>> df_raw
date date1 prev_date update _date date_
0 1 2 3 200 a g
1 4 2 5 300 s h
2 5 5 3 100 v a
>>> date_raw_cols = [col for col in df_raw.columns if 'date' in col and len(col) < 6]
>>> print(date_raw_cols)
['date', 'date1', '_date', 'date_']
Try following regex:
\b(\w*(?=[^a-z]date)|(?=date[^a-z]))\w*\b
It will find all words that contain "date" which are bounded with numbers or punctuation marks:
re.findall(r'\b\w*(?=[^a-z]date)\w*\b|\b(?=date[^a-z])\w*\b',
'date1 date date_1 update new_date 234_date 33date datetime ')
['date1', '', 'date', '', 'date_1', 'new_date', '234_date', '33date', '']

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

Count from a dataframe, then sort based on that count [duplicate]

I have a dataset
category
cat a
cat b
cat a
I'd like to be able to return something like (showing unique values and frequency)
category freq
cat a 2
cat b 1
Use value_counts() as #DSM commented.
In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()
Out[37]:
b 3
a 2
s 2
dtype: int64
Also groupby and count. Many ways to skin a cat here.
In [38]:
df.groupby('a').count()
Out[38]:
a
a
a 2
b 3
s 2
[3 rows x 1 columns]
See the online docs.
If you wanted to add frequency back to the original dataframe use transform to return an aligned index:
In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df
Out[41]:
a freq
0 a 2
1 b 3
2 s 2
3 s 2
4 b 3
5 a 2
6 b 3
[7 rows x 2 columns]
If you want to apply to all columns you can use:
df.apply(pd.value_counts)
This will apply a column based aggregation function (in this case value_counts) to each of the columns.
df.category.value_counts()
This short little line of code will give you the output you want.
If your column name has spaces you can use
df['category'].value_counts()
df.apply(pd.value_counts).fillna(0)
value_counts - Returns object containing counts of unique values
apply - count frequency in every column. If you set axis=1, you get frequency in every row
fillna(0) - make output more fancy. Changed NaN to 0
In 0.18.1 groupby together with count does not give the frequency of unique values:
>>> df
a
0 a
1 b
2 s
3 s
4 b
5 a
6 b
>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]
However, the unique values and their frequencies are easily determined using size:
>>> df.groupby('a').size()
a
a 2
b 3
s 2
With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.
Using list comprehension and value_counts for multiple columns in a df
[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]
https://stackoverflow.com/a/28192263/786326
As everyone said, the faster solution is to do:
df.column_to_analyze.value_counts()
But if you want to use the output in your dataframe, with this schema:
df input:
category
cat a
cat b
cat a
df output:
category counts
cat a 2
cat b 1
cat a 2
you can do this:
df['counts'] = df.category.map(df.category.value_counts())
df
If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().
index, counts = np.unique(df.values,return_counts=True)
np.bincount() could be faster if your values are integers.
You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.
cats = ['client', 'hotel', 'currency', 'ota', 'user_country']
df[cats] = df[cats].astype('category')
and then calling describe:
df[cats].describe()
This will give you a nice table of value counts and a bit more :):
client hotel currency ota user_country
count 852845 852845 852845 852845 852845
unique 2554 17477 132 14 219
top 2198 13202 USD Hades US
freq 102562 8847 516500 242734 340992
Without any libraries, you could do this instead:
def to_frequency_table(data):
frequencytable = {}
for key in data:
if key in frequencytable:
frequencytable[key] += 1
else:
frequencytable[key] = 1
return frequencytable
Example:
to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}
I believe this should work fine for any DataFrame columns list.
def column_list(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return pd.DataFrame(column_list_df)
column_list_df.rename(columns={0: "Feature", 1: "Value_count"})
The function "column_list" checks the columns names and then checks the uniqueness of each column values.
#metatoaster has already pointed this out.
Go for Counter. It's blazing fast.
import pandas as pd
from collections import Counter
import timeit
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])
Timers
%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop
%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop
%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop
%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop
Cheers!
The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.
valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]
n_values = data.income.value_counts()
First unique value count
n_at_most_50k = n_values[0]
Second unique value count
n_greater_50k = n_values[1]
n_values
Output:
<=50K 34014
>50K 11208
Name: income, dtype: int64
Output:
n_greater_50k,n_at_most_50k:-
(11208, 34014)
your data:
|category|
cat a
cat b
cat a
solution:
df['freq'] = df.groupby('category')['category'].transform('count')
df = df.drop_duplicates()

Group by and Count Function returns NaNs [duplicate]

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)
The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.
You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2
The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)
You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2
lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

Resources