How to get data in groupby like SQL having with pandas. - python-3.x

I have data like below.
id, name, password, note, num
1, hoge, xxxxxxxx, aaaaa, 2
2, hoge, xxxxxxxx, bbbbb, 1
3, moge, yyyyyyyy, ccccc, 2
4, zape, zzzzzzzz, ddddd, 3
I would like to make framedata using groupby same name and password. In this case, 1,hoge and 2,hoge are treated as same data. Then I would like to get count 3
from num column.
I tried like below.
df1 = pd.read_csv("sample.csv")
df2 = df1.groupby(['name','password']).count()
print(df2[df2[note] > 1])
It goes like this.
name, password, note, num
hoge, xxxxxxxx, 2, 2
How can I get sum of num value?

I belive you need GroupBy.size or count for exclude NaNs rows with transform for new Series with same size like original DaatFrame, so possible filtering with sum:
s = df1.groupby(['name','password'])['note'].transform('size')
s = df1.groupby(['name','password'])['note'].transform('count')
out = df1.loc[s > 1, 'num'].sum()
print (out)
3
If want count only duplicated rows filter by DataFrame.duplicated with specify columns for check dupes:
out = df1.loc[df1.duplicated(['name','password'], keep=False), 'num'].sum()
print (out)
3

Related

How to convert an alphanumberic column (object dtype) to int?

I have a dataframe (df) with 5 columns. 4 of the columns are dtype: object, and one is dtype: int. For simplicity let's say Columns 1-4 are objects and Column 5 is int dtype. I'm interested in converting Column 1 from an object dtype to an integer. It has a format of randomly created alphanumeric combinations like 0000000-1111111-aaaaaaaaa-bbbbbbb to zzzz99-abc1234-jfkslnfnsl120-204875987, with a total of 5000 unique values
Here is what I have tried so far. I've tried straight datatype conversions like
df.column1.astype('int')
df.column1..astype(theano.config.floatX)
But I get errors about how the conversion isn't possible that way.
I've also tried creating a new column and mapping integer values for each unique value in Column1 to use as a work-around, but I haven't had any luck. The code looked something like this:
np_arange = np.arange(1, 5000, 1)
df.int_column = df.column1.map(np_arange)
or
num_range = range(1, 5000, 1)
df.int_column = df.column1.map(num_range)
Here I get errors saying that the numpy arrays aren't callable, but I can't think of any other way to get around this. Does anyone have any ideas for how I could complete this?
Edit: The dataframe looks something like this (except more columns and rows):
df = pd.DataFrame({
'Column1': ['00000-aaaa-1111-bbbbn', 'zzzz-1820-2222-vvvv', '4124-ce69-11f5-0293'],
'Column2': [76, 25, 89],
'Column3': ['MW', 'NA', 'BL'],
'Column4': ['Car', 'Truck', 'Bike'],
'Column5': ['OH', 'WE', 'SC']
})
And I need either another column where for every '0000-aaaa-1111-bbbb' value, there is a corresponding 1 value in the new column, and for every 'zzzz-1820-2222-vvvv' value in the column 1 for there to be a 2 in the new column - or a way to convert the alphanumeric combinations to integer
Combine select_dtypes and factorize:
df.update(df.select_dtypes(exclude='number').apply(lambda s: pd.factorize(s)[0]+1))
Output:
Column1 Column2 Column3 Column4 Column5
0 1 76 1 1 1
1 2 25 2 2 2
2 3 89 3 3 3

pandas df: change values in column A only for rows that are unique in column B

It sounds like a trivial question and I'd expected to find a quick answer, but didn't have much success.
I have the dataframe population and columns A and B. I want to change the value in B to 1 only to those rows with unique value in column A (currently all rows in B hold the value 0).
I tried:
small_towns = population['A'].value_counts() == 1
population[population['A'] in small_towns]['B']=1
and got: 'Series' objects are mutable, thus they cannot be hashed
I also tried:
population.loc[population['A'].value_counts() == 1, population['B']] = 1
and got the same error with an aditional: pandas.core.indexing.IndexingError:
Any ideas?
Thanks in advance,
Ben
We can Series.duplicated with keep = False
this returns a Series with True on all duplicates values ​​and False on the rest. We can put 1 in rows with True using DataFrame.loc[]:
population.loc[~population['A'].duplicated(keep=False), 'B'] = 1
#population.loc[~population.duplicated(subset = 'A', keep=False), 'B'] = 1
We can also use Series.where or Series.mask
population['B'] = population['B'].where(population['A'].duplicated(keep=False), 1)
#population['B'] = population['B'].mask(~population['A'].duplicated(keep=False), 1)
but if you want to create a serie B with 1 or 0 you can simply do:
population['B'] = (~population['A'].duplicated(keep=False)).astype(int)
or
population['B'] = np.where(population['A'].duplicated(keep=False), 0, 1)

i want to count the -ve in a col and put them in another colm using groupby in an col

count the number of neg values in delay column using groupby
merged_inner['delayed payments']=merged_inner.groupby('Customer Name')['delay'].apply(lambda x: x [x < 0].count())
the delayed payments col is showing null
I believe the problem here is that you are trying to put the results back to the same dataframe as you did .groupby with, which won't have Customer Name as index.
Consider following minified example:
df = pd.DataFrame({
'Customer Name':['a', 'b','c','a', 'c','a','b','a'],
'Delay':[1, 2, -3, 0, -1,-2, -3,2]
})
You can even try:
df.loc[df['col']<0].groupby('Customer Name')['delay'].size()
Output:
Customer Name
a 1
b 1
c 2
Name: Delay, dtype: int64
You can have dataframe using:
df.loc[df['Delay']<0].groupby('Customer Name')['Delay'].size().reset_index(name='delayed_payment')
Output:
Customer Name delayed_payment
0 a 1
1 b 1
2 c 2

Using xlswrite in MATLAB

I am working with three datasets in MATLAB, e.g.,
Dates:
There are D dates that are chars each, but saved in a cell array.
{'01-May-2019','02-May-2019','03-May-2019'....}
Labels:
There are 100 labels that are strings each, but saved in a cell array.
{'A','B','C',...}
Values:
[0, 1, 2,...]
This is one row of the Values matrix of size D×100.
I would like the following output in Excel:
date labels Values
01-May-2019 A 0
01-May-2019 B 1
01-May-2019 C 2
till the same date repeats itself 100 times. Then, the next date is added (+ repeated 100 times) onto the subsequent row along with the 100 labels in the second column and new values from 2nd row of Values matrix transposed in third column. This repeats until the date length D is reached.
For the first date, I used:
c_1 = {datestr(datenum(dates(1))*ones(100,1))}
c_2 = labels
c_3 = num2cell(Values(1,:)')
xlswrite('test.xls',[c_1, c_2, c_3])
but, unfortunately, this seemed to have put everything in one column, i.e., the date, then, labels, then, 1st row of values array. I need these to be in three columns.
Also, I think that the above needs to be in a for loop over each day that I am considering. I tried using the table function, but, didn't have much luck with it.
How to solve this efficiently?
You can use repmat and reshape to build your columns and (optionally) add them to a table for exporting.
For example:
dates = {'01-May-2019','02-May-2019'};
labels = {'A','B', 'C'};
values = [0, 1, 2];
n_dates = numel(dates);
n_labels = numel(labels);
dates_repeated = reshape(repmat(dates, n_labels, 1), [], 1);
labels_repeated = reshape(repmat(labels, n_dates, 1).', [], 1);
values_repeated = reshape(repmat(values, n_dates, 1).', [], 1);
full_table = table(dates_repeated, labels_repeated, values_repeated);
Gives us the following table:
>> full_table
full_table =
6×3 table
dates_repeated labels_repeated values_repeated
______________ _______________ _______________
'01-May-2019' 'A' 0
'01-May-2019' 'B' 1
'01-May-2019' 'C' 2
'02-May-2019' 'A' 0
'02-May-2019' 'B' 1
'02-May-2019' 'C' 2
Which should export to a spreadsheet with writetable as desired.
What we're doing with repmat and reshape is "stacking" the values and then converting them into a single column:
>> repmat(dates, n_labels, 1)
ans =
3×2 cell array
{'01-May-2019'} {'02-May-2019'}
{'01-May-2019'} {'02-May-2019'}
{'01-May-2019'} {'02-May-2019'}
We transpose the labels and values so they get woven together (e.g [0, 1, 0, 1] vs [0, 0, 1, 1]), as repmat is column-major.
If you don't want the intermediate table, you can use num2cell to create a cell array from values so you can concatenate all 3 cell arrays together for xlswrite (or writematrix, added in R2019a, which also deprecates xlswrite):
values_repeated = num2cell(reshape(repmat(values, n_dates, 1).', [], 1));
full_array = [dates_repeated, labels_repeated, values_repeated];

Python - How to dynamically exclude a column name from a list of columns of a Panda Dataframe

So far I am able to get the list of all column names present in the dataframe or to get a specific column names based on its datatype, starting letters, etc...
Now my requirement is to get the whole list of column names or a sublist and to exclude one column from it (i.e Target variable / Label Column. This is a part of Machine Learning. So I am using the terms that are used in machine learning)
Please note I am not speaking about the data present in those columns. I am just taking the column names and want to exclude a particular column by its name
Please see below example for better understanding :
# Get all the column names from a Dataframe
df.columns
Index(['transactionID', 'accountID', 'transactionAmountUSD',
'transactionAmount', 'transactionCurrencyCode',
'accountAge', 'validationid', 'LABEL'],
dtype='object')
# Get only the Numeric Variables (Columns with numeric values in it)
df._get_numeric_data().columns
Index(['transactionAmountUSD', 'transactionAmount', 'accountAge', 'LABEL'],
dtype='object')
Now inorder to get remaining column names I am subtracting both the above commands
string_cols = list(set(list(df.columns))-set(df._get_numeric_data().columns))
Ok everything goes well until I hit this.
I have found out that Label column though it has numeric values it should not be present in the list of numeric variables. It should be excluded.
(i.e) I want to exclude a particular column name (not using its index in the list but using its name explicitly)
I tried similar statements like the following ones but in vain. Any inputs on this will be helpful
set(df._get_numeric_data().columns-set(df.LABEL)
set(df._get_numeric_data().columns-set(df.LABEL.column)
set(df._get_numeric_data().columns-set(df['LABEL'])
I am sure I am missing a very basic thing but not able to figure it out.
First of all, you can exclude all numeric columns much more simply with
pd.DataFrame.select_dtypes(exclude=[np.number])
transactionID accountID transactionCurrencyCode validationid
0 a a a a
1 a a a a
2 a a a a
3 a a a a
4 a a a a
Second of all, there are many ways to drop a column. See this post
df._get_numeric_data().drop('LABEL', 1)
transactionAmountUSD transactionAmount accountAge
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
If you really wanted the columns, use pd.Index.difference
df._get_numeric_data().columns.difference(['LABEL'])
Index(['accountAge', 'transactionAmount', 'transactionAmountUSD'], dtype='object')
Setup
df = pd.DataFrame(
[['a', 'a', 1, 1, 'a', 1, 'a', 1]] * 5,
columns=[
'transactionID', 'accountID', 'transactionAmountUSD',
'transactionAmount', 'transactionCurrencyCode',
'accountAge', 'validationid', 'LABEL']
)
Pandas' index supports set operations, so to exclude one column from column index you can just write something like
import pandas as pd
df = pd.DataFrame(columns=list('abcdef'))
print(df.columns.difference({'b'}))
which will return to you
Index(['a', 'c', 'd', 'e', 'f'], dtype='object')
I hope this is what you want :)
Considering LABEL column as your output and the other features as your input, you can try this:
feature_names = [x for x in df._get_numeric_data().columns if x not in ['LABEL']]
input = df[feature_names]
output= df['LABEL']
Hope this helps.

Resources