How to add new Column in Python pandas dataframe by searching keyword value given in list? - python-3.x

I want to add new Column in the dataframe on the basis of Identified keyword:
This is Current Data(Dataframe name = df):
Topic Count
0 This is Python 39
1 This is SQL 6
2 This is Paython Pandas 98
3 import tkinter 81
4 Learning Python 94
5 SQL Working 85
6 Pandas and Work 67
7 This is Pandas 30
8 Computer 20
9 Mobile Work 55
10 Smart Mobile 69
My desired output as below
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group
How to Identify Groups Column Value
The Groups created on the basis of below Identity in Topics Column.
if particular word found in Topics then particular group name need assign to it
List of Keywords from Topic Column
Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL', 'Select']
Devices_Group = ['Computer','Mobile']
I have tried below code for it:
df['Groups'] = [
'Python Group' if "Python" in x
else 'Python Group' if "Pandas" in x
else 'Python Group' if "tkinter" in x
else 'SQL Group' if "SQL" in x
else 'Devices Group' if "Computer" in x
else 'Devices Group' if "Mobile" in x
else '000'
for x in df['Topic']]
print(df)
Above code is also giving me the desired output but I want to make it more short and quick because in above mentioned dataframe has almost 2MM+ Records and its very difficult for me to write 1k+ line of code to define grouping.
Is there any way where I can utilized List of keyword falling under Topic Column?
OR
any Custom Function that can help me in this iterative process?
Code:2 Another below code tried after consulting Stack overflow Experts:
d = pd.read_excel('Map.xlsx').to_dict('list')
keyword_groups = {x:k for k, v in d.items() for x in v}
pat = '({})'.format('|'.join(keyword_groups)) #This line is giving an error
df['Groups'] = (df['Topic'].str.extract(pat, expand=False)
.map(keyword_groups)
.fillna('000'))
The Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-543675c0b403> in <module>
3
4 keyword_groups = {x:k for k, v in d.items() for x in v}
----> 5 pat = '({})'.format('|'.join(keyword_groups))
6 pat
TypeError: sequence item 5: expected str instance, float found
Thanks for you help.

One way could be to consider maintaining your groups and keywords in a dict:
d = {'Python_Group': ['Python','Pandas','tkinter'],
'SQL_Group': ['SQL', 'Select'],
'Devices_Group': ['Computer','Mobile']}
From here, you could easily reverse this to a "keyword: Group" dict.
keyword_groups = {x:k for k, v in d.items() for x in v}
# {'Python': 'Python_Group',
# 'Pandas': 'Python_Group',
# 'tkinter': 'Python_Group',
# 'SQL': 'SQL_Group',
# 'Select': 'SQL_Group',
# 'Computer': 'Devices_Group',
# 'Mobile': 'Devices_Group'}
Then you can use Series.str.extract to find these keywords using regex and map them to the correct group. Use fillna to catch any non-matching groups.
pat = '({})'.format('|'.join(keyword_groups))
df['Groups'] = (df['Topic'].str.extract(pat, expand=False)
.map(keyword_groups)
.fillna('000'))
[out]
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group

you can do this using np.select. np.select receives 3 parameters, one of conditions, one of results and the last the default value when no condition is found.
Python_Group = ['Python','Pandas','tkinter']
SQL_Group = ['SQL', 'Select']
Devices_Group = ['Computer','Mobile']
conditions = [
df['Topic'].str.contains('|'.join(Python_Group))
,df['Topic'].str.contains('|'.join(SQL_Group))
,df['Topic'].str.contains('|'.join(Devices_Group))
]
results = [
"Python_Group"
,"SQL_Group"
,"Devices_Group"
]
df['Groups'] = np.select(conditions, results, '000')
#output:
Topic Count Groups
0 This is Python 39 Python_Group
1 This is SQL 6 SQL_Group
2 This is Paython Pandas 98 Python_Group
3 import tkinter 81 Python_Group
4 Learning Python 94 Python_Group
5 SQL Working 85 SQL_Group
6 Pandas and Work 67 Python_Group
7 This is Pandas 30 Python_Group
8 Computer 20 Devices_Group
9 Mobile Work 55 Devices_Group
10 Smart Mobile 69 Devices_Group

Related

Appending DataFrame to empty DataFrame in {Key: Empty DataFrame (with columns)}

I am struggling to understand this one.
I have a regular df (same columns as the empty df in dict) and an empty df which is a value in a dictionary (the keys in the dict are variable based on certain inputs, so can be just one key/value pair or multiple key/value pairs - think this might be relevant). The dict structure is essentially:
{key: [[Empty DataFrame
Columns: [list of columns]
Index: []]]}
I am using the following code to try and add the data:
dict[key].append(df, ignore_index=True)
The error I get is:
temp_dict[product_match].append(regular_df, ignore_index=True)
TypeError: append() takes no keyword arguments
Is this error due to me mis-specifying the value I am attempting to append the df to (like am I trying to append the df to the key instead here) or something else?
Your dictionary contains a list of lists at the key, we can see this in the shown output:
{key: [[Empty DataFrame Columns: [list of columns] Index: []]]}
# ^^ list starts ^^ list ends
For this reason dict[key].append is calling list.append as mentioned by #nandoquintana.
To append to the DataFrame access the specific element in the list:
temp_dict[product_match][0][0].append(df, ignore_index=True)
Notice there is no inplace version of append. append always produces a new DataFrame:
Sample Program:
import numpy as np
import pandas as pd
temp_dict = {
'key': [[pd.DataFrame()]]
}
product_match = 'key'
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 100, (5, 4)))
temp_dict[product_match][0][0].append(df, ignore_index=True)
print(temp_dict)
Output (temp_dict was not updated):
{'key': [[Empty DataFrame
Columns: []
Index: []]]}
The new DataFrame will need to be assigned to the correct location.
Either a new variable:
some_new_variable = temp_dict[product_match][0][0].append(df, ignore_index=True)
some_new_variable
0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65
Or back to the list:
temp_dict[product_match][0][0] = (
temp_dict[product_match][0][0].append(df, ignore_index=True)
)
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Assuming there the DataFrame is actually an empty DataFrame, append is unnecessary as simply updating the value at the key to be that DataFrame works:
temp_dict[product_match] = df
temp_dict
{'key': 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65}
Or if list of list is needed:
temp_dict[product_match] = [[df]]
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Maybe you have an empty list at dict[key]?
Remember that "append" list method (unlike Pandas dataframe one) only receives one parameter:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

Creating a list from series of pandas

Click here for the imageI m trying to create a list from 3 different series which will be of the shape "({A} {B} {C})" where A denotes the 1st element from series 1, B is for 1st element from series 2, C is for 1st element from series 3 and this way it should create a list containing 600 element.
List 1 List 2 List 3
u_p0 1 v_p0 2 w_p0 7
u_p1 21 v_p1 11 w_p1 45
u_p2 32 v_p2 25 w_p2 32
u_p3 45 v_p3 76 w_p3 49
... .... ....
u_p599 56 v_p599 78 w_599 98
Now I want the output list as follows
(1 2 7)
(21 11 45)
(32 25 32)
(45 76 49)
.....
These are the 3 series I created from a dataframe
r1=turb_1.iloc[qw1] #List1
r2=turb_1.iloc[qw2] #List2
r3=turb_1.iloc[qw3] #List3
Pic of the seriesFor the output I think formatted string python method will be useful but I m quite not sure how to proceed.
turb_3= ["({A} {B} {C})".format(A=i,B=j,C=k) for i in r1 for j in r2 for k in r3]
Any kind of help will be useful.
Use pandas.DataFrame.itertuples with str.format:
# Sample data
print(df)
col1 col2 col3
0 1 2 7
1 21 11 45
2 32 25 32
3 45 76 49
fmt = "({} {} {})"
[fmt.format(*tup) for tup in df[["col1", "col2", "col3"]].itertuples(False, None)]
Output:
['(1 2 7)', '(21 11 45)', '(32 25 32)', '(45 76 49)']

Creating an aggregate columns in pandas dataframe

I have a pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ORDER':["A", "A", "B", "B"], 'var1':[2, 3, 1, 5],'a1_bal':[1,2,3,4], 'a1c_bal':[10,22,36,41], 'b1_bal':[1,2,33,4], 'b1c_bal':[11,22,3,4], 'm1_bal':[15,2,35,4]})
df
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal
0 A 2 1 10 1 11 15
1 A 3 2 22 2 22 2
2 B 1 3 36 33 3 35
3 B 5 4 41 4 4 4
I want to create new columns as below:
a1_final_bal = sum(a1_bal, a1c_bal)
b1_final_bal = sum(b1_bal, b1c_bal)
m1_final_bal = m1_bal (since we only have m1_bal field not m1c_bal, so it will renain as it is)
I don't want to hardcode this step because there might be more such columns as "c_bal", "m2_bal", "m2c_bal" etc..
My final data should look something like below
ORDER var1 a1_bal a1c_bal b1_bal b1c_bal m1_bal a1_final_bal b1_final_bal m1_final_bal
0 A 2 1 10 1 11 15 11 12 15
1 A 3 2 22 2 22 2 24 24 2
2 B 1 3 36 33 3 35 38 36 35
3 B 5 4 41 4 4 4 45 8 4
You could try something like this. I am not sure if its exactly what you are looking for, but I think it should work.
dfforgroup = df.set_index(['ORDER','var1']) #Creates MultiIndex
dfforgroup.columns = dfforgroup.columns.str[:2] #Takes first two letters of remaining columns
df2 = dfforgroup.groupby(dfforgroup.columns,axis=1).sum().reset_index().drop(columns =
['ORDER','var1']).add_suffix('_final_bal') #groups columns by their first two letters and sums the columns up
df = pd.concat([df,df2],axis=1) #concatenates new columns to original df

sum of all the columns values in the given dataframe and display output in in a new data frame

I have tried the below code:
import pandas as pd
dataframe = pd(C1,columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = daeframet.sum(axis=0)
print (sum_column)
I am getting the below error
TypeError: 'module' object is not callable
Data:
Output:
The error is coming from calling the module pd as a function. It's difficult to know which function you should be calling from pandas without knowing what C1 is, but if it is a dictionary or a pandas data frame, try:
import pandas as pd
# common to abbreviate dataframe as df
df = pd.DataFrame(C1, columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = df.sum(axis=0)
print(sum_column)
using sum will only return a series and not a dataframe, there are many ways you can do this. Lets try using select_dtypes and the to_frame() method
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame({'class' : ['first','second','third','fourth','fifth'],
'School A' : np.random.randint(1,50,5),
'School B' : np.random.randint(1,50,5),
'School C' : np.random.randint(1,50,5),
'School D' : np.random.randint(1,50,5),
'School E' : np.random.randint(1,50,5)})
print(df)
class School A School B School C School D School E
0 first 36 10 49 16 14
1 second 15 9 31 40 12
2 third 48 37 17 17 2
3 fourth 39 40 8 28 48
4 fifth 17 28 13 45 31
new_df = (df.select_dtypes(include='int').sum(axis=0).to_frame()
.reset_index().rename(columns={0 : 'Total','index' : 'School'}))
print(new_df)
School Total
0 School A 155
1 School B 124
2 School C 118
3 School D 146
4 School E 107
Edit
seems like there are some typos in your code :
import pandas as pd
dataframe = pd.DataFrame(C1,columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = dataframe.sum(axis=0)
print (sum_column)
will return the sum as a series, and also sum the text columns by way of string concatenation :
class firstsecondthirdfourthfifth
School A 155
School B 124
School C 118
School D 146
School E 107
dtype: object

Remove index from dataframe using Python

I am trying to create a Pandas Dataframe from a string using the following code -
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
I am getting the following result -
0 1 2
0 A B C
1 0 34 88
2 2 45 200
3 3 47 65
4 4 32 140
5 None None
But I need something like the following -
A B C
0 34 88
2 45 200
3 47 65
4 32 140
I added "index = False" while creating the dataframe like -
df = pd.DataFrame([x.split(';') for x in data.split('\n')],index = False)
But, it gives me an error -
TypeError: Index(...) must be called with a collection of some kind, False
was passed
How is this achievable?
Use read_csv with StringIO and index_col parameetr for set first column to index:
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
df = pd.read_csv(pd.compat.StringIO(input_string),sep=';', index_col=0)
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
Your solution should be changed with split by default parameter (arbitrary whitespace), pass to DataFrame all values of lists without first with columns parameter and if need first column to index add DataFrame.set_axis:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index('A')
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
For general solution use first value of first list in set_index:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index(L[0][0])
EDIT:
You can set column name instead index name to A value:
df = df.rename_axis(df.index.name, axis=1).rename_axis(None)
print (df)
A B C
0 34 88
2 45 200
3 47 65
4 32 140
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split()])
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
df.set_index('A',inplace = True)
df
output
B C
A
0 34 88
2 45 200
3 47 65
4 32 140

Resources