How to combine multiple cells into a single text сell - python-3.x

I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'item': [1, 1,2,2],
'user': [1,2,2,1],
'appraisal': [4,2,1,3],
'feedback' : ['good', 'bad', 'bad', 'well']
})
names = ['item', 'user', 'appraisal', 'feedback' ]
df = df[names]
df
I want to get a dataframe as below
item appraisal feedback
0 1 3 good bad
1 2 2 bad well
Where 'item' is 'item' from df, 'appraisal' is average of df.appraisal for items and 'feedback' is combined cells from df.feedback for items
I can get two variales
df1 = df.groupby('item')['appraisal'].mean()
But how to combine text cells? I can make pivot_table item / user and "feedback" as a value and then add cells user1+user2.....
but the real dataset has many unique values and i don't think it's a best decision
thanx for help

you can use GroupBy.agg() method:
In [4]: df.groupby('item').agg({'appraisal':'mean','feedback':' '.join})
Out[4]:
appraisal feedback
item
1 3 good bad
2 2 bad well
or if you need a "flat" DF, use as_index=False as #John Galt has recommended:
In [5]: df.groupby('item', as_index=False).agg({'appraisal':'mean','feedback':' '.join})
Out[5]:
item appraisal feedback
0 1 3 good bad
1 2 2 bad well

Related

Convert lists present in each column to its respective datatypes

I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)

Pandas : Reorganization of a DataFrame [duplicate]

This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 2 years ago.
I'm looking for a way to clean the following data:
I would like to output something like this:
with the tokenized words in the first column and their associated labels on the other.
Is there a particular strategy with Pandas and NLTK to obtain this type of output in one go?
Thank you in advance for your help or advice
Given the 1st table, it's simply a matter of splitting the first column and repeating the 2nd column:
import pandas as pd
data = [['foo bar', 'O'], ['George B', 'PERSON'], ['President', 'TITLE']]
df1 = pd.DataFrame(data, columns=['col1', 'col2'])
print(df1)
df2 = pd.concat([pd.Series(row['col2'], row['col1'].split(' '))
for _, row in df1.iterrows()]).reset_index()
df2 = df2.rename(columns={'index': 'col1', 0: 'col2'})
print(df2)
The output:
col1 col2
0 foo bar O
1 George B PERSON
2 President TITLE
col1 col2
0 foo O
1 bar O
2 George PERSON
3 B PERSON
4 President TITLE
As for splitting the 1st column, you want to look at the split method which supports regular expression, which should allow you to handle the various language delimiters:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
If 1st table is not given there is no way to do this in 1 go with pandas since pandas has no built-in NLP capabilities.

Is there a way to compare the values of a Pandas DataFrame with the values of a second DataFrame?

I have 2 Pandas Dataframes with 5 columns and about 1000 rows each (working with python3).
I'm interested in making a comparison between the first column in df1 and the first column of df2 as follows:
DF1
[index] [col1]
1 "foobar"
2 "acksyn"
3 "foobaz"
4 "ackfin"
... ...
DF2
[index] [col1]
1 "old"
2 "fin"
3 "new"
4 "bar"
... ...
What I want to achieve is this: for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.
In this example the resulting DF1 should be:
DF1
[index] [col1]
2 "acksyn"
3 "foobaz"
... ...
(see DF2 indexes 2 and 4 are the final part in DF1 indexes 1 and 4)
I tried using an internally defined function like:
def check_presence(df1_col1, second_csv):
for index, row in second_csv.iterrows():
search_string = "(?P<first_group>^(" + some_string + "))(?P<the_rest>" + row["col1"] + "$)"
if re.search(search_string, df1_col1):
return True
return False
and instructions with this format:
indexes = csv[csv.col1.str.contains(some_regex, regex= True, na=False)].index
but in both cases the python console complies about not being able to compare non-string objects with a string
What am I doing wrong? I can even try a solution after joining the 2 CSVs but I think I would need to do the same thing in the end
Thanks for patience, I'm new to python...
You will need to join your keywords in df2 first if you want to use str.contains method.
import pandas as pd
df = pd.DataFrame({'col1': {0: 'foobar', 1: 'acksyn', 2: 'foobaz', 3: 'ackfin'}})
df2 = pd.DataFrame({'col1': {0: 'old', 1: 'fin', 2: 'new', 3: 'bar'}})
print (df["col1"].str.contains("|".join(df2["col1"])))
#
0 True
1 False
2 False
3 True
Possible Solution
"" for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.""
This is a one-liner if I understand properly:
# Search for Substring
# Generate an "OR" statement with a join
# Drop if match.
df[~df.col1.str.contains('|'.join(df2.col1.values))]
This will keep only the rows where DF2.Col1 is NOT found in DF1.Col1.
pd.Series.str.contains
Take your frames
frame1 =frame1=pd.DataFrame({"col1":["foobar","acksyn","foobaz","ackfin"]})
frame2=pd.DataFrame({"col1":["old","fin","new","bar"]})
Then
myList=frame2.col2.values
pattern='|'.join(myList)
Finally
frame1["col2"]=frame1["col1"].str.contains(pattern)
frame1.loc[frame1["col2"]==True]
col1 col2
0 foobar True
3 ackfin True

Can we automate data filters for Excel using Pandas?

I want to apply filters to spread sheet using Python, which module is more useful Pandas or any other?
Filtering within your pandas dataframe can be done with loc (in addition to some other methods). What I THINK you're looking for is a way to export dataframes to excel and apply a filter within excel.
XLSXWRITER (by John McNamara) satisfies pretty much all xlsx/pandas use cases and has great documentation here --> https://xlsxwriter.readthedocs.io/.
Auto-filtering is an option :) https://xlsxwriter.readthedocs.io/worksheet.html?highlight=auto%20filter#worksheet-autofilter
I am not sure if I understand your question right. Maybe the combination of pandas and
qgrid might help you.
Simple filtering in pandas can be accomplished using the .loc DataFrame method.
In [4]: data = ({'name': ['Joe', 'Bob', 'Alice', 'Susan'],
...: 'dept': ['Marketing', 'IT', 'Marketing', 'Sales']})
In [5]: employees = pd.DataFrame(data)
In [6]: employees
Out[6]:
name dept
0 Joe Marketing
1 Bob IT
2 Alice Marketing
3 Susan Sales
In [7]: marketing = employees.loc[employees['dept'] == 'Marketing']
In [8]: marketing
Out[8]:
name dept
0 Joe Marketing
2 Alice Marketing
You can also use .loc with .isin to select multiple values in the same column
In [9]: marketing_it = employees.loc[employees['dept'].isin(['Marketing', 'IT'])]
In [10]: marketing_it
Out[10]:
name dept
0 Joe Marketing
1 Bob IT
2 Alice Marketing
You can also pass multiple conditions to .loc using an and (&) or or (|) statement to select values from multiple columns
In [11]: joe = employees.loc[(employees['dept'] == 'Marketing') & (employees['name'] == 'Joe')]
In [12]: joe
Out[12]:
name dept
0 Joe Marketing
Here is an an example of adding an autofilter to a worksheet exported from Pandas using XlsxWriter:
import pandas as pd
# Create a Pandas dataframe by reading some data from a space-separated file.
df = pd.read_csv('autofilter_data.txt', sep=r'\s+')
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_autofilter.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object. We also turn off the
# index column at the left of the output dataframe.
df.to_excel(writer, sheet_name='Sheet1', index=False)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Get the dimensions of the dataframe.
(max_row, max_col) = df.shape
# Make the columns wider for clarity.
worksheet.set_column(0, max_col - 1, 12)
# Set the autofilter.
worksheet.autofilter(0, 0, max_row, max_col - 1)
# Add an optional filter criteria. The placeholder "Region" in the filter
# is ignored and can be any string that adds clarity to the expression.
worksheet.filter_column(0, 'Region == East')
# It isn't enough to just apply the criteria. The rows that don't match
# must also be hidden. We use Pandas to figure our which rows to hide.
for row_num in (df.index[(df['Region'] != 'East')].tolist()):
worksheet.set_row(row_num + 1, options={'hidden': True})
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Output:
The data used in this example is here.

Using List Comprehension with Pandas Series and Dataframes

I have written the below code that accepts a pandas series (dataframe column) of strings and a dictionary of terms to replace in the strings.
def phrase_replace(repl_dict, str_series):
for k,v in repl_dict.items():
str_series = str_series.str.replace(k,v)
return str_series
It works correctly, but it seems like I should be able to use some kind of list comprehension instead of the for loop.
I don't want to use str_series = [] or {} because I don't want a list or a dictionary returned, but a pandas.core.series.Series
Likewise, if I want to use the function on every column in a dataframe:
for column in df.columns:
df[column] = phrase_replace(repl_dict, df[column])
There must be a list comprehension method to do this?
It is possible, but then need concat for DataFrame because get list of Series:
df = pd.concat([phrase_replace(repl_dict, df[column]) for column in df.columns], axis=1)
But maybe need replace by dictionary:
df = df.replace(repl_dict)
df = pd.DataFrame({'words':['apple','banana','orange']})
repl_dict = {'an':'foo', 'pp':'zz'}
df.replace({'words':repl_dict}, inplace=True, regex=True)
df
Out[263]:
words
0 azzle
1 bfoofooa
2 orfooge
If you want to apply to all columns:
df2 = pd.DataFrame({'key1':['apple', 'banana', 'orange'], 'key2':['banana', 'apple', 'pineapple']})
df2
Out[13]:
key1 key2
0 apple banana
1 banana apple
2 orange pineapple
df2.replace(repl_dict,inplace=True, regex=True)
df2
Out[15]:
key1 key2
0 azzle bfoofooa
1 bfoofooa azzle
2 orfooge pineazzle
The whole point of pandas is to not use for loops... it's optimized to use the built in methods for dataframes and series...

Resources