Using List Comprehension with Pandas Series and Dataframes - python-3.x

I have written the below code that accepts a pandas series (dataframe column) of strings and a dictionary of terms to replace in the strings.
def phrase_replace(repl_dict, str_series):
for k,v in repl_dict.items():
str_series = str_series.str.replace(k,v)
return str_series
It works correctly, but it seems like I should be able to use some kind of list comprehension instead of the for loop.
I don't want to use str_series = [] or {} because I don't want a list or a dictionary returned, but a pandas.core.series.Series
Likewise, if I want to use the function on every column in a dataframe:
for column in df.columns:
df[column] = phrase_replace(repl_dict, df[column])
There must be a list comprehension method to do this?

It is possible, but then need concat for DataFrame because get list of Series:
df = pd.concat([phrase_replace(repl_dict, df[column]) for column in df.columns], axis=1)
But maybe need replace by dictionary:
df = df.replace(repl_dict)

df = pd.DataFrame({'words':['apple','banana','orange']})
repl_dict = {'an':'foo', 'pp':'zz'}
df.replace({'words':repl_dict}, inplace=True, regex=True)
df
Out[263]:
words
0 azzle
1 bfoofooa
2 orfooge
If you want to apply to all columns:
df2 = pd.DataFrame({'key1':['apple', 'banana', 'orange'], 'key2':['banana', 'apple', 'pineapple']})
df2
Out[13]:
key1 key2
0 apple banana
1 banana apple
2 orange pineapple
df2.replace(repl_dict,inplace=True, regex=True)
df2
Out[15]:
key1 key2
0 azzle bfoofooa
1 bfoofooa azzle
2 orfooge pineazzle
The whole point of pandas is to not use for loops... it's optimized to use the built in methods for dataframes and series...

Related

How to only keep an item of a list within pandas dataframe

This is what my dataframe looks like:
df = {"a": [[1,2,3], [4,5,6]],
"b": [[11,22,33], [44,55,66]],
"c": [[111,222,333], [444,555,666]]}
df = pd.Dataframe(df)
I want for each cell, keep only the item that I choose its index.
I've tried doing this way to keep the first item of the list
df = df.apply(lambda x:x[0])
but it doesn't work.
Anyone can enlighten me on this ? Thanks,
If need first value for all columns need processing lambda function elementwise by DataFrame.applymap:
df = df.applymap(lambda x:x[0])
print(df)
a b c
0 1 11 111
1 4 44 444

Sort values in a dataframe by a column and take second one only if equal

I've created a dataframe using random values using the following code:
values = random(5)
values_1= random(5)
col1= list(values/ values .sum())
col2= list(values_1)
df = pd.DataFrame({'col1':col1, 'col2':col2})
df.sort_values(by=['col2','col1'],ascending=[False,False]).reset_index(inplace=True)
The dataframe created in my case looks like this:
As you can see, the dataframe is not sorted in descending order by 'col2'. What I want to achieve is that it first sorts by 'col2' and if any 2 rows have same values for 'col2', then it should sort by 'col1' as well. Any suggestions? Any help would be appreciated.
Your solution almost working well, but if use inplace in reset_index it is not reused in sort_values.
Possible solution is add ignore_index=True, so reset_index is not necessary.
np.random.seed(2022)
df = pd.DataFrame({'col1':np.random.random(5), 'col2':np.random.random(5)})
df = df.sort_values(by=['col2','col1'],ascending=False, ignore_index=True)
print (df)
col1 col2
0 0.499058 0.897657
1 0.049974 0.896963
2 0.685408 0.721135
3 0.113384 0.647452
4 0.009359 0.486988
Or if want use inplace add it only to sort_values and add also ignore_index=True:
df.sort_values(by=['col2','col1'],ascending=False, ignore_index=True,inplace=True)
print (df)
col1 col2
0 0.499058 0.897657
1 0.049974 0.896963
2 0.685408 0.721135
3 0.113384 0.647452
4 0.009359 0.486988
Your logic is correct but you've missed an inplace=True inside sort_values. Due to this, the sorting does not actually take place in your dataframe. Replace it with this:
df.sort_values(by=['col2','col1'],ascending=[False,False],inplace=True)
df.reset_index(inplace=True,drop=True)
You want to also do the sort inplace=True, not only the reset_index()

Convert lists present in each column to its respective datatypes

I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)

Is there a way to compare the values of a Pandas DataFrame with the values of a second DataFrame?

I have 2 Pandas Dataframes with 5 columns and about 1000 rows each (working with python3).
I'm interested in making a comparison between the first column in df1 and the first column of df2 as follows:
DF1
[index] [col1]
1 "foobar"
2 "acksyn"
3 "foobaz"
4 "ackfin"
... ...
DF2
[index] [col1]
1 "old"
2 "fin"
3 "new"
4 "bar"
... ...
What I want to achieve is this: for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.
In this example the resulting DF1 should be:
DF1
[index] [col1]
2 "acksyn"
3 "foobaz"
... ...
(see DF2 indexes 2 and 4 are the final part in DF1 indexes 1 and 4)
I tried using an internally defined function like:
def check_presence(df1_col1, second_csv):
for index, row in second_csv.iterrows():
search_string = "(?P<first_group>^(" + some_string + "))(?P<the_rest>" + row["col1"] + "$)"
if re.search(search_string, df1_col1):
return True
return False
and instructions with this format:
indexes = csv[csv.col1.str.contains(some_regex, regex= True, na=False)].index
but in both cases the python console complies about not being able to compare non-string objects with a string
What am I doing wrong? I can even try a solution after joining the 2 CSVs but I think I would need to do the same thing in the end
Thanks for patience, I'm new to python...
You will need to join your keywords in df2 first if you want to use str.contains method.
import pandas as pd
df = pd.DataFrame({'col1': {0: 'foobar', 1: 'acksyn', 2: 'foobaz', 3: 'ackfin'}})
df2 = pd.DataFrame({'col1': {0: 'old', 1: 'fin', 2: 'new', 3: 'bar'}})
print (df["col1"].str.contains("|".join(df2["col1"])))
#
0 True
1 False
2 False
3 True
Possible Solution
"" for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.""
This is a one-liner if I understand properly:
# Search for Substring
# Generate an "OR" statement with a join
# Drop if match.
df[~df.col1.str.contains('|'.join(df2.col1.values))]
This will keep only the rows where DF2.Col1 is NOT found in DF1.Col1.
pd.Series.str.contains
Take your frames
frame1 =frame1=pd.DataFrame({"col1":["foobar","acksyn","foobaz","ackfin"]})
frame2=pd.DataFrame({"col1":["old","fin","new","bar"]})
Then
myList=frame2.col2.values
pattern='|'.join(myList)
Finally
frame1["col2"]=frame1["col1"].str.contains(pattern)
frame1.loc[frame1["col2"]==True]
col1 col2
0 foobar True
3 ackfin True

Split rows with same ID into different columns python

I want to have a dataframe with repeated values with the same id number. But i want to split the repeated rows into colunms.
data = [[10450015,4.4],[16690019 4.1],[16690019,4.0],[16510069 3.7]]
df = pd.DataFrame(data, columns = ['id', 'k'])
print(df)
The resulting dataframe would have n_k (n= repated values of id rows). The repeated id gets a individual colunm and when it does not have repeated id, it gets a 0 in the new colunm.
data_merged = {'id':[10450015,16690019,16510069], '1_k':[4.4,4.1,3.7], '2_k'[0,4.0,0]}
print(data_merged)
Try assiging the column idx ref, using DataFrame.assign and groupby.cumcount then DataFrame.pivot_table. Finally use a list comprehension to sort column names:
df_new = (df.assign(col=df.groupby('id').cumcount().add(1))
.pivot_table(index='id', columns='col', values='k', fill_value=0))
df_new.columns = [f"{x}_k" for x in df_new.columns]
print(df_new)
1_k 2_k
id
10450015 4.4 0
16510069 3.7 0
16690019 4.1 4

Resources