Append strings to pandas dataframe column with conditional - python-3.x

I'm trying to achieve the following: check if each key value in the dictionary is in the string from column layers. If it meets the conditional, to append the value from the dictionary to the pandas dataframe.
For example, if BR and EWKS is contained within the layer, then in the new column there will be BRIDGE-EARTHWORKS.
Dataframe
mapping = {'IDs': [1244, 35673, 37863, 76373, 234298],
'Layers': ['D-BR-PILECAPS-OUTLINE 2',
'D-BR-STEEL-OUTLINE 2D-TERR-BOUNDARY',
'D-SUBG-OTHER',
'D-COMP-PAVE-CONC2',
'D-EWKS-HINGE']}
df = pd.DataFrame(mapping)
Dictionary
d1 = {"BR": "Bridge", "EWKS": "Earthworks", "KERB": "Kerb", "TERR": "Terrain"}
My code thus far is:
for i in df.Layers
for x in d1.keys():
first_key = list(d1)[0]
first_val = list(d1.values())[0]
print(first_key,first_val)
if first_key in i:
df1 = df1.append(first_val, ignore_index = True)
# df.apply(first_val)
Note I'm thinking it may be easier to do the comprehension at the mapping step prior to creating the dataframe.. I'm rather new to python still so any tips are appreciated.
Thanks!

Use Series.str.extractall for all matched keys, then mapping by dictionary with Series.map and last aggregate join:
pat = r'({})'.format('|'.join(d1.keys()))
df['new'] = df['Layers'].str.extractall(pat)[0].map(d1).groupby(level=0).agg('-'.join)
print (df)
IDs Layers new
0 1244 D-BR-PILECAPS-OUTLINE 2 Bridge
1 35673 D-BR-STEEL-OUTLINE 2D-TERR-BOUNDARY Bridge-Terrain
2 37863 D-SUBG-OTHER NaN
3 76373 D-COMP-PAVE-CONC2 NaN
4 234298 D-EWKS-HINGE Earthworks

Related

How to average groups of columns

Given the following pandas dataframe:
I am trying to get to point b (shown in image 2). Where I want to use row 'class' to identify column names and average columns with the same class. I have been trying to use setdefault to create a dictionary but I am not having much luck. I aim to achieve the final result shown in fig 2.
Since this is a representative example (the actual dataframe is huge), please let me know of a loop based example if possible.
Any help or pointers in the right direction is immensely appreciated.
Imports and Test DataFrame
import pandas as pd
from string import ascii_lowercase # for test data
import numpy as np # for test data
np.random.seed(365)
df = pd.DataFrame(np.random.rand(5, 6) * 1000, columns=list(ascii_lowercase[:6]))
df.index.name = 'Class'
a b c d e f
Class
0 941.455743 641.602705 684.610467 588.562066 543.887219 368.070913
1 766.625774 305.012427 442.085972 110.443337 438.373785 752.615799
2 291.626250 885.722745 996.691261 486.568378 349.410194 151.412764
3 891.947611 773.542541 780.213921 489.000349 532.862838 189.855095
4 958.551868 882.662907 86.499676 243.609553 279.726092 215.662172
Create a DataFrame of column pair means
# use list slicing to select even and odd columns
even_cols = df.columns[0::2]
odd_cols = df.columns[1::2]
# zip the two lists into pairs
# zip creates tuples, but pandas requires list of columns, so we map the tuples into lists
col_pairs = list(map(list, zip(even_cols, odd_cols)))
# in a list comprehension iterate through each column pair, get the mean, and concat the results into a dataframe
df_means = pd.concat([df[pairs].mean(axis=1) for pairs in col_pairs], axis=1)
# in a list comprehension create column header names with a string join
df_means.columns = [' & '.join(pair) for pair in col_pairs]
# display(df_means)
a & b c & d e & f
Class
0 791.529224 636.586267 455.979066
1 535.819101 276.264655 595.494792
2 588.674498 741.629819 250.411479
3 832.745076 634.607135 361.358966
4 920.607387 165.054615 247.694132
Try This
df['A B'] = df[['A', 'B']].mean(axis=1)

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

Pandas Dataframe - sort list element by date, when date is substring of element

I want to sort data in each cell in column named SESSIONS, based on date (YYYY_MM_DD) and this date is inside elements (strings) forming list. SESSIONS column can have various numbers of sessions and can be also empty. In one cell of SESSIONS colum there is list of sessions (like in "li" I put as an example for testing).
Below is how it worked OK when doing it outside df (2019_04_20 appears as latest):
li = ['WE233JP_2015_03_03__13_31_21','WE238JP_2019_04_20__16_40_59','WE932LT_2017_10_12__08_35_49']
li.sort(key = lambda x: datetime.strptime(re.sub(r'^([^_]+)_(.+)__(.+)', r'\2', x), '%Y_%m_%d'))
print(li)
When I try to apply it on df with below codes (2 attempts):
df['sessions'] = df.sessions.fillna('NULL').sort_values().apply(lambda x: sorted(datetime.strptime(re.sub(r'^([^_]+)_(.+)__(.+)', r'\2', x), '%Y_%m_%d')))
df['sessions'] = df.sessions.fillna('NULL').sort_values().apply(lambda x: sorted(re.sub(r'^([^_]+)_(.+)__(.+)', r'\2', x)))
In both cases I got an erorr: TypeError: expected string or bytes-like object
Simple non-date sorting like below works OK:
df['sessions'] = df.sessions.fillna('NULL').sort_values().apply(lambda x: sorted(x))
Any suggestions how to sort df on extracted part of string formatted as date?
Let's try series map with custom sort key function
Sample `df`:
sessions
0 [WE233JP_2015_03_03__13_31_21, WE238JP_2019_04_20__16_40_59, WE932LT_2017_10_12__08_35_49]
1 NaN
import re
sort_func = lambda x: pd.to_datetime(re.findall(r'^[^_]+_(.+)__.+', x)[0],
format='%Y_%m_%d', errors='coerce')
df['sorted_sessions'] = df.sessions.map(lambda y: sorted(y, key=sort_func)
if y is not np.nan else y)
Out[1455]:
sessions \
0 [WE233JP_2015_03_03__13_31_21, WE238JP_2019_04_20__16_40_59, WE932LT_2017_10_12__08_35_49]
1 NaN
sorted_sessions
0 [WE233JP_2015_03_03__13_31_21, WE932LT_2017_10_12__08_35_49, WE238JP_2019_04_20__16_40_59]
1 NaN

Splitting Multiple values inside a Pandas Column into Separate Columns

I have a dataframe with column which contains two different column values and their name as follows:
How Do I transform it into separate columns?
So far, I tried Following:
use df[col].apply(pd.Series) - It didn't work since data in column is not in dictionary format.
Tried separating columns by a semi-colon (";") sign but It is not a good idea since the given dataframe might have n number of column based on response.
EDIT:
Data in plain text format:
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
How about:
df2 = (df["ClusterName"]
.str.replace("Date:", "")
.str.replace("Bucket:", "")
.str.split(";", expand=True))
df2.columns = ["Date", "Bucket"]
EDIT:
Without hardcoding the variable names, here's a quick hack. You can clean it up (and make less silly variable names):
df_temp = df.ClusterName.str.split(";", expand=True)
cols = []
for col in df_temp:
df_temptemp = df_temp[col].str.split(":", expand=True)
df_temp[col] = df_temptemp[1]
cols.append(df_temptemp.iloc[0, 0])
df_temp.columns = cols
So .. maybe like this ...
Setup the data frame
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
df = pd.DataFrame(data=d)
df
Parse over the dataframe breaking apart by colon and semi-colon
ls = []
for index, row in df.iterrows():
splits = row['ClusterName'].split(';')
print(splits[0].split(':')[1],splits[1].split(':')[1])
ls.append([splits[0].split(':')[1],splits[1].split(':')[1]])
df = pd.DataFrame(ls, columns =['Date', 'Bucket'])

List iterations and regex, what is the better way to remove the text I don' t need?

We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []

Resources