How to average groups of columns - python-3.x

Given the following pandas dataframe:
I am trying to get to point b (shown in image 2). Where I want to use row 'class' to identify column names and average columns with the same class. I have been trying to use setdefault to create a dictionary but I am not having much luck. I aim to achieve the final result shown in fig 2.
Since this is a representative example (the actual dataframe is huge), please let me know of a loop based example if possible.
Any help or pointers in the right direction is immensely appreciated.

Imports and Test DataFrame
import pandas as pd
from string import ascii_lowercase # for test data
import numpy as np # for test data
np.random.seed(365)
df = pd.DataFrame(np.random.rand(5, 6) * 1000, columns=list(ascii_lowercase[:6]))
df.index.name = 'Class'
a b c d e f
Class
0 941.455743 641.602705 684.610467 588.562066 543.887219 368.070913
1 766.625774 305.012427 442.085972 110.443337 438.373785 752.615799
2 291.626250 885.722745 996.691261 486.568378 349.410194 151.412764
3 891.947611 773.542541 780.213921 489.000349 532.862838 189.855095
4 958.551868 882.662907 86.499676 243.609553 279.726092 215.662172
Create a DataFrame of column pair means
# use list slicing to select even and odd columns
even_cols = df.columns[0::2]
odd_cols = df.columns[1::2]
# zip the two lists into pairs
# zip creates tuples, but pandas requires list of columns, so we map the tuples into lists
col_pairs = list(map(list, zip(even_cols, odd_cols)))
# in a list comprehension iterate through each column pair, get the mean, and concat the results into a dataframe
df_means = pd.concat([df[pairs].mean(axis=1) for pairs in col_pairs], axis=1)
# in a list comprehension create column header names with a string join
df_means.columns = [' & '.join(pair) for pair in col_pairs]
# display(df_means)
a & b c & d e & f
Class
0 791.529224 636.586267 455.979066
1 535.819101 276.264655 595.494792
2 588.674498 741.629819 250.411479
3 832.745076 634.607135 361.358966
4 920.607387 165.054615 247.694132

Try This
df['A B'] = df[['A', 'B']].mean(axis=1)

Related

Dictionary to Dataframe with list as value

I am trying to create a dataframe from two dictionaries with matching keys with the values in neighboring columns.
I have two dictionaries:
pl = {'seq1' : ['actgcta', 'cggctatcg'], 'seq2': ['cgatcgatca'], 'seq3': ['cgatcagt', 'cgataataat']}
pr = {'seq1' : ['cagtatacga', 'attacgat', 'atcgactagt'], 'seq2': ['cgatcgatca'], 'seq3': ['cgatcagt']}
I am trying to create a dataframe that looks like this (please forgive the crude figure):
seq1 |['actgcta','cggctatcg'] |['cagtatacga', 'attacgat', 'atcgactagt']
--------------------------------------------------------------------------
seq2 |['cgatcgatca'] |['cgatcgatca']
--------------------------------------------------------------------------
seq3 |['cgatcagt', 'cgataataat'] |['cgatcagt']
I have tried working with pd.DataFrame , pd.DataFrame.from_dict , and played with various orient args, but have had no success.
One way using pandas.concat:
df = pd.concat(map(pd.Series, [pl, pr]), axis=1)
Output:
0 1
seq1 [actgcta, cggctatcg] [cagtatacga, attacgat, atcgactagt]
seq2 [cgatcgatca] [cgatcgatca]
seq3 [cgatcagt, cgataataat] [cgatcagt]
Unfortunately, to import the dictionaries into dataframes, you r lists must be of the same length, which they are not. So we must first put the lists into a single list of list. Then we can merge the dataframes with simple concatenation:
import pandas as pd
pl = {'seq1' : ['actgcta', 'cggctatcg'], 'seq2': ['cgatcgatca'], 'seq3': ['cgatcagt', 'cgataataat']}
pr = {'seq1' : ['cagtatacga', 'attacgat', 'atcgactagt'], 'seq2': ['cgatcgatca'], 'seq3': ['cgatcagt']}
for keys in pl.keys():
pl[keys] = [pl[keys]]
for keys in pr.keys():
pr[keys] = [pr[keys]]
df = pd.DataFrame(pl)
df1 = pd.DataFrame(pr)
df2 = pd.concat([df, df1])
print(df2.transpose())

Append strings to pandas dataframe column with conditional

I'm trying to achieve the following: check if each key value in the dictionary is in the string from column layers. If it meets the conditional, to append the value from the dictionary to the pandas dataframe.
For example, if BR and EWKS is contained within the layer, then in the new column there will be BRIDGE-EARTHWORKS.
Dataframe
mapping = {'IDs': [1244, 35673, 37863, 76373, 234298],
'Layers': ['D-BR-PILECAPS-OUTLINE 2',
'D-BR-STEEL-OUTLINE 2D-TERR-BOUNDARY',
'D-SUBG-OTHER',
'D-COMP-PAVE-CONC2',
'D-EWKS-HINGE']}
df = pd.DataFrame(mapping)
Dictionary
d1 = {"BR": "Bridge", "EWKS": "Earthworks", "KERB": "Kerb", "TERR": "Terrain"}
My code thus far is:
for i in df.Layers
for x in d1.keys():
first_key = list(d1)[0]
first_val = list(d1.values())[0]
print(first_key,first_val)
if first_key in i:
df1 = df1.append(first_val, ignore_index = True)
# df.apply(first_val)
Note I'm thinking it may be easier to do the comprehension at the mapping step prior to creating the dataframe.. I'm rather new to python still so any tips are appreciated.
Thanks!
Use Series.str.extractall for all matched keys, then mapping by dictionary with Series.map and last aggregate join:
pat = r'({})'.format('|'.join(d1.keys()))
df['new'] = df['Layers'].str.extractall(pat)[0].map(d1).groupby(level=0).agg('-'.join)
print (df)
IDs Layers new
0 1244 D-BR-PILECAPS-OUTLINE 2 Bridge
1 35673 D-BR-STEEL-OUTLINE 2D-TERR-BOUNDARY Bridge-Terrain
2 37863 D-SUBG-OTHER NaN
3 76373 D-COMP-PAVE-CONC2 NaN
4 234298 D-EWKS-HINGE Earthworks

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

Using loops to call multiple pandas dataframe columns

I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan

List iterations and regex, what is the better way to remove the text I don' t need?

We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []

Resources