create MultiIndex columns based on "lookup" - python-3.x

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!

Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

Related

Append strings to pandas dataframe column with conditional

I'm trying to achieve the following: check if each key value in the dictionary is in the string from column layers. If it meets the conditional, to append the value from the dictionary to the pandas dataframe.
For example, if BR and EWKS is contained within the layer, then in the new column there will be BRIDGE-EARTHWORKS.
Dataframe
mapping = {'IDs': [1244, 35673, 37863, 76373, 234298],
'Layers': ['D-BR-PILECAPS-OUTLINE 2',
'D-BR-STEEL-OUTLINE 2D-TERR-BOUNDARY',
'D-SUBG-OTHER',
'D-COMP-PAVE-CONC2',
'D-EWKS-HINGE']}
df = pd.DataFrame(mapping)
Dictionary
d1 = {"BR": "Bridge", "EWKS": "Earthworks", "KERB": "Kerb", "TERR": "Terrain"}
My code thus far is:
for i in df.Layers
for x in d1.keys():
first_key = list(d1)[0]
first_val = list(d1.values())[0]
print(first_key,first_val)
if first_key in i:
df1 = df1.append(first_val, ignore_index = True)
# df.apply(first_val)
Note I'm thinking it may be easier to do the comprehension at the mapping step prior to creating the dataframe.. I'm rather new to python still so any tips are appreciated.
Thanks!
Use Series.str.extractall for all matched keys, then mapping by dictionary with Series.map and last aggregate join:
pat = r'({})'.format('|'.join(d1.keys()))
df['new'] = df['Layers'].str.extractall(pat)[0].map(d1).groupby(level=0).agg('-'.join)
print (df)
IDs Layers new
0 1244 D-BR-PILECAPS-OUTLINE 2 Bridge
1 35673 D-BR-STEEL-OUTLINE 2D-TERR-BOUNDARY Bridge-Terrain
2 37863 D-SUBG-OTHER NaN
3 76373 D-COMP-PAVE-CONC2 NaN
4 234298 D-EWKS-HINGE Earthworks

How to average groups of columns

Given the following pandas dataframe:
I am trying to get to point b (shown in image 2). Where I want to use row 'class' to identify column names and average columns with the same class. I have been trying to use setdefault to create a dictionary but I am not having much luck. I aim to achieve the final result shown in fig 2.
Since this is a representative example (the actual dataframe is huge), please let me know of a loop based example if possible.
Any help or pointers in the right direction is immensely appreciated.
Imports and Test DataFrame
import pandas as pd
from string import ascii_lowercase # for test data
import numpy as np # for test data
np.random.seed(365)
df = pd.DataFrame(np.random.rand(5, 6) * 1000, columns=list(ascii_lowercase[:6]))
df.index.name = 'Class'
a b c d e f
Class
0 941.455743 641.602705 684.610467 588.562066 543.887219 368.070913
1 766.625774 305.012427 442.085972 110.443337 438.373785 752.615799
2 291.626250 885.722745 996.691261 486.568378 349.410194 151.412764
3 891.947611 773.542541 780.213921 489.000349 532.862838 189.855095
4 958.551868 882.662907 86.499676 243.609553 279.726092 215.662172
Create a DataFrame of column pair means
# use list slicing to select even and odd columns
even_cols = df.columns[0::2]
odd_cols = df.columns[1::2]
# zip the two lists into pairs
# zip creates tuples, but pandas requires list of columns, so we map the tuples into lists
col_pairs = list(map(list, zip(even_cols, odd_cols)))
# in a list comprehension iterate through each column pair, get the mean, and concat the results into a dataframe
df_means = pd.concat([df[pairs].mean(axis=1) for pairs in col_pairs], axis=1)
# in a list comprehension create column header names with a string join
df_means.columns = [' & '.join(pair) for pair in col_pairs]
# display(df_means)
a & b c & d e & f
Class
0 791.529224 636.586267 455.979066
1 535.819101 276.264655 595.494792
2 588.674498 741.629819 250.411479
3 832.745076 634.607135 361.358966
4 920.607387 165.054615 247.694132
Try This
df['A B'] = df[['A', 'B']].mean(axis=1)

pandas dataframe manipulation without using loop

Please find the below input and output. Corresponding to each store id and period id , 11 Items should be present , if any item is missing, add it and fill that row with 0
without using loop.
Any help is highly appreciated.
input
Expected Output
You can do this:
Sample df:
df = pd.DataFrame({'store_id':[1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962, 1160962],
'period_id':[1025,1025,1025,1025,1025,1025,1026,1026,1026,1026,1026],
'item_x':[1,4,5,6,7,8,1,2,5,6,7],
'z':[1,4,5,6,7,8,1,2,5,6,7]})
Solution:
num = range(1,12)
def f(x):
return x.reindex(num, fill_value=0)\
.assign(store_id=x['store_id'].mode()[0], period_id = x['period_id'].mode()[0])
df.set_index('item_x').groupby(['store_id','period_id'], group_keys=False).apply(f).reset_index()
You can do:
from itertools import product
pdindex=product(df.groupby(["store_id", "period_id"]).groups, range(1,12))
pdindex=pd.MultiIndex.from_tuples(map(lambda x: (*x[0], x[1]), pdindex), names=["store_id", "period_id", "Item"])
df=df.set_index(["store_id", "period_id", "Item"])
res=pd.DataFrame(index=pdindex, columns=df.columns)
res.loc[df.index, df.columns]=df
res=res.fillna(0).reset_index()
Now this will work only assuming you don't have any Item outside of range [1,11].
This is a simplification of #GrzegorzSkibinski's correct answer.
This answer is not modifying the original DataFrame. It uses fewer variables to store intermediate data structures and employs a list comprehension to simplify an use of map.
I'm also using reindex() rather than creating a new DataFrame using the generated index and populating it with the original data.
import pandas as pd
import itertools
df.set_index(
["store_id", "period_id", "Item_x"]
).reindex(
pd.MultiIndex.from_tuples([
group + (item,)
for group, item in itertools.product(
df.groupby(["store_id", "period_id"]).groups,
range(1, 12),
)],
names=["store_id", "period_id", "Item_x"]
),
fill_value=0,
).reset_index()
In testing, output matched what you listed as expected.

How can I select specific values from list and plot a seaborn boxplot?

I have a list (length 300) of lists (each length 1000). I want to sort the list of 300 by the median of each list of 1000, and then plot a seaborn boxplot of the top 10 (i.e. the 10 lists with the greatest median).
I am able to plot the entire list of 300 but don't know where to go from there.
I can plot a range of the points but how to I plot, for example: data[3],data[45], data[129] all in the same plot?
ax = sns.boxplot(data = data[0:50])
I can also work out which items in the list are in the top 10 by doing this (but I realise this is not the most elegant way!)
array_median = np.median(data, axis=1)
np_sortedarray = np.sort(np.array(array_median))
sort_panda = pd.DataFrame(array_median)
TwoL = sort_panda.reset_index()
TwoL.sort_values(0)
Ultimately I want a boxplot with 10 boxes, showing the list items that have the greatest median values.
Example of data: list of 300 x 1000
[[1.236762285232544,
1.2303414344787598,
1.196462631225586,
...1.1787045001983643,
1.1760116815567017,
1.1614983081817627,
1.1546586751937866],
[1.1349891424179077,
1.1338907480239868,
1.1239897012710571,
1.1173863410949707,
...1.1015456914901733,
1.1005324125289917,
1.1005228757858276],
[1.0945734977722168,
...1.091795563697815]]
I modified your example data a bit just to make it easier.
import seaborn as sns
import pandas as pd
import numpy as np
data = [[1.236762285232544, 1.2303414344787598, 1.196462631225586, 1.1787045001983643, 1.1760116815567017, 1.1614983081817627, 1.1546586751937866],
[1.1349891424179077, 1.1338907480239868, 1.1239897012710571, 1.1173863410949707, 1.1015456914901733, 1.1005324125289917, 1.1005228757858276]]
To sort your data, since it is in list format and not numpy arrays, you can use the sorted function with a key to tell it to perform an operation on each list in your list, which is how the function will sort. Setting reverse = True tells it to sort highest to lowest.
sorted_data = sorted(data, key = lambda x: np.median(x), reverse = True)
To select the top n lists, add [:n] to the end of the previous statement.
To plot in Seaborn, it's easiest to convert your data to a pandas.DataFrame.
df = pd.DataFrame(data).T
That makes a DataFrame with 10 columns (or 2 in this example). We can rename the columns to make each dataset clearer.
df = df.rename(columns={k: f'Data{k+1}' for k in range(len(sorted_data))}).reset_index()
And to plot 2 (or 10) boxplots in one plot, you can reshape the dataframe to have 2 columns, one for the data and one for the dataset number (ID) (credit here).
df = pd.wide_to_long(df, stubnames = ['Data'], i = 'index', j = 'ID').reset_index()[['ID', 'Data']]
And then you can plot it.
sns.boxplot(x='ID', y = 'Data', data = df)
See this answer for fetching top 10 elements
idx = (-median).argsort()[:10]
data[idx]
Also, you can get particular elements of data like this
data[[3, 45, 129]]

List iterations and regex, what is the better way to remove the text I don' t need?

We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []

Resources