how to convert tuples in a column rows of a pandas dataframe into repeating rows and columns? - python-3.x

I have a dataframe which contains the following data (only 3 samples are provided here):
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
# Create DataFrame
df = pd.DataFrame(data)
Basically, each row contains a tuple of top 10 words along with their frequencies in each of Department.
I wanted to create a dataframe where (let department name get repeated and) each row contains the word from tuple in one column and frequency count in the other columns so that it should look like something this:
Department Word Counts
D1 cat 6
D1 project 6
D1 dog 6
D1 develop 4
D1 smooth 4
D1 efficient 4
D1 administrative 4
D1 procedure 4
D1 establishment 3
D1 matter 3
D2 management 21
D2 satisfaction 12
D2 within 9
D2 budget 9
D2 township 9
Is there any work around this type of conversion?

I'd suggest you do the wrangling before loading into a dataframe, with the data dictionary:
length = [len(entry) for entry in data['TopWords']]
department = {'Department' : np.repeat(data['Department'], length)}
(pd
.DataFrame([ent for entry in data['TopWords'] for ent in entry],
columns = ['Word', 'Counts'])
.assign(**department)
)
Word Counts Department
0 cat 6 D1
1 project 6 D1
2 dog 6 D1
3 develop 4 D1
4 smooth 4 D1
5 efficient 4 D1
6 administrative 4 D1
7 procedure 4 D1
8 establishment 3 D1
9 matter 3 D1
10 management 21 D2
11 satisfaction 12 D2
12 within 9 D2
13 budget 9 D2
14 township 9 D2
15 site 9 D2
16 periodic 9 D2
17 admin 9 D2
18 maintain 9 D2
19 guest 6 D2
20 manage 2 D3
21 ir 2 D3
22 mines 2 D3
23 implimentation 2 D3
24 clrg 2 D3
25 act 2 D3
26 implementations 2 D3
27 office 2 D3
28 maintenance 2 D3
29 administration 2 D3

First, use DataFrame.explode to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign + Series.str
res = (
df.explode('TopWords', ignore_index=True)
.assign(Word=lambda df: df['TopWords'].str[0],
Counts=lambda df: df['TopWords'].str[1])
.drop(columns='TopWords')
)
Output:
>>> res
Department Word Counts
0 D1 cat 6
1 D1 project 6
2 D1 dog 6
3 D1 develop 4
4 D1 smooth 4
5 D1 efficient 4
6 D1 administrative 4
7 D1 procedure 4
8 D1 establishment 3
9 D1 matter 3
10 D2 management 21
11 D2 satisfaction 12
12 D2 within 9
13 D2 budget 9
14 D2 township 9
15 D2 site 9
16 D2 periodic 9
17 D2 admin 9
18 D2 maintain 9
19 D2 guest 6
20 D3 manage 2
21 D3 ir 2
22 D3 mines 2
23 D3 implimentation 2
24 D3 clrg 2
25 D3 act 2
26 D3 implementations 2
27 D3 office 2
28 D3 maintenance 2
29 D3 administration 2
As #sammywemmy suggested, if you are dealing with a considerable amount of data, it will be faster if you wrangle it before loading it into a DataFrame.
Another way of doing it using a nested loop
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
records = []
for idx, top_words_list in enumerate(data['TopWords']):
for word, count in top_words_list:
rec = {
'Department': data['Department'][idx],
'Word': word,
'Count': count
}
records.append(rec)
res = pd.DataFrame(records)

In addition to #sammywemmy answer, following approach would not need numpy package, however due to double loops, it might not be as performant in large number of datasets.
d = {"Department": [], "Words": [], "Count": []}
for idx, department in enumerate(data["Department"]):
for word, count in data["TopWords"][idx]:
d["Department"].append(department)
d["Words"].append(word)
d["Count"].append(count)
print(pd.DataFrame(d))
#Rodalm made this code more readable using enumeration. Previously I had used simple range()

One option using a dictionary comprehension:
(df
.drop(columns='TopWords')
.join(pd.concat({k: pd.DataFrame(x, columns=['Word', 'Counts'])
for k,x in enumerate(df['TopWords'])}).droplevel(1))
)
output:
Department Word Counts
0 D1 cat 6
0 D1 project 6
0 D1 dog 6
0 D1 develop 4
0 D1 smooth 4
0 D1 efficient 4
0 D1 administrative 4
0 D1 procedure 4
0 D1 establishment 3
0 D1 matter 3
1 D2 management 21
1 D2 satisfaction 12
1 D2 within 9
1 D2 budget 9
1 D2 township 9
1 D2 site 9
1 D2 periodic 9
1 D2 admin 9
1 D2 maintain 9
1 D2 guest 6
2 D3 manage 2
2 D3 ir 2
2 D3 mines 2
2 D3 implimentation 2
2 D3 clrg 2
2 D3 act 2
2 D3 implementations 2
2 D3 office 2
2 D3 maintenance 2
2 D3 administration 2

Related

Align value counts of two dataframes side by side

If I have two dataframes - df1 (data for current day), and df2 (data for previous day).
Both dataframes have 40 columns, and all columns are object data type.
how do I compare Top 3 value_counts for both dataframes, ideally so that the result is side by side, like the following;
df1 df2
Column a Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
Column b Value count 1 Value count 1
Value count 2 Value count 2
Value count 3 Value count 3
The main idea is to check for data anomalies between the data for the two days.
I only know that for each column per dataframe, I must do something like this -
df1.Column.value_counts().head(3)
But this doesn't show combined results as I want. Please help!
You can compare if same columns names in both DataFrames - first use lambda function with Series.value_counts, top3 and create default index for both DataFrames and then join them with concat and for expected order add DataFrame.stack:
np.random.seed(2022)
df1 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df2 = pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c')
df11 = df1.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df22 = df2.apply(lambda x: x.value_counts().head(3).reset_index(drop=True))
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
Or use DataFrame.compare:
df = (df11.compare(df22,keep_equal=True)
.rename(columns={'self':'df1','other':'df2'})
.stack(0)
.sort_index(level=1))
print (df)
df1 df2
0 c0 7 8
1 c0 6 8
2 c0 6 6
0 c1 8 9
1 c1 7 7
2 c1 7 7
0 c2 9 7
1 c2 7 7
2 c2 7 6
0 c3 9 7
1 c3 7 7
2 c3 7 6
0 c4 11 14
1 c4 7 8
2 c4 7 7
EDIT: For add categories use f-strings for join indices and values of Series in list comprehension:
np.random.seed(2022)
df1 = 'Cat1' + pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df2 = 'Cat2' + pd.DataFrame(np.random.randint(10, size=(50,5))).add_prefix('c').astype(str)
df11 = df1.apply(lambda x: [f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df22 = df2.apply(lambda x:[f'{a} - {b}' for a, b in x.value_counts().head(3).items()])
df = pd.concat([df11, df22], axis=1, keys=('df1','df2')).stack().sort_index(level=1)
print (df)
df1 df2
0 c0 Cat18 - 7 Cat29 - 8
1 c0 Cat11 - 6 Cat24 - 8
2 c0 Cat19 - 6 Cat23 - 6
0 c1 Cat17 - 8 Cat24 - 9
1 c1 Cat10 - 7 Cat26 - 7
2 c1 Cat14 - 7 Cat20 - 7
0 c2 Cat13 - 9 Cat28 - 7
1 c2 Cat11 - 7 Cat25 - 7
2 c2 Cat19 - 7 Cat26 - 6
0 c3 Cat15 - 9 Cat20 - 7
1 c3 Cat18 - 7 Cat24 - 7
2 c3 Cat13 - 7 Cat27 - 6
0 c4 Cat12 - 11 Cat25 - 14
1 c4 Cat13 - 7 Cat20 - 8
2 c4 Cat15 - 7 Cat26 - 7

Python Dataframe rename column compared to value

With Pandas I'm trying to rename unnamed columns in dataframe with values on the first ligne of data.
My dataframe:
id
store
unnamed: 1
unnamed: 2
windows
unnamed: 3
unnamed: 4
0
B1
B2
B3
B1
B2
B3
1
2
c
12
15
15
14
2
4
d
35
14
14
87
My wanted result:
id
store_B1
store_B3
store_B2
windows_B1
windows_B2
windows_B3
0
B1
B2
B3
B1
B2
B3
1
2
c
12
15
15
14
2
4
d
35
14
14
87
I don't know how I can match the column name with the value in my data. Thanks for your help. Regards
You can use df.columns.where to make unnamed: columns NaN, then convert it to a Series and use ffill:
df.columns = pd.Series(df.columns.where(~df.columns.str.startswith('unnamed:'))).ffill() + np.where(~df.columns.isin(['id','col2']), ('_' + df.iloc[0].astype(str)).tolist(), '')
Output:
>>> df
id store_B1 store_B2 store_B3 windows_B1 windows_B2 windows_B3
0 0 B1 B2 B3 B1 B2 B3
1 1 2 c 12 15 15 14
2 2 4 d 35 14 14 87

how to sort a pandas dataframe according to elements of list [duplicate]

I have the following example of dataframe.
c1 c2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
Given a template c1 = [3, 2, 5, 4, 1], I want to change the order of the rows based on the new order of column c1, so it will look like:
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
I found the following thread, but the shuffle is random. Cmmiw.
Shuffle DataFrame rows
If values are unique in list and also in c1 column use reindex:
df = df.set_index('c1').reindex(c1).reset_index()
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
General solution working with duplicates in list and also in column:
c1 = [3, 2, 5, 4, 1, 3, 2, 3]
#create df from list
list_df = pd.DataFrame({'c1':c1})
print (list_df)
c1
0 3
1 2
2 5
3 4
4 1
5 3
6 2
7 3
#helper column for count duplicates values
df['g'] = df.groupby('c1').cumcount()
list_df['g'] = list_df.groupby('c1').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df).drop('g', axis=1)
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
5 3 c
merge
You can create a dataframe with the column specified in the wanted order then merge.
One advantage of this approach is that it gracefully handles duplicates in either df.c1 or the list c1. If duplicates not wanted then care must be taken to handle them prior to reordering.
d1 = pd.DataFrame({'c1': c1})
d1.merge(df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
searchsorted
This is less robust but will work if df.c1 is:
already sorted
one-to-one mapping
df.iloc[df.c1.searchsorted(c1)]
c1 c2
2 3 c
1 2 b
4 5 e
3 4 d
0 1 a

Conditionally highlighting totals that don't match a dynamically summed range

Objective:
I am looking to use conditional formatting to highlight cells in rows that are not equal to the sum of a dynamic range.
Problem:
While the formula I have created seems to work when pasted into cells, it does not give the same results when entered as a conditional formula.
Example:
Here is a space delimited example to be pasted into "A1":
Allo d1 d2 d3 d4 d5
Total 10 10 10 10 10
A 9 9 10 10 9
B 0 0 0 0 0
C 0 1 0 0 0
Total 12 12 12 12 12
B 0 5 0 3 4
C 12 7 8 8 8
Total 12 12 12 12 12
A 0 0 0 0 0
B 0 0 0 0 0
C 0 5 0 3 4
D 12 7 8 8 8
I wrote this formula which shows TRUEs and FALSEs correctly when pasted into "H2" and dragged to the right and down to "L13." When I apply this formula to the data range "B2:F13" it does not mimic what I'd expect.
=IF($A2="TOTAL", B2 <> SUM(INDIRECT(ADDRESS(ROW(B3),COLUMN(B3),4)&":"&ADDRESS(ROW(B2)+IFERROR(MATCH("TOTAL",$A3:$A$13,0)-1,ROW($A$13)-ROW($A2)),COLUMN(B2),4))))
Below you can see the formula broken out in a more easy to read way. Is my formula flawed/ How can I accomplish what I am trying to do? I appreciate your thoughts.
=IF($A2="TOTAL",
B2 <> SUM(
INDIRECT( ADDRESS( ROW(B3),
COLUMN(B3),
4) &":"&
ADDRESS( ROW(B2) + IFERROR(
MATCH( "TOTAL", $A3:$A$13, 0)-1,
ROW($A$13)-ROW($A2)),
COLUMN(B2),4))))
Use OFFSET instead:
=IF($A2="TOTAL",B2<> SUM(OFFSET(B3,0,0,IFERROR(MATCH("TOTAL",$A3:$A$13,0)-1,ROWS($A3:$A$13)),1)))

Adding values to list using vlookup

I have a big list like:
Actual:(sheet "A") Expected:(Sheet "B")
A Values A Values
1 5 1 5
3 12 2 0
5 11 3 12
4 0
5 11
I used formula =vlookup(A2, A!A2:A!B3, 2, FALSE) in B2 of sheet B and it returns "#N/A"
In Sheet B, A2, use this: =IfError(vlookup(A2,A!$A:$B,2,0),"No Match")

Resources