convert grouped dataframe to dictionary with apply/lambda - python-3.x

I have a df like this
A B C
0 11 one 5
1 11 two 7
2 11 three 9
3 22 one 11
4 22 two 13
I'd like to convert this df to dictionary like this:
{11: [(one, 5), (two, 7), (three, 9)]
22: [(one, 11), (two, 13)]}
df_dict = df.groupby('A').apply(lambda y: {(x.B, x.C) for i, x in y.iterrows()})
The actual results I got from this code:
11 {(one, 5), (two, 7), (three, 9)}
22 {(one, 11), (two, 13)}

Try,
df.groupby('A').apply(lambda x: list(zip(x.B, x.C))).to_dict()
{11: [('one', 5), ('two', 7), ('three', 9)], 22: [('one', 11), ('two', 13)]}

Related

how to convert tuples in a column rows of a pandas dataframe into repeating rows and columns?

I have a dataframe which contains the following data (only 3 samples are provided here):
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
# Create DataFrame
df = pd.DataFrame(data)
Basically, each row contains a tuple of top 10 words along with their frequencies in each of Department.
I wanted to create a dataframe where (let department name get repeated and) each row contains the word from tuple in one column and frequency count in the other columns so that it should look like something this:
Department Word Counts
D1 cat 6
D1 project 6
D1 dog 6
D1 develop 4
D1 smooth 4
D1 efficient 4
D1 administrative 4
D1 procedure 4
D1 establishment 3
D1 matter 3
D2 management 21
D2 satisfaction 12
D2 within 9
D2 budget 9
D2 township 9
Is there any work around this type of conversion?
I'd suggest you do the wrangling before loading into a dataframe, with the data dictionary:
length = [len(entry) for entry in data['TopWords']]
department = {'Department' : np.repeat(data['Department'], length)}
(pd
.DataFrame([ent for entry in data['TopWords'] for ent in entry],
columns = ['Word', 'Counts'])
.assign(**department)
)
Word Counts Department
0 cat 6 D1
1 project 6 D1
2 dog 6 D1
3 develop 4 D1
4 smooth 4 D1
5 efficient 4 D1
6 administrative 4 D1
7 procedure 4 D1
8 establishment 3 D1
9 matter 3 D1
10 management 21 D2
11 satisfaction 12 D2
12 within 9 D2
13 budget 9 D2
14 township 9 D2
15 site 9 D2
16 periodic 9 D2
17 admin 9 D2
18 maintain 9 D2
19 guest 6 D2
20 manage 2 D3
21 ir 2 D3
22 mines 2 D3
23 implimentation 2 D3
24 clrg 2 D3
25 act 2 D3
26 implementations 2 D3
27 office 2 D3
28 maintenance 2 D3
29 administration 2 D3
First, use DataFrame.explode to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign + Series.str
res = (
df.explode('TopWords', ignore_index=True)
.assign(Word=lambda df: df['TopWords'].str[0],
Counts=lambda df: df['TopWords'].str[1])
.drop(columns='TopWords')
)
Output:
>>> res
Department Word Counts
0 D1 cat 6
1 D1 project 6
2 D1 dog 6
3 D1 develop 4
4 D1 smooth 4
5 D1 efficient 4
6 D1 administrative 4
7 D1 procedure 4
8 D1 establishment 3
9 D1 matter 3
10 D2 management 21
11 D2 satisfaction 12
12 D2 within 9
13 D2 budget 9
14 D2 township 9
15 D2 site 9
16 D2 periodic 9
17 D2 admin 9
18 D2 maintain 9
19 D2 guest 6
20 D3 manage 2
21 D3 ir 2
22 D3 mines 2
23 D3 implimentation 2
24 D3 clrg 2
25 D3 act 2
26 D3 implementations 2
27 D3 office 2
28 D3 maintenance 2
29 D3 administration 2
As #sammywemmy suggested, if you are dealing with a considerable amount of data, it will be faster if you wrangle it before loading it into a DataFrame.
Another way of doing it using a nested loop
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
records = []
for idx, top_words_list in enumerate(data['TopWords']):
for word, count in top_words_list:
rec = {
'Department': data['Department'][idx],
'Word': word,
'Count': count
}
records.append(rec)
res = pd.DataFrame(records)
In addition to #sammywemmy answer, following approach would not need numpy package, however due to double loops, it might not be as performant in large number of datasets.
d = {"Department": [], "Words": [], "Count": []}
for idx, department in enumerate(data["Department"]):
for word, count in data["TopWords"][idx]:
d["Department"].append(department)
d["Words"].append(word)
d["Count"].append(count)
print(pd.DataFrame(d))
#Rodalm made this code more readable using enumeration. Previously I had used simple range()
One option using a dictionary comprehension:
(df
.drop(columns='TopWords')
.join(pd.concat({k: pd.DataFrame(x, columns=['Word', 'Counts'])
for k,x in enumerate(df['TopWords'])}).droplevel(1))
)
output:
Department Word Counts
0 D1 cat 6
0 D1 project 6
0 D1 dog 6
0 D1 develop 4
0 D1 smooth 4
0 D1 efficient 4
0 D1 administrative 4
0 D1 procedure 4
0 D1 establishment 3
0 D1 matter 3
1 D2 management 21
1 D2 satisfaction 12
1 D2 within 9
1 D2 budget 9
1 D2 township 9
1 D2 site 9
1 D2 periodic 9
1 D2 admin 9
1 D2 maintain 9
1 D2 guest 6
2 D3 manage 2
2 D3 ir 2
2 D3 mines 2
2 D3 implimentation 2
2 D3 clrg 2
2 D3 act 2
2 D3 implementations 2
2 D3 office 2
2 D3 maintenance 2
2 D3 administration 2

Take the mean of the top 10% values according to the date

I have a dataframe that consist of the day column and a score column. So per day, there are many values. So I need to get the mean of the top 10% values per day. Simply means I need the output as a day column and mean of the top 10% values of that day.
this is a example dataset screenshot
`{'Date':
[
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1)],
'value': [
3.35,
1.85,
1.3,
1.85,
1.85,
1.17,
1.17,
2.8,
1.43,
2.54,
1.22,
2.54,
1.17,
1.17,
2.71,
5.98,
1.39,
1.48,
16.46,
1.43,
8.39,
33.99,
2.54,
11.8,
2.13,
2.24,
2.92,
1.35,
1.54,
2.52]}`
Should be pretty simple -
*Assuming you're using Pandas, and this is a pandas dataframe called df with columns date and value
Creating a demo dataframe and importing required packages, you would probably import your table as a dataframe!
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
#If you need to save results as a DataFrame later on
res = pd.DataFrame(columns = ['date','top_10p_mean'])
Filter based on dates
Basically getting a list of different dates and iterating through them to get values in a list
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
Sort filtered set by values
Sorting the list in reverse order to have highest values at the top or you can just omit the reverse=True part to keep values as is
temp.sort(reverse=True)
Take mean of top 10% values
This will calculate the number of items in top 10% of the list (the index is rounded up to the next integer), take those values and calculate the mean.
Further explanation of the functions for beginners -
First "round_up_to_next_integer(total_number_of_items(in_list) * 10%)"
Then "give_me_mean_of(list_items[from_index_0 : the_number_I_got_from_the_percentage_calculation])"
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
Print it or save in a new DataFrame
Printing the results and appending it to the previously created empty DataFrame
print('Mean value on ' + str(date) + ' = ' + str(avg))
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
So in total it should work something like this -
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
df
Out[]:
date value
0 2021-04-01 12
1 2021-04-01 32
2 2021-04-01 12
3 2021-04-01 23
4 2021-04-01 12
5 2021-04-02 14
6 2021-04-02 15
7 2021-04-02 54
8 2021-04-02 43
9 2021-04-02 64
10 2021-04-02 21
11 2021-04-02 15
res = pd.DataFrame(columns = ['date','top_10p_mean'])
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
temp.sort(reverse=True)
print(temp) #Just to show what it looks like
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
print('\nMean value on ' + str(date) + ' = ' + str(avg) + '\n')
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
Out[]:
[32, 23, 12, 12, 12]
Mean value on 2021-04-01 = 32
[64, 54, 43, 21, 15, 15, 14]
Mean value on 2021-04-02 = 64
res
Out[]:
date top_10p_mean
0 2021-04-01 32
1 2021-04-02 64
df.nlargest is what you want.
First determine how many values correspond with 10% by running (df being your dataframe):
highest10p = 0.1*len(df)
then you can select the 10 largest values in the value column by using
df.nlargest(highest10p, 'value')
So if you want the mean, you use .mean() function:
df.nlargest(highest10p, 'value').mean()

combine multiindex dataframe with Int64Index dataframe

i have one dataframe with multiindex
result =
MultiIndex([(1, 'HK_MN001'),
(2, 'HK_MN001'),
(3, 'HK_MN002'),
(4, 'HK_MN003'),
(5, 'HK_MN004'),
(6, 'HK_MN005'),
(7, 'HK_MN005'),
(8, 'HK_MN005')],
names=['ID1', 'ID2'])
Another dataframe with index:
photo_df:
Int64Index([1, 2, 3, 4, 5, 6, 7, 8], dtype='int64', name='ID1')
I want to concatenate both the dataframes but it gives me error:
Code:
result = pd.concat([result,photo_df], axis = 1,sort=False)
error:
"Can only union MultiIndex with MultiIndex or Index of tuples, "
NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
Result Dataframe is:
photo_df dataframe is:
PhotoID raw_photo
0 1 HK_MN001_DSC_2160_161014Ushio.JPG
1 2 HK_MN001_DSC_2308_161014Ushio.JPG
2 3 HK_MN002_DSC_2327_161014Ushio.JPG
3 4 HK_MN003_DSC_1474_181015Ushio.jpg
4 5 HK_MN004_DSC_1491_181015Ushio.jpg
5 6 HK_MN005_DSC_1506_181015Ushio.JPG
6 7 HK_MN005_DSC_1527_181015Ushio.JPG
7 8 HK_MN005_DSC_1528_181015Ushio.jpg
Required Output dataframe:(If possible drop the index = Id1)
I think you need in both DataFrames create MultiIndex:
photo_df = photo_df.set_index('PhotoID', drop=False)
photo_df.columns = pd.MultiIndex.from_product([photo_df.columns, ['']])
print (photo_df)
PhotoID raw_photo
PhotoID
1 1 HK_MN001_DSC_2160_161014Ushio.JPG
2 2 HK_MN001_DSC_2308_161014Ushio.JPG
3 3 HK_MN002_DSC_2327_161014Ushio.JPG
4 4 HK_MN003_DSC_1474_181015Ushio.jpg
5 5 HK_MN004_DSC_1491_181015Ushio.jpg
6 6 HK_MN005_DSC_1506_181015Ushio.JPG
7 7 HK_MN005_DSC_1527_181015Ushio.JPG
8 8 HK_MN005_DSC_1528_181015Ushio.jpg
#second level ID2 is column
result = pd.concat([result.reset_index(level=1),photo_df], axis = 1,sort=False)

Extract tuple parts to create another two tuples

I have this dataset:
duplicates id userid timestamp_date
0 (007, us1, us2, 6, 7, 1) b us1 1
1 (001, us1, us2, 1, 9, 8) b us2 7
2 (009, us1, us2, 1, 28, 27) b us1 8
3 (007, us1, us2, 6, 7, 1) c us2 9
4 (009, us2, us1, 1, 29, 28) c us4 10.
d = pd.DataFrame({'duplicates': [("007", "us1", "us2", 6, 7, 1), ("001", "us1", "us2", 1, 9, 8), ("009", "us1", "us2", 1, 28, 27), ("007", "us1", "us2", 6, 7, 1), ("009", "us2", "us1", 1, 29, 28)],
'id': ["b", "b", "b", 'c', "c"],
'userid': ["us1", "us2", "us1", "us2", "us4"],
"timestamp_date": [1, 7, 8, 9, 10]})
And I want to extract the tuples is the following way:
tuple(a, b, c, d, e, f) -> tuple(a, b, null, e) and tuple (a, c, d, f).
So the result should be:
duplicates id
0 (007, us1, null, 7) b
1 (007, us2, 6, 1) b
2 (001, us1, null, 9) b
3 (001, us2, 1, 8) b
4 (009, us1, null, 28). b
5 (009, us2, 1, 27) b
6 (007, us1, null, 7) c
7 (007, us2, 6, 1) c
8 (009, us2, null, 29). c
9 (009, us1, 1, 28) c
e = pd.DataFrame({'duplicates': [("007", "us1", null, 7), ("007", "us2", 6, 1),
("001", "us1", null, 9), ("001", "us2", 1, 8),
("009", "us1", null, 28), ("009", "us2", 1, 27),
("007", "us1", null, 7), ("007", "us2", 6, 1),
("009", "us2", null, 29), ("009", "us1", 1, 28)],
'id': ["b", "b", "b", "b", "b", "b", "c", "c", "c", "c"]})
I don't like to put questions without code but I really have no idea where I should start and I couldn't find on other questions. I tried to use zip with apply(), but I don't think is this way because I didn't even could make the runtime errors stop appearing.
You can use .apply() to split the tuple to list of two tuples and then .explode():
d = (d.assign(duplicates=d['duplicates'].apply(lambda x: [(x[0], x[1], None, x[4]), (x[0], x[2], x[3], x[5])]))
.explode('duplicates')
.drop(columns=['userid', 'timestamp_date']))
print(d)
Prints:
duplicates id
0 (007, us1, None, 7) b
0 (007, us2, 6, 1) b
1 (001, us1, None, 9) b
1 (001, us2, 1, 8) b
2 (009, us1, None, 28) b
2 (009, us2, 1, 27) b
3 (007, us1, None, 7) c
3 (007, us2, 6, 1) c
4 (009, us2, None, 29) c
4 (009, us1, 1, 28) c

How can I get the weight of an undirected edge in networkx?

import networkx as nx
G=nx.Graph()
connections = [(0, 1, 4), (0, 7, 8), (1, 7, 11),(1, 2, 8), (2, 8, 2), (7, 8, 7),
(7, 6, 1), (8, 6, 6), (2, 5, 4), (6, 5, 2), (2, 3, 7), (3, 5, 14),
(3, 4, 9), (5, 4, 10), ]
G.add_weighted_edges_from(connections)
In this code, how can I get the weight between two nodes? (i.e) 5 and 4 ?
For one edge:
G.edges[5,4]['weight']
> 4
For all edges of one node:
G.edges(5, data=True)
> EdgeDataView([(5, 2, {'weight': 4}), (5, 6, {'weight': 2}), (5, 3, {'weight': 14}), (5, 4, {'weight': 10})])
For all edges:
for u, v, w in G.edges(data=True):
print(u, v, w['weight'])
> 0 1 4
> 0 7 8
> 1 7 11
> 1 2 8
> 7 8 7
> 7 6 1
> 2 8 2
> 2 5 4
> 2 3 7
> 8 6 6
> 6 5 2
> 5 3 14
> 5 4 10
> 3 4 9

Resources