Extract tuple parts to create another two tuples - python-3.x

I have this dataset:
duplicates id userid timestamp_date
0 (007, us1, us2, 6, 7, 1) b us1 1
1 (001, us1, us2, 1, 9, 8) b us2 7
2 (009, us1, us2, 1, 28, 27) b us1 8
3 (007, us1, us2, 6, 7, 1) c us2 9
4 (009, us2, us1, 1, 29, 28) c us4 10.
d = pd.DataFrame({'duplicates': [("007", "us1", "us2", 6, 7, 1), ("001", "us1", "us2", 1, 9, 8), ("009", "us1", "us2", 1, 28, 27), ("007", "us1", "us2", 6, 7, 1), ("009", "us2", "us1", 1, 29, 28)],
'id': ["b", "b", "b", 'c', "c"],
'userid': ["us1", "us2", "us1", "us2", "us4"],
"timestamp_date": [1, 7, 8, 9, 10]})
And I want to extract the tuples is the following way:
tuple(a, b, c, d, e, f) -> tuple(a, b, null, e) and tuple (a, c, d, f).
So the result should be:
duplicates id
0 (007, us1, null, 7) b
1 (007, us2, 6, 1) b
2 (001, us1, null, 9) b
3 (001, us2, 1, 8) b
4 (009, us1, null, 28). b
5 (009, us2, 1, 27) b
6 (007, us1, null, 7) c
7 (007, us2, 6, 1) c
8 (009, us2, null, 29). c
9 (009, us1, 1, 28) c
e = pd.DataFrame({'duplicates': [("007", "us1", null, 7), ("007", "us2", 6, 1),
("001", "us1", null, 9), ("001", "us2", 1, 8),
("009", "us1", null, 28), ("009", "us2", 1, 27),
("007", "us1", null, 7), ("007", "us2", 6, 1),
("009", "us2", null, 29), ("009", "us1", 1, 28)],
'id': ["b", "b", "b", "b", "b", "b", "c", "c", "c", "c"]})
I don't like to put questions without code but I really have no idea where I should start and I couldn't find on other questions. I tried to use zip with apply(), but I don't think is this way because I didn't even could make the runtime errors stop appearing.

You can use .apply() to split the tuple to list of two tuples and then .explode():
d = (d.assign(duplicates=d['duplicates'].apply(lambda x: [(x[0], x[1], None, x[4]), (x[0], x[2], x[3], x[5])]))
.explode('duplicates')
.drop(columns=['userid', 'timestamp_date']))
print(d)
Prints:
duplicates id
0 (007, us1, None, 7) b
0 (007, us2, 6, 1) b
1 (001, us1, None, 9) b
1 (001, us2, 1, 8) b
2 (009, us1, None, 28) b
2 (009, us2, 1, 27) b
3 (007, us1, None, 7) c
3 (007, us2, 6, 1) c
4 (009, us2, None, 29) c
4 (009, us1, 1, 28) c

Related

Export list of Dictionnary from python to excel

have a list called 'Aff' that consists of dictionaries. It looks like this:
Aff=[{('J', 0, 1): 36, ('J', 1, 1): 36, ('J', 2, 1): 42}, {('I', 0, 1): 36, ('I', 1, 1): 30}, {('H', 0, 1): 36, ('H', 1, 1): 36, ('H', 2, 1): 42}]
and i wanna get this structure on EXCEL :
-------Num------letter-----NV----Postion ---Q
1 J 0 1 36
1 J 1 1 36
1 J 2 1 42
2 I 0 1 36
2 I 1 1 36
...ect
Your dictionary keys consisting of tuple forces you to change the structure of your data. IIUC Num is the index (+1) of the dictionary in the list.
Once you flattened your data, you can use pandas.to_excel():
import pandas as pd
Aff=[{('J', 0, 1): 36, ('J', 1, 1): 36, ('J', 2, 1): 42}, {('I', 0, 1): 36, ('I', 1, 1): 30}, {('H', 0, 1): 36, ('H', 1, 1): 36, ('H', 2, 1): 42}]
arr = []
for num, d in enumerate(Aff):
for k,v in d.items():
arr.append([num+1] + list(k) + [v])
df = pd.DataFrame(arr, columns=['Num', 'letter', 'NV', 'Position', 'Q'])
df.to_excel('output.xlsx', index=False)
print(df) would output:
Num letter NV Position Q
0 1 J 0 1 36
1 1 J 1 1 36
2 1 J 2 1 42
3 2 I 0 1 36
4 2 I 1 1 30
5 3 H 0 1 36
6 3 H 1 1 36
7 3 H 2 1 42

Compute new pandas column for the number of time a date intersects a list of date ranges

I have actually solved the problem, but I am looking for advice for a more elegant / pandas-orientated solution.
I have a pandas dataframe of linkedin followers with a date field. The data looks like this:
Date Sponsored followers Organic followers Total followers
0 2021-05-30 0 105 105
1 2021-05-31 0 128 128
2 2021-06-01 0 157 157
3 2021-06-02 0 171 171
4 2021-06-03 0 133 133
I have a second dataframe that contains the start and end dates for paid social campaigns. What I have done is create a list of tuples from this dataframe, where the first element in the tuple is the start date, and the second is the end date, i converted these dates to datetimes as such:
[(datetime.date(2021, 7, 8), datetime.date(2021, 7, 9)),
(datetime.date(2021, 7, 12), datetime.date(2021, 7, 13)),
(datetime.date(2021, 7, 13), datetime.date(2021, 7, 14)),
(datetime.date(2021, 7, 14), datetime.date(2021, 7, 15)),
(datetime.date(2021, 7, 16), datetime.date(2021, 7, 18)),
(datetime.date(2021, 7, 19), datetime.date(2021, 7, 21)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 8, 9), datetime.date(2021, 8, 12)),
(datetime.date(2021, 8, 12), datetime.date(2021, 8, 15)),
(datetime.date(2021, 9, 3), datetime.date(2021, 9, 7)),
(datetime.date(2021, 10, 22), datetime.date(2021, 11, 21)),
(datetime.date(2021, 10, 29), datetime.date(2021, 11, 10)),
(datetime.date(2021, 10, 29), datetime.date(2021, 11, 2)),
(datetime.date(2021, 11, 3), datetime.date(2021, 11, 4)),
(datetime.date(2021, 11, 5), datetime.date(2021, 11, 8)),
(datetime.date(2021, 11, 9), datetime.date(2021, 11, 12)),
(datetime.date(2021, 11, 12), datetime.date(2021, 11, 16)),
(datetime.date(2021, 11, 11), datetime.date(2021, 11, 12)),
(datetime.date(2021, 11, 25), datetime.date(2021, 11, 27)),
(datetime.date(2021, 11, 26), datetime.date(2021, 11, 28)),
(datetime.date(2021, 12, 8), datetime.date(2021, 12, 11))]
In order to create a new column in my main dataframe (which is a count of how many campaigns falls on any given day), I loop through each row in my dataframe, and then each element in my list using the following code:
is_campaign = []
for date in df['Date']:
count = 0
for date_range in campaign_dates:
if date_range[0] <= date <= date_range[1]:
count += 1
is_campaign.append(count)
df['campaign'] = is_campaign
Which gives the following result:
df[df['campaign']!=0]
Date Sponsored followers Organic followers Total followers campaign
39 2021-07-08 0 160 160 1
40 2021-07-09 17 166 183 1
43 2021-07-12 0 124 124 1
44 2021-07-13 16 138 154 2
45 2021-07-14 22 158 180 2
... ... ... ... ... ...
182 2021-11-28 31 202 233 1
192 2021-12-08 28 357 385 1
193 2021-12-09 29 299 328 1
194 2021-12-10 23 253 276 1
195 2021-12-11 25 163 188 1
Any advice on how this could be done in a more efficient way, and specifically using pandas functionality would be appreciated.
My idea would be to use your second DataFrame alone to count the number of campaigns by date, and finally put the numbers into your first DataFrame. In this way you only go through your list of date-ranges once (or twice if you also take the counting step into account).
Expand your list of date-ranges into list of dates. Note that dates that occur N times represents N campaigns on that date.
dates = [
start_date + datetime.timedelta(day)
for start_date, end_date in date_ranges
for day in range((end_date - start_date).days + 1)
]
Then do the counting.
from collections import Counter
date_counts = Counter(dates)
Finally, put the numbers in.
df1['campaign'] = df1['Date'].map(pd.Series(date_counts))

Take the mean of the top 10% values according to the date

I have a dataframe that consist of the day column and a score column. So per day, there are many values. So I need to get the mean of the top 10% values per day. Simply means I need the output as a day column and mean of the top 10% values of that day.
this is a example dataset screenshot
`{'Date':
[
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1)],
'value': [
3.35,
1.85,
1.3,
1.85,
1.85,
1.17,
1.17,
2.8,
1.43,
2.54,
1.22,
2.54,
1.17,
1.17,
2.71,
5.98,
1.39,
1.48,
16.46,
1.43,
8.39,
33.99,
2.54,
11.8,
2.13,
2.24,
2.92,
1.35,
1.54,
2.52]}`
Should be pretty simple -
*Assuming you're using Pandas, and this is a pandas dataframe called df with columns date and value
Creating a demo dataframe and importing required packages, you would probably import your table as a dataframe!
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
#If you need to save results as a DataFrame later on
res = pd.DataFrame(columns = ['date','top_10p_mean'])
Filter based on dates
Basically getting a list of different dates and iterating through them to get values in a list
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
Sort filtered set by values
Sorting the list in reverse order to have highest values at the top or you can just omit the reverse=True part to keep values as is
temp.sort(reverse=True)
Take mean of top 10% values
This will calculate the number of items in top 10% of the list (the index is rounded up to the next integer), take those values and calculate the mean.
Further explanation of the functions for beginners -
First "round_up_to_next_integer(total_number_of_items(in_list) * 10%)"
Then "give_me_mean_of(list_items[from_index_0 : the_number_I_got_from_the_percentage_calculation])"
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
Print it or save in a new DataFrame
Printing the results and appending it to the previously created empty DataFrame
print('Mean value on ' + str(date) + ' = ' + str(avg))
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
So in total it should work something like this -
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
df
Out[]:
date value
0 2021-04-01 12
1 2021-04-01 32
2 2021-04-01 12
3 2021-04-01 23
4 2021-04-01 12
5 2021-04-02 14
6 2021-04-02 15
7 2021-04-02 54
8 2021-04-02 43
9 2021-04-02 64
10 2021-04-02 21
11 2021-04-02 15
res = pd.DataFrame(columns = ['date','top_10p_mean'])
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
temp.sort(reverse=True)
print(temp) #Just to show what it looks like
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
print('\nMean value on ' + str(date) + ' = ' + str(avg) + '\n')
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
Out[]:
[32, 23, 12, 12, 12]
Mean value on 2021-04-01 = 32
[64, 54, 43, 21, 15, 15, 14]
Mean value on 2021-04-02 = 64
res
Out[]:
date top_10p_mean
0 2021-04-01 32
1 2021-04-02 64
df.nlargest is what you want.
First determine how many values correspond with 10% by running (df being your dataframe):
highest10p = 0.1*len(df)
then you can select the 10 largest values in the value column by using
df.nlargest(highest10p, 'value')
So if you want the mean, you use .mean() function:
df.nlargest(highest10p, 'value').mean()

How can I get the weight of an undirected edge in networkx?

import networkx as nx
G=nx.Graph()
connections = [(0, 1, 4), (0, 7, 8), (1, 7, 11),(1, 2, 8), (2, 8, 2), (7, 8, 7),
(7, 6, 1), (8, 6, 6), (2, 5, 4), (6, 5, 2), (2, 3, 7), (3, 5, 14),
(3, 4, 9), (5, 4, 10), ]
G.add_weighted_edges_from(connections)
In this code, how can I get the weight between two nodes? (i.e) 5 and 4 ?
For one edge:
G.edges[5,4]['weight']
> 4
For all edges of one node:
G.edges(5, data=True)
> EdgeDataView([(5, 2, {'weight': 4}), (5, 6, {'weight': 2}), (5, 3, {'weight': 14}), (5, 4, {'weight': 10})])
For all edges:
for u, v, w in G.edges(data=True):
print(u, v, w['weight'])
> 0 1 4
> 0 7 8
> 1 7 11
> 1 2 8
> 7 8 7
> 7 6 1
> 2 8 2
> 2 5 4
> 2 3 7
> 8 6 6
> 6 5 2
> 5 3 14
> 5 4 10
> 3 4 9

transform integer value patterns in a column to a group

DataFrame
df=pd.DataFrame({'occurance':[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0],'value':[45, 3, 2, 12, 14, 32, 1, 1, 6, 4, 9, 32, 78, 96, 12, 6, 3]})
df
Expected output
df=pd.DataFrame({'occurance':[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0],'value':[45, 3, 2, 12, 14, 32, 1, 1, 6, 4, 9, 32, 78, 96, 12, 6, 3],'group':[1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4, 100, 5, 5, 5, 5]})
df
I need to transform the dataframe into the output. I am after a wild card that will determine 1 is the start of a new group and a group consists of only 1 followed by n zeroes. If a group criteria is not met, then group it as 100.
I tried in the line of;
bs=df[df.occurance.eq(1).any(1)&df.occurance.shift(-1).eq(0).any(1)].squeeze()
bs
This even when broken down could only bool select start and nothing more.
Any help?
Create mask by compare 1 and next 1 in mask, then filter occurance for all values without them, create cumulative sum by Series.cumsum and last add 100 values by Series.reindex:
m = df.occurance.eq(1) & df.occurance.shift(-1).eq(1)
df['group'] = df.loc[~m, 'occurance'].cumsum().reindex(df.index, fill_value=100)
print (df)
occurance value group
0 1 45 1
1 0 3 1
2 0 2 1
3 0 12 1
4 1 14 2
5 0 32 2
6 0 1 2
7 0 1 2
8 0 6 2
9 0 4 2
10 1 9 3
11 0 32 3
12 1 78 100
13 1 96 4
14 0 12 4
15 0 6 4
16 0 3 4

Resources