How can I get the weight of an undirected edge in networkx? - python-3.x

import networkx as nx
G=nx.Graph()
connections = [(0, 1, 4), (0, 7, 8), (1, 7, 11),(1, 2, 8), (2, 8, 2), (7, 8, 7),
(7, 6, 1), (8, 6, 6), (2, 5, 4), (6, 5, 2), (2, 3, 7), (3, 5, 14),
(3, 4, 9), (5, 4, 10), ]
G.add_weighted_edges_from(connections)
In this code, how can I get the weight between two nodes? (i.e) 5 and 4 ?

For one edge:
G.edges[5,4]['weight']
> 4
For all edges of one node:
G.edges(5, data=True)
> EdgeDataView([(5, 2, {'weight': 4}), (5, 6, {'weight': 2}), (5, 3, {'weight': 14}), (5, 4, {'weight': 10})])
For all edges:
for u, v, w in G.edges(data=True):
print(u, v, w['weight'])
> 0 1 4
> 0 7 8
> 1 7 11
> 1 2 8
> 7 8 7
> 7 6 1
> 2 8 2
> 2 5 4
> 2 3 7
> 8 6 6
> 6 5 2
> 5 3 14
> 5 4 10
> 3 4 9

Related

Compute new pandas column for the number of time a date intersects a list of date ranges

I have actually solved the problem, but I am looking for advice for a more elegant / pandas-orientated solution.
I have a pandas dataframe of linkedin followers with a date field. The data looks like this:
Date Sponsored followers Organic followers Total followers
0 2021-05-30 0 105 105
1 2021-05-31 0 128 128
2 2021-06-01 0 157 157
3 2021-06-02 0 171 171
4 2021-06-03 0 133 133
I have a second dataframe that contains the start and end dates for paid social campaigns. What I have done is create a list of tuples from this dataframe, where the first element in the tuple is the start date, and the second is the end date, i converted these dates to datetimes as such:
[(datetime.date(2021, 7, 8), datetime.date(2021, 7, 9)),
(datetime.date(2021, 7, 12), datetime.date(2021, 7, 13)),
(datetime.date(2021, 7, 13), datetime.date(2021, 7, 14)),
(datetime.date(2021, 7, 14), datetime.date(2021, 7, 15)),
(datetime.date(2021, 7, 16), datetime.date(2021, 7, 18)),
(datetime.date(2021, 7, 19), datetime.date(2021, 7, 21)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 8, 9), datetime.date(2021, 8, 12)),
(datetime.date(2021, 8, 12), datetime.date(2021, 8, 15)),
(datetime.date(2021, 9, 3), datetime.date(2021, 9, 7)),
(datetime.date(2021, 10, 22), datetime.date(2021, 11, 21)),
(datetime.date(2021, 10, 29), datetime.date(2021, 11, 10)),
(datetime.date(2021, 10, 29), datetime.date(2021, 11, 2)),
(datetime.date(2021, 11, 3), datetime.date(2021, 11, 4)),
(datetime.date(2021, 11, 5), datetime.date(2021, 11, 8)),
(datetime.date(2021, 11, 9), datetime.date(2021, 11, 12)),
(datetime.date(2021, 11, 12), datetime.date(2021, 11, 16)),
(datetime.date(2021, 11, 11), datetime.date(2021, 11, 12)),
(datetime.date(2021, 11, 25), datetime.date(2021, 11, 27)),
(datetime.date(2021, 11, 26), datetime.date(2021, 11, 28)),
(datetime.date(2021, 12, 8), datetime.date(2021, 12, 11))]
In order to create a new column in my main dataframe (which is a count of how many campaigns falls on any given day), I loop through each row in my dataframe, and then each element in my list using the following code:
is_campaign = []
for date in df['Date']:
count = 0
for date_range in campaign_dates:
if date_range[0] <= date <= date_range[1]:
count += 1
is_campaign.append(count)
df['campaign'] = is_campaign
Which gives the following result:
df[df['campaign']!=0]
Date Sponsored followers Organic followers Total followers campaign
39 2021-07-08 0 160 160 1
40 2021-07-09 17 166 183 1
43 2021-07-12 0 124 124 1
44 2021-07-13 16 138 154 2
45 2021-07-14 22 158 180 2
... ... ... ... ... ...
182 2021-11-28 31 202 233 1
192 2021-12-08 28 357 385 1
193 2021-12-09 29 299 328 1
194 2021-12-10 23 253 276 1
195 2021-12-11 25 163 188 1
Any advice on how this could be done in a more efficient way, and specifically using pandas functionality would be appreciated.
My idea would be to use your second DataFrame alone to count the number of campaigns by date, and finally put the numbers into your first DataFrame. In this way you only go through your list of date-ranges once (or twice if you also take the counting step into account).
Expand your list of date-ranges into list of dates. Note that dates that occur N times represents N campaigns on that date.
dates = [
start_date + datetime.timedelta(day)
for start_date, end_date in date_ranges
for day in range((end_date - start_date).days + 1)
]
Then do the counting.
from collections import Counter
date_counts = Counter(dates)
Finally, put the numbers in.
df1['campaign'] = df1['Date'].map(pd.Series(date_counts))

Take the mean of the top 10% values according to the date

I have a dataframe that consist of the day column and a score column. So per day, there are many values. So I need to get the mean of the top 10% values per day. Simply means I need the output as a day column and mean of the top 10% values of that day.
this is a example dataset screenshot
`{'Date':
[
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1)],
'value': [
3.35,
1.85,
1.3,
1.85,
1.85,
1.17,
1.17,
2.8,
1.43,
2.54,
1.22,
2.54,
1.17,
1.17,
2.71,
5.98,
1.39,
1.48,
16.46,
1.43,
8.39,
33.99,
2.54,
11.8,
2.13,
2.24,
2.92,
1.35,
1.54,
2.52]}`
Should be pretty simple -
*Assuming you're using Pandas, and this is a pandas dataframe called df with columns date and value
Creating a demo dataframe and importing required packages, you would probably import your table as a dataframe!
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
#If you need to save results as a DataFrame later on
res = pd.DataFrame(columns = ['date','top_10p_mean'])
Filter based on dates
Basically getting a list of different dates and iterating through them to get values in a list
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
Sort filtered set by values
Sorting the list in reverse order to have highest values at the top or you can just omit the reverse=True part to keep values as is
temp.sort(reverse=True)
Take mean of top 10% values
This will calculate the number of items in top 10% of the list (the index is rounded up to the next integer), take those values and calculate the mean.
Further explanation of the functions for beginners -
First "round_up_to_next_integer(total_number_of_items(in_list) * 10%)"
Then "give_me_mean_of(list_items[from_index_0 : the_number_I_got_from_the_percentage_calculation])"
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
Print it or save in a new DataFrame
Printing the results and appending it to the previously created empty DataFrame
print('Mean value on ' + str(date) + ' = ' + str(avg))
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
So in total it should work something like this -
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
df
Out[]:
date value
0 2021-04-01 12
1 2021-04-01 32
2 2021-04-01 12
3 2021-04-01 23
4 2021-04-01 12
5 2021-04-02 14
6 2021-04-02 15
7 2021-04-02 54
8 2021-04-02 43
9 2021-04-02 64
10 2021-04-02 21
11 2021-04-02 15
res = pd.DataFrame(columns = ['date','top_10p_mean'])
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
temp.sort(reverse=True)
print(temp) #Just to show what it looks like
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
print('\nMean value on ' + str(date) + ' = ' + str(avg) + '\n')
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
Out[]:
[32, 23, 12, 12, 12]
Mean value on 2021-04-01 = 32
[64, 54, 43, 21, 15, 15, 14]
Mean value on 2021-04-02 = 64
res
Out[]:
date top_10p_mean
0 2021-04-01 32
1 2021-04-02 64
df.nlargest is what you want.
First determine how many values correspond with 10% by running (df being your dataframe):
highest10p = 0.1*len(df)
then you can select the 10 largest values in the value column by using
df.nlargest(highest10p, 'value')
So if you want the mean, you use .mean() function:
df.nlargest(highest10p, 'value').mean()

Extract tuple parts to create another two tuples

I have this dataset:
duplicates id userid timestamp_date
0 (007, us1, us2, 6, 7, 1) b us1 1
1 (001, us1, us2, 1, 9, 8) b us2 7
2 (009, us1, us2, 1, 28, 27) b us1 8
3 (007, us1, us2, 6, 7, 1) c us2 9
4 (009, us2, us1, 1, 29, 28) c us4 10.
d = pd.DataFrame({'duplicates': [("007", "us1", "us2", 6, 7, 1), ("001", "us1", "us2", 1, 9, 8), ("009", "us1", "us2", 1, 28, 27), ("007", "us1", "us2", 6, 7, 1), ("009", "us2", "us1", 1, 29, 28)],
'id': ["b", "b", "b", 'c', "c"],
'userid': ["us1", "us2", "us1", "us2", "us4"],
"timestamp_date": [1, 7, 8, 9, 10]})
And I want to extract the tuples is the following way:
tuple(a, b, c, d, e, f) -> tuple(a, b, null, e) and tuple (a, c, d, f).
So the result should be:
duplicates id
0 (007, us1, null, 7) b
1 (007, us2, 6, 1) b
2 (001, us1, null, 9) b
3 (001, us2, 1, 8) b
4 (009, us1, null, 28). b
5 (009, us2, 1, 27) b
6 (007, us1, null, 7) c
7 (007, us2, 6, 1) c
8 (009, us2, null, 29). c
9 (009, us1, 1, 28) c
e = pd.DataFrame({'duplicates': [("007", "us1", null, 7), ("007", "us2", 6, 1),
("001", "us1", null, 9), ("001", "us2", 1, 8),
("009", "us1", null, 28), ("009", "us2", 1, 27),
("007", "us1", null, 7), ("007", "us2", 6, 1),
("009", "us2", null, 29), ("009", "us1", 1, 28)],
'id': ["b", "b", "b", "b", "b", "b", "c", "c", "c", "c"]})
I don't like to put questions without code but I really have no idea where I should start and I couldn't find on other questions. I tried to use zip with apply(), but I don't think is this way because I didn't even could make the runtime errors stop appearing.
You can use .apply() to split the tuple to list of two tuples and then .explode():
d = (d.assign(duplicates=d['duplicates'].apply(lambda x: [(x[0], x[1], None, x[4]), (x[0], x[2], x[3], x[5])]))
.explode('duplicates')
.drop(columns=['userid', 'timestamp_date']))
print(d)
Prints:
duplicates id
0 (007, us1, None, 7) b
0 (007, us2, 6, 1) b
1 (001, us1, None, 9) b
1 (001, us2, 1, 8) b
2 (009, us1, None, 28) b
2 (009, us2, 1, 27) b
3 (007, us1, None, 7) c
3 (007, us2, 6, 1) c
4 (009, us2, None, 29) c
4 (009, us1, 1, 28) c

ST_contains does not work correctly when filterin

I have following table and data.
create table test ( id bigserial not null,
geo geometry not null );
insert
into
test(geo)
values ('MULTIPOLYGON (((0 0, 0 0, 0 7, 0 7, 0 0)), ((0 0, 0 7, 7 7, 7 0, 0 0)), ((0 0, 7 0, 7 0, 0 0, 0 0)), ((7 7, 7 7, 7 0, 7 0, 7 7)), ((0 7, 0 7, 7 7, 7 7, 0 7)), ((0 0, 7 0, 7 7, 0 7, 0 0)))'),
('POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))'),
('POLYGON ((2 2, 5 2, 5 5, 2 5, 2 2))');
select * from test;
id|geo |
--|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
5|MULTIPOLYGON (((0 0, 0 0, 0 7, 0 7, 0 0)), ((0 0, 0 7, 7 7, 7 0, 0 0)), ((0 0, 7 0, 7 0, 0 0, 0 0)), ((7 7, 7 7, 7 0, 7 0, 7 7)), ((0 7, 0 7, 7 7, 7 7, 0 7)), ((0 0, 7 0, 7 7, 0 7, 0 0)))|
6|POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0)) |
7|POLYGON ((2 2, 5 2, 5 5, 2 5, 2 2)) |
following query (Q) should return all rows
select
*
from
test t
where
st_contains('MULTIPOLYGON (((0 0, 0 0, 0 7, 0 7, 0 0)), ((0 0, 0 7, 7 7, 7 0, 0 0)), ((0 0, 7 0, 7 0, 0 0, 0 0)), ((7 7, 7 7, 7 0, 7 0, 7 7)), ((0 7, 0 7, 7 7, 7 7, 0 7)), ((0 0, 7 0, 7 7, 0 7, 0 0)))',
geo);
id|geo |
--|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
5|MULTIPOLYGON (((0 0, 0 0, 0 7, 0 7, 0 0)), ((0 0, 0 7, 7 7, 7 0, 0 0)), ((0 0, 7 0, 7 0, 0 0, 0 0)), ((7 7, 7 7, 7 0, 7 0, 7 7)), ((0 7, 0 7, 7 7, 7 7, 0 7)), ((0 0, 7 0, 7 7, 0 7, 0 0)))|
6|POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0)) |
Because following constraint returns true.
select
st_contains('MULTIPOLYGON (((0 0, 0 0, 0 7, 0 7, 0 0)), ((0 0, 0 7, 7 7, 7 0, 0 0)), ((0 0, 7 0, 7 0, 0 0, 0 0)), ((7 7, 7 7, 7 0, 7 0, 7 7)), ((0 7, 0 7, 7 7, 7 7, 0 7)), ((0 0, 7 0, 7 7, 0 7, 0 0)))',
'POLYGON ((2 2, 5 2, 5 5, 2 5, 2 2))');
What is wrong here with the query Q above?
The input geometry is invalid, and so is the result as per the doc:
So ST_Contains(A,B) implies ST_Within(B,A) except in the case of
invalid geometries where the result is always false regardless or not
defined.
WITH test(geo) as (
values ('MULTIPOLYGON (((0 0, 0 0, 0 7, 0 7, 0 0)), ((0 0, 0 7, 7 7, 7 0, 0 0)), ((0 0, 7 0, 7 0, 0 0, 0 0)), ((7 7, 7 7, 7 0, 7 0, 7 7)), ((0 7, 0 7, 7 7, 7 7, 0 7)), ((0 0, 7 0, 7 7, 0 7, 0 0)))'),
('POLYGON ((0 0, 5 0, 5 5, 0 5, 0 0))'),
('POLYGON ((2 2, 5 2, 5 5, 2 5, 2 2))'))
select st_isvalid(geo), st_isvalidreason(geo) from test;
st_isvalid | st_isvalidreason
------------+-------------------------------------------
f | Too few points in geometry component[0 7]
t | Valid Geometry
t | Valid Geometry
That being said, you may want to read carefully the doc on st_contains and st_covers as there are subtleties when the geometries share an edge.

convert grouped dataframe to dictionary with apply/lambda

I have a df like this
A B C
0 11 one 5
1 11 two 7
2 11 three 9
3 22 one 11
4 22 two 13
I'd like to convert this df to dictionary like this:
{11: [(one, 5), (two, 7), (three, 9)]
22: [(one, 11), (two, 13)]}
df_dict = df.groupby('A').apply(lambda y: {(x.B, x.C) for i, x in y.iterrows()})
The actual results I got from this code:
11 {(one, 5), (two, 7), (three, 9)}
22 {(one, 11), (two, 13)}
Try,
df.groupby('A').apply(lambda x: list(zip(x.B, x.C))).to_dict()
{11: [('one', 5), ('two', 7), ('three', 9)], 22: [('one', 11), ('two', 13)]}

Resources