Compute new pandas column for the number of time a date intersects a list of date ranges - python-3.x

I have actually solved the problem, but I am looking for advice for a more elegant / pandas-orientated solution.
I have a pandas dataframe of linkedin followers with a date field. The data looks like this:
Date Sponsored followers Organic followers Total followers
0 2021-05-30 0 105 105
1 2021-05-31 0 128 128
2 2021-06-01 0 157 157
3 2021-06-02 0 171 171
4 2021-06-03 0 133 133
I have a second dataframe that contains the start and end dates for paid social campaigns. What I have done is create a list of tuples from this dataframe, where the first element in the tuple is the start date, and the second is the end date, i converted these dates to datetimes as such:
[(datetime.date(2021, 7, 8), datetime.date(2021, 7, 9)),
(datetime.date(2021, 7, 12), datetime.date(2021, 7, 13)),
(datetime.date(2021, 7, 13), datetime.date(2021, 7, 14)),
(datetime.date(2021, 7, 14), datetime.date(2021, 7, 15)),
(datetime.date(2021, 7, 16), datetime.date(2021, 7, 18)),
(datetime.date(2021, 7, 19), datetime.date(2021, 7, 21)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 7, 30), datetime.date(2021, 8, 2)),
(datetime.date(2021, 8, 9), datetime.date(2021, 8, 12)),
(datetime.date(2021, 8, 12), datetime.date(2021, 8, 15)),
(datetime.date(2021, 9, 3), datetime.date(2021, 9, 7)),
(datetime.date(2021, 10, 22), datetime.date(2021, 11, 21)),
(datetime.date(2021, 10, 29), datetime.date(2021, 11, 10)),
(datetime.date(2021, 10, 29), datetime.date(2021, 11, 2)),
(datetime.date(2021, 11, 3), datetime.date(2021, 11, 4)),
(datetime.date(2021, 11, 5), datetime.date(2021, 11, 8)),
(datetime.date(2021, 11, 9), datetime.date(2021, 11, 12)),
(datetime.date(2021, 11, 12), datetime.date(2021, 11, 16)),
(datetime.date(2021, 11, 11), datetime.date(2021, 11, 12)),
(datetime.date(2021, 11, 25), datetime.date(2021, 11, 27)),
(datetime.date(2021, 11, 26), datetime.date(2021, 11, 28)),
(datetime.date(2021, 12, 8), datetime.date(2021, 12, 11))]
In order to create a new column in my main dataframe (which is a count of how many campaigns falls on any given day), I loop through each row in my dataframe, and then each element in my list using the following code:
is_campaign = []
for date in df['Date']:
count = 0
for date_range in campaign_dates:
if date_range[0] <= date <= date_range[1]:
count += 1
is_campaign.append(count)
df['campaign'] = is_campaign
Which gives the following result:
df[df['campaign']!=0]
Date Sponsored followers Organic followers Total followers campaign
39 2021-07-08 0 160 160 1
40 2021-07-09 17 166 183 1
43 2021-07-12 0 124 124 1
44 2021-07-13 16 138 154 2
45 2021-07-14 22 158 180 2
... ... ... ... ... ...
182 2021-11-28 31 202 233 1
192 2021-12-08 28 357 385 1
193 2021-12-09 29 299 328 1
194 2021-12-10 23 253 276 1
195 2021-12-11 25 163 188 1
Any advice on how this could be done in a more efficient way, and specifically using pandas functionality would be appreciated.

My idea would be to use your second DataFrame alone to count the number of campaigns by date, and finally put the numbers into your first DataFrame. In this way you only go through your list of date-ranges once (or twice if you also take the counting step into account).
Expand your list of date-ranges into list of dates. Note that dates that occur N times represents N campaigns on that date.
dates = [
start_date + datetime.timedelta(day)
for start_date, end_date in date_ranges
for day in range((end_date - start_date).days + 1)
]
Then do the counting.
from collections import Counter
date_counts = Counter(dates)
Finally, put the numbers in.
df1['campaign'] = df1['Date'].map(pd.Series(date_counts))

Related

Take the mean of the top 10% values according to the date

I have a dataframe that consist of the day column and a score column. So per day, there are many values. So I need to get the mean of the top 10% values per day. Simply means I need the output as a day column and mean of the top 10% values of that day.
this is a example dataset screenshot
`{'Date':
[
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1),
datetime.date(2021, 4, 1)],
'value': [
3.35,
1.85,
1.3,
1.85,
1.85,
1.17,
1.17,
2.8,
1.43,
2.54,
1.22,
2.54,
1.17,
1.17,
2.71,
5.98,
1.39,
1.48,
16.46,
1.43,
8.39,
33.99,
2.54,
11.8,
2.13,
2.24,
2.92,
1.35,
1.54,
2.52]}`
Should be pretty simple -
*Assuming you're using Pandas, and this is a pandas dataframe called df with columns date and value
Creating a demo dataframe and importing required packages, you would probably import your table as a dataframe!
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
#If you need to save results as a DataFrame later on
res = pd.DataFrame(columns = ['date','top_10p_mean'])
Filter based on dates
Basically getting a list of different dates and iterating through them to get values in a list
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
Sort filtered set by values
Sorting the list in reverse order to have highest values at the top or you can just omit the reverse=True part to keep values as is
temp.sort(reverse=True)
Take mean of top 10% values
This will calculate the number of items in top 10% of the list (the index is rounded up to the next integer), take those values and calculate the mean.
Further explanation of the functions for beginners -
First "round_up_to_next_integer(total_number_of_items(in_list) * 10%)"
Then "give_me_mean_of(list_items[from_index_0 : the_number_I_got_from_the_percentage_calculation])"
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
Print it or save in a new DataFrame
Printing the results and appending it to the previously created empty DataFrame
print('Mean value on ' + str(date) + ' = ' + str(avg))
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
So in total it should work something like this -
import pandas as pd
import math
import statistics
df = pd.DataFrame({'date': ['2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-01','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02','2021-04-02'],
'value': [12,32,12,23,12,14,15,54,43,64,21,15]})
df
Out[]:
date value
0 2021-04-01 12
1 2021-04-01 32
2 2021-04-01 12
3 2021-04-01 23
4 2021-04-01 12
5 2021-04-02 14
6 2021-04-02 15
7 2021-04-02 54
8 2021-04-02 43
9 2021-04-02 64
10 2021-04-02 21
11 2021-04-02 15
res = pd.DataFrame(columns = ['date','top_10p_mean'])
for date in df['date'].unique():
temp = list(df['value'][df['date'] == date])
temp.sort(reverse=True)
print(temp) #Just to show what it looks like
avg = statistics.mean(temp[0:math.ceil(len(temp)*0.1)])
print('\nMean value on ' + str(date) + ' = ' + str(avg) + '\n')
res = res.append({'date': date, 'top_10p_mean': avg}, ignore_index=True)
Out[]:
[32, 23, 12, 12, 12]
Mean value on 2021-04-01 = 32
[64, 54, 43, 21, 15, 15, 14]
Mean value on 2021-04-02 = 64
res
Out[]:
date top_10p_mean
0 2021-04-01 32
1 2021-04-02 64
df.nlargest is what you want.
First determine how many values correspond with 10% by running (df being your dataframe):
highest10p = 0.1*len(df)
then you can select the 10 largest values in the value column by using
df.nlargest(highest10p, 'value')
So if you want the mean, you use .mean() function:
df.nlargest(highest10p, 'value').mean()

Extract tuple parts to create another two tuples

I have this dataset:
duplicates id userid timestamp_date
0 (007, us1, us2, 6, 7, 1) b us1 1
1 (001, us1, us2, 1, 9, 8) b us2 7
2 (009, us1, us2, 1, 28, 27) b us1 8
3 (007, us1, us2, 6, 7, 1) c us2 9
4 (009, us2, us1, 1, 29, 28) c us4 10.
d = pd.DataFrame({'duplicates': [("007", "us1", "us2", 6, 7, 1), ("001", "us1", "us2", 1, 9, 8), ("009", "us1", "us2", 1, 28, 27), ("007", "us1", "us2", 6, 7, 1), ("009", "us2", "us1", 1, 29, 28)],
'id': ["b", "b", "b", 'c', "c"],
'userid': ["us1", "us2", "us1", "us2", "us4"],
"timestamp_date": [1, 7, 8, 9, 10]})
And I want to extract the tuples is the following way:
tuple(a, b, c, d, e, f) -> tuple(a, b, null, e) and tuple (a, c, d, f).
So the result should be:
duplicates id
0 (007, us1, null, 7) b
1 (007, us2, 6, 1) b
2 (001, us1, null, 9) b
3 (001, us2, 1, 8) b
4 (009, us1, null, 28). b
5 (009, us2, 1, 27) b
6 (007, us1, null, 7) c
7 (007, us2, 6, 1) c
8 (009, us2, null, 29). c
9 (009, us1, 1, 28) c
e = pd.DataFrame({'duplicates': [("007", "us1", null, 7), ("007", "us2", 6, 1),
("001", "us1", null, 9), ("001", "us2", 1, 8),
("009", "us1", null, 28), ("009", "us2", 1, 27),
("007", "us1", null, 7), ("007", "us2", 6, 1),
("009", "us2", null, 29), ("009", "us1", 1, 28)],
'id': ["b", "b", "b", "b", "b", "b", "c", "c", "c", "c"]})
I don't like to put questions without code but I really have no idea where I should start and I couldn't find on other questions. I tried to use zip with apply(), but I don't think is this way because I didn't even could make the runtime errors stop appearing.
You can use .apply() to split the tuple to list of two tuples and then .explode():
d = (d.assign(duplicates=d['duplicates'].apply(lambda x: [(x[0], x[1], None, x[4]), (x[0], x[2], x[3], x[5])]))
.explode('duplicates')
.drop(columns=['userid', 'timestamp_date']))
print(d)
Prints:
duplicates id
0 (007, us1, None, 7) b
0 (007, us2, 6, 1) b
1 (001, us1, None, 9) b
1 (001, us2, 1, 8) b
2 (009, us1, None, 28) b
2 (009, us2, 1, 27) b
3 (007, us1, None, 7) c
3 (007, us2, 6, 1) c
4 (009, us2, None, 29) c
4 (009, us1, 1, 28) c

How can I get the weight of an undirected edge in networkx?

import networkx as nx
G=nx.Graph()
connections = [(0, 1, 4), (0, 7, 8), (1, 7, 11),(1, 2, 8), (2, 8, 2), (7, 8, 7),
(7, 6, 1), (8, 6, 6), (2, 5, 4), (6, 5, 2), (2, 3, 7), (3, 5, 14),
(3, 4, 9), (5, 4, 10), ]
G.add_weighted_edges_from(connections)
In this code, how can I get the weight between two nodes? (i.e) 5 and 4 ?
For one edge:
G.edges[5,4]['weight']
> 4
For all edges of one node:
G.edges(5, data=True)
> EdgeDataView([(5, 2, {'weight': 4}), (5, 6, {'weight': 2}), (5, 3, {'weight': 14}), (5, 4, {'weight': 10})])
For all edges:
for u, v, w in G.edges(data=True):
print(u, v, w['weight'])
> 0 1 4
> 0 7 8
> 1 7 11
> 1 2 8
> 7 8 7
> 7 6 1
> 2 8 2
> 2 5 4
> 2 3 7
> 8 6 6
> 6 5 2
> 5 3 14
> 5 4 10
> 3 4 9

Creating a vector containing the next 10 row-column values for each pandas row

I am trying to create a vector of the previous 10 values from a pandas column and insert it back into the pandas data frame as a list in a cell.
The below code works but I need to do this for a dataframe of over 30 million rows so it will take too long to do it in a loop.
Can someone please help me convert this to a numpy function that I can apply. I would also like to be able to apply this function in a groupby.
import pandas as pd
df = pd.DataFrame(list(range(1,20)),columns = ['A'])
df.insert(0,'Vector','')
df['Vector'] = df['Vector'].astype(object)
for index, row in df.iterrows():
df['Vector'].iloc[index] = list(df['A'].iloc[(index-10):index])
I have tried in multiple ways but have not been able to get it to work. Any help would be appreciated.
IIUC
df['New']=[df.A.tolist()[max(0,x-10):x] for x in range(len(df))]
df
Out[123]:
A New
0 1 []
1 2 [1]
2 3 [1, 2]
3 4 [1, 2, 3]
4 5 [1, 2, 3, 4]
5 6 [1, 2, 3, 4, 5]
6 7 [1, 2, 3, 4, 5, 6]
7 8 [1, 2, 3, 4, 5, 6, 7]
8 9 [1, 2, 3, 4, 5, 6, 7, 8]
9 10 [1, 2, 3, 4, 5, 6, 7, 8, 9]
10 11 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
11 12 [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
12 13 [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
13 14 [4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
14 15 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
15 16 [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
16 17 [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
17 18 [8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
18 19 [9, 10, 11, 12, 13, 14, 15, 16, 17, 18]

Image.frombytes not writing squares

I have a numpy array:
[[12 13 12 5 6 5 14 4 6 11 11 10 8 11 8 11 7 8 0 0 0]
[ 5 14 4 6 11 11 10 8 11 8 11 8 11 8 11 7 8 0 0 0 0]
[ 5 14 4 6 11 10 10 8 11 8 11 8 11 8 11 8 11 7 8 0 0]
[ 5 14 4 6 11 11 10 7 8 0 0 0 0 0 0 0 0 0 0 0 0]
[ 5 14 4 6 11 11 10 8 11 8 11 8 11 8 11 8 11 8 11 7 8]
[ 5 14 4 6 11 10 8 11 10 8 11 10 8 11 10 7 8 0 0 0 0]
[ 5 14 4 6 11 10 10 8 11 8 11 7 8 0 0 0 0 0 0 0 0]
[ 5 14 4 6 11 11 10 1 11 1 11 7 8 0 0 0 0 0 0 0 0]
[ 5 14 4 6 11 10 10 1 11 1 11 1 11 7 8 0 0 0 0 0 0]
[ 5 14 4 6 11 10 10 8 11 8 11 8 11 7 8 0 0 0 0 0 0]
[ 5 14 4 6 11 10 8 11 10 8 11 10 8 11 10 8 11 7 7 0 0]]
And a colors dictionary:
{0: (0, 0, 0), 1: (17, 17, 17), 2: (34, 34, 34), 3: (51, 51, 51), 4: (68, 68, 68), 5: (85, 85, 85), 6: (102, 102, 102), 7: (119, 119, 119), 8: (136, 136, 136), 9: (153, 153, 153), 10: (170, 170, 170), 11: (187, 187, 187), 12: (204, 204, 204), 13: (221, 221, 221), 14: (238, 238, 238)}
And I'm trying to write pass the array through the dictionary, then write those colors in 10x10 blocks to a .png file. So far I have:
rows = []
for row in arr:
for j in range(10):
for col in row:
for i in range(10):
rows.extend(colors[col])
rows = bytes(rows)
img = Image.frombytes('RGB', (110, 120), rows)
img.save("generated.png")
But this writes it like this:
Which has lines instead of the 10x10 blocks I was trying to write. It seems to me as though the blocks are shifted somehow, but I can't figure out how to un-shift them. Why is this behavior happening?
I believe you only need to change the size parameter to obtain the result you want. Replacing this line should correct the error:
# img = Image.frombytes('RGB', (110, 120), rows)
img = Image.frombytes('RGB', (210, 110), rows)
Size should be a 2-Tuple of the width and height of the image in pixels. The rows list you are creating is an image that is (210,110) pixels. You are drawing that to an image that is (110,120) pixels. This causes the image to break to a new row every 110 pixels.
Here is a working example:
from PIL import Image
array = [
[12, 13, 12, 5, 6, 5, 14, 4, 6, 11, 11, 10, 8, 11, 8, 11, 7, 8, 0, 0, 0],
[5, 14, 4, 6, 11, 11, 10, 8, 11, 8, 11, 8, 11, 8, 11, 7, 8, 0, 0, 0, 0],
[5, 14, 4, 6, 11, 10, 10, 8, 11, 8, 11, 8, 11, 8, 11, 8, 11, 7, 8, 0, 0],
[5, 14, 4, 6, 11, 11, 10, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[5, 14, 4, 6, 11, 11, 10, 8, 11, 8, 11, 8, 11, 8, 11, 8, 11, 8, 11, 7, 8],
[5, 14, 4, 6, 11, 10, 8, 11, 10, 8, 11, 10, 8, 11, 10, 7, 8, 0, 0, 0, 0],
[5, 14, 4, 6, 11, 10, 10, 8, 11, 8, 11, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0],
[5, 14, 4, 6, 11, 11, 10, 1, 11, 1, 11, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0],
[5, 14, 4, 6, 11, 10, 10, 1, 11, 1, 11, 1, 11, 7, 8, 0, 0, 0, 0, 0, 0],
[5, 14, 4, 6, 11, 10, 10, 8, 11, 8, 11, 8, 11, 7, 8, 0, 0, 0, 0, 0, 0],
[5, 14, 4, 6, 11, 10, 8, 11, 10, 8, 11, 10, 8, 11, 10, 8, 11, 7, 7, 0, 0],
]
colors = {
0: (0, 0, 0),
1: (17, 17, 17),
2: (34, 34, 34),
3: (51, 51, 51),
4: (68, 68, 68),
5: (85, 85, 85),
6: (102, 102, 102),
7: (119, 119, 119),
8: (136, 136, 136),
9: (153, 153, 153),
10: (170, 170, 170),
11: (187, 187, 187),
12: (204, 204, 204),
13: (221, 221, 221),
14: (238, 238, 238)
}
rows = []
for row in array:
for _ in range(10):
for col in row:
for _ in range(10):
rows.extend(colors[col])
rows = bytes(rows)
img = Image.frombytes('RGB', (210, 110), rows)
img.save("generated.png")

Resources