Scraping and sorting dates from a website using Python - python-3.x

I am trying to sort dates from some results in a website. I found dates between tags <span class="f". Unfortunately, I cannot extract this kind of information using the code below. I would like to ask you what it is wrong in the code and how I can extract dates and sorting them in ascending/descending order.
What I already did is to collect the information (first 20 results) from the website into an array. The array urls[] is collecting information (sentences) published in different periods (in terms of months, days, minutes...). You could think of posts on Facebook or results in Google.
urls=[]
for url in search(' " life " ', stop=20):
urls.append(url) # this creates a list of results (sentences. For each of them I would like to report the date when it was published)
soup = BeautifulSoup(url)
for url in urls:
url = soup.find_all('span', {'class':'f'})
# <span class="f">2 days ago - </span>
print(url)
I should expect results such as, for example,
"Yesterday I went out with my friends" 2 days ago the oldest result
"I played basketball for several years" 20 hours ago ....
.... 19 hours ago ....
.... 5 hours ago ....
...
for each sentence. So I should have two arrays, one for sentences and one for their date respectively, in order to plot them.
The raw data is:
enter image description here
Could you please help me giving suggestions on how to do that?
Thanks

This requires several steps:
first, extract only the duration from each URL by removing the span tags. You can do this with replace(), split() or you could use regular expressions.
sort the durations into different categories (days, hours, etc)
in each category, sort the durations in reverse order (eg. 2 hours ago should come before 1 hour ago)
finally, join the categories (days, hours etc.) into a single string in the correct order (days should come before hours).
Here's a working implementation. Note that you can extend it to also support minutes, months etc.
elements = [
'<span class="f">21 hours ago - </span>',
'<span class="f">20 hours ago - </span>',
'<span class="f">2 days ago - </span>',
'<span class="f">1 day ago - </span>']
# extract the durations (eg. 21 hours ago) and store them in times list
times = [elem.replace('<span class="f">','').replace(' - </span>','') for elem in elements]
# categorize the times into days and hours
days = [time for time in times if "day" in time]
hours = [time for time in times if "hour" in time]
# sort each category in reverse order
days.sort(reverse=True)
hours.sort(reverse=True)
# join categories into a string, such that each time is on a new line
output = '\n'.join(days) + '\n' + '\n'.join(hours)
print(output)
Output:
2 days ago
1 day ago
21 hours ago
20 hours ago
Demo: https://repl.it/#glhr/55552138
Another more scalable approach is to use a dictionary to convert every duration into a certain number of minutes, store those numeric durations into a separate list, and sort the original list of strings based on the numeric list:
elements = [
'<span class="f">21 hours ago - </span>',
'<span class="f">20 hours ago - </span>',
'<span class="f">2 days ago - </span>',
'<span class="f">1 day ago - </span>']
# extract the durations (eg. 21 hours ago) and store them in times list
times = [elem.replace('<span class="f">','').replace(' - </span>','') for elem in elements]
minutes_per_duration = {"hours": 60, "hour": 60, "minute": 1, "minutes": 1, "day": 1440, "days": 1440}
duration_values = []
for time in times:
duration = time.split(" ")[1] # eg. hours
number = int(time.split(" ")[0]) # eg. 21
minutes = minutes_per_duration[duration] # eg. 60 (for hours)
total = minutes * number # 21 * 60 = 1260
duration_values.append(total)
# sort times based on calculated duration values
output = '\n'.join([times for duration_values, times in sorted(zip(duration_values, times),reverse=True)])
print(output)
Output:
2 days ago
1 day ago
21 hours ago
20 hours ago
In your code, you can implement it like this:
def durationSpansToSortedList(elements):
# extract the durations (eg. 21 hours ago) and store them in times list
times = [elem.replace('<span class="f">','').replace(' - </span>','') for elem in elements]
minutes_per_duration = {"hours": 60, "hour": 60, "minute": 1, "minutes": 1, "day": 1440, "days": 1440}
duration_values = []
for time in times:
duration = time.split(" ")[1] # eg. hours
number = int(time.split(" ")[0]) # eg. 21
minutes = minutes_per_duration[duration] # eg. 60 (for hours)
total = minutes * number # 21 * 60 = 1260
duration_values.append(total)
# sort times based on calculated duration values
# return times as list
return [[times for duration_values, times in sorted(zip(duration_values, times),reverse=True)], duration_values]
urls=[]
for url in search(' " life " ', stop=20):
urls.append(url) # this creates a list of results (sentences. For each of them I would like to report the date when it was published)
spanElements = []
sentenceElements = []
for url in urls:
soup = BeautifulSoup(url, "html.parser")
spanElements.append(str(soup.find_all('span', {'class':'f'})[0]))
sentenceElements.append(url)
sortedDurations, duration_values = durationSpansToSortedList(spanElements)
print("Sorted durations:", sortedDurations,"\n")
sortedSentences = [sentenceElements for duration_values, sentenceElements in sorted(zip(duration_values, sentenceElements), reverse=True)]
print("Sorted sentences:", sortedSentences)

Related

Get difference between two week days that are in string

Problem Statement:
Am developing a custom job scheduler that needs to be run on given days. It takes start date and end date as string and third param is list of week days on which job should run.
Start day can be different with given days but first job should run on next valid day
Let suppose Start date is 2022-09-07 (so day name is Wednesday) but given frequency days are ["Monday", "Friday", "Saturday"] so i need to run my first job on coming Friday and for this i need to calculate difference between start date and first valid day (in this case it's Friday)
So how can i do this python to run my first job on valid day (that can be in any position of given frequency days list) and also after one job complete i need to also get next valid day. I did some work but unfortunately its not working. Here is what i did
sorted_week_days_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
start_date = "2022-09-07"
valid_frequency_days = ["Monday", "Tuesday", "Friday"] # It can be any days in sorted order
start_date_object = datetime.datetime.strptime(start_date, "%Y-%m-%d")
given_start_day = start_date_object.strftime("%A")
if given_start_day not in valid_frequency_days:
# Need help to implement logic to get date for valid day
You should use the datetime.weekday() method to pull out the day of the week for days of interest. Assuming that you have dates similar to the format you show above, it is easy to convert, and also just use the day index for your "allowable start days" (Monday=0).
Then you can jig up a little function to look for the next start date in your sorted list and figure out how many days you need to wait.
Example below does that and also "rolls over" the weekend as needed.
Code:
from datetime import datetime, timedelta
from bisect import bisect_left
start_date = "2022-09-09"
valid_start_dates = [1, 4] # It can be any days in sorted order
start_date_object = datetime.strptime(start_date, "%Y-%m-%d")
d=start_date_object.weekday()
print(f'the numbered day of the week is: {d}')
def days_till_start(day, valid_start_days):
idx = bisect_left(valid_start_days, day)
if idx >= len(valid_start_days): # wrap around to next start
return valid_start_days[0] + 7 - day
elif day == valid_start_days[idx]:
return 0
else:
return valid_start_days[idx] - day
print(days_till_start(d, valid_start_dates))
start_dates = ['2022-09-05', '2022-09-06', '2022-09-07', '2022-09-08', '2022-09-09', '2022-09-10']
start_wkdys = [datetime.strptime(d, "%Y-%m-%d").weekday() for d in start_dates]
for d in start_wkdys:
print(f'day index is: {d}')
print(f'next start date is {days_till_start(d, valid_start_dates)} away')
print()
Output:
the numbered day of the week is: 4
0
day index is: 0
next start date is 1 away
day index is: 1
next start date is 0 away
day index is: 2
next start date is 2 away
day index is: 3
next start date is 1 away
day index is: 4
next start date is 0 away
day index is: 5
next start date is 3 away

how can I use the epoch

I need to print out “Your birthday is 31 March 2001 (a years, b days, c hours, d minutes and e seconds ago).”
I create input
birth_day = int(input("your birth day?"))
birth_month = int(input("your birth month?"))
birth_year = int(input("your birth year?"))
and I understand
print("your birthday is"+(birth_day)+(birth_month)+(birth_year)) to print out first sentence. but I faced problem with second one which is this part (a years, b days, c hours, d minutes and e seconds ago)
I guess I have to use “the epoch”
and use some of various just like below
year_sec=365*60*60*24
day_sec=60*60*24
hour_sec=60*60
min_sec=60
calculate how many seconds of the date since 1 January 1970 00:00:00 UTC:
import datetime, time
t = datetime.datetime(2001, 3, 31, 0, 0)
time.mktime(t.timetuple())
985960800.0
can anyone, could you solve my problem please?
Thank a lot
EDIT: See this answer in the thread kaya3 mentioned above for a more consistently reliable way of doing the same thing. I'm leaving my original answer below since it's useful to understand how to think about the problem, but just be aware that my answer below might mess up in tricky situations due to the quirks of the Gregorian calendar, in particular:
Every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years if they are exactly divisible by 400. For example, the years 1700, 1800, and 1900 are not leap years, but the years 1600 and 2000 are.
ORIGINAL ANSWER:
You can try using the time module:
import time
import datetime
def main(ask_for_hour_and_minute, convert_to_integers):
year, month, day, hour, minute = ask_for_birthday_info(ask_for_hour_and_minute)
calculate_time_since_birth(year, month, day, hour, minute, convert_to_integers)
def ask_for_birthday_info(ask_for_hour_and_minute):
birthday_year = int(input('What year were you born in?\n'))
birthday_month = int(input('What month were you born in?\n'))
birthday_day = int(input('What day were you born on?\n'))
if ask_for_hour_and_minute is True:
birthday_hour = int(input('What hour were you born?\n'))
birthday_minute = int(input('What minute were you born?\n'))
else:
birthday_hour = 0 # set to 0 as default
birthday_minute = 0 # set to 0 as default
return (birthday_year, birthday_month, birthday_day, birthday_hour, birthday_minute)
def calculate_time_since_birth(birthday_year, birthday_month, birthday_day, birthday_hour, birthday_minute, convert_to_integers):
year = 31557600 # seconds in a year
day = 86400 # seconds in a day
hour = 3600 # seconds in a hour
minute = 60 # seconds in a minute
# provide user info to datetime.datetime()
birthdate = datetime.datetime(birthday_year, birthday_month, birthday_day, birthday_hour, birthday_minute)
birthdate_tuple = time.mktime(birthdate.timetuple())
# figure out how many seconds ago birth was
seconds_since_birthday = time.time() - birthdate_tuple
# start calculations
years_ago = seconds_since_birthday // year
days_ago = seconds_since_birthday // day % 365
hours_ago = seconds_since_birthday // hour % 24
minutes_ago = seconds_since_birthday // minute % 60
seconds_ago = seconds_since_birthday % minute
# convert calculated values to integers if convert_to_integers is True
if convert_to_integers is True:
years_ago = int(years_ago)
days_ago = int(days_ago)
hours_ago = int(hours_ago)
minutes_ago = int(minutes_ago)
seconds_ago = int(seconds_ago)
# print calculations
print(f'Your birthday was {years_ago} years, {days_ago}, days, {hours_ago} hours, {minutes_ago} minutes, {seconds_ago} seconds ago.')
# to ask for just the year, month, and day
main(False, False)
# to ask for just the year, month, and day AND convert the answer to integer values
main(False, True)
# to ask for just the year, month, day, hour, and minute
main(True, False)
# to ask for just the year, month, day, hour, and minute AND convert the answer to integer values
main(True, True)
Tried to use descriptive variable names so the variables should make sense, but the operators might need some explaining:
10 // 3 # the // operator divides the numerator by the denominator and REMOVES the remainder, so answer is 3
10 % 3 # the % operator divides the numerator by the denominator and RETURNS the remainder, so the answer is 1
After understanding the operators, the rest of the code should make sense. For clarity, let's walk through it
Create birthdate by asking user for their information in the ask_for_birthday_info() function
Provide the information the user provided to the calculate_time_since_birth() function
Convert birthdate to a tuple and store it in birthdate_tuple
Figure out how many seconds have passed since the birthday and store it in seconds_since_birthday
Figure out how many years have passed since the birthday by dividing seconds_since_birthday by the number of seconds in a year
Figure out how many days have passed since the birthday by dividing seconds_since_birthday by the number of seconds in a day and keeping only the most recent 365 days (that's the % 365 in days_ago)
Figure out how many hours have passed since the birthday by dividing seconds_since_birthday by the number of seconds in a hour and keeping only the most recent 24 hours (that's the % 24 in hours_ago)
Figure out how many minutes have passed since the birthday by dividing seconds_since_birthday by the number of seconds in a minute and keeping only the most recent 60 minutes (that's the % 60 in minutes_ago)
Figure out how many seconds have passed since the birthday by dividing seconds_since_birthday and keeping only the most recent 60 seconds (that's the % 60 in seconds_ago)
Then, we just need to print the results:
print(f'Your birthday was {years_ago} years, {days_ago}, days, {hours_ago} hours, {minutes_ago} minutes, {seconds_ago} seconds ago.')
# if you're using a version of python before 3.6, use something like
print('Your birthday was ' + str(years_ago) + ' years, ' + str(days_ago) + ' days, ' + str(hours_ago) + ' hours, ' + str(minutes_ago) + ' minutes, ' + str(seconds_ago) + ' seconds ago.')
Finally, you can add some error checking to make sure that the user enters valid information, so that if they say they were born in month 15 or month -2, your program would tell the user they provided an invalid answer. For example, you could do something like this AFTER getting the birthday information from the user, but BEFORE calling the calculate_time_since_birth() function:
if not (1 <= month <= 12):
print('ERROR! You provided an invalid month!')
return
if not (1 <= day <= 31):
# note this isn't a robust check, if user provides February 30 or April 31, that should be an error - but this won't catch that
# you'll need to make it more robust to catch those errors
print('ERROR! You provided an invalid day!')
return
if not (0 <= hour <= 23):
print('ERROR! You provided an invalid hour!')
return
if not (0 <= minute <= 59):
print('ERROR! You provided an invalid minute!')
return
if not (0 <= second <= 59):
print('ERROR! You provided an invalid second!')
return

Non-standard Julian day time stamp

I have a timestamp in a non-standard format, its a concatenation of a number of elements. I'd like to convert at least the last part of the string into hours/minutes/seconds/decimal seconds so I can calculate the time gap between them (typically of the order of 2-5 seconds).
I have looked at this link but it assumes a 'proper' Julian time. How to convert Julian date to standard date?
My time stamp looks like this
1380643373
It is set up as ddd hh mm ss.s
This timestamp represent 138th day, 06:43:37.3
Is there a datetime method of working with this or do I need to strip out the various parts (hh,mm,ss.s) and concatenate them in some way? As I am only interested in the seconds, if I can just extract them I could deal with that by adding 60 if the second timestamp is smaller than the first - i.e event passes over the minute change boundary.
If you're only interested in seconds, you can do:
timestamp = 1380643373
seconds = (timestamp % 1000) / 10 # Gives 37.3
timestamp % 1000 gives you the last three digits of timestamp. Then you divide that by 10 to get seconds.
If it's a string, you can take the last three characters by slicing it.
timestamp = "1380643373"
seconds = int(timestamp[-3:]) / 10 # Gives 37.3
It's pretty easy to convert the timestamp to a datetime using the divmod() function repeatedly:
import datetime
base_date = datetime.datetime(2000, 1, 1, 0, 0, 0) # Midnight on Jan 1 2000
timestamp = 1380643373
timestamp, seconds = divmod(timestamp, 1000) # Gives 1380643, 373
seconds = seconds / 10 # Gives 37.3
timestamp, minutes = divmod(timestamp, 100) # Gives 13806, 43
days, hours = divmod(timestamp, 100) # Gives 138, 6
tdelta = datetime.timedelta(days=days, hours=hours, minutes=minutes, seconds=seconds) # Gives datetime.timedelta(days=138, seconds=24217, microseconds=300000)
new_date = base_date + tdelta

Calculating time to earn a specific amount as interest

I cannot figure out the approach to this as the principle amount shall change after every year(if calculated annually, which shall be the easiest). Eventual goal is to calculate exact number of years, months and days to earn say 150000 as interest on a deposit of 1000000 at an interest rate of say 6.5%. I have tried but cannot seem to figure out how to increment the year/month/day in the loop. I don't mind if this is down voted because I have not posted any code(Well, they are wrong). This is not as simple as it might seem to beginners here.
It is a pure maths question. Compound interest is calculated as follows:
Ptotal = Pinitial*(1+rate/100)time
where Ptotal is the new total. rate is usually given in percentages so divide by 100; time is in years. You are interested in the difference, though, so use
interest = Pinitial*(1+rate/100)time – Pinitial
instead, which is in Python:
def compound_interest(P,rate,time):
interest = P*(1+rate/100)**time - P
return interest
A basic inversion of this to yield time, given P, r, and target instead, is
time = log((target+Pinitial)/Pinitial)/log(1+rate/100)
and this will immediately return the number of years. Converting the fraction to days is simple – an average year has 365.25 days – but for months you'll have to approximate.
At the bottom, the result is fed back into the standard compound interest formula to show it indeed returns the expected yield.
import math
def reverse_compound_interest(P,rate,target):
time = math.log((target+P)/P)/math.log(1+rate/100)
return time
timespan = reverse_compound_interest(2500000, 6.5, 400000)
print ('time in years',timespan)
years = math.floor(timespan)
months = math.floor(12*(timespan - years))
days = math.floor(365.25*(timespan - years - months/12))
print (years,'y',months,'m',days,'d')
print (compound_interest(2500000, 6.5, timespan))
will output
time in years 2.356815854829652
2 y 4 m 8 d
400000.0
Can we do better? Yes. datetime allows arbitrary numbers added to the current date, so assuming you start earning today (now), you can immediately get your date of $$$:
from datetime import datetime,timedelta
# ... original script here ...
timespan *= 31556926 # the number of seconds in a year
print ('time in seconds',timespan)
print (datetime.now() + timedelta(seconds=timespan))
which shows for me (your target date will differ):
time in years 2.356815854829652
time in seconds 74373863.52648607
2022-08-08 17:02:54.819492
You could do something like
def how_long_till_i_am_rich(investment, profit_goal, interest_rate):
profit = 0
days = 0
daily_interest = interest_rate / 100 / 365
while profit < profit_goal:
days += 1
profit += (investment + profit) * daily_interest
years = days // 365
months = days % 365 // 30
days = days - (months * 30) - (years * 365)
return years, months, days
years, months, days = how_long_till_i_am_rich(2500000, 400000, 8)
print(f"It would take {years} years, {months} months, and {days} days")
OUTPUT
It would take 1 years, 10 months, and 13 days

ValueError: day is out of range for month python

I am writing code that lets users write down dates and times for things they have on. It takes in the date on which it starts, the start time and finish time. It also allows the user to specify if they want it to carry over into multiple weeks (every Monday for a month for example)
I am using a for loop to do this, and because of the different months having different days I obviously want (if the next Monday for example is in the next month) it to have the correct date.
This is my code:
for i in range(0 , times):
day = day
month = month
fulldateadd = datetime.date(year, month, day)
day = day + 7
if month == ( '01' or '03' or '05' or '07' or '10'or '12'):
if day > 31:
print(day)
day = day - 31
print(day)
month = month + 1
elif month == ( '04' or '06'or '09' or '11'):
if day > 30:
print(day)
day = day - 30
print(day)
month = month + 1
elif month == '02':
if day > 29:
print(day)
day = day - 29
print(day)
month = month + 1
When running this and testing to see if it goes correctly into the new month I get the error:
File "C:\Users\dansi\AppData\Local\Programs\Python\Python36-32\gui test 3.py", line 73, in addtimeslot
fulldateadd = datetime.date(year, month, day)
ValueError: day is out of range for month
Where have I gone wrong?
It's hard to be completely accurate without seeing some of the previous code (for example, where do day, month, year, and times come from?), but here's how you might be able to use timedelta in your code:
fulldateadd = datetime.date(year, month, day)
for i in range(times):
fulldateadd = fulldateadd + datetime.timedelta(7)
A timedelta instance represents a period of time, rather than a specific absolute time. By default, a single integer passed to the constructor represents a number of days. So timedelta(7) gives you an object that represents 7 days.
timedelta instances can then be used with datetime or date instances using basic arithmetic. For example, date(2016, 12, 31) + timedelta(1) would give you date(2017, 1, 1) without you needing to do anything special.

Resources