Web scraping latest news by checking if any new news are added

Web scraping latest news by checking if any new news are added - python-3.x

from bs4 import BeautifulSoup
import requests
import smtplib
import time
def live_news():
source = requests.get(
"https://economictimes.indiatimes.com/news/politics-and-nation/coronavirus-
cases-in-india-live-news-latest-updates-april6/liveblog/75000925.cms"
).text
soup = BeautifulSoup(source, "lxml")
livepage = soup.find("div", class_="pageliveblog")
each_story = livepage.find("div", class_="eachStory")
news_time = each_story.span.text
new_news = each_story.div.text[8::]
print(f"{news_time}\n{new_news}")
while(True):
live_news()
time.sleep(300)
So basically what I'm trying is to scrape latest news updates from a news website. What I'm looking for is to print only the latest news along with its time not the entire news headlines.
With the above code I can get the latest news update and the program will send request to the server every 5mins(that's the delay I've given). But the problem here is, it will print the same previously printed news again after 5 mins if there are no other latest news updated in the page. I don't want the program to print the same news again, instead I would like to add some conditions to the program. So that It will check every 5 mins if there are any new updates or its the same previous news. If there are any new updates then it should print it otherwise should not.

The solution that I can think of is an if-statement. With the first run of your code, a variable check_last_time is empty and when you call live_news() it gets assigned the value of news_time.
After that, each time live_news() is called, it firsts check to see if the current news_time is not the same as check_last_time and if it is not, then it will print the new story:
# Initialise the variable outside of the function
check_last_time = []
def live_news():
...
...
# Check to see with the times don't match
if news_time != check_last_time:
print(f"{news_time}\n{new_news}")
# Update the variable with the new time
check_last_time = news_time

I have found the answer myself. I feel kinda stupid it's very simple, all you need is an additional file to store the value. Since between every execution, the variable values will get reset and so you will need an additional file to read/write data whenever you need.

Related

Broken string on windows search screen

I am developing a python application to send some images via whatsapp but when I try to attach the image the word is broken, does anyone know what happens?
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
AbreAnexo = driver.find_element_by_css_selector('span[data-icon="clip"]')
AbreAnexo.click()
AbreImagem = driver.find_element_by_css_selector('button[class="Ijb1Q"]')
AbreImagem.click()
pyautogui.typewrite("C:\\Users\\f_teicar\\Documents\\Lanchonete\\001-Cardapio.png",interval=0.02)
time.sleep(5)
pyautogui.press('enter')
time.sleep(5)
pyautogui.press('enter')
Expected is to write C:\Users\f_teicar\Documents\Lanchonete\001-Cardapio.png
but output is it car\Documents\Lanchonete\001-Cardapio.png

Especially on external web sites, it takes a while for the page to load. This means that the next step (or part of it) might be ignored as the page isn't ready to receive further operations from the Selenium client.
time.sleep(n) where n is the number of seconds to wait, is a quick way of waiting for the page to load, but if it takes a bit longer than the time you specify, it will fail, and if it loads much faster, then it will waste time. So I use a function to wait for the page like this.
#contextmanager
from selenium import webdriver
from selenium.webdriver.support.expected_conditions import staleness_of
from contextlib import contextmanager
def wait_for_page_load(timeout=MAX_WAIT):
""" Wait for a new page that isn't the old page
"""
old_page = driver.find_element_by_tag_name('html')
yield
webdriver.support.ui.WebDriverWait(driver, timeout).until(staleness_of(old_page))
To call the function, use something like
with self.wait_for_page_load():
AbreImagem.click()
where the second line is anything that causes a new page to load. Note that this procedure depends on the presence of the tag in the old page, which is usually pretty reliable.

Web scraping an http text file page at repeated intervals

I have successfully written code to web scrape an https text page
https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt
this page is automatically updated every 60sec. I have used beautifulSoup4 to do so. Here are my two questions: 1)how do I call a loop to re-scrape the page every 60 seconds? 2) since there are no html tags associated with the page how can only scrape a specific line of data?
I was thinking that I might have to save the scraped page as a CVS file then use the saved page to extract the data I need. However, I'm hoping that this can all be done without saving the page to my local machine. I was hoping that there is some python package that can do all of this for me without saving the page.
import bs4 as bs
import urllib
sauce = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
soup = bs.BeautifulSoup (sauce,'lxml')
print (soup)
I would like to automatically scrape the first line of data every 60 seconds Here is an example first line of data
2019 03 30 1233 58572 45180 9.94e-09 1.00e-09
The header that goes with this data is
YR MO DA HHMM Day Day Short Long
Ultimately I would like to use PyAutoGUI to trigger a ccd imaging application to start a sequence of images when the 'Short' and or "Long" x-ray flux reaches e-04 or greater.

Every tool has its place.
BeautifulSoup is a wonderful tool, but the .txt suffix on that URL is a big hint that this isn't quite the HTML input which bs4 was designed for.
Recommend you use a simpler approach for this fairly simple input.
from itertools import filterfalse
def is_comment(line):
return (line.startswith(':')
or line.startswith('#'))
lines = list(filterfalse(is_comment, sauce.split('\n')))
Now you can do word split on each line to convert to CSV or pandas dataframe.
Or you can just use lines[0] to access the first line.
For example, you might parse it out in this way:
yr, mo, da, hhmm, jday, sec, short, long = map(float, lines[0].split())

Can't get beautiful soup to return the correct article titles, links, and img. Help debug?

I've been trying to scrape data for a project from the times for the last 7 hours. And yes it has to be done without the API. Its been a war of attrition, but this code that checks out keeps returning nans, am I missing something simple? Towards the bottom of the page is every story contained within the front page, little cards that have an image, 3 article titles, and their corresponding links. It either doesn't grab a thing, partially grabs it, or grabs something completely wrong. There should be about 35 cards with 3 links a piece for 105 articles. I've gotten it to recognize 27 cards with a lot of nans instead of strings and none of the individual articles.
import csv, requests, re, json
from bs4 import BeautifulSoup
handle = 'http://www.'
location = 'ny'
ping = handle + locaiton + 'times.com'
pong = requests.get(ping, headers = {'User-agent': 'Gordon'})
soup = BeautifulSoup(pong.content, 'html.parser')
# upper cards attempt
for i in soup.find_all('div', {'class':'css-ki19g7 e1aa0s8g0'}):
print(i.a.get('href'))
print(i.a.text)
print('')
# lower cards attempt
count = 0
for i in soup.find_all('div', {"class":"css-1ee8y2t assetWrapper"}):
try:
print(i.a.get('href'))
count+=1
except:
pass
print('current card pickup: ', count)
print('the goal card pickup:', 35)
Everything Clickable uses "css-1ee8y2t assetWrapper", but when I find_all I'm only getting 27 of them. I wanted to start from css-guaa7h and work my way down but it only returns nans. Other promising but fruitless divs are
div class="css-2imjyh" data-testid="block-Well" data-block-tracking-id="Well"
div class="css-a11566"
div class="css-guaa7h”
div class="css-zygc9n"
div data-testid="lazyimage-container" # for images
Current attempt:
h3 class="css-1d654v4">Politics
My hope is running out, why is just getting a first job is harder then working hard labor.

I checked their website and it's using ajax to load the articles as soon as you scroll down. You'll probably have to use selenium. Here's an answer that might help you do that: https://stackoverflow.com/a/21008335/7933710

Web-scraping. How make it faster ?

I have to extract some attributes(in my example there is only one: a text description of apps) from webpages. The problem is the time !
Using the following code, indeed, to go on an page, extract one part of HTML and save it, takes about 1.2-1.8 sec per page. A lot of time. Is there a way to make it faster ? I have a lot of pages, x could be also 200000.
I'm using jupiter.
Description=[]
for x in range(len(M)):
response = http.request('GET',M[x] )
soup = BeautifulSoup(response.data,"lxml")
t=str(soup.find("div",attrs={"class":"section__description"}))
Description.append(t)
Thank you

You should consider inspecting the page a bit. If page relies on the Rest API, you could scrape contents that you need by directly getting them from the API. This is much more efficient way than getting the contents from HTML.
To consume it you should check out Requests library for Python.

I would try carving this up into multiple processes per my comment. So you can put your code into a function and use multiprocessing like this
from multiprocessing import Pool
def web_scrape(url):
response = http.request('GET',url )
soup = BeautifulSoup(response.data,"lxml")
t=str(soup.find("div",attrs={"class":"section__description"}))
return t
if __name__ == '__main__':
# M is your list of urls
M=["https:// ... , ... , ... ]
p = Pool(5) # 5 or how many processes you think is appropriate (start with how many cores you have, maybe)
description=p.map(web_scrape, M))
p.close()
p.join()
description=list(description) # if you need it to be a list
What is happening is that your list of urls is getting distributed to multiple processes that run your scrape function. The results all then get consolidated in the end and wind up in description. This should be much faster than if you processed each url one at a time like you are doing currently.
For more details: https://docs.python.org/2/library/multiprocessing.html

audio file isn't being parsed with Google Speech

This question is a followup to a previous question.
The snippet of code below almost works...it runs without error yet gives back a None value for results_list. This means it is accessing the file (I think) but just can't extract anything from it.
I have a file, sample.wav, living publicly here: https://storage.googleapis.com/speech_proj_files/sample.wav
I am trying to access it by specifying source_uri='gs://speech_proj_files/sample.wav'.
I don't understand why this isn't working. I don't think it's a permissions problem. My session is instantiated fine. The code chugs for a second, yet always comes up with no result. How can I debug this?? Any advice is much appreciated.
from google.cloud import speech
speech_client = speech.Client()
audio_sample = speech_client.sample(
content=None,
source_uri='gs://speech_proj_files/sample.wav',
encoding='LINEAR16',
sample_rate_hertz= 44100)
results_list = audio_sample.async_recognize(language_code='en-US')

Ah, that's my fault from the last question. That's the async_recognize command, not the sync_recognize command.
That library has three recognize commands. sync_recognize reads the whole file and returns the results. That's probably the one you want. Remove the letter "a" and try again.
Here's an example Python program that does this: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe.py
FYI, here's a summary of the other types:
async_recognize starts a long-running, server-side operation to translate the whole file. You can make further calls to the server to see whether it's finished with the operation.poll() method and, when complete, can get the results via operation.results.
The third type is streaming_recognize, which sends you results continually as they are processed. This can be useful for long files where you want some results immediately, or if you're continuously uploading live audio.

I finally got something to work:
import time
from google.cloud import speech
speech_client = speech.Client()
sample = speech_client.sample(
content = None
, 'gs://speech_proj_files/sample.wav'
, encoding='LINEAR16'
, sample_rate= 44100
, 'languageCode': 'en-US'
)
retry_count = 100
operation = sample.async_recognize(language_code='en-US')
while retry_count > 0 and not operation.complete:
retry_count -= 1
time.sleep(10)
operation.poll() # API call
print(operation.complete)
print(operation.results[0].transcript)
print(operation.results[0].confidence)
for op in operation.results:
print op.transcript
Then something like
for op in operation.results:
print op.transcript

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Web scraping latest news by checking if any new news are added - python-3.x

I have found the answer myself. I feel kinda stupid it's very simple, all you need is an additional file to store the value. Since between every execution, the variable values will get reset and so you will need an additional file to read/write data whenever you need.

Related

Broken string on windows search screen

Web scraping an http text file page at repeated intervals

Can't get beautiful soup to return the correct article titles, links, and img. Help debug?

Web-scraping. How make it faster ?

audio file isn't being parsed with Google Speech

Categories

Resources