I have successfully written code to web scrape an https text page
https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt
this page is automatically updated every 60sec. I have used beautifulSoup4 to do so. Here are my two questions: 1)how do I call a loop to re-scrape the page every 60 seconds? 2) since there are no html tags associated with the page how can only scrape a specific line of data?
I was thinking that I might have to save the scraped page as a CVS file then use the saved page to extract the data I need. However, I'm hoping that this can all be done without saving the page to my local machine. I was hoping that there is some python package that can do all of this for me without saving the page.
import bs4 as bs
import urllib
sauce = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
soup = bs.BeautifulSoup (sauce,'lxml')
print (soup)
I would like to automatically scrape the first line of data every 60 seconds Here is an example first line of data
2019 03 30 1233 58572 45180 9.94e-09 1.00e-09
The header that goes with this data is
YR MO DA HHMM Day Day Short Long
Ultimately I would like to use PyAutoGUI to trigger a ccd imaging application to start a sequence of images when the 'Short' and or "Long" x-ray flux reaches e-04 or greater.
Every tool has its place.
BeautifulSoup is a wonderful tool, but the .txt suffix on that URL is a big hint that this isn't quite the HTML input which bs4 was designed for.
Recommend you use a simpler approach for this fairly simple input.
from itertools import filterfalse
def is_comment(line):
return (line.startswith(':')
or line.startswith('#'))
lines = list(filterfalse(is_comment, sauce.split('\n')))
Now you can do word split on each line to convert to CSV or pandas dataframe.
Or you can just use lines[0] to access the first line.
For example, you might parse it out in this way:
yr, mo, da, hhmm, jday, sec, short, long = map(float, lines[0].split())
Related
I have very little experience but a lot of persistence. One of my hobbies is football and I help in a local team. I'm a big fan of statistics and unfortunately the only way to collect data from the local football federation is by web scraping. I have seen python with beautifulsoup package can help but I cannot identify the tags as these I believe are on a table.
My goal is to automatically collect the information to build up a database with players, fixtures, teams, referees,... and build up stats such as how many times has player been on the starting line up, when are teams more likely to score a goal, when to receive a goal,...
I have two links for a reference.
The first is for the general fixtures of a given group.
The second is the details with a match of any of a given match.
Any clues or where to start with the pattern would be great.
you should first get into web-scraping using python native libraries like requests to kinda contact the page that you want. then depending on the page, you should use bs4 (Beautifulsoup) to find what you are looking for inside the page. Then what you wanna do is transform and refine that information into variables (list or dataframes based on your need) and finally save those variables into a dataset. here is an example code:
import requests
from bs4 import BeautifulSoup #assuming you already have bs4 installed
page = requests.get('www.DESTINATIONSITE.com/BLAHBLAH')
soup = BeautifulSoup(page.text, 'html.parser')
soup_str = str(soup.text)
up to this point, you have the string values of the entire page at hand. now you wanna do some regular expression or any other python coding to seperate the information that you want from soup_str and store them in variable(s).
for the rest of the process, if you want to save that data, i suggest you look into libraries like pandas
I have done some progress
import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
send request
url = 'http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
read acta
acta_text = []
acta_text_element = soup.find_all(class_='acta-table')
for item in acta_text_element:
acta_text.append(item.text)
when I print the items I get many \n
Mahdi_J thanks for the initial response. I have already started the approach:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
soup = BeautifulSoup(c,"html.parser")
What I need is some support on where to start for the parsing. On the url I neeed to collect the players for both teams, the goals, bookings to save them to else where. What I'm having trouble is how to parse. I have very little skills and I need some starting point for the parsing.
from bs4 import BeautifulSoup
import requests
import smtplib
import time
def live_news():
source = requests.get(
"https://economictimes.indiatimes.com/news/politics-and-nation/coronavirus-
cases-in-india-live-news-latest-updates-april6/liveblog/75000925.cms"
).text
soup = BeautifulSoup(source, "lxml")
livepage = soup.find("div", class_="pageliveblog")
each_story = livepage.find("div", class_="eachStory")
news_time = each_story.span.text
new_news = each_story.div.text[8::]
print(f"{news_time}\n{new_news}")
while(True):
live_news()
time.sleep(300)
So basically what I'm trying is to scrape latest news updates from a news website. What I'm looking for is to print only the latest news along with its time not the entire news headlines.
With the above code I can get the latest news update and the program will send request to the server every 5mins(that's the delay I've given). But the problem here is, it will print the same previously printed news again after 5 mins if there are no other latest news updated in the page. I don't want the program to print the same news again, instead I would like to add some conditions to the program. So that It will check every 5 mins if there are any new updates or its the same previous news. If there are any new updates then it should print it otherwise should not.
The solution that I can think of is an if-statement. With the first run of your code, a variable check_last_time is empty and when you call live_news() it gets assigned the value of news_time.
After that, each time live_news() is called, it firsts check to see if the current news_time is not the same as check_last_time and if it is not, then it will print the new story:
# Initialise the variable outside of the function
check_last_time = []
def live_news():
...
...
# Check to see with the times don't match
if news_time != check_last_time:
print(f"{news_time}\n{new_news}")
# Update the variable with the new time
check_last_time = news_time
I have found the answer myself. I feel kinda stupid it's very simple, all you need is an additional file to store the value. Since between every execution, the variable values will get reset and so you will need an additional file to read/write data whenever you need.
When you open the URL below in a browser,
http://www.kianfunds2.com/%D8%A7%D8%B1%D8%B2%D8%B4-%D8%AF%D8%A7%D8%B1%D8%A7%DB%8C%DB%8C-%D9%87%D8%A7-%D9%88-%D8%AA%D8%B9%D8%AF%D8%A7%D8%AF-%D9%88%D8%A7%D8%AD%D8%AF-%D9%87%D8%A7
you see a purple icon by the name of "copy". When you select this icon("copy"), you will achieve a complete table that you can paste into Excel. How can I get this table as an input in Python?
My code is below, and it shows nothing:
import requests
from bs4 import BeautifulSoup
url = "http://www.kianfunds2.com/" + "ارزش-دارایی-ها-و-تعداد-واحد-ها"
result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")
table = soup.find("a", class_="dt-button buttons-copy buttons-html5")
I don't want use Selenium, because it takes a lot of time. Please use Beautiful Soup.
To me it seems pretty unnecessary to use any sort of web scraping here. Since you can download the data as a file anyway, it is inadequate to go through the parsing you would need to represent that data via scrapping.
Instead you could just download the data and read it into a pandas dataframe. You will need to have pandas installed, in case you have Anaconda installed, you might already have it on your computer, otherwise you might need to download Anaconda and instal pandas:
conda install pandas
More Information on Installing Pandas
With pandas you can read in the data directly from the excel-sheet:
import pandas as pd
df = pd.read_excel("dataset.xlsx")
pandas.read_excel documentation
In case that this is making difficulties, you can still convert the excel-sheet to a csv and use pd.read_csv. Notice that you'll want to use correct encoding.
In case that you want to use BeautifulSoup for some reason: You might want to look into how to parse tables.
For a normal table, you would want to identify the content you want to scrape correctly. The table on that specific website has an id which is "arzeshdarayi". It is also the only table on that page, so you can also use the <table>-Tag to select it.
table = soup.find("table", id="arzeshdarayi")
table = soup.select("#arzeshdarayi")
The table on the website you provided has only a static header, the data is rendered as javascript, and BeautifulSoup won't be able to retrieve the information. Yet you can use the [json-object] that javascript works with
and once again, read it in as a dataframe:
import requests
import pandas pd
r = requests.get("http://www.kianfunds2.com/json/gettables.ashx?get=arzeshdarayi")
dict = r.json()
df = pd.DataFrame.from_dict(data)
In case you really want to scrape it, you will need some sort of browser simulation, so the Javascript will be evaluated before you access the html.
This answer recommends using Requests_HTML which is a very high level approach to web scraping, that brings together Requests, BS and that renders Javascript. Your code would look somewhat like this:
import requests_html as request
session = request.HTMLSession()
url = "http://www.kianfunds2.com/ارزش-دارایی-ها-و-تعداد-واحد-ها"
r = session.get(url)
#Render the website including javascript
#Uses Chromium (will be downloaded on first execution)
r.html.render(sleep=1)
#Find the table by it's id and take only the first result
table = r.html.find("#arzeshdarayi")[0]
#Find the single table rows
#Loop through those rows
for items in table.find("tr"):
#Take only the item.text for all elements
#While extracting the Headings and Data from the Tablerows
data = [item.text for item in items.find("th,td")[:-1]]
print(data)
I found several questions related to my question but none helped me.
I have to download .hdf files for a few years. This year I have at least 3 files per month.
I tried to loop in the shell, but it will take 1000 years until I can download all the files, outside that will block my band.
When I access the link I need to find only the link that I want to download. But this link has my variations (date, time, Julian day, and seconds), but it has the same start and ending times every year.
In the link below I show the site:
view-source:https://e4ftl01.cr.usgs.gov/MOLT/MOD09A1.005/2010.02.02/
Where it varies:
https://e4ftl01.cr.usgs.gov/MOLT/MOD09A1.005/ year . month . day/
Below I show the link that I need to get.
MOD09A1.A2010033.h13v12.005.2010042232737.hdf
However, this file name varies greatly.
MOD09A1. varies .h13v12.005. varies .hdf
One of my problems is in how do I make python find just what does not vary and fill in what varies. Another is how to download a .hdf file. And finally how do I log in and password (it is necessary to download the file). I'm new to python and I believe I have to do this, but if I do not I'll have to unfortunately use the hand to download all.
My script is:
import requests
import wget
import os
from urllib.parse import urlparse
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://e4ftl01.cr.usgs.gov/MOLT/MOD09A1.005/2013.08.21/'
resp = requests.get(url)
html = resp.text
soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
t = tags[471] # where is the name I want
down = t.attrs['href']
full_path = url + down
r = requests.post(full_path, auth=('lucas', 'xxxx'), verify=True)
r.status_code
with open(full_path, 'wb') as fd:
for chunk in r.iter_content():
fd.write(chunk)
But still to make my r.status_code still 401, and is not connecting and you are not downloading .hdf.
Thanks!
How can i get the frequency of a specified word in a wikipedia Article without storing the whole article and then process it ? For eg , How may times the word "India" occurs in this article https://simple.wikipedia.org/wiki/India
Here's a simple-minded example that reads the web page line by line. But there is no guarantee the HTML is broken into lines. (It is in this case, over 1300 of them.)
import re
import urllib.request
from collections import Counter
URL = 'https://simple.wikipedia.org/wiki/India'
counter = Counter()
with urllib.request.urlopen(URL) as source:
for line in source:
words = re.split(r"[^A-Z]+", line.decode('utf-8'), flags=re.I)
counter.update(words)
for word in ['India', 'Indian', 'Indians']:
print('{}: {}'.format(word, counter[word]))
OUTPUT
> python3 test.py
India: 547
Indian: 75
Indians: 11
>
This also counts terms if they appear in the HTML structure of the page, not just the content.
If you want to focus on content, consider the Pywikibot python library which uses the preferred MediaWiki API to extract content, though it appears to be based on a "complete page at a time" model which you noted you're trying to avoid. Regardless, that module's documentation points to a list of similar, but more advanced, packages that you might want to review.