I might have some Korean letter enconding issue.

I might have some Korean letter enconding issue. - python-3.x

I am using python 3.6 and pycharm 2016.2 and trying to crawl an web site.
In the category of "보험사고이력 정보 : 내차 피해" (which includes the fifth tables), I tried to crawl the data if one of "p tag" has "- 사고일자" in its contents.
Below is my code. It keeps returning nothing.
Please help.
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse
import re
popup_insurance = "http://www.bobaedream.co.kr/mycar/popup/mycarChart_B.php?car_number=35%EB%91%908475&tbl=cyber&cno=651451"
res = urllib.request.urlopen(popup_insurance)
html = res.read()
soup_insurance = BeautifulSoup(html, 'html.parser')
insurance_content_table = soup_insurance.find_all('table')
elem = soup_insurance.find("p", text="보험사고이력 정보 : 내차 피해")
while elem.string != "보험사고이력 정보 : 타차 가해":
if "사고일자" in elem.next_sibling:
print(elem.next_sibling)
elem = elem.next_sibling
if elem is None:
break

You should loop through elem.next_sibling, NavigableString's can be odd sometimes:
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse
import re
popup_insurance = "http://www.bobaedream.co.kr/mycar/popup/mycarChart_B.php?car_number=35%EB%91%908475&tbl=cyber&cno=651451"
res = urllib.request.urlopen(popup_insurance)
html = res.read()
soup_insurance = BeautifulSoup(html, 'html.parser')
insurance_content_table = soup_insurance.find_all('table')
elem = soup_insurance.find("p", text="보험사고이력 정보 : 내차 피해")
while elem.string != "보험사고이력 정보 : 타차 가해":
for string in elem.next_sibling:
if "사고일자" in string:
print(elem.next_sibling.string.strip())
elem = elem.next_sibling
if elem is None:
break
I am assuming (since you did not provide expected output) that you wanted the Accident Date / Repair cost bit.
This is nowhere near perfect or even elegant, I'm almost sure this can be done with just the for loop.

Related

blacklist href in python to remove junk sites

I want it to print every site that isnt blacklisted(how the code looks so far) but it doesnt work
if you change the string in the last if statement from pass to print(site) then it prints everything in the black list, yet it wont print everything that isnt blacklisted which is my goal
import requests
from bs4 import BeautifulSoup
from lxml import html, etree
import sys
import re
import fnmatch
url = ("http://stackoverflow.com")
blacklist = ['*stackoverflow.com*', '*stackexchange.com*']
r = requests.get(url, timeout=6, verify=True)
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.select('a[href*="http"]'):
site = (link.get('href'))
site = str(site)
for filtering in blacklist:
if fnmatch.fnmatch(site, filtering):
pass
else:
print(site)

You want something like:
import requests
from bs4 import BeautifulSoup
from lxml import html, etree
import sys
import re
import fnmatch
url = ("http://stackoverflow.com")
blacklist = ['*stackoverflow.com*', '*stackexchange.com*']
r = requests.get(url, timeout=6, verify=True)
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.select('a[href*="http"]'):
site = (link.get('href'))
site = str(site)
if any([fnmatch.fnmatch(site, filtering) for filtering in blacklist]):
continue
print(site)
The issue happens here (old code):
for filtering in blacklist:
if fnmatch.fnmatch(site, filtering):
pass
else:
print(site)
While you're iterating here, if the website is blacklisted it will match one condition but not the other, so it will always be printed.
There are multiple solutions, mine was to use any() to check if the result is True at least once and if it is, continue the loop and don't print :D

Scraping without specific strings in Python3

I'm trying to scrape only emoji in Python3. I used starttwith method with if statement but the result got some Unicodes that emoji's HTML tag seems to be same as others. I have no idea why some emoji is converted into Unicode. Could you give me any advice ?? or there is any ways to remove this unicode from list.
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
import re
import os
list0 = []
site_url = "https://www.emojiall.com/zh-hant/categories/A"
get_url = requests.get(site_url)
soup = BeautifulSoup(get_url.text, "lxml")
for script in soup(["span"]):
script.extract()
emojis = soup.select('.emoji_font')
words = soup.select('.emoji_name_truncate')
for emoji0 in emojis:
emoji1 = emoji0.getText()
if not repr(emoji1).startswith(r'\U'):
list0.append(emoji1)
else:
continue
print(list0)

I updated editor and it works well .
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
import re
import os
list0 = []
site_url = "https://www.emojiall.com/zh-hant/categories/A"
get_url = requests.get(site_url)
soup = BeautifulSoup(get_url.text, "lxml")
for script in soup(["span"]):
script.extract()
emojis = soup.select('.emoji_font')
words = soup.select('.emoji_name_truncate')
for emoji0 in emojis:
emoji1 = emoji0.getText()
if not repr(emoji1).startswith(r"'\U"):
list0.append(emoji1)
else:
continue
print(list0)

Can't find specific table using BeautifulSoup

I have been using BeautifulSoup to scrape the pricing information from
"https://www.huaweicloud.com/pricing.html#/ecs"
I want to extract the table information of that website, but I get nothing.
I am using Windows 10 , the latest BeautifulSoup , Request and Python3.7
import requests
from bs4 import BeautifulSoup
url = 'https://www.huaweicloud.com/pricing.html#/ecs'
headers = {'User-Agent':'Mozilla/5.0'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.content,'html.parser')
soup.find_all('table')
After running the soup.find_all('table') , it returns an empty list: []

I know this is not the answer to your question, but this might help you. This is the code I came up with using selenium & BeautifulSoup. You just have to specify the location of chromedriver, and the script is good to go.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.huaweicloud.com/pricing.html#/ecs'
driver = webdriver.Chrome("location of chrome driver")
driver.get(str(url))
driver.find_element_by_id("calculator_tab0").click()
time.sleep(3)
html_source = driver.page_source
soup = BeautifulSoup(html_source, features="lxml")
table_all = soup.findAll("table")
output_rows = []
for table in table_all[:2]:
for table_row in table.findAll('tr'):
thead = table_row.findAll('th')
columns = table_row.findAll('td')
_thead = []
for th in thead:
_thead.append(th.text)
output_rows.append(_thead)
_row = []
for column in columns:
_row.append(column.text)
output_rows.append(_row)
output_rows = [x for x in output_rows if x != []]
df = pd.DataFrame(output_rows)

Not able to use BeautifulSoup to get span content of Nasdaq100 future

from bs4
import BeautifulSoup
import re
import requests
url = 'www.barchart.com/futures/quotes/NQU18'
r = requests.get("https://" +url)
data = r.text
soup = BeautifulSoup(data)
price = soup.find('span', {'class': 'last-change',
'data-ng-class': "highlightValue('priceChange’)”}).text
print(price)
Result:
[[ item.priceChange ]]
It is not the span content. The result should be price. Where am I going wrong?
The following is the span tag of the page:
2nd screenshot: How can I get the time?

Use price = soup.find('span', {'class': 'up'}).text instead to get the +X.XX value:
from bs4 import BeautifulSoup
import requests
url = 'www.barchart.com/futures/quotes/NQU18'
r = requests.get("https://" +url)
data = r.text
soup = BeautifulSoup(data, "lxml")
price = soup.find('span', {'class': 'up'}).text
print(price)
Output currently is:
+74.75
The tradeTime you seek seems to not be present in the page_source, since it's dynamically generated through JavaScript. You can, however, find it elsewhere if you're a little clever, and use the json library to parse the JSON data from a certain script element:
import json
trade_time = soup.find('script', {"id": 'barchart-www-inline-data'}).text
json_data = json.loads(trade_time)
print(json_data["NQU18"]["quote"]["tradeTime"])
This outputs:
2018-06-14T18:14:05
If these don't solve your problem then you will have to resort to something like Selenium that can run JavaScript to get what you're looking for:
from selenium import webdriver
driver = webdriver.Chrome()
url = ("https://www.barchart.com/futures/quotes/NQU18")
driver.get(url)
result = driver.find_element_by_xpath('//*[#id="main-content-column"]/div/div[1]/div[2]/span[2]/span[1]')
print(result.text)
Currently the output is:
-13.00

Can't print tag 'content' anymore

I had a perfectly well working scraper for TripAdvisor, it met all my needs, then I tried to use it after a four day break and something went wrong, I quickly realized that TA had changed some of the tags, I made the appropriate changes and I still couldn't get it working as before. I want to grab the value of the 'content' tag within an element.
This is the element:
<div class="prw_rup prw_common_bubble_rating bubble_rating" data-prwidget-init="" data-prwidget-name="common_bubble_rating"><span alt="5 of 5 bubbles" class="ui_bubble_rating bubble_50" content="5" property="ratingValue" style="font-size:18px;"></span></div>
and here is the code:
for bubs in data.findAll('div',{'class':"prw_rup prw_common_bubble_rating bubble_rating"}):
print([img["content"] for img in bubs.select("img[content]")])
but now it only gives me an empty '[]' instead of the content which is '5'.
Anybody know what may have changed?
here is the rest of my code
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
theurl = "https://www.tripadvisor.com/Hotels-g147364-c3-Cayman_Islands-Hotels.html"
thepage = urllib
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")
base_url = "https://www.tripadvisor.com"
urls = []
init_info = []
init_data = open('/Users/paribaker/Desktop/scrapping/TripAdvisor/Inv/speccaydata.txt', 'w')
for link in soup.findAll('a',href=re.compile('/Hotel_Review')):
urls.append(base_url + (link.get('href')).strip("#REVIEWS"))
def remove_duplicates(urls):
output= []
seen = set()
for line in urls:
if line not in seen:
output.append(line)
seen.add(line)
return output
urls2 = remove_duplicates(urls)
for url in urls2:
try:
driver = webdriver.Chrome()
driver.get(url)
element = driver.find_element_by_id("taplc_prodp13n_hr_sur_review_filter_controls_0_filterLang_ALL").click()
print("succesfull")
moreinfo = driver.page_source
moresoup = BeautifulSoup(moreinfo,"html.parser")
driver.close()
#moreinfo = urllib
#moreinfo = urllib.request.urlopen(url)
#moresoup = BeautifulSoup(moreinfo,"html.parser")
except:
print("none")
for data in moresoup.findAll('div', {"class":"heading_2014 hr_heading"}):
try:
for title in data.findAll('h1',{'id':"HEADING"}):
init_info.append(title.text.strip("\n")+ ",\t")
for add_data in data.findAll('span',{'class':'format_address'}):
print((add_data.find('span',{'class':'street-address'}).text +",\t"))
init_info.append(add_data.find('span',{'class':'street-address'}).text +",\t")
init_info.append(add_data.find('span',{'class':'locality'}).text + ",\t")
init_info.append(add_data.find('span',{'class':'country-name'}).text + ",\t")
for reviews in data.findAll('a',{'class':'more taLnk'}):
init_info.append(reviews.text).strip("\n")
init_info.append(", \t")
#init_info.append([img["alt"] for img in stars.select("img[alt]")])
#init_info.append([img["content"] for img in stars.select("img[content]")])
except :
init_info.append("N/A" + ", /t")

The element with the content="5" attribute is a span, not an img.
Does this get what you want?
for bubs in data.findAll('div',{'class':"prw_rup prw_common_bubble_rating bubble_rating"}):
print([elem["content"] for elem in bubs.select("span[content]")])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

I might have some Korean letter enconding issue. - python-3.x

Related

blacklist href in python to remove junk sites

Scraping without specific strings in Python3

Can't find specific table using BeautifulSoup

Not able to use BeautifulSoup to get span content of Nasdaq100 future

Can't print tag 'content' anymore

Categories

Resources