Python3:How to get title eng from url? - python-3.x

i ues this code
import urllib.request
fp = urllib.request.urlopen("https://english-thai-dictionary.com/dictionary/?sa=all")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
x = 'alt'
for item in mystr.split():
if (x) in item:
print(item.strip())
I get Thai word from this code but I didn't know how to get Eng word.Thanks

If you want to get words from table you should use parsing library like BeautifulSoup4. Here is an example how you can parse this (I'm using requests to fetch and beautifulsoup here to parse data):
First using dev tools in your browser identify table with content you want to parse. Table with translations has servicesT class attribute which occurs only once in whole document:
import requests
from bs4 import BeautifulSoup
url = 'https://english-thai-dictionary.com/dictionary/?sa=all;ftlang=then'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Get table with translations
table = soup.find('table', {'class':'servicesT'})
After that you need to get all rows that contain translations for Thai words. If you look up page's source file you will notice that first few <tr rows are headers that contain only headers so we will omit them. After that we wil get all <td> elements from row (in that table there are always 3 <td> elements) and fetch words from them (in this table words are actually nested in and ).
table_rows = table.findAll('tr')
# We will skip first 3 rows beacause those are not
# contain information we need
for tr in table_rows[3:]:
# Finding all <td> elements
row_columns = tr.findAll('td')
if len(row_columns) >= 2:
# Get tag with Thai word
thai_word_tag = row_columns[0].select_one('span > a')
# Get tag with English word
english_word_tag = row_columns[1].find('span')
if thai_word_tag:
thai_word = thai_word_tag.text
if english_word_tag:
english_word = english_word_tag.text
# Printing our fetched words
print((thai_word, english_word))
Of course, this is very basic example of what I managed to parse from page and you should decide for yourself what you want to scrape. I've also noticed that data inside table does not have translations all the time so you should keep that in mind when scraping data. You also can use Requests-HTML library to parse data (it supports pagination which is present in table on page you want to scrape).

Related

how to select specific values of an element in python3

This is my first question so it may be quite basic.
I've managed to id and select the element but I cannot extract especific values like "IDinmobiliarias" from it.
data = soup.select('#PropJSON')
print(data)
When I do this, I get this output:
[<input id="PropJSON" type="hidden" value='{"id":"186226916","IDinmobiliarias":"108","IDoperaciones":"1","tipoPropiedad":"2","IDdepartamentos":"10","IDzonas":"13","IDpais":"1","refered":1,"particular":"0","temporario":0,"proyecto":0,"destaque":1,"IDmoneda":"1","monto":"1595000","precio_en_usd":1595000,"monedaISO":"USD"}'/>]
How can I extract the "108" for example?
I've tried different things without success.
select will return to you a list. You can then iterate over that list and get the data of the value attribute by accessing it like a dictionary. Once you have the data you will need to parse it with json then you can select any element you like from it.
from bs4 import BeautifulSoup
import json
html = """<input id="PropJSON" type="hidden" value='{"id":"186226916","IDinmobiliarias":"108","IDoperaciones":"1","tipoPropiedad":"2","IDdepartamentos":"10","IDzonas":"13","IDpais":"1","refered":1,"particular":"0","temporario":0,"proyecto":0,"destaque":1,"IDmoneda":"1","monto":"1595000","precio_en_usd":1595000,"monedaISO":"USD"}'/>"""
soup = BeautifulSoup(html, features="lxml")
data = soup.select('#PropJSON')
for input_tag in data:
json_string = json.loads(input_tag['value'])
print(json_string['IDinmobiliarias'])
OUTPUT
108

Beautifulsoup span class is returning a blank string

I am trying to print out different things from a Norwegian weather site with beautifulsoup.
I manage to print out everything i want except one thing witch mentions how the weather will be the next hour.
This contains the text i want to get:
<span class="nowcast-description" data-reactid="59">har opphold nå, det holder seg tørt den neste timen</span>
And i am trying print it with this:
cond = soup.find(class_='nowcast-description').get_text()
Inspected elements from storm.no/ski
Here is a picture of the some of the elements on the site.
with printing these:
soup = bs4.BeautifulSoup(html, "html.parser")
loc = soup.find(class_='info-text').get_text()
cond = soup.find(class_='nowcast-description').get_text()
temp = soup.find(class_='temperature').get_text()
wind = soup.find(class_='indicator wind').get_text()
also tested with this line:
cond = soup.select("span.nowcast-description")
but that gives me everything except what i want from the line.
Site link: https://www.storm.no/ski
i get:
Ski Akershus, 131 moh.
""
2°
3 m/s
It is retrieved dynamically from a script tag. You can regex out object containing all forecasts and handle with hjson library due to unquoted keys. You need to install hjson then do the following:
import requests, hjson, re
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.storm.no/ski')
p = re.compile(r'window\.__dehydratedState = (.*?);', re.DOTALL)
data = hjson.loads(p.findall(r.text)[0])
print(data['app-container']['current']['forecast']['nowcastDescription'])
You could regex out with library direct as well but using hsjon means you have access to all the other data.
It's because text under nowcast-description is generated dynamically. If you will dump the loaded page:
print(soup.prettify())
You only find only this:
<span class="nowcast-description" data-reactid="59">
</span>
On rough analysis, it seems that the content of this span is loaded from field nowcastDescription which is a part of window.__dehydratedState .
Because the field is a simple json, you can try to extract it from it.

Why is the .get('href') returning "None" on a bs4.element.tag?

I'm pulling together a dataset to do analysis on. The goal is to parse a table on a SEC webpage and pull out the link in a row that has the text "SC 13D" in it. This needs to be repeatable so I can automate it across a large list of links I have in a database. I know this code is not the most Pythonic, but I hacked it together to get what I need out of the table, except for the link in the table row. How can I extract the href value from the table row?
I tried doing a .findAll on 'tr' instead of 'td' in the table (Line 15) but couldn't figure out how to search on "SC 13D" and pop the element from the list of table rows if I performed the .findAll('td'). I also tried to just get the anchor tag with the link in it using the .get('a) instead of .get('href') (included in the code, line 32) but it also returns "None".
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://www.sec.gov/Archives/edgar/data/1050122/000101143807000336/0001011438-07-000336-index.htm'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table',{'summary':'Document Format Files'})
rows = table.findAll("td")
i = 0
pos = 0
for row in rows:
if "SC 13D" in row:
pos = i
break
else: i = i + 1
linkpos = pos - 1
linkelement = rows[linkpos]
print(linkelement.get('a'))
print(linkelement.get('href'))
The expected results is printing out the link in linkelement. The actual result is "None".
It is because your a tag is inside your td tag
You just have to do:
linkelement = rows[linkpos]
a_element = linkelement.find('a')
print(a_element.get('href'))
Switch your .get to .find
You want to find the <a> tag, and print the href attribute
print(linkelement.find('a')['href'])
Or you need to use .get with the tag:
print(linkelement.a.get('href'))

How to print the table from a website using python script?

Here is my python script so far.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'my_company_website'
#opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs each product
containers = page_soup.findAll("div",{"class":"navigator-content"})
print (containers)
After this, in inspect element it is like below,
<div class ="issue-table-container">
<div>
<table id ="issuetable" class>
<thead>...</thead>
<tbody>...<t/body> (This contains all the information i want to print)
</table>
How to print the table and export to csv
For each of the containers you should grab the table [1], then you have to find the body of the table and iterate over its rows [2] and compile a line for your csv file with the table cells (td) [3]
for container in containers:
table = container.find(id="issuetable") [1]
#if you are exactly sure of the structure and/or if the tables have different/unique ids and there is only one table per container you can also do:
table = container.table [1]
for tr in table.tbody.find_all("tr"): [2]
line = ""
for td in tr: [3]
line += td.text+"," #Adding the text in the td to the line followed by the separator of your choice in this case comma
csvfile.write(line[:-1]+"/n") #add the line (replace "/n" with your system's new line character for extra portability)
There are different ways of navigating the soup tree depending on your need and on how flexible you script needs to be.
Have a look at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ and check out the find / find_all sections.
Good luck!
/Teo

(Python)- How to store text extracted from HTML table using BeautifulSoup in a structured python list

I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']

Resources