I have a list of authors and books and I'm using an API to retrieve the descriptions and ratings of each of them. I'm not able to iterate the list and store the descriptions in a new list.
The Goodreads API gives me a chance to look for books' information, sending the title and author for the accuracy, or only the title. What I want is:
Loop the list of titles and authors, get the 1st title and 1st author and try to retrieve the description and save it in a third list.
If not found, try to retrieve the description using only the title and save it to the list.
If not found yet, add a standard error message.
I've tried the code below, but I'm not able to iterate the entire list and save the results.
#Running the API with author's name and book's title
book_url_w_author = 'https://www.goodreads.com/book/title.xml?author='+edit_authors[0]+'&key='+api_key+'&title='+edit_titles[0]
#Running the API with only book's title
book_url_n_author = 'https://www.goodreads.com/book/title.xml?'+'&key='+api_key+'&title='+edit_titles[0]
# parse book url with author
html_n_author = requests.get(book_url_w_author).text
soup_n_author = BeautifulSoup(html_n_author, "html.parser")
# parse book url without author
html_n_author = requests.get(book_url_n_author).text
soup_n_author = BeautifulSoup(html_n_author, "html.parser")
#Retrieving the books' descriptions
description = []
try:
#fetch description for url with author and title and add it to descriptions list
for desc_w_author in soup_w_author.book.description:
description.append(desc_w_author)
except:
#fetch description for url with only title and add it to descriptions list
for desc_n_author in soup_n_author.book.description:
description.append(desc_n_author)
else:
#return and inform that no description was found
description.append('Description not found')
Expected:
description = [description1, description2, description3, ....]
Related
I'm hoping to query the PubMed API based on a list of paper IDs and return the title,abstract and content.
So far I have been able to do the first three things doing the following:
from metapub import PubMedFetcher
pmids = [2020202, 1745076, 2768771, 8277124, 4031339]
fetch = PubMedFetcher()
title_list = []
abstract_list = []
for pmid in pmids:
article = fetch.article_by_pmid(pmid)
abstract = article.abstract # str
abstract_list.append(abstract)
title = article.title # str
title_list.append(title)
OR get the full paper content, but the query is based on keywords rather than IDs
email = 'myemail#gmail.com'
pubmed = PubMed(tool="PubMedSearcher", email=email)
## PUT YOUR SEARCH TERM HERE ##
search_term = "test"
results = pubmed.query(search_term, max_results=300)
articleList = []
articleInfo = []
for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
articleDict = article.toDict()
articleList.append(articleDict)
# Generate list of dict records which will hold all article details that could be fetch from PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
pubmedId = article['pubmed_id'].partition('\n')[0]
# Append article info to dictionary
articleInfo.append({u'pubmed_id':pubmedId,
u'title':article['title'],
u'keywords':article['keywords'],
u'journal':article['journal'],
u'abstract':article['abstract'],
u'conclusions':article['conclusions'],
u'methods':article['methods'],
u'results': article['results'],
u'copyrights':article['copyrights'],
u'doi':article['doi'],
u'publication_date':article['publication_date'],
u'authors':article['authors']})
Any help is appreciated!
I have a list and wanted to extract a particular line from the list. Below is my list
I wanted to extract 'src link' from the above list
example:
(src="https://r-cf.bstatic.com/xdata/images/hotel/square600/244245064.webp?k=8699eb2006da453ae8fe257eee2dcc242e70667ef29845ed85f70dbb9f61726a&o="). My final aim is to extract only the link. I have 20 records in the list. Hence, the need to extract 20 links from the same
My code (I stored the list in 'aas')
links = []
for i in aas:
link = re.search('CONCLUSION: (.*?)([A-Z]{2,})', i).group(1)
links.append(link)
````
I am getting an error: "expected string or bytes-like object"
Any suggestions?
As per the Beautiful Soup documentation, you can access a tag’s attributes by treating the tag like a dictionary, like so:
for img in img_list:
print(img["src"])
Table from data to be extractedExtract text within under specific class and store in respective lists
I am trying to extract data from "https://www.airlinequality.com/airline-reviews/vietjetair/page/1/" . I am able to extract the summary, review and user info, but unable to get the tabular data. Tabular data needs to be stored in respective lists. Different user reviews have different number of ratings. Given in the code below are couple of things which I tried. All are giving empty lists.
Extracted review using xpath
(review = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//div[#class="text_content "]') )
following are some xpaths which are giving empty list. Here I m=am trying to extract data/text corresponding to "Type Of Traveller "
tot = driver.find_elements_by_xpath('//div[#class="tc_mobile active"]//div[#class="review-stats"]//table[#class="review-ratings"]//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class="review-value "]')
tot1 = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//div[#class="review-stats"]//table//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class="review-value "]')
tot2 = driver.find_elements_by_xpath('//div//div/table//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class = "review-value "]')
This code should do what you want. All the code is doing at a basic level is following the DOM structure and then iterating over each element at that layer.
It extracts the values into a dictionary for each review and then appends that to a results list:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.airlinequality.com/airline-reviews/vietjetair/page/1/")
review_tables = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//table[#class="review-ratings"]//tbody') # Gets all the review tables
results = list() # A list of all rating results
for review_table in review_tables:
review_rows = review_table.find_elements_by_xpath('./tr') # Gets each row from the table
rating = dict() # Holds the rating result
for row in review_rows:
review_elements = row.find_elements_by_xpath('./td') # Gets each element from the row
if review_elements[1].text == '12345': # Logic to extract star rating as int
rating[review_elements[0].text] = len(review_elements[1].find_elements_by_xpath('./span[#class="star fill"]'))
else:
rating[review_elements[0].text] = review_elements[1].text
results.append(rating) # Add rating to results list
Sample entry of review data in results list:
{
"Date Flown": "January 2019",
"Value For Money": 2,
"Cabin Staff Service": 3,
"Route": "Ho Chi Minh City to Bangkok",
"Type Of Traveller": "Business",
"Recommended": "no",
"Seat Comfort": 3,
"Cabin Flown": "Economy Class",
"Ground Service": 1
}
I have a list of urls that go to different anime on myanimelist.net. For each anime, I want to get the text for the genres for each anime that can be found on the website and add it to a list of strings (one element for each anime, not 5 separate elements if an anime has 5 genres listed)
Here is the HTML code for an anime on myanimelist.net. I want to essentially get the genre text at top of the image and put in a list so in the image shown, its entry in the list would be ["Mystery, Police, Psychological, Supernatural, Thriller, Shounen"] and for each url in my list, another string containing the genres for that anime is appended to the list.
This is the main part of my code
driver = webdriver.Firefox()
flist = [url1, url2, url3] #List of urls
genres = []
for item in flist:
driver.get(item) #Opens each url
elem = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div/div[16]").text
genres.append(elem)
The code works for some anime and not for others. Sometimes the position is different for some anime and instead of getting the info about the genres, I get info about the studio that produced the anime, etc.
What I want is to specify "Genres:" in the span class and get the genres that are listed below it as shown in my image above. I can't seem to find anything similar to what I'm looking for (though I might just not be phrasing my questions right as well as a lack of experience using xpaths)
driver.get('https://myanimelist.net/anime/35760/Shingeki_no_Kyojin_Season_3')
links = driver.find_elements_by_xpath("//div[contains(string(), 'Genres')]/a[contains(#href,'genre')]")
for link in links:
title= elem.get_attribute("title")
genres.append(title)
print(genres)
genresString = ",".join(genres)
print(genresString)
Sample Output:
['Action', 'Military', 'Mystery', 'Super Power', 'Drama', 'Fantasy', 'Shounen']
Action,Military,Mystery,Super Power,Drama,Fantasy,Shounen
I am using Biopython with Python 3.x to conduct searches from PubMed-database. I get the search results correctly, but next I would need to extract all the journal names (full names, not just abbreviations) of the search results. Currently I am using the following code:
from Bio import Entrez
from Bio import Medline
Entrez.email = "my_email#gmail.com"
handle = Entrez.esearch(db="pubmed", term="search_term", retmax=20)
record = Entrez.read(handle)
handle.close()
idlist = record["IdList"]
records = list(records)
for record in records:
print("source:", record.get("SO", "?"))
So this works fine, but record.get("SO"), "?") returns only the abbreviation of the journal (for example, N Engl J Med, not New England Journal of Medicine). From my experiences with manual PubMed-searches, you can search using both the abbreviation or the full name, and PubMed will handle those in the same way, so I figured if there is also some parameter to get the full name?
So this works fine, but record.get("SO"), "?") returns only the abbreviation of the journal
No it doesn't. It won't even run due to this line:
records = list(records)
as records isn't defined. And even if you fix that, all you get back from:
idlist = record["IdList"]
is a list of numbers like: ['17510654', '2246389'] that are intended to be passed back via an Entrez.efetch() call to get the actual data. So when you do record.get("SO", "?") on one of these number strings, your code blows up (again).
First, the "SO" field abbreviation is defined to return Journal Title Abbreviation (TA) as part of what it returns. You likely want "JT" Journal Title instead as defined in MEDLINE/PubMed Data Element (Field) Descriptions. But neither of these has anything to do with this lookup.
Here's a rework of your code to get the article title and the title of the journal that it's in:
from Bio import Entrez
Entrez.email = "my_email#gmail.com" # change this to be your email address
handle = Entrez.esearch(db="pubmed", term="cancer AND wombats", retmax=20)
record = Entrez.read(handle)
handle.close()
for identifier in record['IdList']:
pubmed_entry = Entrez.efetch(db="pubmed", id=identifier, retmode="xml")
result = Entrez.read(pubmed_entry)
article = result['PubmedArticle'][0]['MedlineCitation']['Article']
print('"{}" in "{}"'.format(article['ArticleTitle'], article['Journal']['Title']))
OUTPUT
> python3 test.py
"Of wombats and whales: telomere tales in Madrid. Conference on telomeres and telomerase." in "EMBO reports"
"Spontaneous proliferations in Australian marsupials--a survey and review. 1. Macropods, koalas, wombats, possums and gliders." in "Journal of comparative pathology"
>
Details can be found in the document: MEDLINE PubMed XML Element Descriptions