Biopython and retrieving journal's full name - python-3.x

I am using Biopython with Python 3.x to conduct searches from PubMed-database. I get the search results correctly, but next I would need to extract all the journal names (full names, not just abbreviations) of the search results. Currently I am using the following code:
from Bio import Entrez
from Bio import Medline
Entrez.email = "my_email#gmail.com"
handle = Entrez.esearch(db="pubmed", term="search_term", retmax=20)
record = Entrez.read(handle)
handle.close()
idlist = record["IdList"]
records = list(records)
for record in records:
print("source:", record.get("SO", "?"))
So this works fine, but record.get("SO"), "?") returns only the abbreviation of the journal (for example, N Engl J Med, not New England Journal of Medicine). From my experiences with manual PubMed-searches, you can search using both the abbreviation or the full name, and PubMed will handle those in the same way, so I figured if there is also some parameter to get the full name?

So this works fine, but record.get("SO"), "?") returns only the abbreviation of the journal
No it doesn't. It won't even run due to this line:
records = list(records)
as records isn't defined. And even if you fix that, all you get back from:
idlist = record["IdList"]
is a list of numbers like: ['17510654', '2246389'] that are intended to be passed back via an Entrez.efetch() call to get the actual data. So when you do record.get("SO", "?") on one of these number strings, your code blows up (again).
First, the "SO" field abbreviation is defined to return Journal Title Abbreviation (TA) as part of what it returns. You likely want "JT" Journal Title instead as defined in MEDLINE/PubMed Data Element (Field) Descriptions. But neither of these has anything to do with this lookup.
Here's a rework of your code to get the article title and the title of the journal that it's in:
from Bio import Entrez
Entrez.email = "my_email#gmail.com" # change this to be your email address
handle = Entrez.esearch(db="pubmed", term="cancer AND wombats", retmax=20)
record = Entrez.read(handle)
handle.close()
for identifier in record['IdList']:
pubmed_entry = Entrez.efetch(db="pubmed", id=identifier, retmode="xml")
result = Entrez.read(pubmed_entry)
article = result['PubmedArticle'][0]['MedlineCitation']['Article']
print('"{}" in "{}"'.format(article['ArticleTitle'], article['Journal']['Title']))
OUTPUT
> python3 test.py
"Of wombats and whales: telomere tales in Madrid. Conference on telomeres and telomerase." in "EMBO reports"
"Spontaneous proliferations in Australian marsupials--a survey and review. 1. Macropods, koalas, wombats, possums and gliders." in "Journal of comparative pathology"
>
Details can be found in the document: MEDLINE PubMed XML Element Descriptions

Related

How to get custom objects Fields more than 200 records from sales force api using python

from simple_salesforce import Salesforce
import pandas as pd
# username = 'USER_NAME'
# password = 'PASSWORD'
# security_token = 'SECURITY_TOKEN'
uname = 'USER_NAME'
passwd = 'PASSWORD'
token = 'SECURITY_TOKEN'
sfdc_query = Salesforce(username = uname,password=passwd,security_token=token)
object_list = []
for x in sfdc_query.describe()["sobjects"]:
object_list.append(x["name"])
object_list = ['ag1__c'] # my custom database in sales force
obj = ", ".join(object_list)
soql = 'SELECT FIELDS(CUSTOM) FROM ag1__c LIMIT 200' # CUSTOM
sfdc_rec = sfdc_query.query_all(soql)
sfdc_df = pd.DataFrame(sfdc_rec['records'])
sfdc_df
Here i am trying to get all the records from my custom database in sales force which has 1044 rows and i want to extract all the records.
i have tried lot of things but its not working please help me out with this, it will be great help to me.
thanks
ERROR: -
SalesforceMalformedRequest: Malformed request https://pujatiles-dev-ed.develop.my.salesforce.com/services/data/v52.0/query/?q=SELECT+FIELDS%28CUSTOM%29+FROM+ag1__c. Response content: [{'message': 'The SOQL FIELDS function must have a LIMIT of at most 200', 'errorCode': 'MALFORMED_QUERY'}]
If you need more rows - you need to list all the fields you need, explicitly mention them in SELECT. "FIELDS(ALL)" trick is limited to 200 records, period. If you don't know what fields are there - simple should have a "describe" operation for you. And then the query has to be under 100K characters.
As for "size in MB" - you can't query that. You can fetch record counts per table (it's a special call, not a query: https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_record_count.htm)
And then... For most objects it's count * 2kB. There are few exceptions (CampaignMember uses 3kB; Email Message uses however many kilobytes the actual email had) but that's a good rule of thumb.
https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/dome_limits.htm is interesting too, DataStorageMB and FileStorageMB.

Problem with loop inside a try-except-else

I have a list of authors and books and I'm using an API to retrieve the descriptions and ratings of each of them. I'm not able to iterate the list and store the descriptions in a new list.
The Goodreads API gives me a chance to look for books' information, sending the title and author for the accuracy, or only the title. What I want is:
Loop the list of titles and authors, get the 1st title and 1st author and try to retrieve the description and save it in a third list.
If not found, try to retrieve the description using only the title and save it to the list.
If not found yet, add a standard error message.
I've tried the code below, but I'm not able to iterate the entire list and save the results.
#Running the API with author's name and book's title
book_url_w_author = 'https://www.goodreads.com/book/title.xml?author='+edit_authors[0]+'&key='+api_key+'&title='+edit_titles[0]
#Running the API with only book's title
book_url_n_author = 'https://www.goodreads.com/book/title.xml?'+'&key='+api_key+'&title='+edit_titles[0]
# parse book url with author
html_n_author = requests.get(book_url_w_author).text
soup_n_author = BeautifulSoup(html_n_author, "html.parser")
# parse book url without author
html_n_author = requests.get(book_url_n_author).text
soup_n_author = BeautifulSoup(html_n_author, "html.parser")
#Retrieving the books' descriptions
description = []
try:
#fetch description for url with author and title and add it to descriptions list
for desc_w_author in soup_w_author.book.description:
description.append(desc_w_author)
except:
#fetch description for url with only title and add it to descriptions list
for desc_n_author in soup_n_author.book.description:
description.append(desc_n_author)
else:
#return and inform that no description was found
description.append('Description not found')
Expected:
description = [description1, description2, description3, ....]

Update Sqlite w/ Python: InterfaceError: Error binding parameter 0 and None Type is not subscriptable

I've scraped some websites and stored the html info in a sqlite database. Now, I want to extract and store the email addresses. I'm able to successfully extract and print the id and emails. But, I keep getting TypeError: "'NoneType' object is not subscriptable" and "sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type" when I try to update the database with these new email addresses.
I've verified that the data types I'm using in the update statement are the same as my database (id is class int and email is str). I've googled a bunch of different examples and mucked around with the syntax alot.
I also tried removing the Where Clause in the update statement but got the same errors.
import sqlite3
import re
conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()
x = cur.execute('SELECT id, html FROM Pages WHERE html is NOT NULL and email is NULL ORDER BY RANDOM()').fetchone()
#print(x)#for testing purposes
for row in x:
row = cur.fetchone()
id = row[0]
html = row[1]
email = re.findall(b'[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+', html)
#print(email)#testing purposes
if not email:
email = 'no email found'
print(id, email)
cur.execute('''UPDATE pages SET email = ? WHERE id = ? ''', (email, id))
conn.commit
I want the update statement to update the database with the extracted email addresses for the appropriate row.
There are a few things going on here.
First off, you don't want to do this:
for row in x:
row = cur.fetchone()
If you want to iterate over the results returned by the query, you should consider something like this:
for row in cur.fetchall():
id = row[0]
html = row[1]
# ...
To understand the rest of the errors you are seeing, let's take a look at them step by step.
TypeError: "'NoneType' object is not subscriptable":
This is likely generated here:
row = cur.fetchone()
id = row[0]
Cursor.fetchone returns None if the executed query doesn't match any rows or if there are no rows left in the result set. The next line, then, is trying to do None[0] which would raise the error in question.
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type:
re.findall returns a list of non-overlapping matches, not an individual match. There's no support for binding a Python list to a sqlite3 text column type. To fix this, you'll need to get the first element from the matched list (if it exists) and then pass that as your email parameter in the UPDATE.
.findall() returns a list.
You want to iterate over that list:
for email in re.findall(..., str(html)):
print(id, email)
cur.execute(...)
Not sure what's going on with that b'[a-z...' expression.
Recommend you use a raw string instead: r'[a-z...'.
It handles regex \ backwhacks nicely.

Trouble with parsing JSON msg

I am calling a QUANDL API for data and getting a JSON msg back, which i am having trouble parsing, before sending to a database. My parsing code is clearly not reading the JSON correctly.
Via the below code, i am getting the following (truncated for simplicity) JSON:
{"datatable":{"data":[["AAPL","MRY","2018-09-29",265595000000],["AAPL","MRY","2017-09-30",229234000000],["AAPL","MRY","2016-09-24",215639000000],["AAPL","MRY","2015-09-26",233715000000],["AAPL","MRY","2014-09-27",182795000000],["AAPL","MRY","2013-09-28",170910000000],["AAPL","MRT","2018-09-29",265595000000],["AAPL","MRT","2018-06-30",255274000000],["AAPL","MRT","2018-03-31",247417000000],["AAPL","MRT","2017-12-30",239176000000],["AAPL","MRT","2017-09-30",229234000000],["AAPL","MRT","2017-07-01",223507000000],["AAPL","MRT","2017-04-01",220457000000],["AAPL","MRT","2016-12-31",218118000000],["AAPL","MRT","2016-09-24",215639000000],["AAPL","MRT","2016-06-25",220288000000],["AAPL","MRT","2016-03-26",227535000000],["AAPL","MRT","2015-12-26",234988000000],["AAPL","MRT","2015-09-26",233715000000],["AAPL","MRT","2015-06-27",224337000000],["AAPL","MRT","2015-03-28",212164000000],["AAPL","MRT","2014-12-27",199800000000],["AAPL","MRT","2014-09-27",182795000000],["AAPL","MRT","2014-06-28",178144000000],["AAPL","MRT","2014-03-29",176035000000],"columns":[{"name":"ticker","type":"String"},{"name":"dimension","type":"String"},{"name":"datekey","type":"Date"},{"name":"revenue","type":"Integer"}]},"meta":{"next_cursor_id":null}}
import quandl, requests
from flask import request
from cs50 import SQL
db = SQL("sqlite:///formula.db")
data =
requests.get(f"https://www.quandl.com/api/v3/datatables/SHARADAR/SF1.json?ticker=AAPL&qopts.columns=ticker,dimension,datekey,revenue&api_key=YOURAPIKEY")
responses = data.json()
print(responses)
for response in responses:
ticker=str(response["ticker"])
dimension=str(response["dimension"])
datekey=str(response["datekey"])
revenue=int(response["revenue"])
db.execute("INSERT INTO new(ticker, dimension, datekey, revenue) VALUES(:ticker, :dimension, :keydate, :revenue)", ticker=ticker, dimension=dimension, datekey=datekey, revenue=revenue)
I'm getting the following error msg (which i have in the past, and successfully addressed it) so strongly believe i am not reading the json correctly:
File "new2.py", line 12, in
ticker=str(response["ticker"])
TypeError: string indices must be integers
I want to be able to loop through the json and be able to isolate specific data to then populate a database.
for your response structure, you have a nested dict object:
datatable
data
list of lists of data
so, this will happen:
responses = data.json()
datatable = responses['datatable'] # will get you the information mapped to the 'datatables' key
datatable_data = datatable['data'] # will get you the list mapped to the 'data' key
Now, datatable_data is a list of lists, right? and lists can only be accessed by index point, not by strings
so, lets say you want the first response.
first_response = datatable_data[0]
that will result in
first_response = ["AAPL","MRY","2018-09-29",265595000000]
which you can now access by index point:
for idx, val in enumerate(first_response):
print(f'{idx}\t{val}')
which will print out
0 AAPL
1 MRY
2 2018-09-29
3 265595000000
so, with all this information, you need to alter your program to ensure you're accessing the data key in the response, and then iterate over the list of lists.
So, something like this:
data = responses['datatable']['data']
for record in data:
ticker, dimension, datekey, revenue = record # unpack list into named variables
db.execute(...)

Incoherent results from feeds search through the API

I want to visualize in an earth map all feeds from the user 'airqualityegg'. In order to do this I wrote the following script with Python (if you are gonna try yourself, indent correctly the code in the text editor you use):
import json
import urllib
import csv
list=[]
for page in range(7):
url = 'https://api.xively.com/v2/feeds?user=airqualityegg&per_page=100page='+str(page)
rawData=urllib.urlopen(url)
#Loads the data in json format
dataJson = json.load(rawData)
print dataJson['totalResults']
print dataJson['itemsPerPage']
for entry in dataJson['results']:
try:
list2=[]
list2.append(entry['id'])
list2.append(entry['creator'])
list2.append(entry['status'])
list2.append(entry['location']['lat'])
list2.append(entry['location']['lon'])
list2.append(entry['created'])
list.append(list2)
except:
print 'failed to scrape a row'
def escribir():
abrir = open('all_users2_andy.csv', 'w')
wr = csv.writer(abrir, quoting=csv.QUOTE_ALL)
headers = ['id','creator', 'status','lat', 'lon', 'created']
wr.writerow (headers)
for item in list:
row=[item[0], item[1], item[2], item[3], item[4], item[5]]
wr.writerow(row)
abrir.close()
escribir()
I have included a call to 7 pages because the total numbers of feeds posted by this user are 684 (as you can see when writing directly in the browser 'https://api.xively.com/v2/feeds?user=airqualityegg')
The csv file that resulted from running this script does present duplicated rows, what might be explained for the fact that every time that a call is made to a page the order of results varies. Thus, a same row can be included in the results of different calls. For this reason I get less unique results that I should.
Do you know why might be that the results included in different pages are not unique?
Thanks,
MarĂ­a
You can try passing
order=created_at (see docs).
The problem is because by default order=updated_at, hence the chances are that results will appear different on each page.
You should also consider using the official Python library.

Resources