Beautiful Soup Scraping - python-3.x

I'm having issues with old working code not functioning correctly anymore.
My python code is scraping a website using beautiful soup and extracting event data (date, event, link).
My code is pulling all of the events which are located in the tbody. Each event is stored in a <tr class="Box">. The issue is that my scraper seems to be stopping after this <tr style ="box-shadow: none;> After it reaches this section (which is a section containing 3 advertisements on the site for events that I don't want to scrape) the code stops pulling event data from within the <tr class="Box">. Is there a way to skip this tr style/ignore future cases?
import pandas as pd
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
source = urllib.request.urlopen('https://10times.com/losangeles-us/technology/conferences').read()
soup = bs.BeautifulSoup(source,'html.parser')
#---Get Event Data---
test1=[]
table = soup.find('tbody')
table_rows = table.find_all('tr') #find table rows (tr)
for x in table_rows:
data = x.find_all('td') #find table data
row = [x.text for x in data]
if len(row) > 2: #Exlcudes rows with only event name/link, but no data.
test1.append(row)
test1

The data is loaded dynamically via JavaScript, so you don't see more results. You can use this example to load more pages:
import requests
from bs4 import BeautifulSoup
url = "https://10times.com/ajax?for=scroll&path=/losangeles-us/technology/conferences"
params = {"page": 1, "ajax": 1}
headers = {"X-Requested-With": "XMLHttpRequest"}
for params["page"] in range(1, 4): # <-- increase number of pages here
print("Page {}..".format(params["page"]))
soup = BeautifulSoup(
requests.get(url, headers=headers, params=params).content,
"html.parser",
)
for tr in soup.select('tr[class="box"]'):
tds = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
print(tds)
Prints:
Page 1..
['Tue, 29 Sep - Thu, 01 Oct 2020', 'Lens Los Angeles', 'Intercontinental Los Angeles Downtown, Los Angeles', 'LENS brings together the entire Degreed community - our clients, invited prospective clients, thought leaders, partners, employees, executives, and industry experts for two days of discussion, workshops,...', 'Business Services IT & Technology', 'Interested']
['Wed, 30 Sep - Sat, 03 Oct 2020', 'FinCon', 'Long Beach Convention & Entertainment Center, Long Beach 20.1 Miles from Los Angeles', 'FinCon will be helping financial influencers and brands create better content, reach their audience, and make more money. Collaborate with other influencers who share your passion for making personal finance...', 'Banking & Finance IT & Technology', 'Interested 7 following']
['Mon, 05 - Wed, 07 Oct 2020', 'NetDiligence Cyber Risk Summit', 'Loews Santa Monica Beach Hotel, Santa Monica 14.6 Miles from Los Angeles', 'NetDiligence Cyber Risk Summit will conference are attended by hundreds of cyber risk insurance, legal/regulatory and security/privacy technology leaders from all over the world. Connect with leaders in...', 'IT & Technology', 'Interested']
... etc.

Related

I am trying to make a web scraper that checks for all the courses in a coursera specialisation..and then gives the list of them

The problem is there is a button saying "SHOW MORE" on the webpage...because of that the script can't access all the courses ....
from bs4 import BeautifulSoup
import requests
res = requests.get('https://www.coursera.org/specializations/digital-manufacturing-design-
technology#courses')
txt = res.text
status = res.status_code
#print(txt)
#print(status)## 200 is the code for success
page = requests.get('https://www.coursera.org/specializations/digital-manufacturing-design-
technology#courses')
soup = BeautifulSoup(page.content,'lxml')
#Display the title of the specialisation
specialization_title = soup.find('h1')
print(specialization_title.text)
print("\n")
#Display the courses inside the specialisation
number_of_courses = soup.find('h2',class_ = 'headline-4-text bold m-b-1')
print(number_of_courses.text)
print("\n")
course_cards = soup.find_all('h3',class_= 'headline-3-text bold m-t-1 m-b-2')
for course in course_cards:
print(course.text)
The data is stored inside the page in form of javascript object. You can use re/json module to decode it:
import re
import json
import requests
from textwrap import shorten
url = "https://www.coursera.org/specializations/digital-manufacturing-design-technology#courses"
html_doc = requests.get(url).text
data = json.loads(
re.search(r"window.__APOLLO_STATE__ = (.*});", html_doc).group(1)
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
i = 1
for k, v in data.items():
if "SDPCourse:" in k and "." not in k:
print(
"{:<3} {:<8} {:<60} {}".format(
i,
v["averageInstructorRating"],
v["name"],
shorten(v["description"], 40),
)
)
i += 1
Prints:
1 4.6 Digital Manufacturing & Design This course will expose you to the [...]
2 4.56 Digital Thread: Components This course will help you [...]
3 4.76 Digital Thread: Implementation There are opportunities throughout [...]
4 4.41 Advanced Manufacturing Process Analysis Variability is a fact of life in [...]
5 4.56 Intelligent Machining Manufacturers are increasingly [...]
6 4.55 Advanced Manufacturing Enterprise Enterprises that seek to become [...]
7 4.55 Cyber Security in Manufacturing The nature of digital [...]
8 4.54 MBSE: Model-Based Systems Engineering This Model-Based Systems [...]
9 4.68 Roadmap to Success in Digital Manufacturing & Design Learners will create a roadmap to [...]

using a for loop for web scraping - cannot "pass" certain data

The following code is supposed to scrape the rating and the date the rating was posted.
The issue here is, that employees answers the negative reviews, and the date of their post is scraped as well. So when I scrape the site, there's an uneven number of ratings and dates (20 ratings vs 24 dates), as four of the dates belong to the answers given by employees.
In the code I try to "pass" every time the class "ugc-brand-response" shows up which is for the employee answers. So if no ugc class is met "pass" and if not just continue - but none of the data gets stored. Not even the first couple of reviews.
I've learned so much from reading other peoples questions and answers. Thanks for this awesome community.
import requests
import time
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
url = "https://www.bestbuy.com/site/reviews/jabra-elite-85h-wireless-noise-canceling-over-the-ear-headphones-black/6335100?variant=A"
url_get = requests.get(url, headers=headers)
print(url_get.status_code)
soup = BeautifulSoup(url_get.content, 'lxml')
rating_n_date=[]
for rating in soup.find_all(attrs={"class": "c-review-average"}):
rating_n_date.append(rating .text)
for date in soup.findAll(attrs={"class":"submission-date"}):
if "class" == "ugc-brand-response" in date:
pass
else:
continue
rating_n_date.append(date.text)
time.sleep(2)
print(rating_n_date)
Here's the data including the :
<li class="review-item" tabindex="-1"><div class="row"><div class="hidden-xs hidden-sm col-md-3"><div class="undefined ugc-author v-fw-medium body-copy-lg">Jimmy</div><ul class=" ugc-badge-list"><li class="visible-xs-inline-block visible-sm-inline-block visible-md-block visible-lg-block"><span class="c-overlay-wrapper"><span class="overlayTrigger"><button aria-expanded="false" aria-controls="ugc-badge-overlay-bf28b82b-76f5-3c85-897e-598a91bbd8a8-0" aria-owns="ugc-badge-overlay-bf28b82b-76f5-3c85-897e-598a91bbd8a8-0" data-track="Custom"><div class="ugc-my-bby-badge"><img alt="My Best Buy® Member" src="https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/ugc-badge-mybby-core.svg"></div></button></span><span></span></span></li></ul></div><div class="col-xs-12 col-md-9"><div class="c-ratings-reviews v-medium"><p class="sr-only">Rating: 2 out of 5 stars</p><span class="c-stars c-stars-medium" alt="40%" aria-hidden="true"><span class="unfilled"></span><span class="filled" style="width:40%"></span></span><span class="c-reviews"><span class="c-review-average" aria-hidden="true">2</span></span></div><h3 id="review-id-bf28b82b-76f5-3c85-897e-598a91bbd8a8" class="ugc-review-title c-section-title heading-5 v-fw-medium ">A disappointment: low volum, weak bass, distorts</h3><div class="disclaimer">Posted <time class="submission-date" title="Apr 28, 2019 11:29 PM">3 months ago</time></div>
Here's the code I don't want - the employee answer:
<ul class="ugc-brand-response-list"><li><div class="row"><div class="col-sm-12 col-md-9 col-md-offset-3"><div class="ugc-brand-response"><h4 class=" c-section-title body-copy-lg v-fw-medium ">Brand response</h4><p class="body-copy-lg">Jabra</p><div class="disclaimer"><time class="submission-date" title="Apr 29, 2019 8:46 AM">3 months ago</time></div><div class="ugc-brand-response-body body-copy-lg"><p class="pre-white-space">
Hello Jimmy - We were sorry to learn that the Jabra Elite 85h did not meet your expectations. As the Elite 85h is a relatively new product, it is very important that you update the firmware in the headphones as often as necessary to keep up-to-date. We are constantly improving all aspects of the Elite 85h through firmware updates. If you have any specific questions or concers, we invite you to contact us directly by completing the web form at https://www.jabra.com/ServiceMenu/contact/ContactJabraSupport/ContactJabraSupportConsumer, or by giving us a call - we love to help! Thank you.
<img src="https://s3.amazonaws.com/stratos-logos/logos/Jabra.jpg" alt="Jabra" title="Jabra" style="display: block !important; margin-top: 2em !important; border: 1px solid #ccc !important; padding: 2px !important; background-color: white !important;">
<!--[if ReviewResponse]><![endif]--></p></div></div></div></div></li></ul>
It will never skip over where a class is ugc-brand-response-list, because you explicitly pull everything with class attribute submission-date
You also have continue misunderstood. When you use continue, it doesn't mean what you would think it should mean, in the sense of "continue on with the code". What it really means is, "stop right here. Don't follow with the rest of the loop. Go to the next item." So in the way you have it in your code, when it DOESN'T find class == "ugc-brand-response", it goes to else, which says continue. So it'll never append to your list, which is why your data isn't stored/appended.
What you could do, is go to the parent tags and pull the whole review "block" which is found with the class attribute "col-xs-12 col-md-9" , and then from there, go into each of those and pull out the rating and submission date together using find (find will get the 1st occurrence of what ever you are looking for, which means it won't grab the date of the employee replies), and then store those in to a list. I then just threw it into a dataframe/table.
import requests
import time
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
url = "https://www.bestbuy.com/site/reviews/jabra-elite-85h-wireless-noise-canceling-over-the-ear-headphones-black/6335100?variant=A"
url_get = requests.get(url, headers=headers)
print(url_get.status_code)
soup = BeautifulSoup(url_get.content, 'lxml')
rating_list = []
date_list = []
for ratings in soup.find_all(attrs={"class": "col-xs-12 col-md-9"}):
rating = ratings.find('span', {'class':'c-review-average'}).text
submission_date = ratings.find('time', {'class':'submission-date'}).text
rating_list.append(rating)
date_list.append(submission_date)
data = {'Rating':rating_list, 'Date':date_list}
df = pd.DataFrame(data)
Output:
print (df)
Rating Date
0 5 3 months ago
1 5 2 months ago
2 4 3 months ago
3 3 3 months ago
4 4 3 months ago
5 4 3 months ago
6 5 3 months ago
7 4 3 months ago
8 5 3 months ago
9 4 1 week ago
10 4 3 months ago
11 2 3 months ago
12 5 3 months ago
13 4 1 day ago
14 4 1 month ago
15 4 3 months ago
16 4 3 weeks ago
17 2 3 months ago
18 3 3 weeks ago
19 5 3 months ago

How to pull out a dta when it shows <-!--Content End--> after the JavaScript?

I've just started to learn python. I want to extract glass temperature data every three hours for academic purpose. The website is below:
https://www.weather.gov.hk/wxinfo/ts/display_graph_grass_e.htm?kp&
I try to use BeautifulSoup to pull out the data using the below script. Here is the result:
Before I find the wanted data, there is a < !--Content End-- > after the JavaScript and I can't scrape the script behind it. Why would that happen and if there is any solution for that?
The data is stored as Javascript Array in the HTML page. We can use re and ast.literal_eval (doc) to retrieve it:
import re
import requests
from ast import literal_eval
url = 'https://www.weather.gov.hk/wxinfo/ts/display_graph_grass_e.htm?kp&'
html_text = requests.get(url).text
station_code = literal_eval(re.findall(r'StationCode\s*=.*?(\(.*?\))', html_text)[0])
station_name = literal_eval(re.findall(r'stnname\s*=.*?(\(.*?\))', html_text)[0])
station_height = literal_eval(re.findall(r'stn_height\s*=.*?(\(.*?\))', html_text)[0])
grass_temp = literal_eval(re.findall(r'grasstemp\s*=.*?(\(.*?\))', html_text)[0])
min_since_17 = literal_eval(re.findall(r'minSince17\s*=.*?(\(.*?\))', html_text)[0])
min_hour = literal_eval(re.findall(r'minHour\s*=.*?(\(.*?\))', html_text)[0])
min_minute = literal_eval(re.findall(r'minMinute\s*=.*?(\(.*?\))', html_text)[0])
rows = [*zip(station_code, station_name, station_height, grass_temp, min_since_17, min_hour, min_minute)]
headers = ['Station Code', 'Station Name', 'Station Height', 'Grass Temp', 'Min_since_17', 'Min Hour', 'Min Minute']
print(''.join('{: <20}'.format(d) for d in headers))
for row in rows:
print(''.join('{: <20}'.format(d) for d in row))
Prints:
Station Code Station Name Station Height Grass Temp Min_since_17 Min Hour Min Minute
kp King's Park 65 25.4 25.3 23 35
tkl Ta Kwu Ling 15 25.4 24.8 17 00
tms Tai Mo Shan 955 21.3 21.3 07 19

Trying to scrape and segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel?
In the picture(https://ibb.co/8X5xY9C) or in the website link provided, Bold(Except Alphabet Letters(A) and later 'back to top' ) represents Heading and Explanation(non-bold just below bold) represents the content(Content even consists of 'li' and 'ul' blocks later in the site, which should come under respective Heading)
#Code to Start With
from bs4 import BeautifulSoup
import requests
url = "http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html";
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")
Heading = soup.findAll('strong')
content = soup.findAll('div', {"class": "comp-rich-text"})
Output Excel looks something Link this
https://i.stack.imgur.com/NsMmm.png
I've thought about it a little more and thought of a better solution. Rather than "crowd" my initial solution, I chose to add a 2nd solution here:
So thinking about it again, and following my logic of splitting the html by the headlines (essentially breaking it up where we find <strong> tags), I choose to convert to strings using .prettify(), and then split on those specific strings/tags and read back into BeautifulSoup to pull the text. From what I see, it looks like it hasn't missed anything, but you'll have to search through the dataframe to double check:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
splits = section.prettify().split('<strong>')
for each in splits:
try:
headline, content = each.split('</strong>')[0].strip(), each.split('</strong>')[1]
headline = BeautifulSoup(headline, 'html.parser').text.strip()
content = BeautifulSoup(content, 'html.parser').text.strip()
content_split = content.split('\n')
content = ' '.join([ text.strip() for text in content_split if text != ''])
results[headline] = content
except:
continue
df = pd.DataFrame(results.items(), columns = ['Headings','Content'])
df.to_csv('C:/test.csv', index=False)
Output:
print (df)
Headings Content
0 Age requirements Applicants must be at least 18 years old at th...
1 Affordability Our affordability calculator is the same one u...
2 Agricultural restriction The only acceptable agricultural tie is where ...
3 Annual percentage rate of charge (APRC) The APRC is all fees associated with the mortg...
4 Adverse credit We consult credit reference agencies to look a...
5 Applicants (number of) The maximum number of applicants is two.
6 Armed Forces personnel Unsecured personal loans are only acceptable f...
7 Back to back Back to back is typically where the vendor has...
8 Customer funded purchase: when the customer has funded the purchase usin...
9 Bridging: residential mortgage applications where the cu...
10 Inherited: a recently inherited property where the benefi...
11 Porting: where a fixed/discounted rate was ported to a ...
12 Repossessed property: where the vendor is the mortgage lender in pos...
13 Part exchange: where the vendor is a large national house bui...
14 Bank statements We accept internet bank statements in paper fo...
15 Bonus For guaranteed bonuses we will consider an ave...
16 British National working overseas Applicants must be resident in the UK. Applica...
17 Builder's Incentives The maximum amount of acceptable incentive is ...
18 Buy-to-let (purpose) A buy-to-let mortgage can be used for: Purcha...
19 Capital Raising - Acceptable purposes permanent home improvem...
20 Buy-to-let (affordability) Buy to Let affordability must be assessed usin...
21 Buy-to-let (eligibility criteria) The property must be in England, Scotland, Wal...
22 Definition of a portfolio landlord We define a portfolio landlord as a customer w...
23 Carer's Allowance Carer's Allowance is paid to people aged 16 or...
24 Cashback Where a mortgage product includes a cashback f...
25 Casual employment Contract/agency workers with income paid throu...
26 Certification of documents When submitting copies of documents, please en...
27 Child Benefit We can accept up to 100% of working tax credit...
28 Childcare costs We use the actual amount the customer has decl...
29 When should childcare costs not be included? There are a number of situations where childca...
.. ... ...
108 Shared equity We lend on the Government-backed shared equity...
109 Shared ownership We do not lend against Shared Ownership proper...
110 Solicitors' fees We have a panel of solicitors for our fees ass...
111 Source of deposit We reserve the right to ask for proof of depos...
112 Sole trader/partnerships We will take an average of the last two years'...
113 Standard variable rate A standard variable rate (SVR) is a type of v...
114 Student loans Repayment of student loans is dependent on rec...
115 Tenure Acceptable property tenure: Feuhold, Freehold,...
116 Term Minimum term is 3 years Residential - Maximum...
117 Unacceptable income types The following forms of income are classed as u...
118 Bereavement allowance: paid to widows, widowers or surviving civil pa...
119 Employee benefit trusts (EBT): this is a tax mitigation scheme used in conjun...
120 Expenses: not acceptable as they're paid to reimburse pe...
121 Housing Benefit: payment of full or partial contribution to cla...
122 Income Support: payment for people on low incomes, working les...
123 Job Seeker's Allowance: paid to people who are unemployed or working 1...
124 Stipend: a form of salary paid for internship/apprentic...
125 Third Party Income: earned by a spouse, partner, parent who are no...
126 Universal Credit: only certain elements of the Universal Credit ...
127 Universal Credit The Standard Allowance element, which is the n...
128 Valuations: day one instruction We are now instructing valuations on day one f...
129 Valuation instruction A valuation will be automatically instructed w...
130 Valuation fees A valuation will always be obtained using a pa...
131 Please note: W hen upgrading the free valuation for a home...
132 Adding fees to the loan Product fees are the only fees which can be ad...
133 Product fee This fee is paid when the mortgage is arranged...
134 Working abroad Previously, we required applicants to be empl...
135 Acceptable - We may consider applications from people who: ...
136 Not acceptable - We will not consider applications from people...
137 Working and Family Tax Credits We can accept up to 100% of Working Tax Credit...
[138 rows x 2 columns]
EDIT: SEE OTHER SOLUTION PROVIDED
It's tricky. I tried to essentially to grab the headings, then use those to grab all the text after the heading, and that proceeds the next heading. The code below is a little messy, and requires some cleaning up, but hopefully gets you to a point to work with it or get you moving in the right direction:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
headlines = section.find_all('strong')
headlines = [each.text for each in headlines ]
for i, headline in enumerate(headlines):
if headline != headlines[-1]:
next_headline = headlines[i+1]
else:
next_headline = ''
try:
find_content = section(text=headline)[0].parent.parent.find_next_siblings()
if ':' in headline and 'Gifted deposit' not in headline and 'Help to Buy' not in headline:
content = section(text=headline)[0].parent.nextSibling
results[headline] = content.strip()
break
except:
find_content = section(text=re.compile(headline))[0].parent.parent.find_next_siblings()
if find_content == []:
try:
find_content = section(text=headline)[0].parent.parent.parent.find_next_siblings()
except:
find_content = section(text=re.compile(headline))[0].parent.parent.parent.find_next_siblings()
content = []
for sibling in find_content:
if next_headline not in sibling.text or headline == headlines[-1]:
content.append(sibling.text)
else:
content = '\n'.join(content)
results[headline.strip()] = content.strip()
break
if headline == headlines[-1]:
content = '\n'.join(content)
results[headline] = content.strip()
df = pd.DataFrame(results.items())

How to download pubmed articles and read them?

Im having trouble to save pubmed articles and read them. I've seen at this page here that there are some special files types but no one of them worked for me. I want to save them in a way that I can continuous using the keys to get the the data. I don't know if its possible use it if I save it as a text file. My code is this one:
import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO
'''Class Crawler is responsable to browse the biological databases
from DownloadArticles import DownloadArticles
c = DownloadArticles()
c.articles_dataset_list
'''
class DownloadArticles():
def __init__(self):
Entrez.email='myemail#gmail.com'
self.dataC = self.saveArticlesFilesInXMLMode('pubmed', '26837606')
'''Metodo 4 ler dado em forma de texto.'''
def saveArticlesFilesInXMLMode(self,dbs, ids):
net_handle = Entrez.efetch(db=dbs, id=ids, rettype="medline", retmode="txt")
directory = "/dataset/Pubmed/DatasetArticles/"+ ids + ".fasta"
# if not os.path.exists(directory):
# os.makedirs(directory)
# filename = directory + '/'
# if not os.path.exists(filename):
out_handle = open(directory, "w+")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")
print("Parsing...")
record = SeqIO.read(directory, "fasta")
print(record)
return(record.read())
I'm getting this error: ValueError: No records found in handle
Pease someone can help me?
Now my code is like this, I am trying to do a function to save in .fasta like you did. And one to read the .fasta files like in the answer above.
import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO
def save_Articles_Files(dbName, idNum, rettypeName):
net_handle = Entrez.efetch(db=dbName, id=idNum, rettype=rettypeName, retmode="txt")
filename = path + idNum + ".fasta"
out_handle = open(filename, "w")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")
enter code here
Entrez.email='myemail#gmail.com'
dbName = 'pubmed'
idNum = '26837606'
rettypeName = "medline"
path ="/run/media/Dropbox/codigos/Codes/"+dbName
save_Articles_Files(dbName, idNum, rettypeName)
But my function is not working I need some help please!
You're mixing up two concepts.
1) Entrez.efetch() is used to access NCBI. In your case you are downloading an article from Pubmed. The result that you get from net_handle.read() looks like:
PMID- 26837606
OWN - NLM
STAT- In-Process
DA - 20160203
LR - 20160210
IS - 2045-2322 (Electronic)
IS - 2045-2322 (Linking)
VI - 6
DP - 2016 Feb 03
TI - Exploiting the CRISPR/Cas9 System for Targeted Genome Mutagenesis in Petunia.
PG - 20315
LID - 10.1038/srep20315 [doi]
AB - Recently, CRISPR/Cas9 technology has emerged as a powerful approach for targeted
genome modification in eukaryotic organisms from yeast to human cell lines. Its
successful application in several plant species promises enormous potential for
basic and applied plant research. However, extensive studies are still needed to
assess this system in other important plant species, to broaden its fields of
application and to improve methods. Here we showed that the CRISPR/Cas9 system is
efficient in petunia (Petunia hybrid), an important ornamental plant and a model
for comparative research. When PDS was used as target gene, transgenic shoot
lines with albino phenotype accounted for 55.6%-87.5% of the total regenerated T0
Basta-resistant lines. A homozygous deletion close to 1 kb in length can be
readily generated and identified in the first generation. A sequential
transformation strategy--introducing Cas9 and sgRNA expression cassettes
sequentially into petunia--can be used to make targeted mutations with short
indels or chromosomal fragment deletions. Our results present a new plant species
amenable to CRIPR/Cas9 technology and provide an alternative procedure for its
exploitation.
FAU - Zhang, Bin
AU - Zhang B
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Yang, Xia
AU - Yang X
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Yang, Chunping
AU - Yang C
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Li, Mingyang
AU - Li M
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Guo, Yulong
AU - Guo Y
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
LA - eng
PT - Journal Article
PT - Research Support, Non-U.S. Gov't
DEP - 20160203
PL - England
TA - Sci Rep
JT - Scientific reports
JID - 101563288
SB - IM
PMC - PMC4738242
OID - NLM: PMC4738242
EDAT- 2016/02/04 06:00
MHDA- 2016/02/04 06:00
CRDT- 2016/02/04 06:00
PHST- 2015/09/21 [received]
PHST- 2015/12/30 [accepted]
AID - srep20315 [pii]
AID - 10.1038/srep20315 [doi]
PST - epublish
SO - Sci Rep. 2016 Feb 3;6:20315. doi: 10.1038/srep20315.
2) SeqIO.read() is used to read and parse FASTA files. This is a format that is used to store sequences. A sequence in FASTA format is represented as a series of lines. The first line in a FASTA file starts with a ">" (greater-than) symbol. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.
As you can see, the result that you get back from Entrez.efetch() (which I pasted above) doesn't look like a FASTA file. So SeqIO.read() gives the error that it can't find any sequence records in the file.

Resources