I'm trying to create a metadata scraper to enrich my e-book collection, but am experiencing some problems. I want to create a dict (or whatever gets the job done) to store the index (only while testing), the path and the series name. This is the code I've written so far:
from bs4 import BeautifulSoup
def get_opf_path():
opffile=variables.items
pathdict={'index':[],'path':[],'series':[]}
safe=[]
x=0
for f in opffile:
x+=1
pathdict['path']=f
pathdict['index']=x
with open(f, 'r') as fi:
soup=BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name')=='calibre:series':
pathdict['series']=meta.get('content')
safe.append(pathdict)
print(pathdict)
print(safe)
this code is able to go through all the opf files and get the series, index and path, I'm sure of this, since the console output is this:
However, when I try to store the pathdict to the safe, no matter where I put the safe.append(pathdict) the output is either:
or
or
What do I have to do, so that the safe=[] has the data shown in image 1?
I have tried everything I could think of, but nothing worked.
Any help is appreciated.
I believe this is the correct way:
from bs4 import BeautifulSoup
def get_opf_path():
opffile = variables.items
pathdict = {'index':[], 'path':[], 'series':[]}
safe = []
x = 0
for f in opffile:
x += 1
pathdict['path'] = f
pathdict['index'] = x
with open(f, 'r') as fi:
soup = BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name') == 'calibre:series':
pathdict['series'] = meta.get('content')
print(pathdict)
safe.append(pathdict.copy())
print(safe)
For two main reasons:
When you do:
pathdict['series'] = meta.get('content')
you are overwriting the last value in pathdict['series'] so I believe this is where you should save.
You also need to make a copy of it, if you don´t it will change also in the list. When you store the dict you really are storing a reeference to it (in this case, a reference to the variable pathdict.
Note
If you want to print the elements of the list in separated lines you can do something like this:
print(*save, sep="\n")
Related
I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.
I'm new to Python and I'm trying to build a program that downloads and extracts zip files from various websites. I've pasted the two programs I've written to do this. The first program is a "child" program names "urls", which I import to the second program. I'm trying to iterate through each of the urls, and within each url iterate through each data file, and finally check if the "keywords" list is a part of the file name, and if yes, download and extract that file. I'm getting stuck on the part where I need to loop through the list of "keywords" to check against the file names I want to download. Would you be able to help? I appreciate any of your suggestions or guidance. Thank you. Andy
**Program #1 called "urls":**
urls = [
"https://www.dentoncad.com/content/data-extracts/1-appraisal-data-extracts/1-2019/1-preliminary/2019-preliminary" \
"-protax-data.zip",
"http://www.dallascad.org/ViewPDFs.aspx?type=3&id=//DCAD.ORG\WEB\WEBDATA\WEBFORMS\DATA%20PRODUCTS\DCAD2020_" \
"CURRENT.ZIP"
]
keywords = [
"APPRAISAL_ENTITY_INFO",
"SalesExport",
"account_info",
"account_apprl_year",
"res_detail",
"applied_std_exempt",
"land",
"acct_exempt_value"
]`enter code here`
enter code here
**Program #2 (primary program):**
import requests
import zipfile
import os
import urls
def main():
print_header()
dwnld_zfiles_from_web()
def print_header():
print('---------------------------------------------------------------------')
print(' DOWNLOAD ZIP FILES FROM THE WEB APP')
print('---------------------------------------------------------------------')
print()
def dwnld_zfiles_from_web():
file_num = 0
dest_folder = "C:/Users/agbpi/OneDrive/Desktop/test//"
# loop through each url within the url list, assigning it a unique file number each iteration
for url in urls.urls:
file_num = file_num + 1
url_resp = requests.get(url, allow_redirects=True, timeout=5)
if url_resp.status_code == 200:
saved_archive = os.path.basename(url)
with open(saved_archive, 'wb') as f:
f.write(url_resp.content)
# for match in urls.keywords:
print("Extracting...", url_resp.url)
with zipfile.ZipFile('file{0}'.format(str(file_num)), "r") as z:
zip_files = z.namelist()
# print(zip_files)
for content in zip_files:
while urls.keywords in content:
z.extract(path=dest_folder, member=content)
# while urls.keywords in zip_files:
# for content in zip_files:
# z.extract(path=dest_folder, member=content)
print("Finished!")
if __name__ == '__main__':
main()
Okay, updated answer based on updated question.
Your code is fine until this part:
with zipfile.ZipFile('file{0}'.format(str(file_num)), "r") as z:
zip_files = z.namelist()
# print(zip_files)
for content in zip_files:
while urls.keywords in content:
z.extract(path=dest_folder, member=content)
Issue 1
You already have the zip file name as saved_archive, but you try to open something else as a zipfile. Why 'file{0}'.format(str(file_num))? You should just with zipfile.ZipFile(saved_archive, "r") as z:
Issue 2
while is kind of an if statement, but it does not work as a filter (it seems you wanted that). What while does is that it checks if the condition of the statement (after the while part) is True-ish and if so, it executes the indented code. And as soon as the first False-ish evaluation kicks in, the code execution moves on. So if your condition evaluation would yield these results [True, False, True], the first would trigger the indented code to run, the second would result an exit, and the third one would be ignored due the previous exit condition. But the condition is invalid which leads to:
Issue 3
url.keywords is a list and content is a str. A list in string will never make sense. It is like ['apple', 'banana'] in 'b'. 'b' won't have such members. You could reverse the logic, but keep in mind that 'b' in ['apple', 'banana'] will be False, 'banana' in ['apple', 'banana'] will be True.
Which means in your case that this condition: '_SalesExport.txt' in urls.keywords will be False! Why? Because url.keywords is:
[
"APPRAISAL_ENTITY_INFO",
"SalesExport",
"account_info",
"account_apprl_year",
"res_detail",
"applied_std_exempt",
"land",
"acct_exempt_value"
]
and SalesExport is not _SalesExport.txt.
To achieve partial match check, you need to compare list items (strings) against a string. "SalesExport" in "_SalesExport.txt" is True, but "SalesExport" in ["_SalesExport.txt"] is False because SalesExport is not a member of the list.
There are three things you could do:
update your keywords list to exact filenames so content in kw_list could work (this means that if there is a directory structure in the zip file, you must include that one too)
for content in zip_files:
if content in urls.keywords:
z.extract(path=dest_folder, member=content)
implement a for cycle in for cycle
for content in zip_files:
for kw in urls.keywords:
if kw in content:
z.extract(path=dest_folder, member=content)
use a generator
matches = [x for x in zip_files if any(y for y in urls.keywords if y in x)]
for m in matches:
z.extract(path=dest_folder, member=m)
Finally, a recommendation:
Timeouts
Be careful with
url_resp = requests.get(url, allow_redirects=True, timeout=5).
"timeout" controls two things, connection timeout and read timeout. Since response may take longer than 5 sec, you may want a longer read timeout. You can specify timeout as tuple: (connect timeout, read timeout). So a better parameter would be:
url_resp = requests.get(url, allow_redirects=True, timeout=(5, 120))
I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.
What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically
["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)
Following 1 or 2 cannot get the case done.
What I've done so far
base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]
This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.
What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)
You could regex out javascript object holding that item then parse with json library
import requests,re,json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)
Or do whole thing with regex:
import requests,re
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])
Second regex:
Another option:
import requests,re, json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])
I am trying to automate the production of pdfs by reading data from a pandas data frame and writing it a page on an existing pdf form using pyPDF2 and reportlab. The main meat of the program is here:
def pdfOperations(row, bp):
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize=letter)
createText(row, can)
packet.seek(0)
new_pdf = PdfFileReader(packet)
textPage = new_pdf.getPage(0)
secondPage = bp.getPage(1)
secondPage.mergePage(textPage)
assemblePDF(frontPage, secondPage, row)
del packet, can, new_pdf, textPage, secondPage
def main():
df = openData()
bp = readPDF()
frontPage = bp.getPage(0)
for ind in df.index:
row = df.loc[ind]
pdfOperations(row, bp)
This works fine for the first row of data and the first pdf generated, but for the subsequent ones all the text is overwritten. I.e. the second pdf contains text from the first iteration and the second. I thought the garbage collection would take care of all the in memory changes, but that does not seem to be happening. Anyone know why?
I even tries forcing the objects to be deleted after the function has run its course, but no luck...
You read bp only once before the loop. Then in the loop, you obtain its second page via getPage(1) and merge stuff to it. But since its always from the same object (bp), each iteration will merge to the same page, therefore all the merges done before add up.
While I don't find any way to create a "deepcopy" of a page in PyPDF2's docs, it should work to just create a new bp object for each iteration.
Somewhere in readPDF you must have done something where you open your template PDF into a binary stream and then pass that to PdfFileReader. Instead, you could read the data into a variable:
with open(filename, "rb") as f:
bp_bin = f.read()
And from that, create a new PdfFileReader instance for each loop iteration:
for ind in df.index:
row = df.loc[ind]
bp = PdfFileReader(bp_bin)
pdfOperations(row, bp)
This should "reset" the secondPage everytime without any additional file I/O overhead. Only the parsing is done again each time, but depending on the file size and contents, maybe the time that takes is low and you can live with that.
I am new to python and this is my first practice code with Beautifulsoup. I have not learned creative solutions to specific data extract problems yet.
This program prints just fine but there is some difficult in extracting to the CSV. It takes the first elements but leaves all others behind. I can only guess there might be some whitespace, delimiter, or something that causes the code to halt extraction after initial texts???
I was trying to get the CSV extraction to happen to each item by row but obviously floundered. Thank you for any help and/or advice you can provide.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
price_page = 'http://www.harryrosen.com/footwear/c/boots'
page = urlopen(price_page)
soup = BeautifulSoup(page, 'html.parser')
product_data = soup.findAll('ul', attrs={'class': 'productInfo'})
for item in product_data:
brand_name=item.contents[1].text.strip()
shoe_type=item.contents[3].text.strip()
shoe_price = item.contents[5].text.strip()
print (brand_name)
print (shoe_type)
print (shoe_price)
with open('shoeprice.csv', 'w') as shoe_prices:
writer = csv.writer(shoe_prices)
writer.writerow([brand_name, shoe_type, shoe_price])
Here is one way to approach the problem:
collect the results into a list of dictionaries with a list comprehension
write the results to a CSV file via the csv.DictWriter and a single .writerows() call
The implementation:
data = [{
'brand': item.li.get_text(strip=True),
'type': item('li')[1].get_text(strip=True),
'price': item.find('li', class_='price').get_text(strip=True)
} for item in product_data]
with open('shoeprice.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=['brand', 'type', 'price'])
writer.writerows(data)
If you want to also write the CSV headers, add the writer.writeheader() call before the writer.writerows(data).
Note that you could have as well used the regular csv.writer and a list of lists (or tuples), but I like the explicitness and the increased readability of using dictionaries in this case.
Also note that I've improved the locators used in the loop - I don't think using the .contents list and getting product children by indexes is a good and reliable idea.
with open('shoeprice.csv', 'w') as shoe_prices:
writer = csv.writer(shoe_prices)
for item in product_data:
brand_name=item.contents[1].text.strip()
shoe_type=item.contents[3].text.strip()
shoe_price = item.contents[5].text.strip()
print (brand_name, shoe_type, shoe_price, spe='\n')
writer.writerow([brand_name, shoe_type, shoe_price])
Change the open file to the outer loop, so you do not need to open file each loop.