i want to read the whole pdf content.
I have used PyPDF2 and iterate pages through numpgaes object using for loop.
but my problem is i only able to get the whole document text only inside the for loop.
but i want to use it outside the loop. what should i do?
my code is look like below.
import PyPDF2
sample_pdf = open(r'/home/user/Desktop/123.pdf', mode='rb')
pdfdoc = PyPDF2.PdfFileReader(sample_pdf)
x = ''
for i in range(pdfdoc.numPages):
current_page = pdfdoc.getPage(i)
text = current_page.extractText()
x = text
I am getting whole pdf content in variable text but in x variable i am only getting last page content.
Have you tried using a list?
import PyPDF2
sample_pdf = open(r'/home/user/Desktop/123.pdf', mode='rb')
pdfdoc = PyPDF2.PdfFileReader(sample_pdf)
x = []
for i in range(pdfdoc.numPages):
current_page = pdfdoc.getPage(i)
text = current_page.extractText()
x.append(text)
The addition of string gives me the expected results.
import PyPDF2
sample_pdf = open(r'/home/user/Desktop/123.pdf', mode='rb')
pdfdoc = PyPDF2.PdfFileReader(sample_pdf)
x = ''
for i in range(pdfdoc.numPages):
current_page = pdfdoc.getPage(i)
text = current_page.extractText()
x += str(text)
print(x)
Related
I wanted to download the data of all my runs at once in tensorboard:
But it seems there's not a way to download all of them in one click. Does anyone know any solution to this problem?
This can lead to your answer!
https://stackoverflow.com/a/73409436/11657898
that's for 1 file, but, it's ready to put into a loop
I came up with this to solve my problem. First, you'll need to run the TensorBoard on a local host and then scrape the data from the browser.
import pandas as pd
import requests
from csv import reader
import os
def URLs(fold, trial):
URLs_dict = {
'train_accuracy' : f'http://localhost:6006/data/plugin/scalars/scalars?tag=epoch_accuracy&run=fold{fold}%5C{trial}%5Cexecution0%5Ctrain&format=csv',
'val_accuracy' : f'http://localhost:6006/data/plugin/scalars/scalars?tag=epoch_accuracy&run=fold{fold}%5C{trial}%5Cexecution0%5Cvalidation&format=csv',
'val_loss' : f'http://localhost:6006/data/plugin/scalars/scalars?tag=epoch_loss&run=fold{fold}%5C{trial}%5Cexecution0%5Cvalidation&format=csv',
'train_loss' : f'http://localhost:6006/data/plugin/scalars/scalars?tag=epoch_loss&run=fold{fold}%5C{trial}%5Cexecution0%5Ctrain&format=csv'
}
return URLs_dict
def tb_data(log_dir, mode, fold, num_trials):
trials = os.listdir(log_dir)
fdf = {}
for i, trial in enumerate(trials[:num_trials]):
r = requests.get(URLs(fold, trial)[mode], allow_redirects=True)
data = r.text
data_csv = reader(data.splitlines())
data_csv = list(data_csv)
df = pd.DataFrame(data_csv)
headers = df.iloc[0]
df = pd.DataFrame(df.values[1:], columns=headers)
if i == 0:
fdf['Step'] = df['Step']
fdf[f'trial {trial}'] = df['Value']
fdf = pd.DataFrame(fdf)
return fdf
P.S: It might need a little tweaking based on a different directory.
Is there any method to return the size of a file and possibly file type in Pyrogram (not Python itself)?
I haven't used python in a while but I think you can use the os module for this.
import os
size = os.path.getsize('path/to/file')
print(size)
try this
if message.document:
name = message.document.file_name
file_size = message.document.file_size
file_type = message.document.mime_type
f_id = message.document.file_id
elif message.video:
name = message.video.file_name
file_size = message.video.file_size
file_type = message.video.mime_type
f_id = message.video.file_id
same goes with photo
I try to load a list from a txt.file and then want to execute multiple task on every single entry. Unfortunately the tasks are executed only on one entry instead of all of them.
I load the list from the txt.file with this function:
def load_dir_file():
directory = os.path.dirname(__file__)
filename = os.path.join(directory, "law_dir")
with open(filename, "r", encoding="utf-8") as fin:
dir_file = fin.readlines()
return dir_file
This is the code to execute those tasks
def create_html():
dir_lst = load_dir_file()
for dir_link_dirty in dir_lst:
dir_link = dir_link_dirty.replace('"',"").replace(",","").replace("\n","")
dir_link_code = urllib.request.urlopen(dir_link)
bs_dir_link_code = BeautifulSoup(dir_link_code, "html5lib")
h2_a_tag = bs_dir_link_code.h2.a
html_link = str(dir_link) + "/" + str(h2_a_tag["href"])
print(dir_lst)
return html_link
The txt. file looks like this now:
"https://www.gesetze-im-internet.de/ao_1977",
"https://www.gesetze-im-internet.de/bbg_2009",
"https://www.gesetze-im-internet.de/bdsg_2018"
I am new to programming and probably fail some very basic points up there. So if you want to add some recommendation how i can improve basically, I would more then appreciate it.
Based on your comment above it sounds like you want to return a list of html links not just one. To do that you need that function to build a list and have it return that list. You have a lot going on in create_html, so for illustration purposes I split that function into two: create_html_link_list and create_html_link.
def create_html_link(dir_link_dirty):
dir_link = dir_link_dirty.replace('"',"").replace(",","").replace("\n","")
dir_link_code = urllib.request.urlopen(dir_link)
bs_dir_link_code = BeautifulSoup(dir_link_code, "html5lib")
h2_a_tag = bs_dir_link_code.h2.a
html_link = str(dir_link) + "/" + str(h2_a_tag["href"])
return html_link
def create_html_link_list():
dir_lst = load_dir_file()
html_link_list = [
create_html_link(dir_link_dirty)
for dir_link_dirty in dir_lst
]
return html_link_list
I am running a script in Python3 using Selenium. I am getting my output what I expected. Now, I want to save my output to a text, or csv or json file. When I am trying to run my script and save result to a file I am getting an Error with open('bangkok_vendor.txt','a') as wt :
TypeError: 'NoneType' object is not callable
Which means loop in the program runs only one time and does not store data in file called bangkok_vendor.txt. In normal python scraper programs it would n't have any problem storing data but this is first time I am using selenium. Can you please help me with solution thanks.
I am trying to run this script from my terminal command and output is what to save to any file format :
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import csv
import requests
contents =[]
filename = 'link_business_filter.csv'
def copy_json():
with open("bangkok_vendor.text",'w') as wt:
for x in script2:
wt.writer(x)
wt.close()
with open(filename,'rt') as f:
data = csv.reader(f)
for row in data:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(link)
print(link)
browser = webdriver.Chrome('chromedriver')
open = browser.get(link)
source = browser.page_source
data = bs(source,"html.parser")
body = data.find('body')
script = body
x_path = '//*[#id="react-root"]/section/main/div'
script2 = browser.find_element_by_xpath(x_path)
script3 = script2.text
#script2.send_keys(keys.COMMAND + 't')
browser.close()
print(script3)
You need to pass script2 as a parameter for copy_json function and call it when you extract the data from the page.
Change write mode to append, otherwise the file will be reset every time you call copy_json function.
Dont overwrite built-in functions like open, otherwise you won't be able to open a file to write data once you move onto the second iteration.
I refactored your code a bit:
LINK_CSV = 'link_business_filter.csv'
SAVE_PATH = 'bangkok_vendor.txt'
def read_links():
links = []
with open(LINK_CSV) as f:
reader = csv.reader(f)
for row in reader:
links.append(row[0])
return links
def write_data(data):
with open(SAVE_PATH, mode='a') as f:
f.write(data + "\n")
if __name__ == '__main__':
browser = webdriver.Chrome('chromedriver')
links = read_links()
for link in links:
browser.get(link)
# You may have to wait a bit here
# until the page is loaded completely
html = browser.page_source
# Not sure what you're trying to do with body
# soup = BeautifulSoup(html, "html.parser")
# body = soup.find('body')
x_path = '//*[#id="react-root"]/section/main/div'
main_div = browser.find_element_by_xpath(x_path)
text = main_div.text
write_data(text)
# close browser after every link is processed
browser.quit()
So here is my problem. I am trying to output my scraping results in a GUI using tkinter in python. The code i use works in the shell, but when i use it with tkinter it doesnt Here is my code.
import sys
from tkinter import *
from urllib.request import urlopen
import re
def stockSearch():
searchTerm = userInput.get()
url = "http://finance.yahoo.com/q?s="+searchTerm+"&q1=1"
htmlfile = urlopen(url)
htmltext = str(htmlfile.read())
regex = '<span id="yfs_l84_'+searchTerm+'">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
outputStock = str(["The price of ", searchTerm, "is ", price])
sLabel2 = Label(sGui, text=outputStock).pack()
sGui = Tk()
userInput = StringVar()
sGui.geometry("450x450+200+200")
sGui.title("Stocks")
sLabel = Label(sGui, text="Stocks List", fg="black")
sLabel.pack()
sButton = Button(sGui, text="LookUp", command = stockSearch)
sButton.place(x=200, y=400)
uEntry = Entry(sGui, textvariable=userInput).pack()
sGui.mainloop()
If i input a search for Google (GOOG) for example, I return this:
"The price of GOOG is []"
However, if i use the same code, but i print the result in a shell as opposed to using tkinter, i get the price as it should.
Any ideas anyone?
It appears your code isn't properly handling case. If you search for "goog" the value shows up. The problem is this line:
regex = '<span id="yfs_l84_'+searchTerm+'">(.+?)</span>'
If you type "GOOG", the regex becomes:
<span id="yfs_l84_GOOG'">(.+?)</span>
However, the html that is returned doesn't have that pattern. Doing a case-insensitive search should solve that problem:
pattern = re.compile(regex, flags=re.IGNORECASE)
Also, there's no need to create a new Label every time -- you can create the label once and then change the text each time you do a lookup.