How to extract title of each page from the PDF using Python

How to extract title of each page from the PDF using Python - python-3.x

I want to extract the title of each page of PDF, but my pdfs does not have similar or predefine size of title (title size is varying in every page), I tried following code, but its not giving me the expected output, instead its extracting whole text of that page
import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
filenames = ['Test2.pdf']
# filenames = ['sample-pdf-download-10-mb.pdf', 'sample-pdf-file.pdf', 'sample-pdf-with-images.pdf']
pdf_Writer = PdfFileWriter()
for filename in filenames:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count += 1
text += pageObj.extractText()
print(count, "= ", pageObj.extractText().title())
Also how can I extract highlighted text from PDF?

Related

.csv to .arff function on Python

I'm trying to do a convertion function from csv to arff, right now I have this:
def csv2arff(csv_path, arff_path=None):
with open(csv_path, 'r') as fr:
attributes = []
if arff_path is None:
arff_path = csv_path[:-4] + '_prueba.arff' # *.arff -> *.csv
write_sw = False
with open(arff_path, 'w') as fw:
fw.write('#relation base_datos_modelo_3_limpia \n')
firstline = fr.readlines()[0].rstrip()
fw.write(firstline)
and that gives me:
#relation base_datos_modelo_3_limpia
DVJ_Valgus_KneeMedialDisplacement_D_discr,BMI,AgeGroup,ROM-PADF-KE_D,DVJ_Valgus_FPPA_D_discr,TrainFrequency,DVJ_Valgus_FPPA_ND_discr,Asym_SLCMJLanding-pVGRF(10percent)_discr,Asym-ROM-PHIR(≥8)_discr,Asym_TJ_Valgus_FPPA(10percent)_discr,TJ_Valgus_FPPA_ND_discr,Asym-ROM-PHF-KE(≥8)_discr,TJ_Valgus_FPPA_D_discr,Asym_SLCMJ-Height(10percent)_discr,Asym_YBTpl(10percent)_discr,Position,Asym-ROM-PADF-KE(≥8º)_discr,DVJ_Valgus_KneeMedialDisplacement_ND_discr,DVJ_Valgus_Knee-to-ankle-ratio_discr,Asym-ROM-PKF(≥8)_discr,Asym-ROM-PHABD(≥8)_discr,Asym-ROM-PHF-KF(≥8)_discr,Asym-ROM-PHER(≥8)_discr,AsymYBTanterior10percentdiscr,Asym-ROM-PHABD-HF(≥8)_discr,Asym-ROM-PHE(≥8)_discr,Asym(>4cm)-DVJ_Valgus_Knee;edialDisplacement_discr,Asym_SLCMJTakeOff-pVGRF(10percent)_discr,Asym-ROM-PHADD(≥8)_discr,Asym-YBTcomposite(10percent)_discr,Asym_SingleHop(10percent)_discr,Asym_YBTpm(10percent)_discr,Asym_DVJ_Valgus_FPPA(10percent)_discr,Asym_SLCMJ-pLFT(10percent)_discr,DominantLeg,Asym-ROM-PADF-KF(≥8)_discr,ROM-PHER_ND,CPRDmentalskills,POMStension,STAI-R,ROM-PHER_D,ROM-PHIR_D,ROM-PADF-KF_ND,ROM-PADF-KF_D,Age_at_PHV,ROM-PHIR_ND,CPRDtcohesion,Eperience,ROM-PHABD-HF_D,MaturityOffset,Weight,ROM-PHADD_ND,Height,ROM-PHADD_D,Age,POMSdepressio,ROM-PADF-KE_ND,POMSanger,YBTanterior_Dnorm,YBTanterior_NDnorm,POMSvigour,Soft-Tissue_injury_≥4days
So i want to put "#attribute" before each attribute and change the "," to "\n". But don't know how to do it, I tried to make a function to change the "," but didn't work, any idea?
Thank you guys.

Try the liac-arff library.
Here is an example for converting the UCI iris dataset from ARFF to CSV and then back to ARFF:
import csv
import arff
# arff -> csv
content = arff.load(open('./iris.arff', 'r'))
with open('./out.csv', 'w') as fp:
writer = csv.writer(fp)
header = []
for n, t in content['attributes']:
header.append(n)
writer.writerow(header)
writer.writerows(content['data'])
# csv -> arff
with open('./out.csv', 'r') as fp:
reader = csv.reader(fp)
header = None
data = []
for row in reader:
if header is None:
header = row
else:
data.append(row)
content = {}
content['relation'] = "from my csv file"
content['attributes'] = []
for n in header:
if n == "class":
content['attributes'].append((n, ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']))
else:
content['attributes'].append((n, 'NUMERIC'))
content['data'] = data
with open('./out.arff', 'w') as fp:
arff.dump(content, fp)
NB: For the last stage, we need to specify the nominal class values, which you could determine by scanning the data.

Import txt file and filter with space

I'm writing a script to track my orders from a website. I want to import the order# from a txt file and the script should repeat it self as long as there are ordernumbers.I wrote a code where the script imports this txt file and chooses a random ordernumber but the script puts all ordernumbers together and doesnt seperate them how can I fix this ?
this is my code:
f=open("Order#.txt", "r")
OrderNR = f.read()
words = OrderNR.split()
Repeat = len(words)
for i in range(Repeat):
randomlist = OrderNR
Orderrandom = random.choice(randomlist)
Mainlink = 'https://footlocker.narvar.com/footlocker/tracking/startrack?order_number=' + Orderrandom

Instead of using f.read(), try using f.readlines().
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()

Try PANDAS
import pandas as pd
df = pd.read_csv('Order#.txt', delimiter='\t')
print(df)
you can see TXT file in table format

'charmap' codec can't encode character '\u0432' in position 0: character maps to <undefined>

I want to extract the caption corresponding to every video_id for the json file.
The file is present in test_videodatainfo.jason.zip-->enter link description here
import json
import csv
with open('D:/Final Year Project/test_videodatainfo/test_videodatainfo.json') as json_file:
data = json.load(json_file)
capdata = data['sentences']
data_file = open('D:/Final Year Project/test_videodatainfo/data_file.csv', 'w')
csv_writer = csv.writer(data_file)
count = 0
for cap in capdata:
if count == 0:
header = cap.keys()
print(cap.keys())
csv_writer.writerow(header)
count += 1
csv_writer.writerow(cap.values())
data_file.close()
the output is:enter image description here

Formatting a Python generated CSV

I'm making a web scraper in python.
I'd like to remove the blank rows from the generated csv and would like to add a header saying "Car make", "Car Model", "Price". and would also like to remove the [] from all the names in the generated csv
imports go here...
source = requests.get(' website link goes here...').text
soup = bs(source, 'html.parser')
csv_file = open('pyScraper_1.3_Export', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['brand_Names', 'Prices'])
csv_file.close()
#gives us the make and model of all cars
Names = []
Prices_Cars = []
for var1 in soup.find_all('h3', class_ = 'brandModelTitle'):
car_Names = var1.text # var1.span.text
test_Split = car_Names.split("\n")
full_Names = test_Split[1:3]
#make = test_Split[1:2]
#model = test_Split[2:3]
Names.append(full_Names)
#prices
for Prices in soup.find_all('span', class_ = 'f20 bold fieldPrice'):
Prices = Prices.span.text
Prices = re.sub("^\s+|\s+$", "",Prices, flags=re.UNICODE) # removing whitespace before the prices
Prices_Cars.append(Prices)
csv_file = open('pyScraper_1.3_Export.csv', 'a')
csv_writer = csv.writer(csv_file)
i = 0
while i < len(Prices_Cars):
csv_writer.writerow([Names[i], Prices_Cars[i]])
i = i + 1
csv_file.close()
here is the screenshot of the generated csv
![][1]
[1]: https://i.stack.imgur.com/m7Xw1.jpg

To remove additional newlines:
csv_file = open('pyScraper_1.3_Export.csv', 'a', newline='')
("If csvfile is a file object, it should be opened with newline=''.", https://docs.python.org/3/library/csv.html#csv.writer)
To add headers:
you are actually adding headers, but to file named pyScraper_1.3_Export (note no .csv extension), this may be a mistype. Just change the code at about line 6 to
csv_file = open('pyScraper_1.3_Export.csv', 'w', newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["Car make", "Car Model", "Price"])
csv_file.close()
As for removing nested list, unpack Names[i] with * operator:
csv_writer.writerow([*Names[i], Prices_Cars[i]])

Extracting words from pdf using python 3?

we are extracting words from resumes in a pdf format.
ONE WAY OF DOING IT!
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('resume1.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Output:
2
Mostrecentversionalwaysavailableat
nlp.stanford.edu/
˘
rkarthik/cv.html
KarthikRaghunathan
Mobile:
+1-650-384-5782
Email:
kr
f
csDOTstanfordDOTedu
g
Homepage:
nlp.stanford.edu/
˘
rkarthik
ResearchInterests
Intelligence,NaturalLanguageProcessing,Human-RobotInteraction
EducationStanfordUniversity
,California2008onwards
MasterofScienceinComputerScienceCurrentGPA:3.91/4.00
NationalInstituteofTechnology(NIT)
,Calicut,India2004-2008
BachelorofTechnologyinComputerScienceandEngineeringCGPA:9.14/10.00
SoftwareSkills
ProgrammingLanguages
:C,C
++
,Perl,Java,C
#
,MATLAB,Lisp,SQL,MDX,Intelx86
assembly
Speech/NLP/AITools
:HMMToolkit(HTK),CMUSphinxAutomaticSpeechRecogni-
tionSystem,FestivalSpeechSynthesisSystem,VoiceXML,BerkeleyAligner,Giza++,Moses
StatisticalMachineTranslationToolkit,RobotOperatingSystem(ROS)
OtherTools
:L
A
T
E
X,LEX,YACC,Vim,Eclipse,MicrosoftVisualStudio,MicrosoftSQLServer
ManagementStudio,TestNGJavaTestingPlatform,SVN
OperatingSystems
:Linux,Windows,DOS
WorkExperienceMicrosoftCorporationSoftwareDevelopmentEngineerIntern
Redmond,WAJune2009-Sept2009
WorkedwiththeRevenue&RelevanceTeamatMicrosoftadCenteronthe
adCenterMarket-
placeScorecard
project,aimedatdevelopingastandardreliablesetofmetricsthatmeasure
thecompany'sperformanceintheonlineadvertisingmarketplaceandaidinmakinginformed
decisionstomaximizethemarketplacevalue.Alsoinitiatedtheonastatisticallearning
modelthatectivelypredictschangesintheadvertisers'biddingbehaviorwithtime.
StanfordNaturalLanguageProcessingGroupGraduateResearchAssistant
StanfordUniversity,CASept2008onwards
WorkingonStanford'sstatisticalmachinetranslation(SMT)system(aspartoftheDARPA
GALEProgram)undertheguidanceofProf.ChristopherManning.LedStanford'sfor
theGALEPhase3Chinese-EnglishMTevaluationaspartoftheIBM-Rosettateam.
MicrosoftResearch(MSR)LabIndiaResearchIntern
Bangalore,IndiaApr2007-Jul2007
Investigatedthetoleranceofstatisticalmachinetranslationsystemstonoiseinthetraining
corpus,particularlythekindofnoisethataccompaniesautomaticextractionofparallelcorpora
fromcomparablecorpora.AlsoworkedonthedesignofanonlinegameforNLPdataacquisition.
InternationalInstituteofInformationTechnology(IIIT)SummerIntern
Hyderabad,IndiaApr2006-Jun2006
Workedontherapidprototypingofrestricteddomainspokendialogsystems(SDS)forIndian
languages.Developedthe
IIITReceptionist
,aSDSinTamil,TeluguandEnglishlanguages,
whichfunctionedasanautomaticreceptionistforIIIT.
CourseProjectsNormalizationoftextinSMSmessagesusinganSMTsystem
Apr2009-Jun2009
Developedasystemforconvertingtextspeak(languageusedinSMScommunication)toproper
EnglishusingtheMosesstatisticalmachinetranslationsystem.
STAIRspokendialogproject
Jan2009-Apr2009
DevelopedaspokendialoginterfacetotheStanfordAIRobot(STAIR)forgivinginstructions
forfetchingtasks,undertheguidanceofProf.DanJurafskyandProf.AndrewNg.
The words are not extracted as keywords or words and this hapazard thing appears.
Another way of doing it.
import PyPDF2
#import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
#write a for-loop to open many files -- leave a comment if you'd #like to
learn how
filename = 'sample.pdf'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the
pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words.
It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert
scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our
PDF file.
# Now, we will clean our text variable, and return it as a list of keywords.
print(text)
#The word_tokenize() function will break our text phrases into #individual
words
tokens = word_tokenize(text)
#print(tokens)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',',' ']
#We initialize the stopwords variable which is a list of words like #"The",
"I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are
NOT IN stop_words and NOT IN punctuations.
keywords = []
keywords = [word for word in tokens if not word in stop_words and not word
in string.punctuation]
print(keywords)
Output:
For this the output is the same but textract module cannot be found.
Question: Can anyone correct the code or give a new one to help with the work?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to extract title of each page from the PDF using Python - python-3.x

Related

.csv to .arff function on Python

Import txt file and filter with space

'charmap' codec can't encode character '\u0432' in position 0: character maps to <undefined>

Formatting a Python generated CSV

Extracting words from pdf using python 3?

Categories

Resources