Extracting title from pdf using pypdf2 not working - python-3.x

I'm trying to extract the title of PDF files using pyPDF2. The output is either none or a wrong title. I tried using PDFminer as well, still the same result. I tried using 3 different pdf files. Is there a better way to extract the title with better accuracy?
This is the code I used:
from PyPDF2 import PdfFileReader
def get_pdf_title(pdf_file_path):
pdf_reader = PdfFileReader(open(pdf_file_path, "rb"))
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('C:/PythonPrograms/Test.pdf')
print(title)

Your code is working, at least for me on python 3.5.2. Check in the PDF properties that he indeed has a title.
PDF's title is part of its metadata, that needs to be set. It is not mandatory, not related to its content (other than by the will of the person writing it), nor with its filename.
If you use your snippet on a file with no title, it's output will be an empty string.

Related

How to scrape a pdf file to make a DF

I need to build a database with several data. Most of those data is contained in PDF Files. Those PDF files are all the same, but change only on the data. (for example, one of the files i have to work in: https://documentos.serviciocivil.cl/actas/dnsc/documentService/downloadWs?uuid=aecfeb7c-d494-4631-ade4-584d67ea120e)
I have been trying to extract the data with PyPDF, tabula, pdfminer (even tried with textract but it didn't work through Anaconda) and other stuff, but i didn't get what i want.
Then i tried to transform those pdf files in txt files and then mining it, but didn't get anything. Also tried with regex but didn't understand how to use it, although the code doesn't show errors when running:
import re
import sys
recording = False
your_file = "D:\Magister\Tercer semestre\Tesis I\Txt\ResultadoConcurso1.txt"
start_pattern = 'apellidos:'
stop_pattern = '1.2'
output_section = []
for line in open(your_file).readlines():
if recording is False:
if re.search(start_pattern, line) is not None:
recording = True
output_section.append(line.strip())
elif recording is True:
if re.search(stop_pattern, line) is not None:
recording = False
sys.exit()
output_section.append(line.strip())
print("".join(output_section))
As you can see in the upper link i left, pdf files have different sections. I need to get the info that's inside those sections. For example, one of the fields in my database it's going to be "Nombre y apellido" (name and lastname). It's contained between "apellidos:" and "1.2".
what should i do? Can i work directly from PDF format? Or should i work in txt files? And then, what should i use to get the info? (Python 3.XX; Anaconda)
Thanks

Importing only a specific part of the docx in Python

I am trying to extract the majority of my docx file when I am importing it to the Python. The best would be if I could tell my code which paragraphs I need or what part of the text I am going to use.
Can anyone help me with that?
I have tried this code:
import docx
doc = docx.Document('A.docx')
print(len(doc.paragraphs))
print (doc.paragraphs[2].text)
but the problem with this is that whenever I hit enter it thinks that a new paragraph has started.

Vader Sentiment with multiple PDF

I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("\n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.

PyPDF2 returns only empty lines for some files

I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
I had the same issue, i fixed it using another python library called slate
Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)

Python: Universal XML parser

I'm trying to make simple Python 3 program to read weather information from XML web source, convert it into Python-readable object (maybe dictionary) and process it (for example visualize multiple observations into graph).
Source of data is national weather service's (direct translation) xml file at link provided in code.
What's different from typical XML parsing related question in Stack Overflow is that there are repetitive tags without in-tag identificator (<station> tags in my example) and some with (1st line, <observations timestamp="14568.....">). Also I would like to try parse it straight from website, not local file. Of course, I could create local temporary file too.
What I have so far, is simply loading script, that gives string containing xml code for both forecast and latest weather observations.
from urllib.request import urlopen
#Read 4-day forecast
forecast= urlopen("http://www.ilmateenistus.ee/ilma_andmed/xml/forecast.php").read().decode("iso-8859-1")
#Get current weather
observ=urlopen("http://www.ilmateenistus.ee/ilma_andmed/xml/observations.php").read().decode("iso-8859-1")
Shortly, I'm looking for as universal as possible way to parse XML to Python-readable object (such as dictionary/JSON or list) while preserving all of the information in XML-file.
P.S I prefer standard Python 3 module such as xml, which I didn't understand.
Try xmltodict package for simple conversion of XML structure to Python dict: https://github.com/martinblech/xmltodict

Resources