Vader Sentiment with multiple PDF - python-3.x

I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("\n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you

Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.

Related

convert .pptx and .xlsx files to pdf using libreoffice cli

I'm using LibreOffice CLI to convert two files I have:
.XLSX to .PDF;
.PPTX to .PDF;
To do the task I'm using the below command:
import subprocess
from subprocess import Popen, PIPE
import re
args = ['libreoffice', '--headless', '--convert-to',
'pdf', '--outdir', 'output', 'my_file.pptx']
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=25)
re.search('-> (.*?) using filter', process.stdout.decode())
It is converting the files to pdf, but unfortunately, it's not keeping the same layout style
Continuation of the previous page (with the footer) and header of the next page:
As you can see in the images, the headers have different size and it also has a weird value in the header "x000a"
I would like to understand if there are a alternative to fix this kind of problems?
Regards,
Leonardo
You should not expect a 100% replica of Microsoft Office output from LibreOffice. Anyway, if you think this is a bug, you can report the bug to LibreOffice Bugzilla with a good description and a sample documents to reproduce the bug:
https://bugs.documentfoundation.org/
There are faster ways to convert a document to PDF, if you can have a running LibreOffice in the background. For example, see this project which is written in Python:
Python script to automate document conversions using LibreOffice/OpenOffice.org
https://github.com/mirkonasato/pyodconverter

How to convert a scanned PDF file to Editable PDF file with python?

I just need to know if we can convert a scanned pdf file to an editable pdf file using python. I know couple of libraries out there like pytesseract, pyocr. Guidance in this regard will be highly appreciated. Thanks
A scanned pdf document (number of images combined into one pdf document) saved in pdf format. But in this pdf file, unable to select a single letter of the text and even search for this letter also.
I also faced the same problem. Hence I have handled this with 3 lines of code.
It converts scanned pdf files into a select and searchable text pdf document.
Hope it works for you!
import ocrmypdf
def scannedPdfConverter(file_path, save_path):
ocrmypdf.ocr(file_path, save_path, skip_text=True)
print('File converted successfully!')
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
Referenced fromn here
You should look before you ask.

Importing only a specific part of the docx in Python

I am trying to extract the majority of my docx file when I am importing it to the Python. The best would be if I could tell my code which paragraphs I need or what part of the text I am going to use.
Can anyone help me with that?
I have tried this code:
import docx
doc = docx.Document('A.docx')
print(len(doc.paragraphs))
print (doc.paragraphs[2].text)
but the problem with this is that whenever I hit enter it thinks that a new paragraph has started.

Extracting title from pdf using pypdf2 not working

I'm trying to extract the title of PDF files using pyPDF2. The output is either none or a wrong title. I tried using PDFminer as well, still the same result. I tried using 3 different pdf files. Is there a better way to extract the title with better accuracy?
This is the code I used:
from PyPDF2 import PdfFileReader
def get_pdf_title(pdf_file_path):
pdf_reader = PdfFileReader(open(pdf_file_path, "rb"))
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('C:/PythonPrograms/Test.pdf')
print(title)
Your code is working, at least for me on python 3.5.2. Check in the PDF properties that he indeed has a title.
PDF's title is part of its metadata, that needs to be set. It is not mandatory, not related to its content (other than by the will of the person writing it), nor with its filename.
If you use your snippet on a file with no title, it's output will be an empty string.

PyPDF2 returns only empty lines for some files

I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
I had the same issue, i fixed it using another python library called slate
Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)

Resources