Convert PDF to Excel/csv/xlsx - excel

My intention is to convert the pdf strings into excel/csv file as follows:
PDF file: (Source File)
#_________________________________________________________________________
appliance
n. 1. See server appliance. 2. See information appliance. 3. A device with a single or limited ......
appliance server
n. 1. An inexpensive computing .....2. See server appliance.
application
n. A program designed ......
#________________________________________________________________________
Excel File : (Target File)
#________________________________________________________________________
appliance , n. , 1. See server appliance ,
appliance server , n. , 1. An inexpensive co ,
application , n. , A program designed ...... ,
_#_______________________________________________________________________
I have convert the pdf into text and trying to split with "," and then convert the text file into csv file. But i have stuck after converting the pdf to text file.
import os
from os import chdir, getcwd, listdir, path
import PyPDF2
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print ("\nThe specified path does not exist.\n")
abs_path = raw_input(prompt)
return abs_path
print ("\n")
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
i=0
while i<=len(list):
path=list[i]
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = PyPDF2.PdfFileReader(filename(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
print (strftime("%H:%M:%S"), " pdf -> txt ")
f=open(name,'w')
f.write(content.encode("UTF-8"))
f.close

It may be worth converting the PDF to CSV first, then manipulating the CSV to the layout you would like afterwards.
This API can be used with Python to convert one or multiple PDFs to CSV: https://pdftables.com/pdf-to-excel-api.
To convert a single PDF:
import pdftables_api
c = pdftables_api.Client('my-api-key')
c.xlsx('input.pdf', 'output.csv')
or to convert multiple PDFs:
import pdftables_api
import os
c = pdftables_api.Client('MY-API-KEY')
file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"
for file in os.listdir(file_path):
if file.endswith(".pdf"):
c.xlsx(os.path.join(file_path,file), file+'.csv')

Related

PDF parser in pdfs with multiple images and formats with python and tabula (open to other options)

So first off what im trying to do: create a pdf parser that will take ONLY tables out of any given pdf. I currently have some pdfs that are for parts manuals which contain an image of the part and then a table for details of the parts and I want to scrape and parse the table data from the pdf into a csv or similar excel style file(csv, xls etc)
What ive tried/trying: I am currently using python3 and tabula(i have no preference for either of these and open to other options) in which I have a py program that is able to scrape all the data of any pdf or directory of pdfs however it takes EVERYTHING including the image file code that has a bunch of 0 1 NaN(adding examples at the bottom). I was thinking of writing a filter function that removes these however that feels like overkill and was wondering/hoping there is a way to filter out the images with tabula or another library? (side note ive also attempted camelot however the module is not importing correctly even when it is in my pip freeze and this happens on both my mac m1 and mac m2 so assuming there is no arm support)
If anyone could help me or help guide me in a direction of a library or method of being able to iterate through all pages in a pdf and JUST grab the tables for export t csv that would be AMAZING!
current main file:
from tabula.io import read_pdf;
from traceback import print_tb;
import pandas as pd;
from tabulate import tabulate;
import os
def parser(fileName, count):
print("\nFile Number: ",count, "\nNow parsing file: ", fileName)
df = read_pdf(fileName, pages="all") #address of pdf file
for i in range(len(df)):
df[i].to_excel("./output/test"+str(i)+".xlsx")
print(tabulate(df))
print_tb(df)
def reader(type):
filecount = 1
if(type == 'f'):
file = input("\nFile(f) type selected\nplease enter full file name with path (ex. Users/Name/directory1/filename.pdf: ")
parser(file, filecount)
elif(type == 'd'):
#directory selected
location = input("\nPlease enter diectory path, if in the same folder just enter a period(.)")
print("Opening directory: ", location)
#loop through and parse directory
for filename in os.listdir(location):
f = os.path.join(location, filename)
# checking if it is a file
if os.path.isfile(f):
parser(f, filecount)
filecount + 1
else:
print('\n\n ERROR, path given does not contain a file or is not a directory type..')
else:
print("Error: please select directory(d) or file(f)")
fileType = input("\n-----> Hello!\n----> Would you like to parse a directory(d) or file(f)?").lower()
reader(fileType)

saving text files to .npy file

I have many text files in a directory with numerical extension(example: signal_data1.9995100000000001,signal_data1.99961 etc)
The content of the files are as given below
signal_data1.9995100000000001
-1.710951390504200198e+00
5.720409824754981720e-01
2.730176313110273423e+00
signal_data1.99961
-6.710951390504200198e+01
2.720409824754981720e-01
6.730176313110273423e+05
I just want to arrange the above files into a single .npy files as
-1.710951390504200198e+00,5.720409824754981720e-01, 2.730176313110273423e+00
-6.710951390504200198e+01,2.720409824754981720e-01, 6.730176313110273423e+05
So, I want to implement the same procedure for many files of a directory.
I tried a loop as follows:
import numpy as np
import glob
for file in glob.glob(./signal_*):
np.savez('data', file)
However, it does not give what I want as depicted above. So here I need help. Thanks in advance.
Here is another way of achieving it:
import os
dirPath = './data/' # folder where you store your data
with os.scandir(dirPath) as entries:
output = ""
for entry in entries: # read each file in your folder
dataFile = open(dirPath + entry.name, "r")
dataLines = dataFile.readlines()
dataFile.close()
for line in dataLines:
output += line.strip() + " " # clear all unnecessary characters & append
output += '\n' # after each file break line
writeFile = open("a.npy", "w") # save it
writeFile.write(output)
writeFile.close()
You can use np.loadtxt() and np.save():
a = np.array([np.loadtxt(f) for f in sorted(glob.glob('./signal_*'))])
np.save('data.npy', a)

how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file

this is my PDF file "https://drive.google.com/open?id=1M9k1AO17ZSwT6HTrTrB-uz85ps3WL1wS"
Help me someone to extract this , as i search on SO getting some clue to extract text using these libries PyPDF2, PyPDF2.pdf , PageObject, u_, ContentStream, b_, TextStringObject ,but not getting how to use it.
someone please help me to extract this with some explanation, so i can understand the code and tell me how to read DOM of PDF file.
you need to install some libaries:
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)

Retrieve exif data from thousands of images fast - optimising function

I've written script that retrieves specific fields of exif data from thousands of images in a directory (including subdirectories) and saves the info to a csv file:
import os
from PIL import Image
from PIL.ExifTags import TAGS
import csv
from os.path import join
####SET THESE!###
imgpath = 'C:/x/y' #Path to folder of images
csvname = 'EXIF_data.csv' #Name of saved csv
###
def get_exif(fn):
ret = {}
i = Image.open(fn)
info = i._getexif()
for tag, value in info.items():
decoded = TAGS.get(tag, tag)
ret[decoded] = value
return ret
exif_list = []
path_list = []
filename_list = []
DTO_list = []
MN_list = []
for root, dirs, files in os.walk(imgpath, topdown=True):
for name in files:
if name.endswith('.JPG'):
pat = join(root, name)
pat.replace(os.sep,"/")
exif = get_exif(pat)
path_list.append(pat)
filename_list.append(name)
DTO_list.append(exif['DateTimeOriginal'])
MN_list.append(exif['MakerNote'])
zipped = zip(path_list, filename_list, DTO_list, MN_list)
with open(csvname, "w", newline='') as f:
writer = csv.writer(f)
writer.writerow(('Paths','Filenames','DateAndTime','MakerNotes'))
for row in zipped:
writer.writerow(row)
However, it is quite slow. I've attempted to optimise the script for performance + readabilty by using list and dictionary comprehensions.
import os
from os import walk #Necessary for recursive mode
from PIL import Image #Opens images and retrieves exif
from PIL.ExifTags import TAGS #Convert exif tags from digits to names
import csv #Write to csv
from os.path import join #Join directory and filename for path
####SET THESE!###
imgpath = 'C:/Users/au309263/Documents/imagesorting_testphotos/Finse/FINSE01' #Path to folder of images. The script searches subdirectories as well
csvname = 'PLC_Speedtest2.csv' #Name of saved csv
###
def get_exif(fn): #Defining a function that opens an image, retrieves the exif data, corrects the exif tags from digits to names and puts the data into a dictionary
i = Image.open(fn)
info = i._getexif()
ret = {TAGS.get(tag, tag): value for tag, value in info.items()}
return ret
Paths = [join(root, f).replace(os.sep,"/") for root, dirs, files in walk(imgpath, topdown=True) for f in files if f.endswith('.JPG' or '.jpg')] #Creates list of paths for images
Filenames = [f for root, dirs, files in walk(imgpath, topdown=True) for f in files if f.endswith('.JPG' or '.jpg')] #Creates list of filenames for images
ExifData = list(map(get_exif, Paths)) #Runs the get_exif function on each of the images specified in the Paths list. List converts the map-object to a list.
MakerNotes = [i['MakerNote'] for i in ExifData] #Creates list of MakerNotes from exif data for images
DateAndTime = [i['DateTimeOriginal'] for i in ExifData] #Creates list of Date and Time from exif data for images
zipped = zip(Paths, Filenames, DateAndTime, MakerNotes) #Combines the four lists to be written into a csv.
with open(csvname, "w", newline='') as f: #Writes a csv-file with the exif data
writer = csv.writer(f)
writer.writerow(('Paths','Filenames','DateAndTime','MakerNotes'))
for row in zipped:
writer.writerow(row)
However, this has not changed the performance.
I've timed the specific regions of the code and found that specifically opening each image and getting the exif data from each image in the get_exif function is what takes time.
To make the script faster,I'm wondering if:
1) It is possible to optimise on the performance of the function?, 2) It is possible to retrive exif data without opening the image?, 3) list(map(fn,x)) is the fastest way of applying the funtion?
If read the docs in the right way PIL.Image.open() does not only extracts the EXIF data from the file but also reads and decodes the entire image, which probably is the bottleneck here. The first thing I would do would be to change to a library or routine that only works on the EXIF data and does not care for the image content. ExifRead or piexif might be worth a try.

Custom filetype in Python 3

How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.
Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.
Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!
This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.

Resources