Reading images from pdf and extract Text from it - python-3.x

Problem Statement: I have a pdf which contains n number of pages and each page has 1 image whose text I need to read and perform some operation.
What I tried: I have to do this in python, and the only library I found with the best result is pytesserac.
I am pasting the sample code which I tried
fn = kw['fn'] = self.env.context.get('wfg_pg', kw['fn'])
zoom, zoom_config = self.get_zoom_for_doc(index), ' -c tessedit_do_invert=0'
if 3.3 < zoom < 3.5:
zoom_config += ' --oem 3 --psm 4'
elif 0 != page_number_list[0]:
zoom_config += ' --psm 6'
full_text, page_length = '', kw['doc'].pageCount
if recursion and index >= 10:
return fn.get('most_correct') or fn.get(page_number_list[0])
mat = fitz.Matrix(zoom, zoom) # increase resolution
for page_no in page_number_list:
page = kw['doc'].loadPage(page_no) # number of page
pix = page.getPixmap(matrix=mat)
with Image.open(io.BytesIO(pix.getImageData())) as img:
text_of_each_page = str(pytesseract.image_to_string(img, config='%s' % zoom_config)).strip()
fn[page_no] = text_of_each_page
full_text = '\n'.join((full_text, text_of_each_page, '\n'))
_logger.critical(f"full text in load immage {full_text}")
args = (full_text, page_number_list)
load = recursion and self.run_recursion_to_load_new_image_to_text(*args, **kw)
if recursion and load:
return self.load_image
return full_text
The issue: My pdf is having dates like 1/13, 1/7 the library is reading them as 143, 1n and in some places, it is reading 17 as 1). Also after the text, it is also giving some symbols like { & . , = randomly whereas in pdf these things are not even there.
For accuracy
1. I tried converting the image to .tiff format but it didn't work for me.
2. Tried adjusting the resolution of the image.

You can use pdftoppm tool for converting you images really fast as it provides you to use multi-threading feature by just passing thread_count=(no of threads).
You can refer to this link for more info on this tool. Also better images can increase the accuracy of tesseract.

Related

How to modify TIF file's EXIF data

I am trying to modify existing metadata within python 3. More specifically I have GPS coordinates and altitude in a my metadata, and I need to modify it.
I'm using piexif mudule, and I ancounter two problems.
First, I managed to change Altitude, using
exif_dict['GPS'][piexif.GPSIFD.GPSAltitude] = (140, 1)
and it works.
But I can't understand how to change Latitude and Longtitude? as they consist of three fields, like ((53, 1), (291191, 10000), (0, 1)).
The second problem occurs when I try to save tiff file with modified metadata. If I save it as TIFF file:
img.save(fname_2, 'tiff', exif=exif_bytes),
the fname_2 file is created, but it's metadata isn't changed. If Isave as JPEG -
img.save(fname_2, 'jpeg', exif=exif_bytes)
- metadata changes, but the file is compressed from 289 MB to 15 MB, that makes it impossible to use it for my purposes.
Has anyone managed to do this? It sounds like it would be very simple, but I can't seem to work it out.
import piexif
from PIL import Image
Image.MAX_IMAGE_PIXELS = 1000000000
fname_1='D:\EZG\Codding\photo\iiq/eee.tif'
fname_2='D:\EZG\Codding\photo\iiq/eee_change.tif'
img = Image.open(fname_1)
exif_dict = piexif.load(fname_1)
latitide = exif_dict['GPS'][piexif.GPSIFD.GPSLatitude]
longtitude = exif_dict['GPS'][piexif.GPSIFD.GPSLongitude]
altitude = exif_dict['GPS'][piexif.GPSIFD.GPSAltitude]
print(latitide)
print(longtitude)
print(altitude)
exif_dict['GPS'][piexif.GPSIFD.GPSAltitude] = (140, 1)
exif_bytes = piexif.dump(exif_dict)
img.save(fname_2, 'tiff', exif=exif_bytes)
the fname_2 file is created, but it's metadata isn't changed
Based on other questions and answers on SO it seems that the values are encoded as fractions:
((53, 1), (291191, 10000), (0, 1))
is 53 degrees 291191/10000 = 29.1191 minutes North (0 == N; 1 == S)
You may also want to check this answer, as there is a better package to edit GPS coordinates in photo metadata.

How to download a sentinel images from google earth engine using python API in tfrecord

While trying to download sentinel image for a specific location, the tif file is generated by default in drive but its not readable by openCV or PIL.Image().Below is the code for the same. If I use the file format as tfrecord. There are no Images downloaded in the drive.
starting_time = '2018-12-15'
delta = 15
L = -96.98
B = 28.78
R = -97.02
T = 28.74
cordinates = [L,B,R,T]
my_scale = 30
fname = 'sinton_texas_30'
llx = cordinates[0]
lly = cordinates[1]
urx = cordinates[2]
ury = cordinates[3]
geometry = [[llx,lly], [llx,ury], [urx,ury], [urx,lly]]
tstart = datetime.datetime.strptime(starting_time, '%Y-%m-%d') tend =
tstart+datetime.timedelta(days=delta)
collSent = ee.ImageCollection('COPERNICUS/S2').filterDate(str(tstart).split('')[0], str(tend).split(' ')[0]).filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', 20)).map(mask2clouds)
medianSent = ee.Image(collSent.reduce(ee.Reducer.median())) cropLand = ee.ImageCollection('USDA/NASS/CDL').filterDate('2017-01-01','2017-12-31').first()
task_config = {
'scale': my_scale,
'region': geometry,
'fileFormat':'TFRecord'
}
f1 = medianSent.select(['B1_median','B2_median','B3_median'])
taskSent = ee.batch.Export.image(f1,fname+"_Sent",task_config)
taskSent.start()
I expect the output to be readable in python so I can covert into numpy. In case of file format 'tfrecord', I expect the file to be downloaded in my drive.
I think you should think about the following things:
File format
If you want to open your file with PIL or OpenCV, and not with TensorFlow, you would rather use GeoTIFF. Try with this format and see if things are improved.
Saving to drive
Normally saving to your Drive is the default behavior. However, you can try to force writing to your drive:
ee.batch.Export.image.toDrive(image=f1, ...)
You can further try to setup a folder, where the images should be sent to:
ee.batch.Export.image.toDrive(image=f1, folder='foo', ...)
In addition, the Export data help page and this tutorial are good starting points for further research.

Python3 decode removes white spaces when should be kept

I'm reading a binary file that has a code on STM32. I placed deliberate 2 const strings in the code, that allows me to read SW version and description from a given file.
When you open a binary file with hex editor or even in python3, you can see correct form. But when run text = data.decode('utf-8', errors='ignore'), it removes a zeros from the file! I don't want this, as I keep EOL characters to properly split and extract string that interest me.
(preview of the end of the data variable)
Svc\x00..\Src\adc.c\x00..\Src\can.c\x00defaultTask\x00Task_CANbus_receive\x00Task_LED_Controller\x00Task_LED1_TX\x00Task_LED2_RX\x00Task_PWM_Controller\x00**SW_VER:GN_1.01\x00\x00\x00\x00\x00\x00MODULE_DESC:generic_module\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00**Task_SuperVisor_Controller\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x06\x07\x08\t\x00\x00\x00\x00\x01\x02\x03\x04..\Src\tim.c\x005!\x00\x08\x11!\x00\x08\x01\x00\x00\x00\xaa\xaa\xaa\xaa\x01\x01\nd\x00\x02\x04\nd\x00\x00\x00\x00\xa2J\x04'
(preview of text, i.e. what I receive after decode)
r g # IDLE TmrQ Tmr Svc ..\Src\adc.c ..\Src\can.c
defaultTask Task_CANbus_receive Task_LED_Controller Task_LED1_TX
Task_LED2_RX Task_PWM_Controller SW_VER:GN_1.01
MODULE_DESC:generic_module
Task_SuperVisor_Controller ..\Src\tim.c 5! !
d d J
with open(path_to_file, "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
text = data.decode('utf-8', errors='ignore')
# get index of "SW_VER:" sting in the file
sw_ver_index = text.rfind("SW_VER:")
# SW_VER found
if sw_ver_index is not -1:
# retrive the value, e.g. "SW_VER:WB_2.01" will has to start from position 7 and finish at 14
sw_ver_value = text[sw_ver_index + 7:sw_ver_index + 14]
module.append(tuple(('DESC:', sw_ver_value)))
else:
# SW_VER not found
module.append(tuple(('DESC:', 'N/A')))
# get index of "MODULE_DESC::" sting in the file
module_desc_index = text.rfind("MODULE_DESC:")
# MODULE_DESC found
if module_desc_index is not -1:
module_desc_substring = text[module_desc_index + 12:]
module_desc_value = module_desc_substring.split()
module.append(tuple(('DESC:', module_desc_value[0])))
print(module_desc_value[0])
As you can see my white characters are gone, while they should be present

How to save data in excel file MATLAB?

I want to save pitch,yaw and roll data in excel file for all frames. Eg: if i have 200 frames then i want to save 200 frames information in excel file. I have tried but my code only stores one frame data.exceldata
fitting_model='models/Chehra_f1.0.mat';
load(fitting_model);
mov=VideoReader('7_a.avi'); %Read video file and create an object
c=mov.NumberOfFrames;
for k=1:c
a = read(mov, k);
img=im2double(a);
disp(['Detecting Face in ']);
faceDetector = vision.CascadeObjectDetector(); % detect face in an image
bbox = step(faceDetector, img); %create boundary box around face
test_init_shape = InitShape(bbox,refShape); %initialize facial points in variable
test_init_shape = reshape(test_init_shape,49,2);
if size(img,3) == 3
test_input_image = im2double(rgb2gray(img));
else
test_input_image = im2double((img));
end
disp(['Fitting']);
MaxIter=6;
test_points = Fitting(test_input_image,test_init_shape,RegMat,MaxIter);
load('3D_Shape_Model.mat');
n=49;
test_image=img;
imshow(test_image);hold on;
% % Compute 3D Head Pose
if(n==49)
test_shape=test_points;
[pitch,yaw,roll] = ComputePose(PDM_49,test_shape(:));
filename='framesdata.xlsx';
header = {'Pitch', 'yaw ','roll'};
new_data = num2cell([pitch(:), yaw(:), roll(:)]);
output = [header; new_data];
xlswrite(filename,output);
end
plot(test_shape(:,1),test_shape(:,2),'b*');
title([num2str(i),' : Pitch = ',num2str(pitch),' ; Yaw = ',num2str(yaw),' ; Roll = ',num2str(roll)]);
set(gcf,'units','normalized','outerposition',[0 0 1 1]);
pause(0.5);
close all;
end
As #excaza stated, you will need to move the xlswrite command out of your loop or specify the cells you are writing. Please see the xlswrite Doc for more information. The correct syntax would be :
xlswrite(filename,A,xlRange)
The following is the example they provide:
filename = 'testdata.xlsx';
A = {'Time','Temperature'; 12,98; 13,99; 14,97};
sheet = 2;
xlRange = 'E1';
xlswrite(filename,A,sheet,xlRange)
You will just need to provide xlswrite the address to start writing data.

How to find data between two values

I am trying to find the data between two values. I am using this code in a GUI programme, the starting_value and ending_value which you can see in the code below are selected from 2 listboxes in a previous part of the code.
% --- Executes on button press in CalculateIntensity.
function CalculateIntensity_Callback(hObject, eventdata, handles)
% hObject handle to CalculateIntensity (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)
% Trapz function
starting_value = getappdata(0,'StartValue');
ending_value = getappdata(0,'EndValue');
StartingValue = str2mat(starting_value)
EndingValue = str2mat(ending_value)
A = getappdata(0,'XYarray')
data_found = A(A(:,[1,2]) > StartingValue & A(:,[1,2]) < EndingValue)
I found help on:
http://www.mathworks.com/matlabcentral/answers/8556-how-to-find-vector-elements-between-two-values-efficiently
However the
data_found = A(A(:,[1,2]) > StartingValue & A(:,[1,2]) < EndingValue)
part of the code wont work for me, I think starting_value and ending_value are strings so I tried converting it to a matrix but I get the error:
Error using <
Matrix dimensions must agree.
Error in MichelleLaycockGUImainwindow>CalculateIntensity_Callback (line 119)
data_found = A(A(:,[1,2]) > StartingValue & A(:,[1,2]) < EndingValue)
an example of data used is:
A =
1.0e+03 *
0.1660 1.1570
0.1664 0.4650
0.1668 0
0.1672 1.0200
0.1676 1.0110
0.1680 1.0200
0.1684 1.0640
0.1688 1.1100
0.1692 1.0370
0.1696 1.0050
0.1700 1.0750
0.1704 1.0850
0.1708 1.1310
0.1712 1.0630
0.1716 1.0370
0.1719 1.1070
0.1724 1.1450
I'm not really sure where I'm going wrong, any help would be greatly appreciated as it's all I need to complete my work. Thanks in advance!
As some of the values in my data vary rather than just decreasing or increasing the greater than or equal to method I was originally trying to use did not work. So rather than using the loop to get the data between the two points selected I went with a different method.
I use[~,indx1]=ismember(StartingValue,A,'rows') this finds the row number of the selected data and then I use this information to extract the data between and including the selected data.
Here is the full code I used to do this:
starting_value = getappdata(0,'StartValue');
ending_value = getappdata(0,'EndValue');
StartingValue = str2num(starting_value);
EndingValue = str2num(ending_value);
A = getappdata(0,'XYarray');
[~,indx1]=ismember(StartingValue,A,'rows');
[~,indx2]=ismember(EndingValue,A,'rows');
arrayfortrapz = A(indx1:indx2,1:2); %array of data including and between selected data
I hope this of some help to anybody that may run into a similar issue.

Resources