How to get confidence of each line using pytesseract - python-3.x

I have successfully setup Tesseract and can translate the images to text...
text = pytesseract.image_to_string(Image.open(image))
However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?
I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.

After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...
text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')
So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...
text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)
Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...
conf = text.groupby(['block_num'])['conf'].mean()

#Srikar Appalaraju is right. Take the following example image:
Now use the following code:
text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()
Notice that all five rows have the same block_num, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num, block_num, par_num and line_num simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:
lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
.apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
line_conf = []
for i in range(len(lines)):
if lines[i].strip():
line_conf.append((lines[i], round(confs[i],3)))
with the following desired output:
[('Ying Thai Kitchen', 91.667),
('2220 Queen Anne AVE N', 88.2),
('Seattle WA 98109', 90.333),
('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
('‘uw .yingthaikitchen.com', 40.0),
('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
('Order#:17 Table 2', 94.0),
('Date: 7/4/2013 7:28 PM', 86.25),
('Server: Jack (1.4)', 83.0),
('44 Ginger Lover $9.50', 89.0),
('[Pork] [24#]', 43.0),
('Brown Rice $2.00', 95.333),
('Total 2 iten(s) $11.50', 89.5),
('Sales Tax $1.09', 95.667),
('Grand Total $12.59', 95.0),
('Tip Guide', 95.0),
('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
('Thank you very much,', 90.75),
('Cone back again', 92.667)]

The current accepted answer is not entirely correct. The correct way to get each line using pytesseract is
text.groupby(['block_num','par_num','line_num'])['text'].apply(list)
We need to do this based on this answer: Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?
Column block_num: Block number of the detected text or item
Column par_num: Paragraph number of the detected text or item
Column line_num: Line number of the detected text or item
Column word_num: word number of the detected text or item
But above all 4 columns are interconnected.If the item comes from new line then word number will start counting again from 0, it doesn't continue from previous line last word number. Same goes with line_num, par_num, block_num.

Related

Does anyone know how to pull averages from a text file for several different people?

I have a text file (Player_hits.text) that I am trying to pull player batting averages from. Similar to lines 179-189 I want to find an average. However, I do not want to find the average for the entire team. Instead, I want to find the average of every individual player on the team.
For instance, the text file is set up as such:
Player_hits.txt
In this file a 1 defines a hit and a 0 means the player did not get a hit. I am trying to pull an individual average for both players. (Alex = 0.500, Riley = 0.666)
If someone could help, that would be greatly appreciated!
Thanks!
Link to original code on repl.it: Baseball Stat-Tracking
JSONDecodeError Image
The json.decoder.JSONDecodeError: is coming because the json.loads() doesn't interpret that (each line, '[[1, 'Riley']\n'as valid json format. You can use ast to read in that list as a literal evaluation, thus storing that as a list element [', 'Riley'] in your list of p_hits.
Then the second part is you can convert to the dataframe and groupby the 'name' column. So jim has the right idea, but there's errors in that too (Ie. colmuns should be columns, and the items in the list need to be strings ['hit','name'], not undeclared variables.
import pandas as pd
import ast
p_hits = []
with open('Player_hits.txt') as hits:
for line in hits:
l = ast.literal_eval(line)
p_hits.append(l)
df = pd.DataFrame(p_hits, columns=['hit', 'name'])
Output: with an example dataset I made
print(df.groupby(['name']).mean())
hit
name
Matt 0.714286
Riley 0.285714
Todd 0.500000
import pandas as pd
import json
p_hits = []
with open('Player_hits.txt') as hits:
for line in hits:
l = json.loads(line)
p_hits.append(l)
df = pd.DataFrame.from_records(p_hits, colmuns=[hit, name])
df.groupby(['name']).mean()

Append specific line in text file as item 1 of value list in dictionary and subsequent lines as 1 string to item 2 of the same list

I have a text file to import to dictionary but I have an issue trying to get the program to identify the correct line no as item 1 and items 2 in a list in dictionary
The format of text file is like this (there is no empty line between each lines and only at the end of each record, there is a line break):
ProductA
2020-08-03 16:26:21
This painting was done by XNB.
The artist seeks to portray the tragedies caused by event XYZ.
The painting weighs 2kg.
####blank line#####
ProductB
2020-08-03 16:26:21
This painting is done by ONN.
It was stolen during world war 2.
Decades later, it was discovered in the black market of country XYZ.
It was bought for 2 million dollars by ABC.
###blank line###
Desired outcome in dictionary:
{ 'ProductA' : ['2020-08-03 16:26:21', 'This painting was done by XNB.The artist seeks to portray the tragedies caused by event XYZ. The painting weighs 2kg.'], 'ProductB':['2020-08-03 16:26:21','This painting is done by ONN.This painting is done by ONN.Decades later, it was discovered in the black market of country XYZ.It was bought for 2 million dollars by ABC.']}
where item_2 is a single string that is combined from line 3 onwards till the end of the information where it meets a blank line.
Problem: I don't know how to code the logic in such as way that the program will be able to properly assign it to where I want it to.
header = ""
header = True
for line in records:
data = line.splitlines()
if line!= '\n': # check for line break which indicate new record
if Header: #
#code which will assign 1st line of each record as key to dictionary
else:
# This is where I need help.
# Code which will assign 2nd line as item_1 and then assign 3rd lines onwards till the end of record as item_2 in a single string.
# items_2 may have different number of lines being combined into 1 string for each record.
# I try to form a rough idea how the logic might be in code below but I feel that something is missing and I got a bit confused.
for line in list: # result in TypeError, 'type' object is not iterable.
dict[line[1]] = dict[header].append(line[1].strip("\n"))
# Since the outer if has already done its job of identifying 1st line of record. The line of code seeks to assign the next line (line 2 in text file) which I think would be interpreted by the program as line[1] to item 2.
dict[line[2:]] = dict[header].append(line[2:].strip("\n"))
# Assign 3rd line of text file onwards as a single string which is item_2 in the list of value for dictionary.
else:
#code which reset boolean for header
Try this:
with open('data.txt') as fp:
data = fp.read().split('\n\n')
res = {}
for x in data:
k, v = x.strip().split('\n', 1)
v = v.split('\n')
res[k] = [v[0], ' '.join(v[1:])]
print(res)
Output:
{'ProductA': ['2020-08-03 16:26:21', 'This painting was done by XNB. The artist seeks to portray the tragedies caused by event XYZ. The painting weighs 2kg.'], 'ProductB': ['2020-08-03 16:26:21', 'This painting is done by ONN. It was stolen during world war 2. Decades later, it was discovered in the black market of country XYZ. It was bought for 2 million dollars by ABC.']}

Identify numbers, in a large data string, that are prefixed to an alphabet upto 2 positions in between other characters

I have a string containing thousands of lines of this data without line break (only a few lines shown for readability with line break)
5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital
7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital
Format is
(entry number)(district)(patient number)(age)(gender)(case of)(symptoms)(comorbidity)(date of death)(place of death)
without spaces, or brackets.
Problem : The data i want to collect is age.
However i cant seem to find a way to single out the age since its clouded by a lot of other numbers in the data. I have tried various iterations of count, limiting it to 1 to 99, separating the data etc, and failed.
My Idea : Since the gender is always either 'M'/'F', and the two numbers before the gender is the age. Isolating the two numbers before the gender seems like an ideal solution.
xxM
xxF
My Goal : I would like to collect all the xx numbers irrespective of gender and store them in a list. How do i go about this?
import re
input_str = '5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'
ages = [found[-3:-1] for found in re.findall('[0-9]+[M,F]', input_str, re.I)]
print(ages)
# ['62', '65']
This works fine with the sample but if there are districts starting with 'M/F' then entry number will be collected as well.
A workaround is to match exactly seven digits (if the patient number is always 5 digits and and the age is generally 2 digits).
ages = [found[-3:-1] for found in re.findall(r'\d{7}[M,F]', input_str, re.I)]
With the structure you gave I've built a dict of reg expressions to match components. Then put this back into a dict
There are ways I can imagine this will not work
if age < 10, only 1 digit so you will pick up a digit of patient number
there maybe strings that don't match the re expressions which will mean odd results
It's the most structured way I can think to go....
import re
data = "5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital"
md = {
"entrynum": "([0-9]+)",
"district": "([A-Z,a-z]+)",
"patnum_age": "([0-9]+)",
"sex": "([M,F])",
"remainder": "(.*)$"
}
data_dict = {list(md.keys())[i]:tk
for i, tk in
enumerate([tk for tk in re.split("".join(md.values()), data) if tk!=""])
}
print(f"Assumed age:{data_dict['patnum_age'][-2:]}\nparsed:{data_dict}\n")
output
Assumed age:62
parsed:{'entrynum': '5', 'district': 'BengaluruUrban', 'patnum_age': '4598962', 'sex': 'M', 'remainder': 'SARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'}

How can I create a dictionary for a large amount to text and list the most frequent word?

I am new to coding and I am trying to create a dictionary from a large body of text and would also like the most frequent word to be shown?
For example, if I had a block of text such as:
text = '''George Gordon Noel Byron was born, with a clubbed right foot, in London on January 22, 1788. He was the son of Catherine Gordon of Gight, an impoverished Scots heiress, and Captain John (“Mad Jack”) Byron, a fortune-hunting widower with a daughter, Augusta. The profligate captain squandered his wife’s inheritance, was absent for the birth of his only son, and eventually decamped for France as an exile from English creditors, where he died in 1791 at 36.'''
I know the steps I would like the code to take. I want words that are the same but capitalised to be counted together so Hi and hi would count as Hi = 2.
I am trying to get the code to loop through the text and create a dictionary showing how many times each word appears. My final goal is to them have the code state which word appears most frequently.
I don't know how to approach such a large amount of text, the examples I have seen are for a much smaller amount of words.
I have tried to remove white space and also create a loop but I am stuck and unsure if I am going the right way about coding this problem.
a.replace(" ", "")
#this gave built-in method replace of str object at 0x000001A49AD8DAE0>, I have now idea what this means!
print(a.replace) # this is what I tried to write to remove white spaces
I am unsure of how to create the dictionary.
To count the word frequency would I do something like:
frequency = {}
for value in my_dict.values() :
if value in frequency :
frequency[value] = frequency[value] + 1
else :
frequency[value] = 1
What I was expecting to get was a dictionary that lists each word shown with a numerical value showing how often it appears in the text.
Then I wanted to have the code show the word that occurs the most.
This may be too simple for your requirements, but you could do this to create a dictionary of each word and its number of repetitions in the text.
text = "..." # text here.
frequency = {}
for word in text.split(" "):
if word not in frequency.keys():
frequency[word] = 1
else:
frequency[word] += 1
print(frequency)
This only splits the text up at each ' ' and counts the number of each occurrence.
If you want to get only the words, you may have to remove the ',' and other characters which you do not wish to have in your dictionary.
To remove characters such as ',' do.
text = text.replace(",", "")
Hope this helps and happy coding.
First, to remove all non-alphabet characters, aside from ', we can use regex
After that, we go through a list of the words and use a dictionary
import re
d = {}
text = text.split(" ")#turns it into a list
text = [re.findall("[a-zA-Z']", text[i]) for i in range(len(text))]
#each word is split, but non-alphabet/apostrophe are removed
text = ["".join(text[i]) for i in range(len(text))]
#puts each word back together
#there may be a better way for the short-above. If so, please tell.
for word in text:
if word in d.keys():
d[word] += 1
else:
d[word] = 1
d.pop("")
#not sure why, but when testing I got one key ""
You can use regex and Counter from collections :
import re
from collections import Counter
text = "This cat is not a cat, even if it looks like a cat"
# Extract words with regex, ignoring symbols and space
words = re.compile(r"\b\w+\b").findall(text.lower())
count = Counter(words)
# {'cat': 3, 'a': 2, 'this': 1, 'is': 1, 'not': 1, 'even': 1, 'if': 1, 'it': 1, 'looks': 1, 'like': 1}
# To get the most frequent
most_frequent = max(count, key=lambda k: count[k])
# 'cat'

Extracting the last digit of a number

Suppose this is the data:
2011/03/06,17:24:17.100,EUR/USD,1.40200,3000000
I want to extract the last digit of the price, even if it is zero. Here is the code:
library(data.table)
library(stringr)
types<-list(date="TEXT",time="TEXT",curr="TEXT",price="TEXT",volume="INTEGER")
mydata=read.table("data.out", sep="," , header=FALSE)
setnames(mydata, names(types))
str_sub(mydata$price,start=-1)
this is the result: "2", it ignores zeros. I read the price as a text so I should get "0".
EDIT, Thanks to jlhoward:
mydata<-read.table(...) is converting the price and volume columns to numeric automatically. I used mydata<- read.table(...,colClasses="character"), problem solved.
You aren't stating where your data come from. If you use read.csv on a file, you'll have the items separated into columns immediately. You can even get them from the keyboard, e.g.:
read.csv(stdin(),header=FALSE)
0: 300,hello,40.2
1:
# V1 V2 V3
# 1 300 hello 40.2
Then you could do foo<- dat$V3%%10 , as in my comment above, to get the last digit.
Here is the method I talked about in the comment.
line = "2011/03/06,17:24:17.100,EUR/USD,1.40200,3000000"
sp = line.split(",")
sp
['2011/03/06', '17:24:17.100', 'EUR/USD', '1.40200', '3000000']
price = sp[-2]
price
'1.40200'
last = price[-1]
last
'0'

Resources