Extract medical marker name, values and units from analysed image? - python-3.x

I am using Amazon Textract to analyse anonymous blood tests.
It consists of markers, their values, units, ref interval.
I want to extract them into a dictionary like this:
{"globulin": [2.8, gidL, [1.0, 4.0]], "cholesterol": [161, mg/dL, [120, 240]], .... }
Here is an example of such OCR produced text:
Name:
Date Perfermed
$/6/2010
DOBESevState:
Date Collected:
05/03/201004.00 PN
Date Lac Meat: 05/03/2010 10.45 A
Eraminer:
PTM
Date Received: $/7/2010 12:13.11A
Tukit No.
8028522035
Abeormal
Normal
Range
CARDLAC RISK
CHOLESTEROL
161.00
120.00 240.00 mg/dL
CHOLESTEROLHDL RATIO
2.39
1.250 5.00
HIGH DENSITY LIPOPROTEINCHDL)
67.30
35.00 75.00 me/dL
LOW DENSITY LIPOPROTEIN (LDL)
78.70
60.00 a 190.00 midI.
TRIGLYCERIDES
75.00
10.00 a 200.00 made
CHEMISTRIES
ALBUMIN
4.40
3.50 5.50 pidl
ALKALINE PHOSPHATASE
49.00
30.00 120.00 UAL
BLOOD UREA NITROGEN (BUN)
17.00
6.00 2500 meidL
CREATININE
0,85
060 1.50 matdL
FRUCTOSAMINE
182
1.20 1.79 mmoV/l
GAMMA GLUTAMYUTRANSFERASE
9.00
2.00 65.00 UIL
GLOBULIN
2.80
1.00 4.00 gidL.
GLUCOSE
61.00
70.00 125.00 me/dl.
HEMOGLOBIN AIC
5.10
3.00 6.00 %
SGOT (AST)
25.00
0.00 41.00 UM
SOPI (ALT)
22.00
0.00 45.00 IMI
TOTAL BILIRUBIN
0.52
0.10 1.20 mmeldi.
TOTAL PROTEIN
720
6.00 8.50 gidl.
1. This sample lab report shows both normal and abnormal results. as well as
acceptable reference ranges for each testing category.
Please advise what is the best way to extract this information, I have tried Amazon Comprehend medical - it does the job but not for all images.
Tried SpaCy: https://github.com/NLPatVCU/medaCy,
https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

This might not be a good application of NLP as the text isn't any sort of natural language. Rather, they are structured data that can be extracted using rules. Writing rules is definitely one way to go about this.
You can first try to do a fuzzy match of the categories on the OCR results, namely "CARDIAC RISK" and "CHEMISTRIES" to partition the string into their respective categories.
If you are sure that each entry will take only 3 lines, you can simply partition them by newline and extract the data from there.
Once you have them split into entries
Here's some sample code I ran on the data you provided. It requires the fuzzyset package which you can get by running python3 -m pip install fuzzyset. Since some entries don't have units I modified your desired output format slightly and made units a list so it can easily be empty. It also stores random letters found in the third line.
from fuzzyset import FuzzySet
### Load data
with open("ocr_result.txt") as f:
data = f.read()
lines = data.split("\n")
### Create fuzzy set
CATEGORIES = ("CARDIAC RISK", "chemistries")
fs = FuzzySet(lines)
### Get the line ranges of each category
cat_ranges = [0] * (len(CATEGORIES) + 1)
for i, cat in enumerate(CATEGORIES):
match = fs.get(cat)[0]
match_idx = lines.index(match[1])
cat_ranges[i] = match_idx
last_idx = lines.index(fs.get("sample lab report")[0][1])
cat_ranges[-1] = last_idx
### Read lines in each category
def _to_float(s: str) -> float:
"""
Attempt to convert a string value to float
"""
try:
f = float(s)
except ValueError:
if "," in s:
s = s.replace(",", ".")
f = float(s)
else:
raise ValueError(f"Cannot convert {s} to float.")
return f
result = {}
for i, cat in enumerate(CATEGORIES):
result[cat] = {}
# Ignore the line of the category itself
s = slice(cat_ranges[i] + 1, cat_ranges[i + 1])
lines_in_cat = lines[s]
if len(lines_in_cat) % 3 != 0:
breakpoint()
raise ValueError("Something's wrong")
for i in range(0, len(lines_in_cat), 3):
_name = lines_in_cat[i]
_value = lines_in_cat[i + 1]
_line_3 = lines_in_cat[i + 2].split(" ")
# Convert value to float
_value = _to_float(_value)
# Process line 3 to get range and unit
_range = []
_unit = []
for i, v in enumerate(_line_3):
if v[0].isdigit() and len(_range) < 2:
_range.append(_to_float(v))
else:
_unit.append(v)
_l = [_value, _unit, _range]
result[cat][_name] = _l
print(result)
Output:
{'CARDIAC RISK': {'CHOLESTEROL': [161.0, ['mg/dL'], [120.0, 240.0]], 'CHOLESTEROLHDL RATIO': [2.39, [], [1.25, 5.0]], 'HIGH DENSITY LIPOPROTEINCHDL)': [67.3, ['me/dL'], [35.0, 75.0]], 'LOW DENSITY LIPOPROTEIN (LDL)': [78.7, ['a', 'midI.'], [60.0, 190.0]], 'TRIGLYCERIDES': [75.0, ['a', 'made'], [10.0, 200.0]]}, 'chemistries': {'ALBUMIN': [4.4, ['pidl'], [3.5, 5.5]], 'ALKALINE PHOSPHATASE': [49.0, ['UAL'], [30.0, 120.0]], 'BLOOD UREA NITROGEN (BUN)': [17.0, ['meidL'], [6.0, 2500.0]], 'CREATININE': [0.85, ['matdL'], [60.0, 1.5]], 'FRUCTOSAMINE': [182.0, ['mmoV/l'], [1.2, 1.79]], 'GAMMA GLUTAMYUTRANSFERASE': [9.0, ['UIL'], [2.0, 65.0]], 'GLOBULIN': [2.8, ['gidL.'], [1.0, 4.0]], 'GLUCOSE': [61.0, ['me/dl.'], [70.0, 125.0]], 'HEMOGLOBIN AIC': [5.1, ['%'], [3.0, 6.0]], 'SGOT (AST)': [25.0, ['UM'], [0.0, 41.0]], 'SOPI (ALT)': [22.0, ['IMI'], [0.0, 45.0]], 'TOTAL BILIRUBIN': [0.52, ['mmeldi.'], [0.1, 1.2]], 'TOTAL PROTEIN': [720.0, ['gidl.'], [6.0, 8.5]]}}

Related

Append lines of data to a Pandas Dataframe that is not associated with the existing dataframe

How to go about adding lines of text to an existing Pandas Dataframe?
I have saved a pandas dataframe via this command:
predictionsdf = pd.DataFrame(predictions, columns=['File_name', 'Actual_class', 'pred_class', 'Boom'])
The saved data looks like this:
I wanted to add lines of text like this:
Total # of Boom detection = 1 from 100 files
Percentage of Boom detection from plastic bag pop = 1.0 %
Time: 0.43 mins
At the bottom of the dataframe data.
Can you tell me how to go about appending these lines to the bottom of the dataframe?
Thanks!
Not sure to understand what you are trying to do here, but with the following toy dataframe and lines of text:
import pandas as pd
pd.set_option("max_colwidth", 100)
df = pd.DataFrame(
{
"File_name": [
"15FT_LabCurtain_S9_pt5GAL_TCL.wav",
"15FT_LabCurtain_S9_pt6GAL_TCL.wav",
"15FT_LabCurtain_S9_pt7GAL_TCL.wav",
],
"Actual_class": ["plastic_bag", "plastic_bag", "plastic_bag"],
"pred_class": ["plastic_bag", "plastic_bag", "plastic_bag"],
"Boom": [0, 0, 1],
}
)
lines = (
"Total # of Boom detection = 1 from 100 files",
"Percentage of Boom detection from plastic bag pop = 1.0 %",
"Time: 0.43 mins",
)
You could try this:
for line in lines:
df.loc[df.shape[0] + 1, "File_name"] = line
df = df.fillna("")
print(df)
# Output
File_name Actual_class pred_class Boom
0 15FT_LabCurtain_S9_pt5GAL_TCL.wav plastic_bag plastic_bag 0.0
1 15FT_LabCurtain_S9_pt6GAL_TCL.wav plastic_bag plastic_bag 0.0
2 15FT_LabCurtain_S9_pt7GAL_TCL.wav plastic_bag plastic_bag 1.0
4 Total # of Boom detection = 1 from 100 files
5 Percentage of Boom detection from plastic bag pop = 1.0 %
6 Time: 0.43 mins

Set the extracted text in a column As a Single String in Pytesseract

So I extracted string from an image with 3 columns.
the extracted text is:
SUBJECT GRADE FINALGRADE CREDITS
ADVANCED CALCULUS 1 1.54 A 3
I want to put a separator between these items that it should look like this:
SUBJECT, GRADE, FINALGRADE, CREDITS
ADVANCED CALCULUS 1, 1.54, A, 3
We can achieve the solution by two-steps.
Specify the starting keyword.
Split the line using space as the separator.
If we look at the provided example from the comment:
We don't need any image-preprocessing, since there is no artifact in the image.
Assume I want to separate the row starting with "state" with comma.
Specify the starting keyword:
start_word = line.split(" ")[0]
Split the line using space as the separator:
if start_word == "state":
line = line.split(" ")
Now for each word in the line, we can add comma to the end
for word in line:
result += word + ", "
But we need to remove the last two characters, otherwise it will end "2000, "
result = result[:-2]
print(result)
Result:
state, 1983, 1987, 1988, 1993, 1994, 1999, 2000
Code:
import cv2
import pytesseract
img = cv2.imread("15f8U.png")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255,
cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY, 11, 2)
txt = pytesseract.image_to_string(gry)
txt = txt.split("\n")
result = ""
for line in txt:
start_word = line.split(" ")[0]
if start_word == "state":
line = line.split(" ")
for word in line:
result += word + ", "
result = result[:-2]
print(result)
continue
if line != '' or line != "":
print(line)
Result:
Table 1: WAGE SAMPLE STATISTICS, by year and state (1983-2000)
Logged mean wages
in year
state, 1983, 1987, 1988, 1993, 1994, 1999, 2000
Andhra Pradesh 5.17 5.49 5.53 6.28 6.24 5.77 5.80
Gujarat 9 6.04 5.92 6.64 6.58 6.09 6.04
Haryana 12 6.25 6.43 6.80 6.60 6.54 6.74
Manipur 54 6.31 6.73 7.15 7.09 6.90 6.83
Orissa 5.24 5.90 5.96 6.16 6.26 5.57 5.58
Tamil Nadu 5.19 5.67 5.68 6.31 633 6.02 5.97
Uttar Pradesh 5.55 6.06 3 6.61 2 6.00 6.07
Mizoram 6.43 5.44 6.03 681 6.76 8 7

Why do I get 'positional index out of bound' error only for a few inputs?

I am working on a project where I return the search results for a particular topic from the dataset of scraped news articles. However, for few of the inputs, I get 'IndexError: positional index out of bounds' while for others, the code works just fine. I even tried to limit the number of outputs and printed the indexes of the rows which are to be printed just to be sure that '.iloc' does not return that error but it still is happening.
Data:
Code:
'''
def search_index(c):
global input_str
c = c.lower()
c = re.sub("((\S+)?(http(S)?)(\S+))|((\S+)?(www)(\S+))|((\S+)?(\#)(\S+)?)", " ", c)
c = re.sub('[^a-zA-Z0-9\n]',' ',c)
d = list(c.split())
input_str = [word for word in d if word not in stop_words]
print(input_str)
e = OrderedDict()
f = []
temp=[]
for index,content in data.iterrows():
count = 0
points = 0
for i in input_str:
if i in (content['word_count']).keys():
#pdb.set_trace()
count += 1 # considering how many words from the input match with the content
points += content['word_count'][i] # considering the number of times those words occur in the content corpus
if len(input_str)<=3:
if count>=1:
e[index] = {'count':count,'points':points}
elif 3 < len(input_str) <=5:
if count>=2:
e[index] = {'count':count,'points':points}
elif len(input_str) > 5:
if count>=3:
e[index] = {'count':count,'points':points}
#print('\nIndex:',index,'\nContent:\n',content['Content'])
# the lambda function first sorts the dictionary based on the 'count' and then on the basis of 'points'
for key,val in sorted(e.items(), key=lambda kv: (kv[1]['count'],kv[1]['points']),reverse=True):
f.append(key)
#print(key,val)
#print(f)
#data.iloc[f,:]
print('Total number of results: ',len(f))
if len(f)>50 :
temp=f[:20]
print(temp)
print('Top 20 results:\n')
a = data.iloc[temp,[0,1,2,3]].copy()
else:
a = data.iloc[f,[0,1,2,3]].copy()
print(a)
'''
'''
def user_ask():
b = input('Enter the topic you''re interested in:')
articles = search_index(b)
print(articles)
'''
'''
user_ask()
'''
Output: For this input I am getting the required output
'''
Enter the topic youre interested in:Joe Biden
['joe', 'biden']
Total number of results: 2342
[2337, 3314, 4164, 3736, 3750, 3763, 4246, 3386, 3392, 13369, 3006, 4401,
4089, 3787, 4198, 3236, 4432, 4097, 4179, 4413]
Top 20 results:
Link \
2467 https://abcnews.go.com/Politics/rep-max-rose-c...
3471 https://abcnews.go.com/International/dalai-lam...
4343 https://abcnews.go.com/US/georgia-legislation-...
3910 https://abcnews.go.com/Politics/temperatures-c...
3924 https://abcnews.go.com/Business/cheap-fuel-pul...
3937 https://abcnews.go.com/US/puerto-ricans-demand...
4425 https://abcnews.go.com/Politics/trump-biden-is...
3543 https://abcnews.go.com/Business/record-number-...
3549 https://abcnews.go.com/US/michigan-state-stude...
17774 https://abcnews.go.com/Politics/bernie-sanders...
3152 https://abcnews.go.com/Politics/note-gop-aids-...
4583 https://abcnews.go.com/Politics/polls-show-tig...
4268 https://abcnews.go.com/International/students-...
3962 https://abcnews.go.com/Politics/heels-arizona-...
4377 https://abcnews.go.com/Politics/north-carolina...
3388 https://abcnews.go.com/Lifestyle/guy-fieri-lau...
4614 https://abcnews.go.com/Politics/persistence-he...
4276 https://abcnews.go.com/Politics/congressional-...
4358 https://abcnews.go.com/US/nursing-home-connect...
4595 https://abcnews.go.com/US/hurricane-sally-upda...
Title \
2467 Rep. Max Rose calls on Trump to up COVID-19 ai...
3471 The Dalai Lama's simple advice to navigating C...
4343 Georgia lawmakers pass bill that gives court t...
3910 Temperatures and carbon dioxide are up, regula...
3924 Has cheap fuel pulled the plug on electric veh...
3937 Puerto Ricans demand state of emergency amid r...
4425 Trump vs. Biden on the issues: Foreign policy
3543 Record number of women CEOs on this year's For...
3549 All Michigan State students asked to quarantin...
17774 Bernie Sanders, Danny Glover Attend Game 7 of ...
3152 The Note: GOP aids Trump in programming around...
4583 Trump adviser predicts Sunbelt sweep, misleads...
4268 2 students allegedly caught up in Belarus crac...
3962 On heels of Arizona Senate primary, Republican...
4377 North Carolina to be a crucial battleground st...
3388 Guy Fieri has helped raise over $22M for resta...
4614 Little girls will have to wait 4 more years, W...
4276 Congressional Black Caucus to propose policing...
4358 Nursing home in Connecticut transferring all r...
4595 Sally slams Gulf Coast with life-threatening f...
Content Category
2467 New York Rep. Max Rose joined “The View” Monda... Politics
3471 As millions of people around the world continu... International
4343 They've done their time behind bars and been o... US
3910 Every week we'll bring you some of the climate... Politics
3924 Electric vehicles have always been a tough sel... Business
3937 As Puerto Rico struggles to recover from multi... US
4425 American foreign policy for over half a centur... Politics
3543 A record high number of female CEOs are at the... Business
3549 All local Michigan State University students h... US
17774 — -- Bernie Sanders capped Memorial Day off by... Politics
3152 The TAKE with Rick Klein\nPresident Donald Tru... Politics
4583 Facing polls showing a competitive race in as ... Politics
4268 A U.S. student studying at New York’s Columbia... International
3962 What's sure to be one of the most expensive an... Politics
4377 North Carolina, home of the upcoming business ... Politics
3388 Guy Fieri should add donations to his triple d... Lifestyle
4614 Four years ago, a major political party nomina... Politics
4276 The Congressional Black Caucus is at work on a... Politics
4358 All residents at a Connecticut nursing home ar... US
4595 Sally made landfall near Gulf Shores, Alabama,... US
None
​
'''
For this input it is returning an error.
'''
Enter the topic youre interested in:Joe
['joe']
Total number of results: 2246
[4246, 4594, 3763, 3736, 4448, 2337, 3431, 3610, 3636, 4089, 13369, 15363,
7269, 21077, 3299, 4372, 4413, 7053, 15256, 1305]
Top 20 results:
--------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-31-ff543d46b951> in <module>
----> 1 user_ask()
<ipython-input-27-31af284a01b4> in user_ask()
4 if int(a) == 0:
5 b = input('Enter the topic you''re interested in:')
----> 6 articles = search_index(b)
7 print(articles)
8
<ipython-input-25-4a5261a1e717> in search_index(c)
50 print(temp)
51 print('Top 20 results:\n')
---> 52 a = data.iloc[temp,[0,1,2,3]].copy()
53 else:
54 a = data.iloc[f,[0,1,2,3]].copy()
c:\users\henis\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1759 except (KeyError, IndexError, AttributeError):
1760 pass
-> 1761 return self._getitem_tuple(key)
1762 else:
1763 # we by definition only have the 0th axis
c:\users\henis\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
2064 def _getitem_tuple(self, tup: Tuple):
2065
-> 2066 self._has_valid_tuple(tup)
2067 try:
2068 return self._getitem_lowerdim(tup)
c:\users\henis\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
700 raise IndexingError("Too many indexers")
701 try:
--> 702 self._validate_key(k, i)
703 except ValueError:
704 raise ValueError(
c:\users\henis\appdata\local\programs\python\python37\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
2006 # check that the key does not exceed the maximum size of the index
2007 if len(arr) and (arr.max() >= len_axis or arr.min() < -len_axis):
-> 2008 raise IndexError("positional indexers are out-of-bounds")
2009 else:
2010 raise ValueError(f"Can only index by location with a
[{self._valid_types}]")
IndexError: positional indexers are out-of-bounds
'''

Missing value imputation in Python

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.
Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?
This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.
# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix
alpha_z = []
for clust in range(0, num_clusters):
alpha = clust_to_items[clust, :]
alpha_beta = beta.multiply(alpha)
sum_row = alpha_beta.sum(1)[0, 0]
num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
to_impute = sum_row / num_nonzero
Z = np.repeat(to_impute, beta.shape[1])
alpha_z = alpha.multiply(Z)
idx = beta.nonzero()
alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]
# The interact_score is the required modified beta
# This is used to do some work that is very fast
The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.
Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.
I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:
def fast_impute(num_clusters, item_clusters, beta):
# get counts
cluster_counts = np.zeros(num_clusters)
np.add.at(cluster_counts, item_clusters, 1)
# get complete totals
totals = np.zeros(num_clusters)
np.add.at(totals, item_clusters, beta)
# get number of zeros
zero_counts = np.zeros(num_clusters)
z = beta == 0
np.add.at(zero_counts, item_clusters, z)
# non-zero means
cluster_means = totals / (cluster_counts - zero_counts)
# perform imputations
imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])
return imputed_beta
which gives me
>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms
and
>>> imputed[:5]
array([ 1.27582017, -1. , -1. , 1. , 3. ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561
Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:
>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
beta cluster
0 0 506
1 -1 968
2 -1 873
3 1 179
4 3 269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
beta cluster
0 1.27582 506
1 -1.00000 968
2 -1.00000 873
3 1.00000 179
4 3.00000 269
My suspicion is that
alpha_beta = beta.multiply(alpha)
is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:
sum_row = alpha_beta.sum(1)[0, 0]
So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Python File Writing Format Issue

The Code
def rainfallInInches():
file_object = open('rainfalls.txt')
list_of_cities = []
list_of_rainfall_inches = []
for line in file_object:
cut_up_line = line.split()
city = cut_up_line[0]
rainfall_mm = int(line[len(line) - 3:])
rainfall_inches = rainfall_mm / 25.4
list_of_cities.append(city)
list_of_rainfall_inches.append(rainfall_inches)
inch_index = 0
desired_file = open("rainfallInInches.txt", "w")
for city in list_of_cities:
desired_file.writelines(str((city, "{0:0.2f}".format(list_of_rainfall_inches[inch_index]))))
inch_index += 1
desired_file.close()
rainfalls.txt
Manchester 37
Portsmouth 9
London 5
Southampton 12
Leeds 20
Cardiff 42
Birmingham 34
Edinburgh 26
Newcastle 11
rainfallInInches.txt
This is the unwanted output
('Manchester', '1.46')('Portsmouth', '0.35')('London',
'0.20')('Southampton', '0.47')('Leeds', '0.79')('Cardiff',
'1.65')('Birmingham', '1.34')('Edinburgh', '1.02')('Newcastle',
'0.43')
My program takes the data from 'rainfalls.txt' which has rainfall information in mm and converts the mm to inches then writes this new information into a new file 'rainfallInInches.txt'.
I've gotten this far except I can't figure out how to format 'rainfallInInches.txt' to make it look like 'rainfalls.txt'.
Bear in mind that I am a student, which you probably gathered by my hacky code.
My program takes the data from 'rainfalls.txt' which has rainfall information in mm and converts the mm to inches then writes this new information into a new file 'rainfallInInches.txt'.
You could separate the parsing of the input file, the conversion from mm to inches, and the final formatting for writing:
#!/usr/bin/env python
# read input
rainfall_data = [] # city, rainfall pairs
with open('rainfalls.txt') as file:
for line in file:
if line.strip(): # non-blank
city, rainfall = line.split() # no comments in the input
rainfall_data.append((city, float(rainfall)))
def mm_to_inches(mm):
"""Convert *mm* to inches."""
return mm * 0.039370
# write output
with open('rainfallInInches.txt', 'w') as file:
for city, rainfall_mm in rainfall_data:
file.write("{city} {rainfall:.2f}\n".format(city=city,
rainfall=mm_to_inches(rainfall_mm)))
rainfallInInches.txt:
Manchester 1.46
Portsmouth 0.35
London 0.20
Southampton 0.47
Leeds 0.79
Cardiff 1.65
Birmingham 1.34
Edinburgh 1.02
Newcastle 0.43
If you feel confident that each step is correct in isolation then you could combine the steps:
#!/usr/bin/env python
def mm_to_inches(mm):
"""Convert *mm* to inches."""
return mm * 0.039370
with open('rainfalls.txt') as input_file, \
open('rainfallInInches.txt', 'w') as output_file:
for line in input_file:
if line.strip(): # non-blank
city, rainfall_mm = line.split() # no comments
output_file.write("{city} {rainfall:.2f}\n".format(city=city,
rainfall=mm_to_inches(float(rainfall_mm))))
It produces the same output.
First is better change your parser to split a string by space. With this you dont need a complex logic to take numbers.
After this, to print correctly, change your output to
file.write("{} {0:0.02f}\n".format(city,list_of_rainfall_inches[inch_index] ))

Resources