The Code
def rainfallInInches():
file_object = open('rainfalls.txt')
list_of_cities = []
list_of_rainfall_inches = []
for line in file_object:
cut_up_line = line.split()
city = cut_up_line[0]
rainfall_mm = int(line[len(line) - 3:])
rainfall_inches = rainfall_mm / 25.4
list_of_cities.append(city)
list_of_rainfall_inches.append(rainfall_inches)
inch_index = 0
desired_file = open("rainfallInInches.txt", "w")
for city in list_of_cities:
desired_file.writelines(str((city, "{0:0.2f}".format(list_of_rainfall_inches[inch_index]))))
inch_index += 1
desired_file.close()
rainfalls.txt
Manchester 37
Portsmouth 9
London 5
Southampton 12
Leeds 20
Cardiff 42
Birmingham 34
Edinburgh 26
Newcastle 11
rainfallInInches.txt
This is the unwanted output
('Manchester', '1.46')('Portsmouth', '0.35')('London',
'0.20')('Southampton', '0.47')('Leeds', '0.79')('Cardiff',
'1.65')('Birmingham', '1.34')('Edinburgh', '1.02')('Newcastle',
'0.43')
My program takes the data from 'rainfalls.txt' which has rainfall information in mm and converts the mm to inches then writes this new information into a new file 'rainfallInInches.txt'.
I've gotten this far except I can't figure out how to format 'rainfallInInches.txt' to make it look like 'rainfalls.txt'.
Bear in mind that I am a student, which you probably gathered by my hacky code.
My program takes the data from 'rainfalls.txt' which has rainfall information in mm and converts the mm to inches then writes this new information into a new file 'rainfallInInches.txt'.
You could separate the parsing of the input file, the conversion from mm to inches, and the final formatting for writing:
#!/usr/bin/env python
# read input
rainfall_data = [] # city, rainfall pairs
with open('rainfalls.txt') as file:
for line in file:
if line.strip(): # non-blank
city, rainfall = line.split() # no comments in the input
rainfall_data.append((city, float(rainfall)))
def mm_to_inches(mm):
"""Convert *mm* to inches."""
return mm * 0.039370
# write output
with open('rainfallInInches.txt', 'w') as file:
for city, rainfall_mm in rainfall_data:
file.write("{city} {rainfall:.2f}\n".format(city=city,
rainfall=mm_to_inches(rainfall_mm)))
rainfallInInches.txt:
Manchester 1.46
Portsmouth 0.35
London 0.20
Southampton 0.47
Leeds 0.79
Cardiff 1.65
Birmingham 1.34
Edinburgh 1.02
Newcastle 0.43
If you feel confident that each step is correct in isolation then you could combine the steps:
#!/usr/bin/env python
def mm_to_inches(mm):
"""Convert *mm* to inches."""
return mm * 0.039370
with open('rainfalls.txt') as input_file, \
open('rainfallInInches.txt', 'w') as output_file:
for line in input_file:
if line.strip(): # non-blank
city, rainfall_mm = line.split() # no comments
output_file.write("{city} {rainfall:.2f}\n".format(city=city,
rainfall=mm_to_inches(float(rainfall_mm))))
It produces the same output.
First is better change your parser to split a string by space. With this you dont need a complex logic to take numbers.
After this, to print correctly, change your output to
file.write("{} {0:0.02f}\n".format(city,list_of_rainfall_inches[inch_index] ))
Related
I want to extract patterns from a textfile and create pandas dataframe.
Each line inside the text file look like this:
2022-07-01,08:00:57.853, +12-34 = 1.11 (0. AA), a=0, b=1 cct= p=0 f=0 r=0 pb=0 pbb=0 prr=2569 du=89
I want to extract the following patterns:
+12-34, 1.11, a=0, b=1 cct= p=0 f=0 r=0 p=0 pbb=0 prr=2569 du=89 where cols={id,res,a,b,p,f,r,pb,pbb,prr,du}.
I have written the following the code to extract patterns and create dataframe. The file is around 500MB containing huge amount of rows.
files = glob.glob(path_torawfolder + "*.txt")
lines = []
for fle in files:
with open(fle) as f:
items = {}
lines += f.readlines()
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
dct = {k:[v] for k,v in feature_dict.items()}
series = pd.DataFrame(dct)
#print(series)
df = pd.concat([df,series], ignore_index=True)
Any suggestions to optimize the code and reduce the processing time, please?
Thanks!
A bit of improvement: in the previous code, there were few unnecessary conversions from dict to df.
dicts = []
def create_dataframe():
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d \.\d{2})',feature_interest)[0]
dicts.append(feature_dict)
df = pd.DataFrame(dicts)
return df
Line # Hits Time Per Hit % Time Line Contents
8 def create_dataframe():
9 1 551.0 551.0 0.0 df = pd.DataFrame()
10 1697339 727220.0 0.4 1.7 for l in lines:
11 1697338 1706328.0 1.0 4.0 feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
12 1697338 20857891.0 12.3 49.1 feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
13
14 1697338 1987874.0 1.2 4.7 feature_dict["ctry_provider"] = (l.split("+")[-1]).split(" =")[0]
15 1697338 9142820.0 5.4 21.5 feature_dict["acpa_codes"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
16 1697338 1039880.0 0.6 2.4 dicts.append(feature_dict)
17
18 1 7025303.0 7025303.0 16.5 df = pd.DataFrame(dicts)
19 1 2.0 2.0 0.0 return df
Improvement reduced the computation to few mins. Any more suggestions to optimize by using dask or parallel computing?
I have 2 txt files with names and scores. For example:
File 1 File 2 Desired Output
Name Score Name Score Name Score
Michael 20 Michael 30 Michael 50
Adrian 40 Adrian 50 Adrian 90
Jane 60 Jane 60
I want to sum scores with same names and print them. I tried to pair names and scores in two different dictionaries and after that merge the dictionaries. However, I can't keep same names with different scores. So, I'm stuck here. I've written something like following :
d1=dict()
d2=dict()
with open('data1.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d1[test[i]] = test[i + 1]
i += 2
del d1['Name']
with open('data2.txt', "r") as f:
test = [i for line in f for i in line.split()]
i = 0
while i < len(test) - 1:
d2[test[i]] = test[i + 1]
i += 2
del d2['Name']
z = dict(d2.items() | d1.items())
Using a dictionary comprehension should get you what you are after. I have assumed the contents of the files are:
File1.txt:
Name Score
Michael 20
Adrian 40
Jane 60
File2.txt:
Name Score
Michael 30
Adrian 50
Then you can get a total as:
with open("file1.txt", "r") as file_in:
next(file_in) # skip header
file1_data = dict(row.split() for row in file_in if row)
with open("file2.txt", "r") as file_in:
next(file_in) # skip header
file2_data = dict(row.split() for row in file_in if row)
result = {
key: int(file1_data.get(key, 0)) + int(file2_data.get(key, 0))
for key
in set(file1_data).union(file2_data) # could also use file1_data.keys()
}
print(result)
This should give you a result like:
{'Michael': 50, 'Jane': 60, 'Adrian': 90}
Use defaultdict
from collections import defaultdict
name_scores = defaultdict(int)
files = ('data1.txt', 'data2.txt')
for file in files:
with open(file, 'r') as f:
for name, score in f.split():
name_scores[name] += int(score)
edit: You'll probably have to skip any header line and maybe clean up trailing white spaces, but the gist of it is above.
I am using Amazon Textract to analyse anonymous blood tests.
It consists of markers, their values, units, ref interval.
I want to extract them into a dictionary like this:
{"globulin": [2.8, gidL, [1.0, 4.0]], "cholesterol": [161, mg/dL, [120, 240]], .... }
Here is an example of such OCR produced text:
Name:
Date Perfermed
$/6/2010
DOBESevState:
Date Collected:
05/03/201004.00 PN
Date Lac Meat: 05/03/2010 10.45 A
Eraminer:
PTM
Date Received: $/7/2010 12:13.11A
Tukit No.
8028522035
Abeormal
Normal
Range
CARDLAC RISK
CHOLESTEROL
161.00
120.00 240.00 mg/dL
CHOLESTEROLHDL RATIO
2.39
1.250 5.00
HIGH DENSITY LIPOPROTEINCHDL)
67.30
35.00 75.00 me/dL
LOW DENSITY LIPOPROTEIN (LDL)
78.70
60.00 a 190.00 midI.
TRIGLYCERIDES
75.00
10.00 a 200.00 made
CHEMISTRIES
ALBUMIN
4.40
3.50 5.50 pidl
ALKALINE PHOSPHATASE
49.00
30.00 120.00 UAL
BLOOD UREA NITROGEN (BUN)
17.00
6.00 2500 meidL
CREATININE
0,85
060 1.50 matdL
FRUCTOSAMINE
182
1.20 1.79 mmoV/l
GAMMA GLUTAMYUTRANSFERASE
9.00
2.00 65.00 UIL
GLOBULIN
2.80
1.00 4.00 gidL.
GLUCOSE
61.00
70.00 125.00 me/dl.
HEMOGLOBIN AIC
5.10
3.00 6.00 %
SGOT (AST)
25.00
0.00 41.00 UM
SOPI (ALT)
22.00
0.00 45.00 IMI
TOTAL BILIRUBIN
0.52
0.10 1.20 mmeldi.
TOTAL PROTEIN
720
6.00 8.50 gidl.
1. This sample lab report shows both normal and abnormal results. as well as
acceptable reference ranges for each testing category.
Please advise what is the best way to extract this information, I have tried Amazon Comprehend medical - it does the job but not for all images.
Tried SpaCy: https://github.com/NLPatVCU/medaCy,
https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
This might not be a good application of NLP as the text isn't any sort of natural language. Rather, they are structured data that can be extracted using rules. Writing rules is definitely one way to go about this.
You can first try to do a fuzzy match of the categories on the OCR results, namely "CARDIAC RISK" and "CHEMISTRIES" to partition the string into their respective categories.
If you are sure that each entry will take only 3 lines, you can simply partition them by newline and extract the data from there.
Once you have them split into entries
Here's some sample code I ran on the data you provided. It requires the fuzzyset package which you can get by running python3 -m pip install fuzzyset. Since some entries don't have units I modified your desired output format slightly and made units a list so it can easily be empty. It also stores random letters found in the third line.
from fuzzyset import FuzzySet
### Load data
with open("ocr_result.txt") as f:
data = f.read()
lines = data.split("\n")
### Create fuzzy set
CATEGORIES = ("CARDIAC RISK", "chemistries")
fs = FuzzySet(lines)
### Get the line ranges of each category
cat_ranges = [0] * (len(CATEGORIES) + 1)
for i, cat in enumerate(CATEGORIES):
match = fs.get(cat)[0]
match_idx = lines.index(match[1])
cat_ranges[i] = match_idx
last_idx = lines.index(fs.get("sample lab report")[0][1])
cat_ranges[-1] = last_idx
### Read lines in each category
def _to_float(s: str) -> float:
"""
Attempt to convert a string value to float
"""
try:
f = float(s)
except ValueError:
if "," in s:
s = s.replace(",", ".")
f = float(s)
else:
raise ValueError(f"Cannot convert {s} to float.")
return f
result = {}
for i, cat in enumerate(CATEGORIES):
result[cat] = {}
# Ignore the line of the category itself
s = slice(cat_ranges[i] + 1, cat_ranges[i + 1])
lines_in_cat = lines[s]
if len(lines_in_cat) % 3 != 0:
breakpoint()
raise ValueError("Something's wrong")
for i in range(0, len(lines_in_cat), 3):
_name = lines_in_cat[i]
_value = lines_in_cat[i + 1]
_line_3 = lines_in_cat[i + 2].split(" ")
# Convert value to float
_value = _to_float(_value)
# Process line 3 to get range and unit
_range = []
_unit = []
for i, v in enumerate(_line_3):
if v[0].isdigit() and len(_range) < 2:
_range.append(_to_float(v))
else:
_unit.append(v)
_l = [_value, _unit, _range]
result[cat][_name] = _l
print(result)
Output:
{'CARDIAC RISK': {'CHOLESTEROL': [161.0, ['mg/dL'], [120.0, 240.0]], 'CHOLESTEROLHDL RATIO': [2.39, [], [1.25, 5.0]], 'HIGH DENSITY LIPOPROTEINCHDL)': [67.3, ['me/dL'], [35.0, 75.0]], 'LOW DENSITY LIPOPROTEIN (LDL)': [78.7, ['a', 'midI.'], [60.0, 190.0]], 'TRIGLYCERIDES': [75.0, ['a', 'made'], [10.0, 200.0]]}, 'chemistries': {'ALBUMIN': [4.4, ['pidl'], [3.5, 5.5]], 'ALKALINE PHOSPHATASE': [49.0, ['UAL'], [30.0, 120.0]], 'BLOOD UREA NITROGEN (BUN)': [17.0, ['meidL'], [6.0, 2500.0]], 'CREATININE': [0.85, ['matdL'], [60.0, 1.5]], 'FRUCTOSAMINE': [182.0, ['mmoV/l'], [1.2, 1.79]], 'GAMMA GLUTAMYUTRANSFERASE': [9.0, ['UIL'], [2.0, 65.0]], 'GLOBULIN': [2.8, ['gidL.'], [1.0, 4.0]], 'GLUCOSE': [61.0, ['me/dl.'], [70.0, 125.0]], 'HEMOGLOBIN AIC': [5.1, ['%'], [3.0, 6.0]], 'SGOT (AST)': [25.0, ['UM'], [0.0, 41.0]], 'SOPI (ALT)': [22.0, ['IMI'], [0.0, 45.0]], 'TOTAL BILIRUBIN': [0.52, ['mmeldi.'], [0.1, 1.2]], 'TOTAL PROTEIN': [720.0, ['gidl.'], [6.0, 8.5]]}}
I'm trying to cleanup a text file from a URL using pandas and my idea is to split it into separate columns, add 3 more columns and export to a csv.
I have tried cleaning up the file (I believe it is being separated by " ") and so far no avail.
# script to check and clean text file for 'aberporth' station
import pandas as pd
import requests
# api-endpoint for current weather
URLH = "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt"
with requests.session() as s:
# sending get for histroy text file
r = s.get(URLH)
df1 = pd.read_csv(io.StringIO(r.text), sep=" ", skiprows=5, error_bad_lines=False)
df2 = pd.read_csv(io.StringIO(r.text), nrows=1)
# df1['location'] = df2.columns.values[0]
# _, lat, _, lon = df2.index[0][1].split()
# df1['lat'], df1['lon'] = lat, lon
df1.dropna(how='all')
df1.to_csv('Aberporth.txt', sep='|', index=True)
What makes it worse is that the file itself has uneven columns, and somewhere down line 944, it adds one more column for which I to skip error on bad lines. At this point I'm a bit lost as for how I should proceed and if I should look at something else beyond Pandas.
You don't really need pandas for this. The built-in csv module does just fine.
The data comes in fixed-width format (which is not the same as "delimited format"):
Aberporth
Location: 224100E 252100N, Lat 52.139 Lon -4.570, 133 metres amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
yyyy mm tmax tmin af rain sun
degC degC days mm hours
1942 2 4.2 -0.6 --- 13.8 80.3
1942 3 9.7 3.7 --- 58.0 117.9
...
So we can either split it at predefined indexes (which we would have to count & hard-code, and which probably are subject to change), or we can split on "multiple spaces" using regex, in which case it does not matter where the exact column positions are:
import requests
import re
import csv
def get_values(url):
resp = requests.get(url)
for line in resp.text.splitlines():
values = re.split("\s+", line.strip())
# skip all lines that do not have a year as first item
if not re.match("^\d{4}$", values[0]):
continue
# replace all '---' by None
values = [None if v == '---' else v for v in values]
yield values
url = "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt"
with open('out.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f)
writer.writerows(get_values(url))
You can do writer.writerow(['yyyy','mm','tmax','tmin','af','rain','sun']) to get a header row if you need one.
I have a file like
2.0 4 3
0.5 5 4
-0.5 6 1
-2.0 7 7
.......
the actual file is pretty big
which I want to read and add couple of columns, first added column, column(4) = column(2) * column(3) and 2nd column added would be column 5 = column(2)/column(1) + column(4) so the result should be
2.0 4 3 12 14
0.5 5 4 20 30
-0.5 6 1 6 -6
-2.0 7 7 49 45.5
.....
which I want to write in a different file.
with open('test3.txt', encoding ='latin1') as rf:
with open('test4.txt', 'w') as wf:
for line in rf:
float_list= [float(i) for i in line.split()]
print(float_list)
But so far I just have this. I am just able create the list not sure how to perform arithmetic on the list and create new columns. I think I am completely off here. I am just a beginner in python. Any help will be greatly appreciated. Thanks!
I would reuse your formulae, but shifting indexes since they start at 0 in python.
I would extend the read column list of floats with the new computations, and write back the line, space separated (converting back to str in a list comprehension)
So, the inner part of the loop can be written as follows:
with open('test3.txt', encoding ='latin1') as rf:
with open('test4.txt', 'w') as wf:
for line in rf:
column= [float(i) for i in line.split()] # your code
column.append(column[1] * column[2]) # add column
column.append(column[1]/column[0] + column[3]) # add another column
wf.write(" ".join([str(x) for x in column])+"\n") # write joined strings, separated by spaces
Something like this - see comments in code
with open('test3.txt', encoding ='latin1') as rf:
with open('test4.txt', 'w') as wf:
for line in rf:
float_list = [float(i) for i in line.split()]
# calculate two new columns
float_list.append(float_list[1] * float_list[2])
float_list.append(float_list[1]/float_list[0] + float_list[3])
# convert all values to text
text_list = [str(i) for i in float_list]
# concatente all elements and write line
wf.write(' '.join(text_list) + '\n')
Try the following:
map() is used to convert each element of the list to float, by the end it is used again to convert each float to str so we can concatenate them.
with open('out.txt', 'w') as out:
with open('input.txt', 'r') as f:
for line in f:
my_list = map(float, line.split())
my_list.append(my_list[1]*my_list[2])
my_list.append(my_list[1] / my_list[0] + my_list[3])
my_list = map(str, my_list)
out.write(' '.join(my_list) + '\n')