Creating nested dictionary from text file in python3 - python-3.x

I have a text file of thousands of blocks like this. For processing I needed to convert it into dictionary.
Text file Pattern
[conn.abc]
domain = abc.com
id = Mike
token = jkjkhjksdhfkjshdfhsd
[conn.def]
domain = efg.com
id = Tom
token = hkjhjksdhfks
[conn.ghe]
domain = ghe.com
id = Jef
token = hkjhadkjhskhfskdj7979
Another sample data
New York
domain = Basiclink.com
token = eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Im5PbzNaRHJPRFhFSzFqS1doWHNsSFJfS1hFZyIsImtpZCI6Im5PbzNaRHJPRFhFSzFqS1doWHNsSFJfS1hFZyJ9.eyJhdWQiOiJodHRwczovL21zLmNvbS9zbm93
method = http
username = abc#comp.com
Toronto
domain = hollywoodlink.com
token = eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Im5PbzNaRHJPRFhFSzFqS1doWHNsSFJfS1hFZyIsImtpZCI6Im5PbzNaRHJPRFhFSzFqS1doWHNsSFJfS1hFZyJ9.eyJhdWQiOiJodHRwczovL21zLmNvbS9zbm93Zmxha2UvsfdsdcHJvZGJjcy1lYXN0LXVzLTIiLCJpc3MiOiJodHRwczovL3N0cy53aW5kb3dzLm5ldC9lMjliODE
method = http
username = abc#comp.com
Would like to convert it into following.
d1={conn.abc:{'domain':'abc.com','id': 'Mike',token:'jkjkhjksdhfkjshdfhsd'}
conn.def:{'domain':'efg.com', 'id': 'Tom',token:'hkjhjksdhfks'}
conn.ghe:{'domain':'ghe.com', 'id': 'Jef',token:'hkjhadkjhskhfskdj7979'}}
Thanks

Since the input file can have varying # of lines with data, this code should work.
Assumptions:
Each key (eg: conn.abc) will start with open square bracket and end with open square bracket. example [conn.abc]
Each inner dictionary key value will be separated by =
If the key value can either be [key] or key, then use the below line of code instead of the commented line of code.
elif '=' not in line:
#elif line[0] == '[' and line[-1] == ']':
Code for this is:
with open('abc.txt', 'r') as f:
d1 = {}
for i, line in enumerate(f):
line = line.strip()
if line == '': continue
elif line[0] == '[' and line[-1] == ']':
if i !=0: d1[dkey]= dtemp
dkey = line[1:-1]
dtemp = {}
else:
line_key,line_value = line.split('=')
dtemp[line_key.strip()] = line_value.strip()
d1[dkey]=dtemp
print (d1)
If the input file is:
[conn.abc]
domain = abc.com
id = Mike
token = jkjkhjksdhfkjshdfhsd
[conn.def]
domain = efg.com
id = Tom
dummy = Test
token = hkjhjksdhfks
[conn.ghe]
domain = ghe.com
id = Jef
token = hkjhadkjhskhfskdj7979
The output will be as follows:
{'conn.abc': {'domain': 'abc.com', 'id': 'Mike', 'token': 'jkjkhjksdhfkjshdfhsd'},
'conn.def': {'domain': 'efg.com', 'id': 'Tom', 'dummy': 'Test', 'token': 'hkjhjksdhfks'},
'conn.ghe': {'domain': 'ghe.com', 'id': 'Jef', 'token': 'hkjhadkjhskhfskdj7979'}}
Note here that I added dummy = Test as a key value for conn.def. So your output will have that additional key:value in the output.

You can use standard configparser module:
import configparser
config = configparser.ConfigParser()
config.read("name_of_your_file.txt")
Then you can work with config as standard dictionary:
for name_of_section, section in config.items():
for name_of_value, val in section.items():
print(name_of_section, name_of_value, val)
Prints:
conn.abc domain abc.com
conn.abc id Mike
conn.abc token jkjkhjksdhfkjshdfhsd
conn.def domain efg.com
conn.def id Tom
conn.def token hkjhjksdhfks
conn.ghe domain ghe.com
conn.ghe id Jef
conn.ghe token hkjhadkjhskhfskdj7979
Or:
print(config["conn.abc"]["domain"])
Prints:
abc.com

Related

I have to make a code taking a name and turning it into a last name, first initial and middle initial format

I have to make a code taking a name and turning it into a last name, first initial, middle initial format. My code works assuming there is a middle name but breaks if not provided a middle name. Is there a way to ignore not having a middle name and to just default to last name and first initial? I'm super new to python3 so I'm sorry if my code is uber bad.
Heres my code :
your_name = input()
broken_name = your_name.split()
first_name = broken_name[0]
middle_name = broken_name[1]
last_name = broken_name[2]
middle_in = middle_name[0]
first_in = first_name[0]
print(last_name+',',first_in+'.'+middle_in+'.' )
You can use an if else statement to check how long broken_name is.
Try:
your_name = input()
broken_name = your_name.split()
if len(broken_name) > 2:
first_name = broken_name[0]
middle_name = broken_name[1]
last_name = broken_name[2]
middle_in = middle_name[0]
first_in = first_name[0]
print(last_name+', ',first_in+'. '+middle_in+'.'
else:
first_name = broken_name[0]
last_name = broken_name[1]
first_in = first_name[0]
print(last_name+', ',first_in+'.')
Another option:
def get_pretty_name_str(names):
splited_names = names.split()
last_name = splited_names.pop(-1)
result = f'{last_name},'
while splited_names:
n = splited_names.pop(0)
result += f'{n}.'
return result
print(get_pretty_name_str('First Middle Last')) # Last,First.Middle.
print(get_pretty_name_str('First Last')) # Last,First.

Regex Error and Improvement Driving Licence Data Extraction

I am trying to extract the Name, License No., Date Of Issue and Validity from an Image I processed using Pytesseract. I am quite a lot confused with regex but still went through few documentations and codes over the web.
I got till here:
import pytesseract
import cv2
import re
import cv2
from PIL import Image
import numpy as np
import datetime
from dateutil.relativedelta import relativedelta
def driver_license(filename):
"""
This function will handle the core OCR processing of images.
"""
i = cv2.imread(filename)
newdata=pytesseract.image_to_osd(i)
angle = re.search('(?<=Rotate: )\d+', newdata).group(0)
angle = int(angle)
i = Image.open(filename)
if angle != 0:
#with Image.open("ro2.jpg") as i:
rot_angle = 360 - angle
i = i.rotate(rot_angle, expand="True")
i.save(filename)
i = cv2.imread(filename)
# Convert to gray
i = cv2.cvtColor(i, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
i = cv2.dilate(i, kernel, iterations=1)
i = cv2.erode(i, kernel, iterations=1)
txt = pytesseract.image_to_string(i)
print(txt)
text = []
data = {
'firstName': None,
'lastName': None,
'age': None,
'documentNumber': None
}
c = 0
print(txt)
#Splitting lines
lines = txt.split('\n')
for lin in lines:
c = c + 1
s = lin.strip()
s = s.replace('\n','')
if s:
s = s.rstrip()
s = s.lstrip()
text.append(s)
try:
if re.match(r".*Name|.*name|.*NAME", s):
name = re.sub('[^a-zA-Z]+', ' ', s)
name = name.replace('Name', '')
name = name.replace('name', '')
name = name.replace('NAME', '')
name = name.replace(':', '')
name = name.rstrip()
name = name.lstrip()
nmlt = name.split(" ")
data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
data['lastName'] = nmlt[-1]
if re.search(r"[a-zA-Z][a-zA-Z]-\d{13}", s):
data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]-\d{13}', s)
data['documentNumber'] = data['documentNumber'].group().replace('-', '')
if not data['firstName']:
name = lines[c]
name = re.sub('[^a-zA-Z]+', ' ', name)
name = name.rstrip()
name = name.lstrip()
nmlt = name.split(" ")
data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
data['lastName'] = nmlt[-1]
if re.search(r"[a-zA-Z][a-zA-Z]\d{2} \d{11}", s):
data['documentNumber'] = re.search(r'[a-zA-Z][a-zA-Z]\d{2} \d{11}', s)
data['documentNumber'] = data['documentNumber'].group().replace(' ', '')
if not data['firstName']:
name = lines[c]
name = re.sub('[^a-zA-Z]+', ' ', name)
name = name.rstrip()
name = name.lstrip()
nmlt = name.split(" ")
data['firstName'] = " ".join(nmlt[:len(nmlt)-1])
data['lastName'] = nmlt[-1]
if re.match(r".*DOB|.*dob|.*Dob", s):
yob = re.sub('[^0-9]+', ' ', s)
yob = re.search(r'\d\d\d\d', yob)
data['age'] = datetime.datetime.now().year - int(yob.group())
except:
pass
print(data)
I need to extract the Validity and Issue Date as well. But not getting anywhere near it. Also, I have seen using regex shortens the code like a lot so is there any better optimal way for it?
My input data is a string somewhat like this:
Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India
Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL
: SHRI DARSHAN SINGH GILL
DOB: 10/05/1966 BG: U
Address :
104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034
Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010
(Holder's Sig natu re)
Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR
Or like this:
in
Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India
2
Licence No. : DL-0320170595326 () WN
Name : AZAZ AHAMADSIDDIQUIE
s/w/D : SALAHUDDIN ALI
____... DOB: 26/12/1992 BG: O+
\ \ Address:
—.~J ~—; ROO NO-25 AMK BOYS HOSTEL, J.
— NAGAR, DELHI 110025
Auth to Drive Date of Issue
M.CYL. 12/12/2017
4 wt 4
Iseue Date: 12/12/2017 a
falidity(NT) < 2037
Validity(T) +: NA /
Inv CarrNo : NA te sntian sana
Note: In the second example you wouldn't get the validity, will optimise the OCR for later. Any proper guide which can help me with regex which is a bit simpler would be good.
You can use this pattern: (?<=KEY\s*:\s*)\b[^\n]+ and replace KEY with one of the issues of the date, License No. and others.
Also for this pattern, you need to use regex library.
Code:
import regex
text1 = """
Transport Department Government of NCT of Delhi
Licence to Drive Vehicles Throughout India
Licence No. : DL-0820100052000 (P) R
N : PARMINDER PAL SINGH GILL
: SHRI DARSHAN SINGH GILL
DOB: 10/05/1966 BG: U
Address :
104 SHARDA APPTT WEST ENCLAVE
PITAMPURA DELHI 110034
Auth to Drive Date of Issue
M.CYL. 24/02/2010
LMV-NT 24/02/2010
(Holder's Sig natu re)
Issue Date : 20/05/2016
Validity(NT) : 19/05/2021 : c
Validity(T) : NA Issuing Authority
InvCarrNo : NA NWZ-I, WAZIRPUR
"""
for key in ('Issue Date', 'Licence No\.', 'N', 'Validity\(NT\)'):
print(regex.findall(fr"(?<={key}\s*:\s*)\b[^\n]+", text1, regex.IGNORECASE))
Output:
['20/05/2016']
['DL-0820100052000 (P) R']
['PARMINDER PAL SINGH GILL']
['19/05/2021 : c']
You can also use re with a single regex based on alternation that will capture your keys and values:
import re
text = "Transport Department Government of NCT of Delhi\nLicence to Drive Vehicles Throughout India\n\nLicence No. : DL-0820100052000 (P) R\nN : PARMINDER PAL SINGH GILL\n\n: SHRI DARSHAN SINGH GILL\n\nDOB: 10/05/1966 BG: U\nAddress :\n\n104 SHARDA APPTT WEST ENCLAVE\nPITAMPURA DELHI 110034\n\n\n\nAuth to Drive Date of Issue\nM.CYL. 24/02/2010\nLMV-NT 24/02/2010\n\n(Holder's Sig natu re)\n\nIssue Date : 20/05/2016\nValidity(NT) : 19/05/2021 : c\nValidity(T) : NA Issuing Authority\nInvCarrNo : NA NWZ-I, WAZIRPUR"
search_phrases = ['Issue Date', 'Licence No.', 'N', 'Validity(NT)']
reg = r"\b({})\s*:\W*(.+)".format( "|".join(sorted(map(re.escape, search_phrases), key=len, reverse=True)) )
print(re.findall(reg, text, re.IGNORECASE))
Output of this short online Python demo:
[('Licence No.', 'DL-0820100052000 (P) R'), ('N', 'PARMINDER PAL SINGH GILL'), ('Issue Date', '20/05/2016'), ('Validity(NT)', '19/05/2021 : c')]
The regex is
\b(Validity\(NT\)|Licence\ No\.|Issue\ Date|N)\s*:\W*(.+)
See its online demo.
Details:
map(re.escape, search_phrases) - escapes all special chars in your search phrases to be used as literal texts in a regex (else, . will match any chars, ? won't match a ? char, etc.)
sorted(..., key=len, reverse=True) - sorts the search phrases by length in descending order (to get longer matches first)
"|".join(...) - creates an alternation pattern, a|b|c|...
r"\b({})\s*:\W*(.+)".format( ... ) - creates the final regex.
Regex details
\b - a word boundary (NOTE: replace with (?m)^ if your matches occur at the beginning of a line)
(Validity\(NT\)|Licence\ No\.|Issue\ Date|N) - Group 1: one of the search phrases
\s* - zero or more whitespaces
: - a colon
\W* - zero or more non-word chars
(.+) - (capturing) Group 2: one or more chars other than line break chars, as many as possible.

Validate WTF Flask Form Input Values

I have the following form in my flask app. I'd like to ensure that the input value is actually an integer and also if the value entered in token > k here k can be some number it spits an error message to the screen. The IntegerField doesn't seem to enforce integer values, e.g., if the user enters 2.3 it passes that to my function which fails because it expects an integer.
Can this type of error message happen in the form or do I need to program that inside my flask app once the value is passed from the form to the server?
class Form(FlaskForm):
token = IntegerField('Token Size', [DataRequired()], default = 2)
submit = SubmitField('Submit')
EDIT
Per the comment below, updating this with my revised Form and the route
class Form(FlaskForm):
token = IntegerField('Token Size', validators=[DataRequired(), NumberRange(min=1, max=10, message='Something')], default = 2)
ngram_method = SelectField('Method', [DataRequired()],
choices=[('sliding', 'Sliding Window Method'),
('adjacent', 'Adjacent Text Method')])
rem_stop = BooleanField('Remove Stop Words', render_kw={'checked': True})
rem_punc = BooleanField('Remove Punctuation', default = True)
text2use = SelectField('Text To Use for Word Tree', [DataRequired()],
choices=[('clean', 'Clean/Processed Text'),
('original', 'Original Text String')])
pivot_word = TextField('Pivot Word for Word Tree', [DataRequired()])
submit = SubmitField('Submit')
And the route in which the form is used
#word_analyzer.route('/text', methods=('GET', 'POST'))
def text_analysis():
form = Form()
result = '<table></table>'
ngrams = '<table></table>'
orig_text = '<table></table>'
text = ""
if request.method == 'POST':
tmp_filename = tempfile.gettempdir()+'\\input.txt'
if request.files:
txt_upload = request.files.get('text_file')
if txt_upload:
f = request.files['text_file']
f.save(tmp_filename)
if os.path.exists(tmp_filename):
file = open(tmp_filename, 'r', encoding="utf8")
theText = [line.rstrip('\n') for line in file]
theText = str(theText)
token_size = form.token.data
stops = form.rem_stop.data
punc = form.rem_punc.data
ngram_method = form.ngram_method.data
text_result = text_analyzer(theText, token_size = token_size, remove_stop = stops, remove_punctuation = punc, method = ngram_method)
result = pd.DataFrame.from_dict(text_result, orient='index', columns = ['Results'])[:-3].to_html(classes='table table-striped table-hover', header = "true", justify = "center")
ngrams = pd.DataFrame.from_dict(text_result['ngrams'], orient='index', columns = ['Frequency']).to_html(classes='table table-striped table-hover', header = "true", justify = "center")
if form.pivot_word.data == None:
top_word = json.dumps(text_result['Top Word'])
else:
top_word = json.dumps(form.pivot_word.data)
if form.text2use.data == 'original':
text = json.dumps(text_result['original_text'])
else:
text = json.dumps(text_result['clean_text'])
if form.validate_on_submit():
return render_template('text.html', results = [result], ngrams = [ngrams], form = form, text=text, top_word = top_word)
return render_template('text.html', form = form, results = [result],ngrams = [ngrams], text=text, top_word='')
Use the NumberRange validator from wtforms.validators.NumberRange. You can pass an optional Min and Max value along with the error message. More info here
Update
# Form Class
class Form(FlaskForm):
token = FloatField('Token Size', validators=[DataRequired(), NumberRange(min=1, max=10, message='Something')])
# Route
if form.validate_on_submit():
print(form.name.data)
Here is an example that should work, make sure your form class field looks similar and also that in your route you use form.validate_on_submit():.

Unknown column added in user input form

I have a simple data entry form that writes the inputs to a csv file. Everything seems to be working ok, except that there are extra columns being added to the file in the process somewhere, seems to be during the user input phase. Here is the code:
import pandas as pd
#adds all spreadsheets into one list
Batteries= ["MAT0001.csv","MAT0002.csv", "MAT0003.csv", "MAT0004.csv",
"MAT0005.csv", "MAT0006.csv", "MAT0007.csv", "MAT0008.csv"]
#User selects battery to log
choice = (int(input("Which battery? (1-8):")))
def choosebattery(c):
done = False
while not done:
if(c in range(1,9)):
return Batteries[c]
done = True
else:
print('Sorry, selection must be between 1-8')
cfile = choosebattery(choice)
cbat = pd.read_csv(cfile)
#Collect Cycle input
print ("Enter Current Cycle")
response = None
while response not in {"Y", "N", "y", "n"}:
response = input("Please enter Y or N: ")
cy = response
#Charger input
print ("Enter Current Charger")
response = None
while response not in {"SC-G", "QS", "Bosca", "off", "other"}:
response = input("Please enter one: 'SC-G', 'QS', 'Bosca', 'off', 'other'")
if response == "other":
explain = input("Please explain")
ch = response + ":" + explain
else:
ch = response
#Location
print ("Enter Current Location")
response = None
while response not in {"Rack 1", "Rack 2", "Rack 3", "Rack 4", "EV001", "EV002", "EV003", "EV004", "Floor", "other"}:
response = input("Please enter one: 'Rack 1 - 4', 'EV001 - 004', 'Floor' or 'other'")
if response == "other":
explain = input("Please explain")
lo = response + ":" + explain
else:
lo = response
#Voltage
done = False
while not done:
choice = (float(input("Enter Current Voltage:")))
modchoice = choice * 10
if(modchoice in range(500,700)):
vo = choice
done = True
else:
print('Sorry, selection must be between 50 and 70')
#add inputs to current battery dataframe
log = pd.DataFrame([[cy,ch,lo,vo]],columns=["Cycle", "Charger", "Location", "Voltage"])
clog = pd.concat([cbat,log], axis=0)
clog.to_csv(cfile, index = False)
pd.read_csv(cfile)
And I receive:
Out[18]:
Charger Cycle Location Unnamed: 0 Voltage
0 off n Floor NaN 50.0
Where is the "Unnamed" column coming from?
There's an 'unnamed' column coming from your csv. The reason most likely is that the lines in your input csv files end with a comma (i.e. your separator), so pandas interprets that as an additional (nameless) column. If that's the case, check whether your lines end with your separator. For example, if your files are separated by commas:
Column1,Column2,Column3,
val_11, val12, val12,
...
Into:
Column1,Column2,Column3
val_11, val12, val12
...
Alternatively, try specifying the index column explicitly as in this answer. I believe some of the confusion stems from pandas concat reordering your columns .

Can't avoid stop words in tokens list

I'm normalizing text from wiki and one if the task is to delete stopwords(item) from text tokens. But I can't do it, to be more exact, I can't avoid some of the items.
Code:
# coding: utf8
import os
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist
import win_unicode_console
win_unicode_console.enable()
stop_words_plus = ['il', 'la']
text_tags = ['doc', 'https', 'br', 'clear', 'all']
it_sw = corpus.stopwords.words('italian') + text_tags + stop_words_plus
it_path = os.listdir('C:\\Users\\1\\projects\\i')
lom_path = 'C:\\Users\\1\\projects\\l'
it_corpora = []
lom_corpora = []
def normalize(raw_text):
tokens = word_tokenize(raw_text)
norm_tokens = []
for token in tokens:
if token not in it_sw and token.isalpha() and len(token) > 1:
token = token.lower()
norm_tokens.append(token)
return norm_tokens
for folder_name in it_path:
path_to_files = 'C:\\Users\\1\\projects\\i\\%s' % (folder_name)
files_list = os.listdir(path_to_files)
for file_name in files_list:
file_path = path_to_files + '\\' + file_name
text_file = open(file_path, encoding='utf8')
raw_text = text_file.read()
norm_tokens = normalize(raw_text)
it_corpora += norm_tokens
print(FreqDist(it_corpora).most_common(10))
Output:
[('anni', 1140), ('il', 657), ('la', 523), ('gli', 287), ('parte', 276), ('stato', 276), ('due', 269), ('citta', 254), (
'nel', 248), ('decennio', 242)]
As you can see, I need to avoid words 'il' and 'la', I add them to list(it_sw) and there they are(I've checked). Then I in the func normalize I try to avoid them by `if token not in it_sw, but it doesn't work and I have no idea what's wrong.
You convert your token to lower case after finding that it is not in it_sw. Is it possible that some of your tokens have upper case characters? In this case you could adjust your for loop slightly:
for token in tokens:
token = token.lower()
if token not in it_sw and token.isalpha() and len(token) > 1:
norm_tokens.append(token)
By the way, I'm not sure if the performance of your code is important, but if it is you might get much better performance by checking for the presence of the tokens in a set instead of a list, just change your definition of it_sw to:
it_sw = set(corpus.stopwords.words('italian') + text_tags + stop_words_plus)
You could also change it_corpora into a set, but that would require a few more small changes.

Resources