Converting Input data to dictionary (no single delimiter) - python-3.x

I am trying to convert an file.txt into dictionary. I know if the delimiter is only used one time, Then the code is as follows:
dict = {}
with open('file.txt') as input_file:
for line in input_file:
entry = line.split(":")
dict[entry[0].strip()] = entry[1].strip()
However, how do you turn a input file into a dictionary with no clear delimiter?
file.txt:
cats****5
doggie**6
ox******7
output:
dict = {'cats':5, 'doggie':6, 'ox':7}
Thank you for your help :)

You can simply split on your delimeter as before, but take the first and last field:
for line in input_file:
entry = line.split("*")
dict[entry[0].strip()] = entry[-1].strip()
Negative indices fetch elements from the back of the list - the index -1 is the last element, -2 is the second-to-last element, and so on.
You can also use unpacking, which allows for self-documenting variable naming:
for line in input_file:
key, *_, value = line.split("*")
dict[key.strip()] = value.strip()
Here, *_ consumes an arbitrary number of values - but not the first or last, since key and value are before and after it and both consume exactly one value. The symbol * denotes the arbitrary size, while _ is a regular name that is just conventionally used for unused values.
If your delimiter also appears in the value, splitting is not robust. Use a regular expression to define the grammar of your delimiter, and capture key and value. For example, if your delimiter is . and you expect float values, the following works:
import re
kv_pattern = re.compile(r'^(.+?)\.+(.+?)$')
# ^ ^ ^ capture shortest match for any character sequence
# ^ ^ longest match of delimiter sequence
# ^ capture shortest match for any character sequence
data = {}
input_data = ['cats....5.0', 'doggie...6', 'ox.......7.']
for line in input_data:
key, value = kv_pattern.match(line).groups()
data[key.strip()] = value.strip()

Related

Find and append characters of a String by matching with a List in Python 2.7

There is a list contains with character sequences such below:
seq_list = ['C','CA','CAF','CMMVF','E','CMM','CMMF','CMMFF',...]
and a string can be defined as below:
a_str = 'CAFCMMVFCMMECMMFFCCAF'
The problem is to match the longest character sequence of seq_list in a_str from left to right iteratively, and then a character('|') should be appended if it's found.
For example,
a_str begins with 'C' but the actual character sequence is 'CAF' because 'CAF' has the longer sequence than 'C',
so that it should be achieved such below:
a_str = 'CAF|CMMVFCMMECMMFFCCAF' #actual sequence match
'C|AFCMMVFCMMECMMFFCCAF' #false sequence match
Then, remaining a_str_r should be like this a_str_r = 'CMMVFCMMECMMFFCCAF' after a character '|' has been appended. So that the iterative process has to start over again by matching the longest sequence from the list until the end of the string, and the final result should be like this:
a_str = 'CAF|CMMVF|CMM|E|CMMFF|C|CAF|'
This was one of the attempts for this problem, and still couldn't get right!
a_str_r = []
for each in seq_list:
for i in a_str:
if each in i:
a_str_r.append(i+'|')
return a_str_r
You want to search for leftmost longest match. That is a standout for a regular expression search.
import re
seq_list = ['C','CA','CAF','CMMVF','E','CMM','CMMF','CMMFF']
# Sort to put longer match strings before shorter ones
sseq_list = sorted(seq_list, key=lambda a: len(a), reverse=True)
# Turn list into a regular expression string
sseq_re = '|'.join(sseq_list)
# Compile regular expression string
rx = rx = re.compile(sseq_re)
# Put pipe characters between the matches
print '|'.join(rx.findall('CAFCMMVFCMMECMMFFCCAF'))

Get only one word from line

How can I take only one word from a line in file and save it in some string variable?
For example my file has line "this, line, is, super" and I want to save only first word ("this") in variable word. I tried to read it character by character until I got on "," but I when I check it I got an error "Argument of type 'int' is not iterable". How can I make this?
line = file.readline() # reading "this, line, is, super"
if "," in len(line): # checking, if it contains ','
for i in line:
if "," not in line[i]: # while character is not ',' -> this is where I get error
word += line[i] # add it to my string
You can do it like this, using split():
line = file.readline()
if "," in line:
split_line = line.split(",")
first_word = split_line[0]
print(first_word)
split() will create a list where each element is, in your case, a word. Commas will not be included.
At a glance, you are on the right track but there are a few things wrong that you can decipher if you always consider what data type is being stored where. For instance, your conditional 'if "," in len(line)' doesn't make sense, because it translates to 'if "," in 21'. Secondly, you iterate over each character in line, but your value for i is not what you think. You want the index of the character at that point in your for loop, to check if "," is there, but line[i] is not something like line[0], as you would imagine, it is actually line['t']. It is easy to assume that i is always an integer or index in your string, but what you want is a range of integer values, equal to the length of the line, to iterate through, and to find the associated character at each index. I have reformatted your code to work the way you intended, returning word = "this", with these clarifications in mind. I hope you find this instructional (there are shorter ways and built-in methods to do this, but understanding indices is crucial in programming). Assuming line is the string "this, line, is, super":
if "," in line: # checking that the string, not the number 21, has a comma
for i in range(0, len(line)): # for each character in the range 0 -> 21
if line[i] != ",": # e.g. if line[0] does not equal comma
word += line[i] # add character to your string
else:
break # break out of loop when encounter first comma, thus storing only first word

remove the item in string

How do I remove the other stuff in the string and return a list that is made of other strings ? This is what I have written. Thanks in advance!!!
def get_poem_lines(poem):
r""" (str) -> list of str
Return the non-blank, non-empty lines of poem, with whitespace removed
from the beginning and end of each line.
>>> get_poem_lines('The first line leads off,\n\n\n'
... + 'With a gap before the next.\nThen the poem ends.\n')
['The first line leads off,', 'With a gap before the next.', 'Then the poem ends.']
"""
list=[]
for line in poem:
if line == '\n' and line == '+':
poem.remove(line)
s = poem.remove(line)
for a in s:
list.append(a)
return list
split and strip might be what you need:
s = 'The first line leads off,\n\n\n With a gap before the next.\nThen the poem ends.\n'
print([line.strip() for line in s.split("\n") if line])
['The first line leads off,', 'With a gap before the next.', 'Then the poem ends.']
Not sure where the + fits in as it is, if it is involved somehow either strip or str.replace it, also avoid using list as a variable name, it shadows the python list.
lastly strings have no remove method, you can .replace but since strings are immutable you will need to reassign the poem to the the return value of replace i.e poem = poem.replace("+","")
You can read all non-empty lines like this:
list_m = [line if line not in ["\n","\r\n"] for line in file];
Without looking at your input sample, I am assuming that you simply want your white spaces to be removed. In that case,
for x in range(0, len(list_m)):
list_m[x] = list_m[x].replace("[ ](?=\n)", "");

Split by the delimiter that comes first, Python

I have some unpredictable log lines that I'm trying to split.
The one thing I can predict is that the first field always ends with either a . or a :.
Is there any way I can automatically split the string at whichever delimiter comes first?
Look at the index of the . and : characters in the string using the index() function.
Here’s a simple implementation:
def index_default(line, char):
"""Returns the index of a character in a line, or the length of the string
if the character does not appear.
"""
try:
retval = line.index(char)
except ValueError:
retval = len(line)
return retval
def split_log_line(line):
"""Splits a line at either a period or a colon, depending on which appears
first in the line.
"""
if index_default(line, ".") < index_default(line, ":"):
return line.split(".")
else:
return line.split(":")
I wrapped the index() function in an index_default() function because if the line doesn’t contain a character, index() throws a ValueError, and I wasn’t sure if every line in your log would contain both a period and a colon.
And then here’s a quick example:
mylines = [
"line1.split at the dot",
"line2:split at the colon",
"line3:a colon preceded. by a dot",
"line4-neither a colon nor a dot"
]
for line in mylines:
print split_log_line(line)
which returns
['line1', 'split at the dot']
['line2', 'split at the colon']
['line3', 'a colon preceded. by a dot']
['line4-neither a colon nor a dot']
Check the indexes for both both characters, then use the lowest index to split your string.

How do I read a delimited file with strings/numbers with Octave?

I am trying to read a text file containing digits and strings using Octave. The file format is something like this:
A B C
a 10 100
b 20 200
c 30 300
d 40 400
e 50 500
but the delimiter can be space, tab, comma or semicolon. The textread function works fine if the delimiter is space/tab:
[A,B,C] = textread ('test.dat','%s %d %d','headerlines',1)
However it does not work if delimiter is comma/semicolon. I tried to use dklmread:
dlmread ('test.dat',';',1,0)
but it does not work because the first column is a string.
Basically, with textread I can't specify the delimiter and with dlmread I can't specify the format of the first column. Not with the versions of these functions in Octave, at least. Has anybody ever had this problem before?
textread allows you to specify the delimiter-- it honors the property arguments of strread. The following code worked for me:
[A,B,C] = textread( 'test.dat', '%s %d %d' ,'delimiter' , ',' ,1 )
I couldn't find an easy way to do this in Octave currently. You could use fopen() to loop through the file and manually extract the data. I wrote a function that would do this on arbitrary data:
function varargout = coltextread(fname, delim)
% Initialize the variable output argument
varargout = cell(nargout, 1);
% Initialize elements of the cell array to nested cell arrays
% This syntax is due to {:} producing a comma-separated
[varargout{:}] = deal(cell());
fid = fopen(fname, 'r');
while true
% Get the current line
ln = fgetl(fid);
% Stop if EOF
if ln == -1
break;
endif
% Split the line string into components and parse numbers
elems = strsplit(ln, delim);
nums = str2double(elems);
nans = isnan(nums);
% Special case of all strings (header line)
if all(nans)
continue;
endif
% Find the indices of the NaNs
% (i.e. the indices of the strings in the original data)
idxnans = find(nans);
% Assign each corresponding element in the current line
% into the corresponding cell array of varargout
for i = 1:nargout
% Detect if the current index is a string or a num
if any(ismember(idxnans, i))
varargout{i}{end+1} = elems{i};
else
varargout{i}{end+1} = nums(i);
endif
endfor
endwhile
endfunction
It accepts two arguments: the file name, and the delimiter. The function is governed by the number of return variables that are specified, so, for example, [A B C] = coltextread('data.txt', ';'); will try to parse three different data elements from each row in the file, while A = coltextread('data.txt', ';'); will only parse the first elements. If no return variable is given, then the function won't return anything.
The function ignores rows that have all-strings (e.g. the 'A B C' header). Just remove the if all(nans)... section if you want everything.
By default, the 'columns' are returned as cell arrays, although the numbers within those arrays are actually converted numbers, not strings. If you know that a cell array contains only numbers, then you can easily convert it to a column vector with: cell2mat(A)'.

Resources