Have same thing when checking what the function string.match return in Lua - string

I'm new in lua and in this forum. I'm testing if a string have only a alphanumeric characters.
For this I use string.match function and I test what the function return.
Here my code:
function ReadFistMacAddressFile (_FilePath)
local file = open(_FilePath, "rb") -- r read mode and b binary mode
if not file then
file:close()
appli.Test_failed("Can't open the file".. _FilePath .. ".\n\n Verify the path.")
return nil
end
print("------------------------")
print("-------------------------")
local size = sizeFile (file)
print(" size = " .. size)
print("")
print("")
if size == 0 then
file:close()
appli.Test_failed("Mac Adress file's empty")
return nil
end
local lignes = ReadLine (file, 1)
print(" lignes = " .. lignes)
print("")
print("")
local NoneAlphaNumericFind = nil
NoneAlphaNumericFind = string.match(lignes,"[^%w]")
print(" NoneAlphaNumericFind = " .. NoneAlphaNumericFind)
print("")
print("")
if NoneAlphaNumericFind == nil or NoneAlphaNumericFind == '' then
file:close()
return lignes
else
file:close()
appli.Test_failed("Mac adress contain the characteres: ".. NoneAlphaNumericFind)
return nil
end
end
I open the mac adress file, read the first line and test if the string have only alphanumeric characters
The problem that I have is I go to else condition in any case. I saw in lua docs that string.match return nil if it doesn't find the pattern, so i don't undestantd why it doesn't work.
Here were there is only alaphanumeric characters
Here were there is a none alaphanumeric characters
I thank you in advance for your help.
I've tried every solution that I saw in this forum and none work.

Your file is opened as binary (in rb mode), so CRLF line terminator is not converted to LF.
Only LF is auto-removed by Lua file:lines() and file:read() functions.
So, non-alphanumeric CR char in still in the line returned, it always matches [^%w] pattern.
BTW, Lua has %x pattern to match a hexadecimal digit.
You can use [^%x] or simply %X pattern to find a char which is not from the set 0-9A-Fa-f

Related

Why does using "+=" to append to a List[str] result in an unexpected newline character, while "c = c + a" results in c being empty?

I'm working on this problem on LeetCode:
https://leetcode.com/problems/read-n-characters-given-read4/
The question reads:
Given a file and assume that you can only read the file using a given
method read4, implement a method to read n characters.
Method read4:
The API read4 reads 4 consecutive characters from the file, then
writes those characters into the buffer array buf4.
The return value is the number of actual characters read.
Note that read4() has its own file pointer, much like FILE *fp in C.
Definition of read4:
Parameter: char[] buf4
Returns: int
Note: buf4[] is destination not source, the results from read4 will be
copied to buf4[]
...
Method read:
By using the read4 method, implement the method read that reads n
characters from the file and store it in the buffer array buf.
Consider that you cannot manipulate the file directly.
The return value is the number of actual characters read.
Definition of read:
Parameters: char[] buf, int n
Returns: int
Note: buf[] is destination not source, you will need to write the
results to buf[]
I put together the following simple solution:
"""
The read4 API is already defined for you.
#param buf4, a list of characters
#return an integer
def read4(buf4):
# Below is an example of how the read4 API can be called.
file = File("abcdefghijk") # File is "abcdefghijk", initially file pointer (fp) points to 'a'
buf4 = [' '] * 4 # Create buffer with enough space to store characters
read4(buf4) # read4 returns 4. Now buf = ['a','b','c','d'], fp points to 'e'
read4(buf4) # read4 returns 4. Now buf = ['e','f','g','h'], fp points to 'i'
read4(buf4) # read4 returns 3. Now buf = ['i','j','k',...], fp points to end of file
"""
class Solution:
def read(self, buf, n):
"""
:type buf: Destination buffer (List[str])
:type n: Number of characters to read (int)
:rtype: The number of actual characters read (int)
"""
buf4 = ['']*4
c = 1
while n > 0 and c > 0:
c = read4(buf4)
if c:
if n >= c:
buf += buf4
elif n < c:
buf += buf4[:n]
n -= c
return len(buf)
When I use "+=" to add the contents of buf4 to buf I get a newline character in my output, as in the following example:
"
abc"
If I instead write buf = buf + buf4, I get just the newline character, like so:
"
"
Does anyone know what might be going on here? I know I could solve this problem by using for loops instead. I'm just curious to know what's going on.
I found this article that explains that "+=" and "c = c + b" use different special methods:
Why does += behave unexpectedly on lists?
However I don't think this explains the unexpected newline character. Does anyone know where this newline character is coming from?

Having Issues Concatenating Strings into list without \n - Python3

I am currently having some issues trying to append strings into a new list. However, when I get to the end, my list looks like this:
['MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH\nIIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT\nCGLANMFAYHTLGYEDLDELQKEPQPLVFVIELLQVDAPSDYQRETWNLSNHEKMKAVPV\nLHGEGNRLFKLGRYEEASSKYQEAIICLRNLQTKEKPWEVQWLKLEKMINTLILNYCQCL\nLKKEEYYEVLEHTSDILRHHPGIVKAYYVRARAHAEVWNEAEAKADLQKVLELEPSMQKA\nVRRELRLLENRMAEKQEEERLRCRNMLSQGATQPPAEPPTEPPAQSSTEPPAEPPTAPSA\nELSAGPPAEPATEPPPSPGHSLQH\n']
I'd like to remove the newlines somehow. I looked at other questions on here and most suggest to use .rstrip however in adding that to my code, I get the same output. What am I missing here? Apologies if this question has been asked.
My input also looks like this(took the first 3 lines):
sp|Q9NZN9|AIPL1_HUMAN Aryl-hydrocarbon-interacting protein-like 1 OS=Homo sapiens OX=9606 GN=AIPL1 PE=1 SV=2
MDAALLLNVEGVKKTILHGGTGELPNFITGSRVIFHFRTMKCDEERTVIDDSRQVGQPMH
IIIGNMFKLEVWEILLTSMRVHEVAEFWCDTIHTGVYPILSRSLRQMAQGKDPTEWHVHT
from sys import argv
protein = argv[1] #fasta file
sequence = '' #string linker
get_line = False #False = not the sequence
Uniprot_ID = []
sequence_list =[]
with open(protein) as pn:
for line in pn:
line.rstrip("\n")
if line.startswith(">") and get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
continue
if line.startswith(">") and get_line == True:
sequence.rstrip('\n')
sequence_list.append(sequence) #add the amino acids onto the list
sequence = '' #resets the str
if line != ">" and get_line == True: #if the first line is not a fasta ID and is it a sequence?
sequence += line
print(sequence_list)
Per documentation, rstrip removes trailing characters – the ones at the end. You probably misunderstood others' use of it to remove \ns because typically those would only appear at the end.
To replace a character with something else in an entire string, use replace instead.
These commands do not modify your string! They return a new string, so if you want to change something 'in' a current string variable, assign the result back to the original variable:
>>> line = 'ab\ncd\n'
>>> line.rstrip('\n')
'ab\ncd' # note: this is the immediate result, which is not assigned back to line
>>> line = line.replace('\n', '')
>>> line
'abcd'
When I asked this question I didn't take my time in looking at documentation & understanding my code. After looking, I realized two things:
my code isn't actually getting what I am interested in.
For the specific question I asked, I could have simply used line.split() to remove the '\n'.
sequence = '' #string linker
get_line = False #False = not the sequence
uni_seq = {}
"""this block of code takes a uniprot FASTA file and creates a
dictionary with the key as the uniprot id and the value as a sequence"""
with open (protein) as pn:
for line in pn:
if line.startswith(">"):
if get_line == False:
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
get_line = True
else:
uni_seq[u_id] = sequence
sequence_list.append(sequence)
sp, u_id, name = line.strip().split('|')
Uniprot_ID.append(u_id)
sequence = ''
else:
if get_line == True:
sequence += line.strip() # removes the newline space
uni_seq[u_id] = sequence
sequence_list.append(sequence)

Python3: counting and extracting letters out of a text

I'm doing some processing tasks on a medium-sized (1.7 Mb) Persian text corpus. I want to make lists of three set of characters in the text:
alphabets
white spaces (including newline, tab, space, no-breaking space and etc.) and
punctuation.
I wrote this:
# -*- coding: utf8 -*-
TextObj = open ('text.txt', 'r', encoding = 'UTF8')
import string
LCh = LSpc = LPunct = []
TotalCh = TotalPunct = TotalSpc = 0
TempSet = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
#TempSet variable holds alphabets of Persian language.
ReadObj = TextObj.read ()
for Char in ReadObj:
if Char in TempSet: #This's supposed to count & extract alphabets only.
TotalCh += 1
LCh.append (Char)
elif Char in string.punctuation: #This's supposed to count puncts.
TotalPunct += 1
LPunct.append (Char)
elif Char in ('', '\n', '\t'): #This counts & extracts spacey things.
TotalSpc += 1
LSpc.append (Char)
else: #This'll ignore anything else.
continue
But when I try:
print (LPunct)
print (LSpc)
I tried this code on both Linux and Windows 7. On both of them, the result is not what I expected at all. The punctuation's and space's lists, both contains Persian letters.
Another question:
How can I improve this condition elif Char in ('', '\n', '\t'): so that it covers all kind of space family?
On line 3 you've assigned all the lists to be the same list!
Don't do this:
LCh = LSpc = LPunct = []
Do this:
LCh = []
LSpc = []
LPunct = []
The string class has whitespace built in.
elif Char in string.whitespace:
TotalSpc += 1
LSpc.append (Char)
In your example you didn't actually put a space in your '' character which also may be causing it to fail. Shouldn't this be ' '?
Also, take into account the other answer here, this code is not very pythonic.
I'd write it like this:
# -*- coding: utf8 -*-
import fileinput
import string
persian_chars = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
filename = 'text.txt'
persian_list = []
punctuation_list = []
whitespace_list = []
ignored_list = []
for line in fileinput.input(filename):
for ch in line:
if ch in persian_chars:
persian_list.append(ch)
elif ch in string.punctuation:
punctuation_list.append(ch)
elif ch in string.whitespace:
whitespace_list.append(ch)
else:
ignored_list.append(ch)
total_persian, total_punctuation, total_whitepsace = \
map(len, [persian_list, punctuation_list, whitespace_list])
First of all as a more pythonic way for dealing with files you better to use with statement for opening the files which will close the file at the end of the block.
Secondly since you want to count the number of special characters within your text and preserve them separately, you can use a dictionary with the list names as the keys and relative characters in a list as value. Then use len method to get the length.
And finally for check the membership in whitespaces you can use string.whitespace method.
import string
TempSet = 'ابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی'
result_dict={}
with open ('text.txt', 'r', encoding = 'UTF8') as TextObj :
ReadObj = TextObj.read ()
for ch in ReadObj :
if Char in TempSet:
result_dict['TempSet'].append(ch)
elif Char in string.punctuation:
result_dict['LPunct'].append(ch)
elif Char in string.whitespace:
result_dict['LSpc'].append(ch)
TotalCh =len(result_dict['LSpc'])

Python read char in text until whitespace

I need to create a generator function that will read a word on request char by char from a text file. I'm aware of .split(), but I specifically need char by char until white space.
word = []
with open("text.txt", "r") as file:
char = file.read(1)
for char in file:
if char != " ":
word.append(char)
file.close()
print(word)
This code does not exactly do what I want :( I don't have much experience in programming...
EDIT: I have a code like this now:
def generator():
word = " "
with open("text.txt", "r") as file:
file.read(1)
for line in file:
for char in line:
if char != " ":
word += char
return(word)
def main():
print(generator())
if __name__ == '__main__':
main()
And now it pretty much does what I want, it prints out the char one by one, but it does not stop after the whitespace " " and so prints out the whole text without any spaces. So how can I make it stop before the whitespace and jump put of the function?
The following demonstrates a generator function that yields the individual words from the file as delimited by white-space, reading but one character at a time:
def generator(path):
word = ''
with open(path) as file:
while True:
char = file.read(1)
if char.isspace():
if word:
yield word
word = ''
elif char == '':
if word:
yield word
break
else:
word += char
# Instantiate the word generator.
words = generator('text.txt')
# Print the very first word.
print(next(words))
# Print the remaining words.
for word in words:
print(word)
If the text.txt file contains:
First word on first line.
Second line.
New paragraph.
then the above script outputs:
First
word
on
first
line.
Second
line.
New
paragraph.
It should be noted that the above generator function is unnecessarily complicated, due to the dubitable constraint that the file be read one character at a time. The much more "pythonic" implementation, yielding the same results, would be this:
def generator(path):
with open(path) as file:
for line in file:
for word in line.split():
yield word
for char in file:
This line reads a line not a character.
So you need to do something like this-
for line in file:
for char in line:

Having trouble with str.find()

I'm trying to use the str.find() and it keeps raising an error, what am I doing wrong?
import codecs
def countLOC(inFile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
LOC = 0
for line in inFile:
if line.isspace():
continue
comment = line.find('#')
if comment > 0:
for letter in range(comment):
if not letter.whitespace:
LOC += 1
break
return LOC
if __name__ == "__main__":
while True:
file_loc = input("Enter the file name: ").strip()
try:
source = codecs.open(file_loc)
except:
print ("**Invalid filename**")
else:
break
LOC_count = countLOC(source)
print ("\nThere were {0} lines of code in {1}".format(LOC_count,source.name))
Error
File "C:\Users\Justen-san\Documents\Eclipse Workspace\countLOC\src\root\nested\linesOfCode.py", line 12, in countLOC
comment = line.find('#')
TypeError: expected an object with the buffer interface
Use the built-in function open() instead of codecs.open().
You're running afoul of the difference between non-Unicode (Python 3 bytes, Python 2 str) and Unicode (Python 3 str, Python 2 unicode) string types. Python 3 won't convert automatically between non-Unicode and Unicode like Python 2 will. Using codecs.open() without an encoding parameter returns an object which yields bytes when you read from it.
Also, your countLOC function won't work:
for letter in range(comment):
if not letter.whitespace:
LOC += 1
break
That for loop will iterate over the numbers from zero to one less than the position of '#' in the string (letter = 0, 1, 2...); whitespace isn't a method of integers, and even if it were, you're not calling it.
Also, you're never incrementing LOC if the line doesn't contain #.
A "fixed" but otherwise faithful (and inefficient) version of your countLOC:
def countLOC(inFile):
LOC = 0
for line in inFile:
if line.isspace():
continue
comment = line.find('#')
if comment > 0:
for letter in line[:comment]:
if not letter.isspace():
LOC += 1
break
else:
LOC += 1
return LOC
How I might write the function:
def count_LOC(in_file):
loc = 0
for line in in_file:
line = line.lstrip()
if len(line) > 0 and not line.startswith('#'):
loc += 1
return loc
Are you actually passing an open file to the function? Maybe try printing type(file) and type(line), as there's something fishy here -- with an open file as the argument, I just can't reproduce your problem! (There are other bugs in your code but none that would cause that exception). Oh btw, as best practice, DON'T use names of builtins, such as file, for your own purposes -- that causes incredible amounts of confusion!

Resources