How to account for unexpected data when trying to split values - python-3.x

I have the following code snippet which is part of a larger chunk of code to extract image filenames from links.
for a in soup.find_all('a', href=True):
url = a['href']
path, file = url.rsplit('/', 1)
name, ext = file.rsplit('.', 1)
It works very well, however on occasion the data (which comes from an external source) will have errors.
Specifically, the last line in the snippet above will throw an error that:
name, ext = file.rsplit('.', 1)
ValueError: not enough values to unpack (expected 2, got 1)
What is the best way to ignore this error (or lines containing input not as expected) and continue on to the next entry?
I would have thought a try and catch is the right approach here, but upon googling how to do that with this type of error I did not find anything.
Is it possible to use a try block to catch this type of error? If not, why not, and what is the better approach?

Assuming all you need is to ignore the error, this try/except style should work for you:
for item in ['a.b.c', 'a.b', 'a', 'a.b.c']:
try:
path, file = item.rsplit('.',1)
print("%s, %s" % (path, file))
except ValueError:
print("error with %s" % item)
continue
print("more work here!")
which gives the output:
a.b, c
more work here!
a, b
more work here!
error with a
a.b, c
more work here!
Of course, this may not be the best solution to use, depending on the greater context of what you are trying to do. Is it safe to just ignore the files with no extensions?
In particular, you should generally try to sanitize incoming data as much as possible before processing it, though this is a relatively trivial example and its likely that sanitizing the data for this would be just as expensive as doing this particular split. Put another way, user input being dirty isn't really an "exceptional" condition.

I would not use a try-except in this case, since you have no use for the except part. You're not going to be processing the file if you do encounter an error. Feel free to read up on try-excepts, there are tons of questions on stack overflow about it to see what you think will work best for you.
It sounds like you don't understand the error. The error is because you must have a filename that doesn't have an extension. so when you do rsplit, it only has 1 value. For example:
file = 'babadabooey'
print(file.rsplit('.', 1))
Out: ['babadabooey']
So if you try to unpack that into two values, you're going to get an error. I assume, most of the time you are expecting something like
file = 'babadabooey.exe'
print(file.rsplit('.', 1))
Out: ['babadabooey', '.exe']
So if you try to unpack that value into two values, you're fine. How I would proceed is with an if statement, that way you only try to split it IF '.' is in the file var.
if '.' in file:
name, ext = file.rsplit('.', 1)

Related

Ask a question about a strange Python3 list index error

I am new to python3 and I want to get all the document suffix use:
dir_files = set(map(lambda f: f.split(sep='.')[1], os.listdir()))
but come with an error:
IndexError: list index out of range
However if I change [1] to [0] I can get all the filenames
correctly.
That's why? PLS help me.
If you want to get suffix of all documents, then you should go with some other approach. With this approach your program will fail for filenames like these:
my_doc - In this case, the file doesn't have any suffix. So, the split method will result will generate this list - ['my_doc']. Since this is a single element list, you're bound to get an IndexError.
my.doc.txt - Since more than 1 '.'s are present in this file name, the split method will generate this list - ['my', 'doc', 'txt']. Here, your code will give you doc as the file suffix even though real suffix is txt.
One may list more problems, because os.listdir() lists out directories and hidden files as well, but I won't talk about it since I don't know all about your task.
This is one possible solution that will work in most cases (not all cases):
dir_files = set(map(lambda f: os.path.splitext(f)[1][1:].lower(), os.listdir())) - {''}

Regex .match() not finding matches even though it should

I'm trying to scan a .txt file in a node.js script, and scan its contents for certain pieces of data. The lines I'm interested in getting look mostly like this:
DIBH91643 5/10/2019 108,75
SIR108811 5/10/2019 187,50
SIR108845 5/10/2019 63,75
So I've been trying to match them with a regex without succes. Using a regex testing site, I've even confirmed the fact that it should find the matches I'm looking for, but it always returns null when I call data.match(regex). I'm probably missing something basic here, but I can't figure it out for the life of me. This is the code I'm using (in its entirety, since there isn't much):
var fs = require('fs');
let regex = /\w*?(\d+)\s+(\d+\/\d+\/\d+)\s+(\-{0,1}\d+\,\d+)/g;
let ihateregex = /91/g;
fs.readFile('pathToFile/fileToRead.txt',{encoding: 'utf-8'}, (err, data) => {
var result = data.match(regex);
console.log(result);
});
As shown, even an attempt with a simple pattern that is definitely inside the file still returns null. I have looked into other answers here for similar problems, and they all point to deleting bytes from the beginning of the file. I have used vim -b to delete the first 2 bytes - which did look out of place and furthermore printing the entire data with console.log() did actually show 2 weird characters in the beginning of the file, but I get the exact same error.
I can't figure out what I'm missing here.
Try the following regex:
/^[A-Z]*(\d+)\s+(\d+\/\d+\/\d+)\s+(-?\d+,\d+)/gm
Improvements compared to your regex:
^ - start from the start of line,
[A-Z]* instead of \w*? - note that \w matches also digits,
removed / in front of - and ,,
? instead of {0,1},
added m option (I assume that you want to process all rows, not the first only).
To process the matches I used the following code, using rextester.com, so
instead of e.g. console.log(...) it contains print(...):
let data = 'DIBH91643 5/10/2019 108,75\nSIR108811 5/10/2019 187,50\nSIR108845 5/10/2019 63,75'
print("Data: ")
print(data)
let re = /^[A-Z]*(\d+)\s+(\d+\/\d+\/\d+)\s+(-?\d+,\d+)/gm
print("Result: ")
while ((matches = re.exec(data)) != null) {
print(matches[1], '_', matches[2], '_', matches[3])
}
For a working example see https://rextester.com/PZU21213
So I've finally figured out what went wrong and I feel extremely stupid for taking so long to figure it out. One thing I've failed to mention even though I should have is that the file I'm reading is one created by an OCR program. An OCR program which, apparently, added an invisible char between each character in the text file, that I only saw when I switched to php (fopen(), fgets(), fclose()) and looked at the source of the page I made.
Once I copied the contents of fileToRead.txt into a newly created fileToRead2.txt (simple copy-paste), it worked perfectly.

Python3 - "ValueError: not enough values to unpack (expected 3, got 1)"

I'm very new to Python and programming overall, so if I seem to struggle to understand you, please bear with me.
I'm reading "Learn Python 3 the Hard Way", and I'm having trouble with exercise 23.
I copied the code to my text editor and ended up with this:
import sys
script, input_encoding, error = sys.argv
def main(language_file, encoding, errors):
line = language_file.readline()
if line:
print_line(line, encoding, errors)
return main(language_file, encoding, errors)
def print_line(line, encoding, errors):
next_lang = line.strip()
raw_bytes = next_lang.encode(encoding, errors=errors)
cooked_string = raw_bytes.decode(encoding, errors=errors)
print(raw_bytes, "<====>", cooked_string)
languages = open("languages.txt", encoding = "utf-8")
main(languages, input_encoding, error)
When I tried to run it I got the following error message:
Traceback (most recent call last):
File "pag78.py", line 3, in <module>
script, input_encoding, error = sys.argv
ValueError: not enough values to unpack (expected 3, got 1)
which I am having difficulties understanding in this context.
I googled the exercise, to compare it something other than the book page and, if I'm not missing something, I copied it correctly. For example, see this code here for the same exercise.
Obviously something is wrong with this code, and I'm not capable to identify what it is.
Any help would be greatly appreciated.
When you run the program, you have to enter your arguments into the command line. So run the program like this:
python ex23.py utf-8 strict
Copy and paste all of that into your terminal to run the code. This exercise uses argv like others do. It says this in the chapter, just a little bit later. I think you jumped the gun on running the code before you got to the explanation.
Let's record this in an answer for sake of posterity. In short, the immediate problem described lies not as much in the script itself, but rather in how it's being called. No positional argument was given, but two were expected to be assigned to input_encoding and error.
This line:
script, input_encoding, error = sys.argv
Takes (list) of arguments passed to the script. (sys.argv) and unpacks it, that is asigns its items' values to the variables on the left. This assumes number of variables to unpack to corresponds to items count in the list on the right.
sys.argv contains name of the script called and additional arguments passed to it one item each.
This construct is actually very simple way to ensure correct number of expected arguments is provided, even though as such the resulting error is perhaps not the most obvious.
Later on, you certainly should check out argparse for handling of passed arguments. It is comfortable and quite powerful.
I started reading LPTHW a couple of weeks ago. I got the same error as 'micaldras'. The error arises because you have probably clicked the file-link and opened an IEExplorer window. From there, (I guess), you have copied the text into a notepad file and saved. it.
I did that as well and got the same errors. I then downloaded the file directly from the indicated link (right click on the file and choose Save Target As). The saves the file literally as Zed intended and the program now runs.

Comparing strings in python 2.7

This is my code:
for films in filmlist:
with codecs.open('peliculas.txt', encoding='utf8', mode='r') as lfile:
filmsDone = lfile.read()
filmsDoneList = filmsDone.split(',')
if films not in filmsDoneList:
with codecs.open('peliculas.txt', encoding='utf8', mode='a+') as lfile:
lfile.write(films.strip() + ',')
It will never recognize the last item of the list.
I have printed filmsDoneList and the last item in PyCharm looks like this: u'X Men.Primera Generacion'. I have printed films and they looks like this: X Men.Primera Generacion'
So I have no idea where is the problem. Thanks in advance.
#Rafa, for you to better understand what I meant in the comments, I had to write an entire answer in order for me to attach codes and screenshots.
Let's say the peliculas.txt file has the following format:
You can import such file in python according the following 3 commands:
fileIN=open('peliculas.txt','r')
filmsDoneList=fileIN.readlines()
fileIN.close()
So you basically open the file, import each line thanks to readlines() and then close the file because its contents are available in filmsDoneList. The latter has the following contents (in PyCharm):
Obviously this list is quite long and does not fit in my screen, but you get the point.
You can now get rid of that annoying newline tag '\r\n' by means of the following loop:
for id in range(len(filmsDoneList)):
filmsDoneList[id]=filmsDoneList[id].strip()
and now filmsDoneList has the form:
much better now, innit?
Now, let's say you want to add the following films:
newFilms=['The Exorcist','Back to the Future','Aliens','Back to the Future']
To make your code more robust, I have added Back to the Future twice. Basically you can get rid of duplicates in newFilms by means of the set() function. This will convert newFilms in a set with duplicates removed, but we will convert it back to a list thanks to this command:
newFilms=list(set(newFilms))
and now newFilms has the form:
Now that everything has been sorted, it's time to check if items in newFilms already are in filmsDoneList which, recall, is the contents of peliculas.txt.
Reopen peliculas.txt as follows:
fileOUT=open('peliculas.txt','a')
the 'a' tag means "append", so basically everything you write will be added to the file without removing anything from it.
And the main loop goes:
for film in newFilms:
if film in filmsDoneList:
pass
else:
fileOUT.write(film+'\n')
the pass means "do nothing". The write commands also appends the newline tag to the movie title: this will keep the previous format of 1 title per line. At the end of this loop you might as well close fileOUT.
The resulting peliculas.txt is
and, as you can see, Back to the Future was in newFilms but wasn't appended to the end of this file because already was in it. As instead, The Exorcist and Aliens have been appended to this file, at the bottom.
If your file has titles separated by commas, this approach is still valid. However you must add
filmsDoneList=filmsDoneList[0].split(',')
after the first for loop. Also in the write function (in the last for loop) you might want to replace the newline value with a comma.
This approach is cleaner, I reckon will also fix the problem you've been having and avoids continuous open/close files in a loop. Hope this helps!

Same for loop, giving out two different results using .write()

this is my first time asking a question so let me know if I am doing something wrong (post wise)
I am trying to create a function that writes into a .txt but i seem to get two very different results between calling it from within a module, and writing the same loop in the shell directly. The code is as follows:
def function(para1, para2): #para1 is a string that i am searching for within para2. para2 is a list of strings
with open("str" + para1 +".txt", 'a'. encoding = 'utf-8') as file:
#opens a file with certain naming convention
n = 0
for word in para2:
if word == para1:
file.write(para2[n-1]+'\n')
print(para2[n-1]) #intentionally included as part of debugging
n+=1
function("targetstr". targettext)
#target str is the phrase I am looking for, targettext is the tokenized text I am
#looking through. this is in the form of a list of strings, that is the output of
#another function, and has already been 'declared' as a variable
when I define this function in the shell, I get the correct words appearing. However, when i call this same function through a module(in the shell), nothing appears in the shell, and the text file shows a bunch of numbers (eg: 's93161), and no new lines.
I have even gone to the extent of including a print statement right after declaration of the function in the module, and commented everything but the print statement, and yet nothing appears in the shell when I call it. However, the numbers still appear in the text file.
I am guessing that there is a problem with how I have defined the parameters or how i cam inputting the parameters when I call the function.
As a reference, here is the desired output:
‘She
Ashley
there
Kitty
Coates
‘Let
let
that
PS: Sorry if this is not very clear as I have very limited knowledge on speaking python
I have found the solution to issue. Turns out that I need to close the shell and restart everything before the compiler recognizes the changes made to the function in the module. Thanks to those who took a look at the issue, and those who tried to help.

Resources