Proper way to use HTMLParser.getpos()? - python-3.x

What's the proper way to make use of Python 3's html.parser's getpos() method?
I used the following example to explore a subset of html.parser methods:
https://docs.python.org/3/library/html.parser.html#examples
My copy-and-pasted demo program works. But now I want to use the html.parser's getpos() method to acquire a tag's line number and offset.
After numerous experiments, including trying to add a separate def getpos() method to the class given in the example (nothing at all was output), the only way I've been able to make getpos() return its line number and offset tuple is by inserting one line of (what seems to me to be) clumsy and ugly code per line 4 of the following snippet:
from html.parser import HTMLParser
...
class FlareTopicParser(HTMLParser):
def handle_starttag(self, tag, attrs):
# Following line inserted by me into class's examples.
print(" Line, offset ==", HTMLParser.getpos(self))
# This working code from examples per
# https://docs.python.org/3/library/html.parser.html#examples
print(" Start tag:", tag)
for attr in attrs:
print(" attr:", attr)
That works -- to give but one example, for the zero-indented start tag on line 5 of the HTML input file it prints:
Line, offset == (5, 0)
But the HTMLParser.getpos(self) construction in line 4 of the example code seems (to this only-occasional Python 3 coder) clumsy and wrong.
What's the correct, or if you will better, way to use getpos()?

No need to override getpos in your parser; I suggest to rewrite line 4 as follows:
(line, column) = self.getpos()
print("line %d column %d") % (line, column)
With such call to getpos() you can also use line or column independently.

Here's the way to use getpos():
row, col = parser.getpos()
html.splitlines()[row-1][col:col+100]

Related

python: How to read a file and store each line using map function?

I'm trying to reconvert a program that I wrote but getting rid of all for loops.
The original code reads a file with thousands of lines that are structured like:
Ex. 2 lines of a file:
As you can see, the first line starts with LPPD;LEMD and the second line starts with DAAE;LFML. I'm only interested in the very first and second element of each line.
The original code I wrote is:
# Libraries
import sys
from collections import Counter
import collections
from itertools import chain
from collections import defaultdict
import time
# START
# #time=0
start = time.time()
# Defining default program argument
if len(sys.argv)==1:
fileName = "file.txt"
else:
fileName = sys.argv[1]
takeOffAirport = []
landingAirport = []
# Reading file
lines = 0 # Counter for file lines
try:
with open(fileName) as file:
for line in file:
words = line.split(';')
# Relevant data, item1 and item2 from each file line
origin = words[0]
destination = words[1]
# Populating lists
landingAirport.append(destination)
takeOffAirport.append(origin)
lines += 1
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
airports_dict = defaultdict(list)
# Merge lists into a dictionary key:value
for key, value in chain(Counter(takeOffAirport).items(),
Counter(landingAirport).items()):
# 'AIRPOT_NAME':[num_takeOffs, num_landings]
airports_dict[key].append(value)
# Sum key values and add it as another value
for key, value in airports_dict.items():
#'AIRPOT_NAME':[num_totalMovements, num_takeOffs, num_landings]
airports_dict[key] = [sum(value),value]
# Sort dictionary by the top 10 total movements
airports_dict = sorted(airports_dict.items(),
key=lambda kv:kv[1], reverse=True)[:10]
airports_dict = collections.OrderedDict(airports_dict)
# Print results
print("\nAIRPORT"+ "\t\t#TOTAL_MOVEMENTS"+ "\t#TAKEOFFS"+ "\t#LANDINGS")
for k in airports_dict:
print(k,"\t\t", airports_dict[k][0],
"\t\t\t", airports_dict[k][1][1],
"\t\t", airports_dict[k][1][0])
# #time=1
end = time.time()- start
print("\nAlgorithm execution time: %0.5f" % end)
print("Total number of lines read in the file: %u\n" % lines)
airports_dict.clear
takeOffAirport.clear
landingAirport.clear
My goal is to simplify the program using map, reduce and filter. So far I have sorted teh creation of the two independent lists, one for each first element of each file line and another list with the second element of each file line by using:
# Creates two independent lists with the first and second element from each line
takeOff_Airport = list(map(lambda sub: (sub[0].split(';')[0]), lines))
landing_Airport = list(map(lambda sub: (sub[0].split(';')[1]), lines))
I was hoping to find the way to open the file and achieve the exact same result as the original code by been able to opemn the file thru a map() function, so I could pass each list to the above defined maps; takeOff_Airport and landing_Airport.
So if we have a file as such
line 1
line 2
line 3
line 4
and we do like this
open(file_name).read().split('\n')
we get this
['line 1', 'line 2', 'line 3', 'line 4', '']
Is this what you wanted?
Edit 1
I feel this is somewhat reduntant but since map applies a function to each element of an iterator we will have to have our file name in a list, and we ofcourse define our function
def open_read(file_name):
return open(file_name).read().split('\n')
print(list(map(open_read, ['test.txt'])))
This gets us
>>> [['line 1', 'line 2', 'line 3', 'line 4', '']]
So first off, calling split('\n') on each line is silly; the line is guaranteed to have at most one newline, at the end, and nothing after it, so you'd end up with a bunch of ['all of line', ''] lists. To avoid the empty string, just strip the newline. This won't leave each line wrapped in a list, but frankly, I can't imagine why you'd want a list of one-element lists containing a single string each.
So I'm just going to demonstrate using map+strip to get rid of the newlines, using operator.methodcaller to perform the strip on each line:
from operator import methodcaller
def readFile(fileName):
try:
with open(fileName) as file:
return list(map(methodcaller('strip', '\n'), file))
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
Sadly, since your file is context managed (a good thing, just inconvenient here), you do have to listify the result; map is lazy, and if you didn't listify before the return, the with statement would close the file, and pulling data from the map object would die with an exception.
To get around that, you can implement it as a trivial generator function, so the generator context keeps the file open until the generator is exhausted (or explicitly closed, or garbage collected):
def readFile(fileName):
try:
with open(fileName) as file:
yield from map(methodcaller('strip', '\n'), file)
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
yield from will introduce a tiny amount of overhead over directly iterating the map, but not much, and now you don't have to slurp the whole file if you don't want to; the caller can just iterate the result and get a split line on each iteration without pulling the whole file into memory. It does have the slight weakness that opening the file will be done lazily, so you won't see the exception (if there is any) until you begin iterating. This can be worked around, but it's not worth the trouble if you don't really need it.
I'd generally recommend the latter implementation as it gives the caller flexibility. If they want a list anyway, they just wrap the call in list and get the list result (with a tiny amount of overhead). If they don't, they can begin processing faster, and have much lower memory demands.
Mind you, this whole function is fairly odd; replacing IOErrors with prints and (implicitly) returning None is hostile to API consumers (they now have to check return values, and can't actually tell what went wrong). In real code, I'd probably just skip the function and insert:
with open(fileName) as file:
for line in map(methodcaller('strip', '\n'), file)):
# do stuff with line (with newline pre-stripped)
inline in the caller; maybe define split_by_newline = methodcaller('split', '\n') globally to use a friendlier name. It's not that much code, and I can't imagine that this specific behavior is needed in that many independent parts of your file, and inlining it removes the concerns about when the file is opened and closed.

IndentationError within the timeit function

I want to measure a processing time of a part of my code and I used timeit function for the purpose. However it returns IndentationError from inside of the timeit function.
Here is my code;
for stem, result in zip(stem_dirs, result_dirs):
code_to_measure = '''
print(stem, '\n', result)
subprocess.call(['python', './a.py', "--dir_in", stem, "--dir_out", result])
'''
proccess_time = timeit.timeit(code_to_measure)
print(proccess_time)
Here is the error I get;
Traceback (most recent call last):
File "code_test.py", line 115, in <module>
proccess_time = timeit.timeit(code_to_measure)
File "/usr/local/lib/python3.6/timeit.py", line 233, in timeit
return Timer(stmt, setup, timer, globals).timeit(number)
File "/usr/local/lib/python3.6/timeit.py", line 123, in __init__
compile(stmtprefix + stmt, dummy_src_name, "exec")
File "<timeit-src>", line 3
print(stem, '
^
IndentationError: unexpected indent
However, the timeit function in the code below still runs properly;
# importing the required module
import timeit
# code snippet to be executed only once
mysetup = "from math import sqrt"
# code snippet whose execution time is to be measured
mycode = '''
def example():
mylist = []
for x in range(100):
mylist.append(sqrt(x))
'''
# timeit statement
print(timeit.timeit(setup = mysetup,
stmt = mycode,
number = 10000))
Here is the output of the code;
0.002189640999858966
I am not too sure how to solve the issue. Please advise me if you have any suggestion or solutions on this issue.
Thank you so much in advance.
Bit of a late reply but I ran into the same issue.
It is not possible to simply use triple-quoted strings with newlines in the timeit call. If you want multiple statements in your statement string you can separate them with a ;.
For your code it would look something like this:
for stem, result in zip(stem_dirs, result_dirs):
code_to_measure = f"print({stem}, '\n', {result});subprocess.call(['python', './a.py', '--dir_in', stem, '--dir_out', {result}])"
proccess_time = timeit.timeit(code_to_measure)
print(proccess_time)
(Also adding the variables via format string since timeit runs in an empty environment)
The reason why the timeit call below runs is because it does not actually execute the statements in the function. All it does is create the function, which also explains why it is so ridiculously fast.
The two ways to perform indentation is to either use whitespaces (standard norm is to use 4 whitespaces for one level indentaion), or to use tabs. Make sure you are not mixing them. Stick to one of it.
I might be able to help you more by telling exactly what is the problem if you can share your code with me as a .py file.

An Elegant Solution to Python's Multiline String?

I was trying to log a completion of a scheduled event I set to run on Django. I was trying my very best to make my code look presentable, So instead of putting the string into a single line, I have used a multiline string to output to the logger within a Command Management class method. The example as code shown:
# the usual imports...
# ....
import textwrap
logger = logging.getLogger(__name__)
class Command(BaseCommand):
def handle(self, *args, **kwargs):
# some codes here
# ....
final_statement = f'''\
this is the final statements \
with multiline string to have \
a neater code.'''
dedented_text = textwrap.dedent(final_statment)
logger.info(dedent.replace(' ',''))
I have tried a few methods I found, however, most quick and easy methods still left a big chunk of spaces on the terminal. As shown here:
this is the final statement with multiline string to have a neater code.
So I have come up with a creative solution to solve my problem. By using.
dedent.replace(' ','')
Making sure to replace two spaces with no space in order not to get rid of the normal spaces between words. Which finally produced:
this is the final statement with multiline string to have a neater code.
Is this an elegant solution or did I missed something on the internet?
You could use regex to simply remove all white space after a newline. Additionally, wrapping it into a function leads to less repetitive code, so let's do that.
import re
def single_line(string):
return re.sub("\n\s+", "", string)
final_statement = single_line(f'''
this is the final statements
with multiline string to have
a neater code.''')
print(final_statement)
Alternatively, if you wish to avoid this particular problem (and don't mine the developmental overhead), you could store them inside a file, like JSON so you can quickly edit prompts while keeping your code clean.
Thanks to Neil's suggestion, I have come out with a more elegant solution. By creating a function to replace the two spaces with none.
def single_line(string):
return string.replace(' ','')
final_statement = '''\
this is a much neater
final statement
to present my code
'''
print(single_line(final_statement)
As improvised from Neil's solution, I have cut down the regex import. That's one line less of code!
Also, making it a function improves on readability as the whole print statement just read like English. "Print single line final statement"
Any better idea?
The issue with both Neil’s and Wong Siwei’s answers is they don’t work if your multiline string contains lines more indented than others:
my_string = """\
this is my
string and
it has various
identation
levels"""
What you want in the case above is to remove the two-spaces indentation, not every space at the beginning of a line.
The solution below should work in all cases:
import re
def dedent(s):
indent_level = None
for m in re.finditer(r"^ +", s):
line_indent_level = len(m.group())
if indent_level is None or indent_level > line_indent_level:
indent_level = line_indent_level
if not indent_level:
return s
return re.sub(r"(?:^|\n) {%s}" % indent_level, "", s)
It first scans the whole string to find the lowest indentation level then uses that information to dedent all lines of it.
If you only care about making your code easier to read, you may instead use C-like strings "concatenation":
my_string = (
"this is my string"
" and I write it on"
" multiple lines"
)
print(repr(my_string))
# => "this is my string and I write it on multiple lines"
You may also want to make it explicit with +s:
my_string = "this is my string" + \
" and I write it on" + \
" multiple lines"

How to add the line number at the beginning of each line in a file

So.. I need to read a file and add the line number at the beginning of each line. Just as the title. How do you do it?
For example, if the content of the file was:
This
is
a
simple
test
file
These 6 lines, I should turn it into
1. This
2. is
3. a
4. simple
5. test
6. file
Keep the original content, but just adding the line number at the beginning.
My code looks like this so far:
def add_numbers(filename):
f = open(filename, "w+")
line_number = 1
for line in f.readlines():
number_added = str(line_number) + '. ' + f.readline(line)
line_number += 1
return number_added
But it doesn't really show anything as the result. I have no clues how to do it. Any help?
A few problems I see in your code:
You indentation is not correct. Everything below the def add_numbers(): should be indented one level.
It is good practice to close a file handle at the end of your method.
A similar question to yours was asked here. Looking at the various solutions posted there, using fileinput seems like your best bet because it allows you to edit your file in-place.
import fileinput
def add_numbers(filename):
line_number = 1
for line in fileinput.input(filename, inplace=True):
print("{}. {}".format(line_number, line))
line_number += 1
Also note that I use format to combine two strings instead adding them together, because this handles different variable types more easily. A good explanation of the use of format can be found here.

Read from URL & Process data using list comprehension

I am new to python and I am trying to read data from URL. Basically I am reading the historical stock data, get the closing price and save the closing price in to a list. The closing price is available at the 4th index (5th column) of each line. And I want to do all of these within a list comprehension.
Code snippet:
from urllib.request import urlopen
URL = "http://ichart.yahoo.com/table.csv?s=AAPL&a=3&b=1&c=2016&d=9&e=30&f=2016"
def downloadClosingPrice():
urlHandler = urlopen(URL)
next(urlHandler)
return [float(line.split(",")[4]) for line in urlHandler.read().decode("utf8").splitlines() if line]
closingPriceList = downloadClosingPrice()
The above code just works fine. I am able to read and fetch the required data. However just out of curiosity, can the code for list comprehension be written in a more simpler or easier way ?
Thanks...
I did try out various ways and this is how I could do the same using different forms of list comprehension:
return [float(line.decode("utf8").split(",")[4]) for line in urlHandler if line]
# return [float(line.decode("utf8").split(",")[4]) for line in urlHandler.readlines() if line]
# return [float(line.split(",")[4]) for line in urlHandler.read().decode("utf8").splitlines() if line]
The first one is better because it reads the file line by line which saves memory. And of course it's simpler and easier to understand.

Resources