What is the problem with len and the output - python-3.x

I’m working on exercises in Python, I'm a beginner. I have a problem with this exercise:
Book Titles
You have been asked to make a special book categorization program, which assigns each book a special code based on its title.
The code is equal to the first letter of the book, followed by the number of characters in the title.
For example, for the book "Harry Potter", the code would be: H12, as it contains 12 characters (including the space).
You are provided a books.txt file, which includes the book titles, each one written on a separate line.
Read the title one by one and output the code for each book on a separate line.
For example, if the books.txt file contains:
Some book
Another book
Your program should output:
S9
A12
Recall the readlines() method, which returns a list containing the lines of the file.
Also, remember that all lines, except the last one, contain a \n at the end, which should not be included in the character count.
I understand what I should do but my output is not the same as (S9 or A12)..
This is my code…
file = open("/usercode/files/books.txt", "r")
for i in file.readlines():
print(i[0])
print(len(i))
file.close()
my output is:
H
13
T
17
P
20
G
18
Expected Output
H12
T16
P19
G18

You missed the part of the instructions where it says "remember that all lines, except the last one, contain a \n at the end, which should not be included in the character count."
I'd suggest stripping off the newline, e.g. print(len(i.strip('\n'))).
To get them all on the same line, just combine the prints, and use an empty sep:
for i in file:
i = i.strip('\n')
print(i[0], len(i), sep='')

Related

Regex: Match x if not surrounded with y [duplicate]

I'm very new in programming and regex so apologise if this's been asked before (I didn't find one, though).
I want to use Python to summarise word frequencies in a literal text. Let's assume the text is formatted like
Chapter 1
blah blah blah
Chapter 2
blah blah blah
....
Now I read the text as a string, and I want to use re.findall to get every word in this text, so my code is
wordlist = re.findall(r'\b\w+\b', text)
But the problem is that it matches all these Chapters in each chapter title, which I don't want to include in my stats. So I want to ignore what matches Chapter\s*\d+. What should I do?
Thanks in advance, guys.
Solutions
You could remove all Chapter+space+digits first :
wordlist = re.findall(r'\b\w+\b', re.sub(r'Chapter\s*\d+\s*','',text))
If you want to use just one search , you can use a negative lookahead to find any word that isn't preceded by "Chapter X" and does not begin with a digit :
wordlist = re.findall(r'\b(?!Chapter\s+\d+)[A-Za-z]\w*\b',text)
If performance is an issue, loading a huge string and parsing it with a Regex wouldn't be the correct method anyway. Just read the file line by line, toss any line that matches r'^Chapter\s*\d+' and parse each remaining line separately with r'\b\w+\b' :
import re
lines=open("huge_file.txt", "r").readlines()
wordlist = []
chapter = re.compile(r'^Chapter\s*\d+')
words = re.compile(r'\b\w+\b')
for line in lines:
if not chapter.match(line):
wordlist.extend(words.findall(line))
print len(wordlist)
Performance
I wrote a small ruby script to write a huge file :
all_dicts = Dir["/usr/share/dict/*"].map{|dict|
File.readlines(dict)
}.flatten
File.open('huge_file.txt','w+') do |txt|
newline=true
txt.puts "Chapter #{rand(1000)}"
50_000_000.times do
if rand<0.05
txt.puts
txt.puts
txt.puts "Chapter #{rand(1000)}"
newline = true
end
txt.write " " unless newline
newline = false
txt.write all_dicts.sample.chomp
if rand<0.10
txt.puts
newline = true
end
end
end
The resulting file has more than 50 million words and is about 483MB big :
Chapter 154
schoolyard trashcan's holly's continuations
Chapter 814
assure sect's Trippe's bisexuality inexperience
Dumbledore's cafeteria's rubdown hamlet Xi'an guillotine tract concave afflicts amenity hurriedly whistled
Carranza
loudest cloudburst's
Chapter 142
spender's
vests
Ladoga
Chapter 896
petition's Vijayawada Lila faucets
addendum Monticello swiftness's plunder's outrage Lenny tractor figure astrakhan etiology's
coffeehouse erroneously Max platinum's catbird succumbed nonetheless Nissan Yankees solicitor turmeric's regenerate foulness firefight
spyglass
disembarkation athletics drumsticks Dewey's clematises tightness tepid kaleidoscope Sadducee Cheerios's
The two-step process took 12.2s to extract the wordlist on average, the lookahead method took 13.5s and Wiktor's answer also took 13.5s. The lookahead method I first wrote used re.IGNORECASE, and it took around 18s.
There's basically no difference in performance between all the Regexen methods when reading the whole file.
What surprised me though is that the readlines script took around 20.5s, and didn't use much less memory than the other scripts. If you have any idea how to improve the script, please comment!
Match what you do not need and capture what you need, and use this technique with re.findall that only returns captured values:
re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)
Details:
\bChapter\s*\d+\b - Chapter as a whole word followed with 0+ whitespaces and 1+ digits
| - or
\b(\w+)\b - match and capture into Group 1 one or more word chars
To avoid getting empty values in the resulting list, filter it (see demo):
import re
s = "Chapter 1: Black brown fox 45"
print(filter(None, re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)))

How do I get the computer to seperate a conjoined string into seperate items on a list depending on what it detects?

This is a follow up from a question I asked yesterday which I got brilliant responses for but now I have more problems :P
(How do I get python to detect a right brace, and put a space after that?)
Say I have this string that's in a txt document which I make Python read
!0->{100}!1o^{72}->{30}o^{72}->{30}o^{72}->{30}o^{72}->{30}o^{72}->{30}
I want to seperate this conjoined string into individual components that can be indexed after detecting a certain symbol.
If it detects !0, it's considered as one index.
If it detects ->{100}, that is also considered as another part of the list.
It seperates all of them into different parts until the computer prints out:
!0, ->{100}, !1, o^{72}, ->{30}
From yesterdays code, I tried a plethora of things.
I tried this technique which separates anything with '}' perfectly but has a hard time separating !0
text = "(->{200}o^{90}->{200}o^{90}->{200}o^{90}!0->{200}!1o^{90})" #this is an example string
my_string = ""
for character in text:
my_string += character
if character == "}":
my_string+= "," #up until this point, Guimonte's code perfectly splits "}"
elif character == "0": #here is where I tried to get it to detect !0. it splits that, but places ',' on all zeroes
my_string+= ","
print(my_string)
The output:
(->{20,0,},o^{90,},->{20,0,},o^{90,},->{20,0,},o^{90,},!0,->{20,0,},!1o^{90,},)
I want the out put to insead be:
(->{200}, o^{90}, ->{200}, o^{90}, ->{200}, o^{90}, !0, ->{200}, !1, o^{90})
It seperates !0 but it also messes with the other symbols.
I'm starting to approach a check mate scenario. Is there anyway I can get it to split !0 and !1 as well as the right brace?

Different behaviour shown when running the same code for a file and for a list

I have observed this unusual behaviour when I try to do a string slicing on the words in a file and the words in a list.Both the results are quite different.
For example I have a file 'words.txt' which contains the following content
POPE
POPS
ROPE
POKE
COPE
PAPE
NOPE
POLE
When I write the below piece of code, I expect to get a list of words with last letter omitted.
with open("words.txt", "r") as fo:
for l in fo:
print(l[:-1])
But instead I get this result below.No string slicing takes place and the words are similar as before.
POPE
POPS
ROPE
POKE
COPE
PAPE
NOPE
POLE
But if I write the below code, I get what I want
lis = ["POPE", "POPS", "ROPE", "POKE", "COPE", "PAPE", "NOPE", "POLE"]
for i in lis:
print(i[:-1])
I am able to delete the last letter of each of the words as expected.
POP
POP
ROP
POK
COP
PAP
NOP
POL
So why do I see two different results for the same operation [: -1] ?
The line ends with \n in files where as you dont need line endings in lists.
Your actual file contents are as follows
POPE\n
POPS\n
ROPE\n
POKE\n
COPE\n
PAPE\n
NOPE\n
POLE\n
hence the print(l[:-1]) is actually trimming the line ending i.e. \n.
To verify this, declare an empty list before the loop, and add each line to that list and print it. You will find the that the lines contain the \n on every line
stuff = []
with open("words.txt", "r") as fo:
for line in fo:
stuff.append(line)
print stuff
this will print ['POPE\n', 'POPS\n', 'ROPE\n', 'POKE\n']
If I am not wrong, you want to carry out the slicing operation on the file contents. I think you should look into strip() method.

How to decode a text file by extracting alphabet characters and listing them into a message?

So we were given an assignment to create a code that would sort through a long message filled with special characters (ie. [,{,%,$,*) with only a few alphabet characters throughout the entire thing to make a special message.
I've been searching on this site for a while and haven't found anything specific enough that would work.
I put the text file into a pastebin if you want to see it
https://pastebin.com/48BTWB3B
Anywho, this is what I've come up with for code so far
code = open('code.txt', 'r')
lettersList = code.readlines()
lettersList.sort()
for letters in lettersList:
print(letters)
It prints the code.txt out but into short lists, essentially cutting it into smaller pieces. I want it to find and sort out the alphabet characters into a list and print the decoded message.
This is something you can do pretty easily with regex.
import re
with open('code.txt', 'r') as filehandle:
contents = filehandle.read()
letters = re.findall("[a-zA-Z]+", contents)
if you want to condense the list into a single string, you can use a join:
single_str = ''.join(letters)

printing sentence from a word search

As an exercise in the code below, I've copied and saved Rice's Tarzan novel into a text file (named tarzan.txt) and within it, I've searched for "row" and printed out the corresponding lines.
Is it difficult to modify this code so that it searches for the word "row" rather than instances of these letters appearing in another word AND it prints the sentence that contain this word rather than simply the line it appears in? Thanks.
PS - in the code below, I couldn't get lines 3, 5, and 6 to indent properly, despite the 4 space suggestion
a="tarzan.txt"
with open (a) as f_obj:
contents=f_obj.readlines()
for line in contents:
if "row" in line:
print(line)
import re
a="tarzan.txt"
with open (a) as f_obj:
contents=f_obj.readlines()
for line in contents:
if re.search(r'\brow\b',line): ####### search for 'row' in line
print contents.index(line) ####### print line number
Here \b means word boundries.

Resources