Python 3 - correct way to write to .html without TypeError - python-3.x

# save-webpage.py (To write first 100 characters of html source into 'simple.html')
import urllib.request, io, sys
f = urllib.request.urlopen('https://news.google.com')
webContent = f.read(100)
#g = io.open('simple.html', 'w', encoding='UTF-8')
g = io.open('simple.html', 'w')
#g.write(webContent)
g.write(webContent.decode("UTF-8"))
g.close()
2019-01-11: See above for corrected working code after answers were received. Thanks guys.
Original question:
Upon execution, the file, simple.html, is created with 0 bytes.
Along with an error:
TypeError: must be str, not bytes.
Please help. I've gone about this several ways but to no avail. Thank you in advance!

g.write(webContent.decode("utf-8"))

File objects opened in text mode require you to write Unicode Text.
In this line you encoded to UTF-8 bytes
g = io.open('simple.html', 'w', encoding='UTF-8')
You could either not encode or try decoding it after.

Related

Python 3: Persist strings without b'

I am confused. This talk explains, that you should only use unicode-strings in your code. When strings leave your code, you should turn them into bytes. I did this for a csv file:
import csv
with open('keywords.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter='\t', quotechar='\"')
for (p, keywords) in ml_data:
writer.writerow([p.encode("utf-8"), ', '.join(keywords).encode("utf-8")])
This leads to an annoying effect, where b' is added in front of every string, this didn't happen for me in python 2.7. When not encoding the strings before writing them into the csv file, the b' is not there, but don't I need to turn them into bytes when persisting? How do I write bytes into a file without this b' annoyance?
Stop trying to encode the individual strings, instead you should specify the encoding for the entire file:
import csv
with open('keywords.csv', 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter='\t', quotechar='\"')
for (p, keywords) in ml_data:
writer.writerow([p, ', '.join(keywords)])
The reason your code goes wrong is that writerow is expecting you to give it strings but you're passing bytes so it uses the repr() of the bytes which has the extra b'...' around it. If you pass it strings but use the encoding parameter when you open the file then the strings will be encoded correctly for you.
See the csv documentation examples. One of these shows you how to set the encoding.

Python 3 split('\n')

How do I split a text string according to an explicit newline ('\n')?
Unfortunately, instead of a properly formatted csv file, I am dealing with a long string of text with "\n" where the newline would be. (example format: "A0,B0\nA1,B1\nA2,B2\nA3,B3\n ...") I thought a simple bad_csv_list = text.split('\n') would give me a list of the two-valued cells (example split ['A0,B0', 'A1,B1', 'A2,B2', 'A3,B3', ...]). Instead I end up with one cell and "\n" gets converted to "\\n". I tried copy-pasting a section of the string and using split('\n') and it works as I had hoped. The print statement for the file object tells me the following:
<_io.TextIOWrapper name='stats.csv' mode='r' encoding='cp1252'>
...so I suspect the problem is with the cp1252 encoding? Of note tho: Notepad++ says the file I am working with is "UTF-8 without BOM"... I've looked in the docs and around SO and tried importing io and codec and prepending the open statement and declaring encoding='utf8' but I am at a loss and I don't really grok text encoding. Maybe there is a better solution?
from sys import argv
# import io, codec
filename = argv[1]
file_object = open(filename, 'r')
# file_object = io.open(filename, 'r', encoding='utf8')
# file_object = codec.open(filename, 'r', encoding='utf8')
file_contents = file_object.read()
file_list = file_contents.split('\n')
print("1.) Here's the name of the file: {}".format(filename))
print("2.) Here's the file object info: {}".format(file_object))
print("3.) Here's all the files contents:\n{}".format(file_contents))
print("4.) Here's a list of the file contents:\n{}".format(file_list))
Any help would be greatly appreciated, thank you.
If it helps to explain what I am dealing with, here's the contents of the stats.csv file:
Albuquerque,749\nAnaheim,371\nAnchorage,828\nArlington,503\nAtlanta,1379\nAurora,425\nAustin,408\nBakersfield,542\nBaltimore,1405\nBoston,835\nBuffalo,1288\nCharlotte-Mecklenburg,647\nCincinnati,974\nCleveland,1383\nColorado Springs,455\nCorpus Christi,658\nDallas,675\nDenver,615\nDetroit,2122\nEl Paso,423\nFort Wayne,362\nFort Worth,587\nFresno,543\nGreensboro,563\nHenderson,168\nHouston,992\nIndianapolis,1185\nJacksonville,617\nJersey City,734\nKansas City,1263\nLas Vegas,784\nLexington,352\nLincoln,397\nLong Beach,575\nLos Angeles,481\nLouisville Metro,598\nMemphis,1750\nMesa,399\nMiami,1172\nMilwaukee,1294\nMinneapolis,992\nMobile,522\nNashville,1216\nNew Orleans,815\nNew York,639\nNewark,1154\nOakland,1993\nOklahoma City,919\nOmaha,594\nPhiladelphia,1160\nPhoenix,636\nPittsburgh,752\nPlano,130\nPortland,517\nRaleigh,423\nRiverside,443\nSacramento,738\nSan Antonio,503\nSan Diego,413\nSan Francisco,704\nSan Jose,363\nSanta Ana,401\nSeattle,597\nSt. Louis,1776\nSt. Paul,722\nStockton,1548\nTampa,616\nToledo,1171\nTucson,724\nTulsa,990\nVirginia Beach,169\nWashington,1177\nWichita,742
And the result from the split('\n'):
['Albuquerque,749\\nAnaheim,371\\nAnchorage,828\\nArlington,503\\nAtlanta,1379\\nAurora,425\\nAustin,408\\nBakersfield,542\\nBaltimore,1405\\nBoston,835\\nBuffalo,1288\\nCharlotte-Mecklenburg,647\\nCincinnati,974\\nCleveland,1383\\nColorado Springs,455\\nCorpus Christi,658\\nDallas,675\\nDenver,615\\nDetroit,2122\\nEl Paso,423\\nFort Wayne,362\\nFort Worth,587\\nFresno,543\\nGreensboro,563\\nHenderson,168\\nHouston,992\\nIndianapolis,1185\\nJacksonville,617\\nJersey City,734\\nKansas City,1263\\nLas Vegas,784\\nLexington,352\\nLincoln,397\\nLong Beach,575\\nLos Angeles,481\\nLouisville Metro,598\\nMemphis,1750\\nMesa,399\\nMiami,1172\\nMilwaukee,1294\\nMinneapolis,992\\nMobile,522\\nNashville,1216\\nNew Orleans,815\\nNew York,639\\nNewark,1154\\nOakland,1993\\nOklahoma City,919\\nOmaha,594\\nPhiladelphia,1160\\nPhoenix,636\\nPittsburgh,752\\nPlano,130\\nPortland,517\\nRaleigh,423\\nRiverside,443\\nSacramento,738\\nSan Antonio,503\\nSan Diego,413\\nSan Francisco,704\\nSan Jose,363\\nSanta Ana,401\\nSeattle,597\\nSt. Louis,1776\\nSt. Paul,722\\nStockton,1548\\nTampa,616\\nToledo,1171\\nTucson,724\\nTulsa,990\\nVirginia Beach,169\\nWashington,1177\\nWichita,742']
Why does it ADD a \ ?
dOh!!! ROYAL FACE PALM! I just wrote all this out an then realized that all I needed to do was put an escape slash before the \newline:
file_list = file_contents.split('\\n')
I'm gonna post this anyways so y'all can have a chuckle ^_^

a bytes-like object is required, not 'str': typeerror in compressed file

I am finding substring in compressed file using following python script. I am getting "TypeError: a bytes-like object is required, not 'str'". Please any one help me in fixing this.
from re import *
import re
import gzip
import sys
import io
import os
seq={}
with open(sys.argv[1],'r') as fh:
for line1 in fh:
a=line1.split("\t")
seq[a[0]]=a[1]
abcd="AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG"
print(a[0],"\t",seq[a[0]])
count={}
with gzip.open(sys.argv[2]) as gz_file:
with io.BufferedReader(gz_file) as f:
for line in f:
for b in seq:
if abcd in line:
count[b] +=1
for c in count:
print(c,"\t",count[c])
fh.close()
gz_file.close()
f.close()
and input files are
TruSeq2_SE AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
the second file is compressed text file. The line "if abcd in line:" shows the error.
The "BufferedReader" class gives you bytestrings, not text strings - you can directly compare both objects in Python3 -
Since these strings just use a few ASCII characters and are not actually text, you can work all the way along with byte strings for your code.
So, whenever you "open" a file (not gzip.open), open it in binary mode (i.e.
open(sys.argv[1],'rb') instead of 'r' to open the file)
And also prefix your hardcoded string with a b so that Python uses a binary string inernally: abcd=b"AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG" - this will avoid a similar error on your if abcd in line - though the error message should be different than the one you presented.
Alternativally, use everything as text - this can give you more methods to work with the strings (Python3's byte strigns are somewhat crippled) presentation of data when printing, and should not be much slower - in that case, instead of the changes suggested above, include an extra line to decode the line fetched from your data-file:
with io.BufferedReader(gz_file) as f:
for line in f:
line = line.decode("latin1")
for b in seq:
(Besides the error, your progam logic seens to be a bit faulty, as you don't actually use a variable string in your innermost comparison - just the fixed bcd value - but I suppose you can fix taht once you get rid of the errors)

F.write doesn't work

import os,sys
import time
from colorama import Fore,Back,Style,init
init(autoreset=True)
appdata_path = os.path.join(os.getenv("APPDATA"), os.pardir)
subpath = "Local/sieosp/filesav2292.sav"
f = open(os.path.join(appdata_path, subpath), "r+")
lines=f.readlines()
a1=int (lines[116])
a2=int (lines[120])
a3=int (lines[124])
b4=int (lines[128])
c5=int (lines[132])
d6=int (lines[136])
e7=int (lines[140])
d8=int (lines[144])
d9=int (lines[148])
d10=int (lines[152])
d11=int (lines[156])
d12=int (lines[160])
total=int (a1+a2+a3+b4+c5+d6+e7+d8+d9+d10+d11+d12)
if (total)==(12):
print("You already own every character")
else:
with f:
userinputvalue=int (input("Type 1 if you want to unlock every character,or 0 if you would like to close this \n"))
if(userinputvalue)==1:
lines[156]=f.write("1\n")
lines[116]=f.write("1\n")
lines[120]=f.write("1\n")
lines[124]=f.write("1\n")
lines[128]=f.write("1\n")
lines[132]=f.write("1\n")
lines[136]=f.write("1\n")
lines[140]=f.write("1\n")
lines[144]=f.write("1\n")
lines[148]=f.write("1\n")
lines[152]=f.write("1\n")
lines[160]=f.write("1\n")
else:
print("Closing")
time.sleep(1)
So this should work,right? Don't know why f.write doesn't write 1 to my file. am i using it very wrong? Searched around google for some more info but I didnt understand a thing :/ tried to use f.write as f.readlines but no luck. thanks
It looks like you dont open the file in write mode, only in read mode.
f = open(os.path.join(appdata_path, subpath), "r+")
Change the "r" to a "w"
You have opened the file with "r+", so the file is even writable, the problem is that if you open a file with "r+" you have to manage the pointer in the file, otherwise the string will be append at the end.
In order to manage it you have to use the function f.seek(offset, from_what) like described here Input and Output.
For example in this code I change only the first line of the file:
f = open("File/Path/file.txt", "r+")
f.seek(0,0)
f.write("something")
f.close()
You also use line[N] = f.write("something"), careful to use it in this way, because it returns the number of characters wrote, not the characters wrote ;)

VIM: deleting non-roman characters

I'm working with a document with both Roman and Asian characters, and I want put them each of them alone in two separated files and keeps their original structure, is it possible?
Thanks
Might be easier in Python. Here's a script that reads a text file and creates two output files: one with low-ASCII and one with everything else. If you have Python support compiled in Vim, the following should also be usable from within Vim (with minimal changes).
import codecs
mixedInput = codecs.open('mixed.txt', 'r', 'utf-8')
lowAsciiOutput = codecs.open('lowAscii.txt', 'w', 'utf-8')
otherOutput = codecs.open('other.txt', 'w', 'utf-8')
for rawline in mixedInput:
line = rawline.rstrip()
for c in line:
if ord(c) < 2**7:
lowAsciiOutput.write(c)
else:
otherOutput.write(c)
otherOutput.write('\n')
lowAsciiOutput.write('\n')
mixedInput.close()
lowAsciiOutput.close()
otherOutput.close()
example input file (mixed.txt):
欢迎来到Mifos管理区域
Does that do what you want?
Also saved as a gist: https://gist.github.com/855545

Resources