Cutting Plain HTML files with RegEX - python-3.x

I am using this code to extract a part of my locally stored HTML files and save the shortened new document into a .txt file.
import glob
import os
import re
def extractor():
os.chdir(r"F:\Test") # the directory containing your html
for file in glob.iglob("*.html"): # iterates over all files in the directory ending in .html
with open(file, encoding="utf8") as f, open((file.rsplit(".", 1)[0]) + ".txt", "w", encoding="utf8") as out:
contents = f.read()
extract = re.compile(r'(Start).*?End', re.I | re.S)
cut = extract.sub('', contents)
if re.search(extract, contents) is not None:
out.write(cut)
out.close()
extractor()
It works fine for most of my files however for a few files I do have some encoding issues and get:
Traceback (most recent call last):
File "C:/Users/6930p/PycharmProjects/untitled/Versuch/CutFile.py", line 16, in <module>
extractor()
File "C:/Users/6930p/PycharmProjects/untitled/Versuch/CutFile.py", line 14, in extractor
out.write(cut)
File "C:\Users\6930p\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 241205-241210: character maps to <undefined>
Anyone an idea what's the problem? I thought by using encoding="utf8" I won't have any problems with encoding...
Any help appreciated!

Ok, it has been an issue with encoding="utf8". It forgot to encode my new created .txt file with "utf8". Code is updated and works!

Related

Python: copy file tree to a text file

I'm trying to create a text file with a tree of all files / dirs from a place that I choose using os.chdir(). My approach is to print the tree and to save all prints to the text file. The problem is that it doesn't copy the printed tree and the file is blank.
What am I doing wrong?
And is there a way to write this kind of data to the file without to actually print it?
My code:
import os
import sys
f = open("tree.txt", "w")
os.chdir("c:\\Users\Daniel\Desktop")
sys.stdout = f
os.system("tree /f")
f.close()
Edit
I was able to get the file tree from the clipboard after executing the command, however it gives me and eror when it tried to write to the txt file.
code:
import os
import tkinter
with open("tree.txt", "w") as f:
os.system("tree /f |clip")
root = tkinter.Tk()
tree = root.clipboard_get()
print(tree)
f.write(tree)
eror:
Traceback (most recent call last):
File "c:\Users\Daniel\Desktop\Tick\code_test\files.py", line 9, in <module>
f.write(tree)
File "C:\Users\Daniel\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2502' in position 80: character maps to <undefined>
solution
So I found the problem, I needed to use codec to be able write unicode to the text file. Now it works very well
code:
import os
import tkinter
import codecs
with codecs.open("tree.txt", "w", "utf8") as f:
os.chdir("c:\\Users")
os.system("tree /f |clip")
root = tkinter.Tk()
tree = root.clipboard_get()
f.write(tree)
Method check_output from subprocess module can help you to catch program output:
import subprocess
f = open("tree.txt", "wb")
tree_output = subprocess.check_output('tree /f', shell=True, cwd=r'c:\Users\Daniel\Desktop')
f.write(tree_output)
f.close()
Or with context manager:
import subprocess
with open("tree.txt", "wb") as f:
f.write(subprocess.check_output('tree /f', shell=True, cwd=r'c:\Users\Daniel\Desktop'))
Option wb is required because check_output returns bytes not a str. If you want to process output like a string - call tree_output.decode() first.

UnicodeDecodeError: charmap' codec can't decode byte 0x8f in position 756

I'm unable to retrieve the data from a Microsoft Excel document. I've tried using encoding 'Latin-1' or 'UTF-8' but when it gives me hundreds of \x00's in the terminal. Is there any way I can retrieve the data and output it to a text file?
This is what I'm running on the terminal and the error I get:
PS C:\Users\Andy-\Desktop> python.exe SRT411-Lab2.py Lab2Data.xlsx
Traceback (most recent call last):
File "SRT411-Lab2.py", line 9, in
lines = file.readlines()
File "C:\ProgramFiles\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1776.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 756: character maps to <\undefined>
Any help is greatly appreciated!
#!/usr/bin/python3
import sys
filename = sys.argv[1]
print(filename)
file = open(filename, 'r')
lines = file.readlines()
file.close()
print(lines)
I'd probably convert the excel file to csv file and use pandas to parse it

python3 shutil error copying from config list

Im trying to copy a file using shutil by reading in a config.dat file.
This file is simply:
/home/admin/Documents/file1
/home/admin/Documents/file2
Code does work but it will copy file1 okay but then misses file2 because it see a \n there, which im guessing is because of the new line.
#!/usr/bin/python3
import shutil
data = open("config.dat")
filelist = data.read()
src = filelist
dest = '/home/admin/Documents/backup/'
shutil.copy(src, dest)
Error code im getting :
Traceback (most recent call last):
File "./testing.py", line 18, in <module>
shutil.copy(src, dest)
File "/usr/lib/python3.4/shutil.py", line 229, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/usr/lib/python3.4/shutil.py", line 108, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory:
'/home/admin/Documents/file1\n/home/admin/Documents/file2'
I would like the copy of those files to run based on the files from the config.dat folder but it detects a '\n'. Is there a way to fix this?
thanks for the help
Use split('\n') to get a list of files and the iterate that list. You probably want to throw in the file.strip() to get rid of trailing whitespace and empty lines.
import shutil
dest = '/home/admin/Documents/backup/'
with open('config.dat') as data:
filelist = data.read().split('\n')
for file in filelist:
if file:
shutil.copy(file.strip(), dest)
Or, if you don't need the filelist after this
with open('config.dat') as data:
for file in data:
if file:
shutil.copy(file.strip(), dest)

UnicodeEncodeError: 'ascii' codec can't encode character in python writing to csv [duplicate]

I am trying to write characters with double dots (umlauts) such as ä, ö and Ö. I am able to write it to the file with data.encode("utf-8") but the result b'\xc3\xa4\xc3\xa4\xc3\x96' is not nice (UTF-8 as literal characters). I want to get "ääÖ" as written stored to a file.
How can I write data with umlaut characters to a CSV file in Python 3?
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
data=resultFile
a.writerows(data)
Traceback:
File "<ipython-input-280-73b1f615929e>", line 5, in <module>
a.writerows(data)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 15: ordinal not in range(128)
Add a parameter encoding to the open() and set it to 'utf8'.
import csv
data = "ääÖ"
with open("test.csv", 'w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Edit: Removed the use of io library as open is same as io.open in Python 3.
This solution should work on both python2 and 3 (not needed in python3):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
data="ääÖ"
with open("test.csv", "w") as fp:
a = csv.writer(fp, delimiter=";")
a.writerows(data)
Credits to:
Working with utf-8 encoding in Python source

UnicodeDecodeError Python 3.5.1 Email Script

I am attempting to send an email + attachment to an SMS gateway email. However I currently am getting a Unicode Decode: Error'Charmap' codec can't Decode Byte 0x8d in position 60
I'm not sure how to go about fixing this and would be interested in your advice. Bellow is my code and the Full Error.
import smtplib, os
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart
msg = MIMEMultipart()
msg['Subject'] = 'Cuteness'
msg['From'] = 'sample#outlook.com'
msg['To'] = '111111111#messaging.sprintpcs.com'
msg.preamble = "Would you pet me? I'd Pet me so hard..."
here = os.getcwd()
file = open('cutecat.png')#the png shares directory with actual script
for here in file: #Error appears to be in this section
with open(file, 'rb') as fp:
img = MIMImage(fp.read())
msg.attach(img)
s = smtplib.SMTP('Localhost')
s.send_message(msg)
s.quit()
""" Traceback (most recent call last):
File "C:\Users\Thomas\Desktop\Test_Box\msgr.py", line 16, in <module>
for here in file:
File "C:\Users\Thomas\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 60: character maps to <undefined>"""
You're trying to open the file twice. First you have:
file = open('cutecat.png')
The default mode to open files is to read them in text mode. That is generally not what you want to do with a binary file like a PNG file.
And then you do:
for here in file:
with open(file, 'rb') as fp:
img = MIMImage(fp.read())
msg.attach(img)
You get an exception in the first line because Python is trying to decode the contents of a binary file as text and fails. The chances of this happening are quite high. It is unlikely that a binary file is also a valid text file in your standard encoding.
But even if that would have worked, for every line in the file you try to open the file again? This makes no sense!
Were you just copy/pasting from the examples, especially the third one? You should note that this example is incomplete. The variable pngfiles used in that example (and which should be a sequence of file names) is not defined.
Try this instead:
with open('cutecat.png', 'rb') as fp:
img = MIMImage(fp.read())
msg.attach(img)
Or if you want to include multiple files:
pngfiles = ('cutecat.png', 'kitten.png')
for filename in pngfiles:
with open(filename, 'rb') as fp:
img = MIMImage(fp.read())
msg.attach(img)

Resources