a bytes-like object is required, not 'str': typeerror in compressed file - python-3.x

I am finding substring in compressed file using following python script. I am getting "TypeError: a bytes-like object is required, not 'str'". Please any one help me in fixing this.
from re import *
import re
import gzip
import sys
import io
import os
seq={}
with open(sys.argv[1],'r') as fh:
for line1 in fh:
a=line1.split("\t")
seq[a[0]]=a[1]
abcd="AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG"
print(a[0],"\t",seq[a[0]])
count={}
with gzip.open(sys.argv[2]) as gz_file:
with io.BufferedReader(gz_file) as f:
for line in f:
for b in seq:
if abcd in line:
count[b] +=1
for c in count:
print(c,"\t",count[c])
fh.close()
gz_file.close()
f.close()
and input files are
TruSeq2_SE AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
the second file is compressed text file. The line "if abcd in line:" shows the error.

The "BufferedReader" class gives you bytestrings, not text strings - you can directly compare both objects in Python3 -
Since these strings just use a few ASCII characters and are not actually text, you can work all the way along with byte strings for your code.
So, whenever you "open" a file (not gzip.open), open it in binary mode (i.e.
open(sys.argv[1],'rb') instead of 'r' to open the file)
And also prefix your hardcoded string with a b so that Python uses a binary string inernally: abcd=b"AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG" - this will avoid a similar error on your if abcd in line - though the error message should be different than the one you presented.
Alternativally, use everything as text - this can give you more methods to work with the strings (Python3's byte strigns are somewhat crippled) presentation of data when printing, and should not be much slower - in that case, instead of the changes suggested above, include an extra line to decode the line fetched from your data-file:
with io.BufferedReader(gz_file) as f:
for line in f:
line = line.decode("latin1")
for b in seq:
(Besides the error, your progam logic seens to be a bit faulty, as you don't actually use a variable string in your innermost comparison - just the fixed bcd value - but I suppose you can fix taht once you get rid of the errors)

Related

Reading from CSV file without tokenizing words into letters and numbers into digits

I am downloading csv file and then reading it using csv module. For some reason, words and numbers get tokenized into letters and single digits. However, there is exception with "1 Mo", "3 Mo" etc.
I am getting csv file from here:
url = https://home.treasury.gov/resource-center/data-chart-center/interest-rates/daily-treasury-rates.csv/2022/all?type=daily_treasury_yield_curve&field_tdr_date_value=2022&page&_format=csv
I use Python 3.10 and the code looks as follows:
from urllib.request import urlopen
import csv
response = urlopen(url)
content = response.read().decode('utf-8')
csv_data = csv.reader(content, delimiter=',')
for row in csv_data:
print(row)
Here is what I am getting:
['D']
['a']
['t']
['e']
['','']
['1 Mo']
['','']
['2 Mo']
['','']
['3 Mo']
['','']
.
.
.
['30 Yr']
[]
['1']
['1']
['/']
['0']
['8']
['/']
.
.
.
I tried different delimiters but it does not help.
P.S. When I simply save csv file to drive and then open it - everything works properly. But I don't want to have this extra step.
Check out the documentation for csv.reader at this link:
csv.reader(csvfile, dialect='excel', **fmtparams)
...csvfile can be any object which supports the iterator protocol and returns a string each time its __next__() method is called -- file objects and list objects are both suitable...
Notice that your variable content is a string, not a file. In Python, strings may be iterators, but their __next__() method does not return the next line. You probably want to convert your long CSV string into a list of lines, so that __next__() (when it is called internally to the reader function) will give the next line instead of the next character. Note that this is why your code mysteriously works when you save the CSV to a file first - an open file iterator returns the next line of input each time __next__() is invoked.
To accomplish this, try using the following line in place of line 4:
content = response.read().decode('utf-8').split("\n")

Converting string to dictionary from a opened file

A text file contains dictionary as below
{
"A":"AB","B":"BA"
}
Below are code of python file
with open('devices_file') as d:
print (d["A"])
Result should print AB.
As #rassar and #Ivrf suggested in comments you can use ast.literal_eval() as well as json.loads() to achieve this. Both code snippets outputs AB.
Solution with ast.literal_eval():
import ast
with open("devices_file", "r") as d:
content = d.read()
result = ast.literal_eval(content)
print(result["A"])
Solution with json.loads():
import json
with open("devices_file") as d:
content = json.load(d)
print(content["A"])
Python documentation about ast.eval_literal() and json.load().
Also: I noticed that you're not using the correct syntax in the code snippet in your question. Indented lines should be indented with 4 spaces, and between the print keyword and the associated parentheses there's no whitespace allowed.

Seek from end of file in python 3

One of the changes in python 3 has been to remove the ability to seek from the end of the file in normal text mode. What is the generally accepted alternative to this?
For example in python 2.7 I would enter file.seek(-3,2)
I've read a bit about why they did this so please don't just link to a PEP. I know that using 'rb' would allow me to seek, but this makes my text file read in the wrong format.
In Python 2, the file data wasn't being decoded while reading. Seeking backwards and multi-byte encodings don't mix well (you can't know where would the next character start), which is why it is disabled for Python 3.
You can still seek on the underlying buffer object, via the TextIOBase.buffer attribute, but then you'll have to reattach a new TextIOBase wrapper, as the current wrapper will no longer know where it is at:
import io
file.buffer.seek(-3, 2)
file = io.TextIOWrapper(
file.buffer, encoding=file.encoding, errors=file.errors,
newline=file.newlines)
I've copied across any encoding and line handling information to the io.TextIOWrapper() object.
Take into account that this can decoding could break for UTF-16, UTF-32, UTF-8 and other multi-byte codecs.
Demo:
>>> import io
>>> with open('demo.txt', 'w') as out:
... out.write('Demonstration\nfor seeking from the end')
...
38
>>> with open('demo.txt') as inf:
... print(inf.readline())
... inf.buffer.seek(-3, 2)
... inf = io.TextIOWrapper(inf.buffer)
... print(inf.readline())
...
Demonstration
35
end
You could wrap this up in a utility function:
import io
def textio_seek(fobj, amount, whence=0):
fobj.buffer.seek(amount, whence)
return io.TextIOWrapper(
fobj.buffer, encoding=fobj.encoding, errors=fobj.errors,
newline=fobj.newlines)
and use this as:
with open(somefile) as file:
# ...
file = textio_seek(file, -2, 3)
# ...
Using the file object as a context manager just still works, as the original file object reference is still attached to the original file buffer object and thus can still be used to close the file.

Python code to read first 14 characters, uniquefy based on them, and parse duplicates

I have a list of more than 10k os string that look like different versions of this (HN5ML6A02FL4UI_3 [14 numbers or letters_1-6]), where some are duplicates except for the _1 to _6.
I am trying to find a way to list these and remove the duplicate 14 character (that comes before the _1-_6).
Example of part of the list:
HN5ML6A02FL4UI_3
HN5ML6A02FL4UI_1
HN5ML6A01BDVDN_6
HN5ML6A01BDVDN_1
HN5ML6A02GVTSV_3
HN5ML6A01CUDA2_1
HN5ML6A01CUDA2_5
HN5ML6A02JPGQ9_5
HN5ML6A02JI8VU_1
HN5ML6A01AJOJU_5
I have tried versions of scripts using Reg Expressions: var n = /\d+/.exec(info)[0]; into the following that were posted into my previous question. and
I also used a modified version of the code from : How can I strip the first 14 characters in an list element using python?
More recently I used this script and I am still not getting the correct output.
import os, re
def trunclist('rhodopsins_play', 'hope4'):
with open('rhodopsins_play','r') as f:
newlist=[]
trunclist=[]
for line in f:
if line.strip().split('_')[0] not in trunclist:
newlist.append(line)
trunclist.append(line.split('_')[0])
print newlist, trunclist
# write newlist to file, with carriage returns
with open('hope4','w') as out:
for line in newlist:
out.write(line)
My inputfile.txt contains more than 10k of data which looks like the list above, where the important part are the characters are in front of the '_' (underscore), then outputting a file of the uniquified ABCD12356_1.
Can someone help?
Thank you for your help
Import python and run this script that is similar to the above. It is slitting at the '_' This worked on the file
def trunclist(inputfile, outputfile):
with open(inputfile,'r') as f:
newlist=[]
trunclist=[]
for line in f:
if line.strip().split('_')[0] not in trunclist:
newlist.append(line)
trunclist.append(line.split('_')[0])
print newlist, trunclist
# write newlist to file, with carriage returns
with open(outputfile,'w') as out:
for line in newlist:
out.write(line)

Extracting source code from html file using python3.1 urllib.request

I'm trying to obtain data using regular expressions from a html file, by implementing the following code:
import urllib.request
def extract_words(wdict, urlname):
uf = urllib.request.urlopen(urlname)
text = uf.read()
print (text)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
which returns an error:
File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Upon experimenting further in the IDLE, I noticed that the uf.read() indeed returns the html source code the first time I invoke it. But then onwards, it returns a - b''. Is there any way to get around this?
uf.read() will only read the contents once. Then you have to close it and reopen it to read it again. This is true for any kind of stream. This is however not the problem.
The problem is that reading from any kind of binary source, such as a file or a webpage, will return the data as a bytes type, unless you specify an encoding. But your regexp is not specified as a bytes type, it's specified as a unicode str.
The re module will quite reasonably refuse to use unicode patterns on byte data, and the other way around.
The solution is to make the regexp pattern a bytes string, which you do by putting a b in front of it. Hence:
match = re.findall(b"<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
Should work. Another option is to decode the text so it also is a unicode str:
encoding = uf.headers.getparam('charset')
text = text.decode(encoding)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
(Also, to extract data from HTML, I would say that lxml is a better option).

Resources