i have a vector "char" of type |S1 like in the example below:
masked_array(data=[b'E', b'U', b'3', b'7', b'6', b'8', b' ', b' ', b' ', b' '],
mask=False,
fill_value=b'N/A',
dtype='|S1')
I want to get the string in it, in this example 'EU3768'
This example is taken from a netcdf file. Library used is netCDF4.
Further question: Why is there a b in front of all single letters?
Thanks for your help :)
First of all let's answer the most basic question: What is the meaning of the b in front of each letter. The b simply indicates that the character string is in bytes. The internal format of the data is being stored encoded as utf-8. So to convert it back to a string it must be decoded. So with that as a preamble, the following code will do the trick.
I am assuming that you can extract data from the masked_array. Then perform the following operations:
# Convert the list of bytes to a list of strings
ds = list(map(lambda x: x.decode('utf-8'), data))
# Covert List of strings to a String and strip any trailing spaces
sd = ''.join(ds).strip()
This could of course be performed in a single line of code as follows:
sd = ''.join(list(map(lambda x: x.decode('utf-8'), data))).strip()
as an answer to your follow-up question, you might be able to let Numpy do some of the work by just working with the underlying bytes. for example, I can create a large number of similar shaped objects via:
import numpy as np
from string import ascii_letters, digits
letters = np.array(list(ascii_letters + digits), dtype='S1')
v = np.random.choice(letters, (100_000, 10))
The first three elements of this look like:
[[b'W' b'B' b'W' b'4' b'O' b'B' b'A' b'4' b'Q' b'n']
[b'I' b'I' b'T' b'u' b'K' b'K' b'U' b'a' b'r' b'r']
[b'V' b'f' b'n' b'U' b'G' b'0' b'j' b'R' b'm' b'C']]
I can then convert these back to strings via some byte level shanigans:
[bytes.decode(s) for s in np.frombuffer(v, dtype='S10')]
The first three look like:
['WBW4OBA4Qn', 'IITuKKUarr', 'VfnUG0jRmC']
which hopefully makes sense. This takes ~20ms which is quicker than a version which goes through Python:
[b''.join(r).decode() for r in v]
taking ~200ms. This is still much faster than the version of code you posted, so maybe you could be accessing netcdf more efficiently.
Related
I know there are already at least two topics that explain how map() works but I can't seem to understand its workings in a specific case I encountered.
I was working on the following Python exercise:
Write a program that computes the net amount of a bank account based a
transaction log from console input. The transaction log format is
shown as following:
D 100
W 200
D means deposit while W means withdrawal. Suppose the following input
is supplied to the program:
D 300
D 300
W 200
D 100
Then, the output should be:
500
One of the answers offered for this exercise was the following:
total = 0
while True:
s = input().split()
if not s:
break
cm,num = map(str,s)
if cm=='D':
total+=int(num)
if cm=='W':
total-=int(num)
print(total)
Now, I understand that map applies a function (str) to an iterable (s), but what I'm failing to see is how the program identifies what is a number in the s string. I assume str converts each letter/number/etc in a string type, but then how does int(num) know what to pick as a whole number? In other words, how come this code doesn't produce some kind of TypeError or ValueError, because the way I see it, it would try and make an integer of (for example) "D 100"?
first
cm,num = map(str,s)
could be simplified as
cm,num = s
since s is already a list of strings made of 2 elements (if the input is correct). No need to convert strings that are already strings. s is just unpacked into 2 variables.
the way I see it, it would try and make an integer of (for example) "D 100"?
no it cannot, since num is the second parameter of the string.
if input is "D 100", then s is ['D','100'], then cm is 'D' and num is '100'
Then since num represents an integer int(num) is going to convert num to its integer value.
The above code is completely devoid of error checking (number of parameters, parameters "type") but with the correct parameters it works.
and map is completely useless in that particular example too.
The reason is the .split(), statement before in the s = input().split(). This creates a list of the values D and 100 (or ['D', '100']), because the default split character is a space ( ). Then the map function applies the str operation to both 'D' and '100'.
Now the map, function is not really required because both values upon input are automatically of the type str (strings).
The second question is how int(num) knows how to convert a string. This has to do with the second (implicit) argument base. Similar to how .split() has a default argument of the character to split on, so does num have a default argument to convert to.
The full code is similar to int(num, base=10). So as long as num has the values 0-9 and at most 1 ., int can convert it properly to the base 10. For more examples check out built in int.
I'm converting strings to floats using float(x). However for some reason, one of the strings is "71.2\x0060". I've tried following this answer, but it does not remove the bytes character
>>> s = "71.2\x0060"
>>> "".join([x for x in s if ord(x) < 127])
'71.2\x0060'
Other methods I've tried are:
>>> s.split("\\x")
['71.2\x0060']
>>> s.split("\x")
ValueError: invalid \x escape
I'm not sure why this string is not formatted correctly, but I'd like to get as much precision from this string and move on.
Going off of wim's comment, the answer might be this:
>>> s.split("\x00")
['71.2', '60']
So I should do:
>>> float(s.split("\x00")[0])
71.2
Unfortunately the POSIX group \p{XDigit} does not exist in the re module. To remove the hex control characters with regular expressions anyway, you can try the following.
impore re
re.sub(r'[\x00-\x1F]', r'', '71.2\x0060') # or:
re.sub(r'\\x[0-9a-fA-F]{2}', r'', r'71.2\x0060')
Output:
'71.260'
'71.260'
r means raw. Take a look at the control characters up to hex 1F in the ASCII table: https://www.torsten-horn.de/techdocs/ascii.htm
While importing data from a flat file, I noticed some embedded hex-values in the string (<0x00>, <0x01>).
I want to replace them with specific characters, but am unable to do so. Removing them won't work either.
What it looks like in the exported flat file: https://i.imgur.com/7MQpoMH.png
Another example: https://i.imgur.com/3ZUSGIr.png
This is what I've tried:
(and mind, <0x01> represents a none-editable entity. It's not recognized here.)
import io
with io.open('1.txt', 'r+', encoding="utf-8") as p:
s=p.read()
# included in case it bears any significance
import re
import binascii
s = "Some string with hex: <0x01>"
s = s.encode('latin1').decode('utf-8')
# throws e.g.: >>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 114: invalid start byte
s = re.sub(r'<0x01>', r'.', s)
s = re.sub(r'\\0x01', r'.', s)
s = re.sub(r'\\\\0x01', r'.', s)
s = s.replace('\0x01', '.')
s = s.replace('<0x01>', '.')
s = s.replace('0x01', '.')
or something along these lines in hopes to get a grasp of it while iterating through the whole string:
for x in s:
try:
base64.encodebytes(x)
base64.decodebytes(x)
s.strip(binascii.unhexlify(x))
s.decode('utf-8')
s.encode('latin1').decode('utf-8')
except:
pass
Nothing seems to get the job done.
I'd expect the characters to be replacable with the methods I've dug up, but they are not. What am I missing?
NB: I have to preserve umlauts (äöüÄÖÜ)
-- edit:
Could I introduce the hex-values in the first place when exporting? If so, is there a way to avoid that?
with io.open('out.txt', 'w', encoding="utf-8") as temp:
temp.write(s)
Judging from the images, these are actually control characters.
Your editor displays them in this greyed-out way showing you the value of the bytes using hex notation.
You don't have the characters "0x01" in your data, but really a single byte with the value 1, so unhexlify and friends won't help.
In Python, these characters can be produced in string literals with escape sequences using the notation \xHH, with two hexadecimal digits.
The fragment from the first image is probably equal to the following string:
"sich z\x01 B. irgendeine"
Your attempts to remove them were close.
s = s.replace('\x01', '.') should work.
So I am trying to XOR two strings together but am unsure if I am doing it correctly when the strings are different length.
The method I am using is as follows.
def xor_two_str(a,b):
xored = []
for i in range(max(len(a), len(b))):
xored_value = ord(a[i%len(a)]) ^ ord(b[i%len(b)])
xored.append(hex(xored_value)[2:])
return ''.join(xored)
I get output like so.
abc XOR abc: 000
abc XOR ab: 002
ab XOR abc: 5a
space XOR space: 0
I know something is wrong and I will eventually want to convert the hex value to ascii so am worried the foundation is wrong. Any help would be greatly appreciated.
Your code looks mostly correct (assuming the goal is to reuse the shorter input by cycling back to the beginning), but your output has a minor problem: It's not fixed width per character, so you could get the same output from two pairs characters with a small (< 16) difference as from a single pair of characters with a large difference.
Assuming you're only working with "bytes-like" strings (all inputs have ordinal values below 256), you'll want to pad your hex output to a fixed width of two, with padding zeroes changing:
xored.append(hex(xored_value)[2:])
to:
xored.append('{:02x}'.format(xored_value))
which saves a temporary string (hex + slice makes the longer string then slices off the prefix, when format strings can directly produce the result without the prefix) and zero-pads to a width of two.
There are other improvements possible for more Pythonic/performant code, but that should be enough to make your code produce usable results.
Side-note: When running your original code, xor_two_str('abc', 'ab') and xor_two_str('ab', 'abc') both produced the same output, 002 (Try it online!), which is what you'd expect (since xor-ing is commutative, and you cycle the shorter input, reversing the arguments to any call should produce the same results). Not sure why you think it produced 5a. My fixed code (Try it online!) just makes the outputs 000000, 000002, 000002, and 00; padded properly, but otherwise unchanged from your results.
As far as other improvements to make, manually converting character by character, and manually cycling the shorter input via remainder-and-indexing is a surprisingly costly part of this code, relative to the actual work performed. You can do a few things to reduce this overhead, including:
Convert from str to bytes once, up-front, in bulk (runs in roughly one seventh the time of the fastest character by character conversion)
Determine up front which string is shortest, and use itertools.cycle to extend it as needed, and zip to directly iterate over paired byte values rather than indexing at all
Together, this gets you:
from itertools import cycle
def xor_two_str(a,b):
# Convert to bytes so we iterate by ordinal, determine which is longer
short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
xored = []
for x, y in zip(long, cycle(short)):
xored_value = x ^ y
xored.append('{:02x}'.format(xored_value))
return ''.join(xored)
or to make it even more concise/fast, we just make the bytes object without converting to hex (and just for fun, use map+operator.xor to avoid the need for Python level loops entirely, pushing all the work to the C layer in the CPython reference interpreter), then convert to hex str in bulk with the (new in 3.5) bytes.hex method:
from itertools import cycle
from operator import xor
def xor_two_str(a,b):
short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
xored = bytes(map(xor, long, cycle(short)))
return xored.hex()
I'm essentially trying to solve this problem: http://rosalind.info/problems/revc/
I want to replace all occurrences of A, C, G, T with their compliments T, G, C, A .. in other words all A's will be replaced with T's, all C's with G's and etc.
I had previously used the replace() function to replace all occurrences of 'T' with 'U' and was hoping that the replace function would take a list of characters to replace with another list of characters but I haven't been able to make it work, so it might not have that functionality.
I know I could solve this easily using the BioJulia package and have done so using the following:
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
using Bio.Seq
s = dna"AAAACCCGGT"
t = reverse(complement(s))
println("$t")
But I'd like to not have to rely on the package.
Here's the code I have so far, if someone could steer me in the right direction that'd be great.
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
s = open("nt.txt") # open file containing sequence
t = reverse(s) # reverse the sequence
final = replace(t, r'[ACGT]', '[TGCA]') # this is probably incorrect
# replace characters ACGT with TGCA
println("$final")
It seems that replace doesn't yet do translations quite like, say, tr in Bash. So instead, here are couple of approaches using a dictionary mapping instead (the BioJulia package also appears to make similar use of dictionaries):
compliments = Dict('A' => 'T', 'C' => 'G', 'G' => 'C', 'T' => 'A')
Then if str = "AAAACCCGGT", you could use join like this:
julia> join([compliments[c] for c in str])
"TTTTGGGCCA"
Another approach could be to use a function and map:
function translate(c)
compliments[c]
end
Then:
julia> map(translate, str)
"TTTTGGGCCA"
Strings are iterable objects in Julia; each of these approaches reads one character in turn, c, and passes it to the dictionary to get back the complimentary character. A new string is built up from these complimentary characters.
Julia's strings are also immutable: you can't swap characters around in place, rather you need to build a new string.