I'm getting completely confused with string encodings in Python. I read a number of other answers, but none show, what is really going on in the last three lines of the code below:
filename = "/path/to/file.txt" #textfile contains only the string "\bigcommand"
with open(filename,'r') as f:
file = list(f)
val = file[0] #val = '\\bigcommand\n'
valnew = val.encode('unicode-escape') #valnew = b'\\\\bigcommand\\n'
valnewnew = str(valnew,'utf-8') #valnewnew = '\\\\bigcommand\\n'
Why is the valnew variable suddenly a bytestring? I thought it would be the same as before - but just with the escape characters doubled?
Is there a shorter way to do this, than the convoluted last three lines, in order to get the output of valnewnew?
This will get you the output of valnewnew:
val = file[0].encode('unicode-escape').decode()
with open('t', 'r') as f:
file = list(f)
val = file[0].encode('unicode-escape').decode() # value: '\\\\bigcommand\\n'
When you encode a string in python3.x, you're encoding the string into bytes which then needed to be subsequently decoded to get a string back as a result.
If you give some insight into what you're trying to do, I can try expand.
Related
So while reading a CSV-file into python, some of the variables have the following structure:
'"variable"'
I stored them in listed tuples.
Now, some of these variables have to be compared to each other as they are numeric.
But I can't seem to find a way to compare them to each other. For example:
counter = 0
if '"120000"' < '"130000"':
counter += 1
However, the counter remains at 0.
Any advice on how to work with these types of datastructures?
I tryed converting them to integers but this gives my a ValueError.
The original file has the following layout:
Date,"string","string","string","string","integer"
I read the file as follows:
with open(dataset, mode="r") as flight_information:
flight_information_header = flight_information.readline()
flight_information = flight_information.read()
flight_information = flight_information.splitlines()
flight_information_list = []
for lines in flight_information:
lines = lines.split(",")
flight_information_tuple = tuple(lines)
flight_information_list.append(flight_information_tuple)
For people in the future, the following solved my problem:
Since the tuples are immutable I now removed the "" around my numerical values while loading the csv file:
Example:
with open(dataset, mode="r") as flight_information:
flight_information_header = flight_information.readline()
flight_information = flight_information.read()
flight_information = flight_information.splitlines()
flight_information_list = []
for lines in flight_information:
lines = lines.replace('"', '').split(",")
flight_information_tuple = tuple(lines)
flight_information_list.append(flight_information_tuple)
Note this line in particular:
lines = lines.replace('"', '').split(",")
i have a text file i want to remove punctuation and save it as a new file but it is not removing anything any idea why?
code:
def punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# Print string without punctuation
print(string)
file = open('ir500.txt', 'r+')
file_no_punc = (file.read())
punctuation(l)
with open('ir500_no_punc.txt', 'w') as file:
file.write(file_no_punc)
removing any punctuation why?
def punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# return string without punctuation
return string
file = open('ir500.txt', 'r+')
file_no_punc = (file.read())
file_no_punc = punctuation(file_no_punc)
with open('ir500_no_punc.txt', 'w') as file:
file.write(file_no_punc)
Explanation:
I changed only punctuation(l) to file_no_punc = punctuation(file_no_punc) and print(string) to return string
1) what is l in punctuation(l) ?
2) you are calling punctuation() - which works correctly - but do not use its return value
3) because it is not currently returning a value, just printing it ;-)
Please note that I made only the minimal change to make it work. You might want to post it to our code review site, to see how it could be improved.
Also, I would recommend that you get a good IDE. In my opinion, you cannot beat PyCharm community edition. Learn how to use the debugger; it is your best friend. Set breakpoints, run the code; it will stop when it hits a breakpoint; you can then examine the values of your variables.
taking out the file reading/writing, you could to remove the punctuation from a string like this:
table = str.maketrans("", "", r"!()-[]{};:'\"\,<>./?##$%^&*_~")
# # or maybe even better
# import string
# table = str.maketrans("", "", string.punctuation)
file_with_punc = r"abc!()-[]{};:'\"\,<>./?##$%^&*_~def"
file_no_punc = file_with_punc.lower().translate(table)
# abcdef
where i use str.maketrans and str.translate.
note that python strings are immutable. there is no way to change a given string; every operation you perform on a string will return a new instance.
I have this file with some lines that contain some unicode literals like:
"b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
I want to remove those xe2\x80\x99 like characters.
I can remove them if I declare a string that contains these characters but my solutions don't work when reading from a CSV file. I used pandas to read the file.
SOLUTIONS TRIED
1.Regex
2.Decoding and Encoding
3.Lambda
Regex Solution
line = "b'Who\xe2\x80\x99s he?\n\nA fan rushed the field to join the Cubs\xe2\x80\x99 celebration after Jake Arrieta\xe2\x80\x99s no-hitter."
code = (re.sub(r'[^\x00-\x7f]',r'', line))
print (code)
LAMBDA SOLUTION
stripped = lambda s: "".join(i for i in s if 31 < ord(i) < 127)
code2 = stripped(line)
print(code2)
ENCODING SOLUTION
code3 = (line.encode('ascii', 'ignore')).decode("utf-8")
print(code3)
HOW FILE WAS READ
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
print(stripped(row['text']))
print(re.sub(r'[^\x00-\x7f]',r'', row['text']))
print(row['text'].encode('ascii', 'ignore')).decode("utf-8"))
SUGGESTED METHOD
df = pandas.read_csv('file.csv',encoding = "utf-8")
for index, row in df.iterrows():
en = row['text'].encode()
print(type(en))
newline = en.decode('utf-8')
print(type(newline))
print(repr(newline))
print(newline.encode('ascii', 'ignore'))
print(newline.encode('ascii', 'replace'))
Your string is valid utf-8. Therefore it can be directly converted to a python string.
You can then encode it to ascii with str.encode(). It can ignore non-ascii characters with 'ignore'.
Also possible: 'replace'
line_raw = b'Who\xe2\x80\x99s he?'
line = line_raw.decode('utf-8')
print(repr(line))
print(line.encode('ascii', 'ignore'))
print(line.encode('ascii', 'replace'))
'Who’s he?'
b'Whos he?'
b'Who?s he?'
To come back to your original question, your 3rd method was correct. It was just in the wrong order.
code3 = line.decode("utf-8").encode('ascii', 'ignore')
print(code3)
To finally provide a working pandas example, here you go:
import pandas
df = pandas.read_csv('test.csv', encoding="utf-8")
for index, row in df.iterrows():
print(row['text'].encode('ascii', 'ignore'))
There is no need to do decode('utf-8'), because pandas does that for you.
Finally, if you have a python string that contains non-ascii characters, you can just strip them by doing
text = row['text'].encode('ascii', 'ignore').decode('ascii')
This converts the text to ascii bytes, strips all the characters that cannot be represented as ascii, and then converts back to text.
You should look up the difference between python3 strings and bytes, that should clear things up for you, I hope.
I have a csv file which contains data in two columns. The data is in decimal format.
I am trying to convert the data into hexadecimal format, then concatenate it.
I am able to convert and concatenate when data in column2 is non-zero.
For example: Column1 = 52281 and Column2 = 49152, then i am able to get CC39C000. (hex(52281) = CC39 and hex(49152) = C000).
However, if data in Column2 is zero:-
Column1 = 52281 and Column2 = 0, then i get CC390 instead of CC390000.
Following is my code snippet:-
file=open( inputFile, 'r')
reader = csv.reader(file)
for line in reader:
col1,col2=int(line[0]),int(line[1])
newstr = '{:x}'.format(col1)+'{:x}'.format(col2)
When the data in column2 is 0, i am expecting to get 0000.
How can i modify my code to achieve this??
If you have
a=52281
b=0
You can convert yo hex and calculate the longest string to fill with zeros
hex_a = hex(a)[2:] # [2:] removes the trailing 0x you might want to use [3:] if you have negative numbers
hex_b = hex(b)[2:]
longest=max(len(hex_a), len(hex_b))
Then you can fill with 0 with the zfill method:
print(hex_a.zfill(longest) + hex_b.zfill(longest))
If you only need 4 characters you can do zfill(4)
If I am trying to adapt your code, which is hard to test because I do not have access to the file,
file=open( inputFile, 'r')
reader = csv.reader(file)
for line in reader:
col1,col2=int(line[0]),int(line[1])
hex_col1 = hex(col1)[2:] if col1>=0 else hex(col1)[3:]
hex_col2 = hex(col2)[2:] if col2>=0 else hex(col2)[3:]
longest = max(len(hex_col1), len(hex_col2))
newstr = hex_col1.zfill(longest) + hex_col2.zfill(longest)
2331,0,13:30:08,25.35,22.05,23.8,23.9,23.5,23.7,5455,350,23.65,132,23.6,268,23.55,235,23.5,625,23.45,459,23.7,83,23.75,360,23.8,291,23.85,186,23.9,331,0,1,25,1000,733580089,name,,,
I got a line like this and how could I cut it? I only need the first 9 variable like this:
2331,0,13:30:08,25.35,22.05,23.8,23.9,23.5,23.7,5455
the original data i save as txt.file, and could I rewrite the original one and save?
Use either csv or just to straight file io with string split function
For example:
import csv
with open('some.txt', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row[:9]
or if everything is on a single line and you don't want to use a csv interface
with open('some.txt', 'r') as f:
line = f.read()
print line.split(str=",")[:9]
If you have a file called "content.txt".
f = open("content.txt","r")
contentFile = f.read();
output = contentFile.split(",")[:9]
output = ",".join(output)
f.close()
f = open("content.txt","wb")
f.write(output)
If all your values are stored in an Array, you can slice like this:
arrayB = arrayA[:9]
To get your values into an array you could split your String at every ","
arrayA = inputString.split(str=",")