Unable to process strings in files with different encoding [Not accepting answers]

Unable to process strings in files with different encoding [Not accepting answers] - string

I am trying to process the strings present in a particular file, The file is written in English. The problem arises when the encoding of the file differs from "UTF-8". But the file with encoding as "UTF-16-le" does not behave as expected. My main goal is manipulate the strings from the read file. For example the strings.TrimSpace() only works with the UTF-8 file,
I am aware that golang only supports UTF-8 by default, Any alternate approach would be helpful.
Personal Question
Also I would like to point out, many new programming languages, do process the strings irrespective of the encoding, And why does Go only support UTF-8. If at least there would be an alternative way to pass the encoding format to the reader, that might still help.
What I tried
I tried using utf-8 and utf-16 std packages
Code
(main.go)
sample code to show the difference.
package main
import (
"fmt"
"io/ioutil"
"net/http"
"strings"
)
func processFile(src string) {
data, _ := ioutil.ReadFile(src)
fmt.Println("--- original source ---")
fmt.Println(string(data))
fmt.Println(http.DetectContentType(data))
fmt.Println("\n--- modified source ---")
for _, val := range strings.Split(string(data), "\n") {
fmt.Println(strings.TrimSpace(val))
}
}
func main() {
processFile("./utf-16-english.txt")
processFile("./utf-8-english.txt")
}
File-1
(utf-8-english.txt)
Hello
This is
Sample
Document
File-2
(utf-16-english.txt)
Hello
This is
Sample
Document
EDIT
Seems that the only way to process strings in a better way is to convert them to UTF-8. Kindly refer the marked answer.
As per comments I have written the result from the program to respective files. And the special symbols are not present, but the process with strings, works fine with UTF-8

You have to decode the utf-16 encoded file. The decoding will convert the input to utf-8, after which you can use the string libraries to process the input.
You can use something like this:
import "unicode/utf16"
func processFile(src string, decode func(in[]byte) string) {
data, _ := ioutil.ReadFile(src)
fmt.Println("--- original source ---")
fmt.Println(decode(data))
fmt.Println("\n--- modified source ---")
for _, val := range strings.Split(decode(data), "\n") {
fmt.Println(strings.TrimSpace(val))
}
}
func main() {
processFile("./utf-16-english.txt",func(in []byte) string {
return string(utf16.Decode(in)) })
processFile("./utf-8-english.txt",func(in []byte) string {
return string(in)})
}

Related

How to parse a configuration file (kind of a CSV format) using LUA

I am using LUA on a small ESP8266 chip, trying to parse a text string that looks like the one below. I am very new at LUA, and tried many similar scripts found at this forum.
data="
-- String to be parsed\r\n
Tiempo1,20\r\n
Tiempo2a,900\r\n
Hora2b,27\r\n
Tiempo2b,20\r\n
Hora2c,29\r\n
Tiempo2c,18\r\n"
My goal would be to parse the string, and return all the configuration pairs (name/value).
If needed, I can modify the syntax of the config file because it is created by me.
I have been trying something like this:
var1,var2 = data:match("([Tiempo2a,]), ([^,]+)")
But it is returning nil,nil. I think I am on the very wrong way to do this.
Thank you very much for any help.

You need to use gmatch and parse the values excluding non-printable characters (\r\n) at the end of the line or use %d+
local data=[[
-- String to be parsed
Tiempo1,20
Tiempo2a,900
Hora2b,27
Tiempo2b,20
Hora2c,29
Tiempo2c,18]]
local t = {}
for k,v in data:gmatch("(%w-),([^%c]+)") do
t[#t+1] = { k, v }
print(k,v)
end

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!

Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

How does ruamel.yaml determine the encoding of escaped byte sequences in a string?

I am having trouble figuring out where to modify or configure ruamel.yaml's loader to get it to parse some old YAML with the correct encoding. The essence of the problem is that an escaped byte sequence in the document seems to be interpreted as latin1, and I have no earthly clue where it is doing that, after some source diving here. Here is a code sample that demonstrates the behavior (this in particular was run in Python 3.6):
from ruamel.yaml import YAML
yaml = YAML()
yaml.load('a:\n b: "\\xE2\\x80\\x99"\n') # Note that this is a str (that is, unicode) with escapes for the byte escapes in the YAML document
# ordereddict([('a', ordereddict([('b', 'â\x80\x99')]))])
Here are the same bytes decoded manually, just to show what it should parse to:
>>> b"\xE2\x80\x99".decode('utf8')
'’'
Note that I don't really have any control over the source document, so modifying it to produce the correct output with ruamel.yaml is out of the question.

ruamel.yaml doesn't interpret individual strings, it interprets the
stream it gets hanled, i.e. the argument to .load(). If that
argument is a byte-stream or a file like object then its encoding is
determined based on the BOM, defaulting to UTF-8. But again: that is
at the stream level, not at individual scalar content after
interpreting escapes. Since you hand .load() Unicode (as this is
Python 3) that "stream" needs no further decoding. (Although
irrelevant for this question: it is done in the reader.py:Reader methods stream and
determine_encoding)
The hex escapes (of the form \xAB), will just put a specific hex
value in the type the loader uses to construct the scalar, that is
value for key 'b', and that is a normal Python 3 str i.e. Unicode in
one of its internal representations. That you get the â in your
output is because of how your Python is configured to decode it str
tyes.
So you won't "find" the place where ruamel.yaml decodes that
byte-sequence, because that is already assumed to be Unicode.
So the thing to do is that you double decode your double quoted
scalars (you only have to address those as plain, single quoted,
literal/folded scalars cannot have the hex escapes). There are various
points at which you can try to do that, but I think
constructor.py:RoundTripConsturtor.construct_scalar and
scalarstring.py:DoubleQuotedScalarString are the best candidates. The former of those might take some digging to find, but the latter is actually the type you'll get if you inspect
that string after loading when you add the option to preserve quotes:
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(type(data['a']['b']))
which prints:
<class 'ruamel.yaml.scalarstring.DoubleQuotedScalarString'>
knowing that you can inspect that rather simple wrapper class:
class DoubleQuotedScalarString(ScalarString):
__slots__ = ()
style = '"'
def __new__(cls, value, anchor=None):
# type: (Text, Any) -> Any
return ScalarString.__new__(cls, value, anchor=anchor)
"update" the only method there (__new__) to do your double
encoding (you might have to put in additional checks to not double encode all
double quoted scalars0:
import sys
import codecs
import ruamel.yaml
def my_new(cls, value, anchor=None):
# type information only needed if using mypy
# value is of type 'str', decode to bytes "without conversion", then encode
value = value.encode('latin_1').decode('utf-8')
return ruamel.yaml.scalarstring.ScalarString.__new__(cls, value, anchor=anchor)
ruamel.yaml.scalarstring.DoubleQuotedScalarString.__new__ = my_new
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(data)
which gives:
ordereddict([('a', ordereddict([('b', '’')]))])

Remove duplicate line in text file

I have searching around, but not able to get some auto script that perform overall tasks below:
1) go through all text files from a folder
2) remove duplicate line/row from the text file (text is already sorted, so can skip the sorting part)
3) save & overwrite the text files
Unfortunately, all the result I searched only to remove line from 1 specific file, and save as another file name.
Then i will set a schedule task to run this script.
I don't have any script knowledge, only have few experience on batch script setup. Your help and guide would be much appreciated.

Unfortunately, all the result I searched only to remove line from 1 specific file, and save as another file name.
I think you have your answer right here. I don't know which language you're writing in, but typically in this scenario I would do something as such.
Open file A
Read lines
Sort lines
Remove duplicate lines
Save as file B
Close file A
Rename file A to _backup or _original (unnecessary, but a good safe guard for data loss prevention)
Rename file B to file A
Again I don't know which language you're writing in etc... there really isn't enough detail here to answer the question any further.
The key point though is to simply delete your original file, and rename your new file to the original.

I wrote and commented a little script in GoLang for you It might help in your case if you know how to run it. If not, quick research will help you.
package main
import (
"io/ioutil"
"strings"
"log"
"os"
)
func main() {
// get all files in directory
files, err := ioutil.ReadDir(".")
// check error
if err != nil { log.Println(err) }
// go through all the files
for _, file := range files {
// check if it's a txt file (can change this)
if strings.HasSuffix(file.Name(), "txt") { // you can change this
// read the lines
line, _ := ioutil.ReadFile(file.Name())
// turn the byte slice into string format
strLine := string(line)
// split the lines by a space, can also change this
lines := strings.Split(strLine, " ")
// remove the duplicates from lines slice (from func we created)
RemoveDuplicates(&lines)
// get the actual file
f, err := os.OpenFile(file.Name(), os.O_APPEND|os.O_WRONLY, 0600)
// err check
if err != nil { log.Println(err) }
// delete old one
os.Remove(file.Name())
// create it again
os.Create(file.Name())
// go through your lines
for e := range lines {
// write to the file without the duplicates
f.Write([]byte(lines[e] +" ")) // added a space here, but you can change this
}
// close file
f.Close()
}
}
}
func RemoveDuplicates(lines *[]string) {
found := make(map[string]bool)
j := 0
for i, x := range *lines {
if !found[x] {
found[x] = true
(*lines)[j] = (*lines)[i]
j++
}
}
*lines = (*lines)[:j]
}
Your file: hello hello yes no
Returned result: hello yes no
if you run this program in the directory with all your files, it'll remove the duplicates.
Hope it fits your needs.

Extracting source code from html file using python3.1 urllib.request

I'm trying to obtain data using regular expressions from a html file, by implementing the following code:
import urllib.request
def extract_words(wdict, urlname):
uf = urllib.request.urlopen(urlname)
text = uf.read()
print (text)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
which returns an error:
File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Upon experimenting further in the IDLE, I noticed that the uf.read() indeed returns the html source code the first time I invoke it. But then onwards, it returns a - b''. Is there any way to get around this?

uf.read() will only read the contents once. Then you have to close it and reopen it to read it again. This is true for any kind of stream. This is however not the problem.
The problem is that reading from any kind of binary source, such as a file or a webpage, will return the data as a bytes type, unless you specify an encoding. But your regexp is not specified as a bytes type, it's specified as a unicode str.
The re module will quite reasonably refuse to use unicode patterns on byte data, and the other way around.
The solution is to make the regexp pattern a bytes string, which you do by putting a b in front of it. Hence:
match = re.findall(b"<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
Should work. Another option is to decode the text so it also is a unicode str:
encoding = uf.headers.getparam('charset')
text = text.decode(encoding)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
(Also, to extract data from HTML, I would say that lxml is a better option).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Unable to process strings in files with different encoding [Not accepting answers] - string

Related

How to parse a configuration file (kind of a CSV format) using LUA

Parsing a non-Unicode string with Flask-RESTful

How does ruamel.yaml determine the encoding of escaped byte sequences in a string?

Remove duplicate line in text file

Extracting source code from html file using python3.1 urllib.request

Categories

Resources