this is the first line in my .svg image
<?xml version="1.0" encoding="UTF-8"?>
this is inside head part of .php file:
<meta charset="utf-8">
this is checked in Notepad++ - Encoding menu:
Encode in UTF-8 without BUM
or (I tried):
Encode in UTF-8
Still, my img does not display unicode character !?
Related
I am looking for a way to replace a HTML tag with another, but keep the text.
I have a big HTML file, which contains:
<span class="desc e-font-family-cond">fork</span>
I want to replace <span> tag with <strong> tag:
<strong>fork</strong>
Tool doesn't really matter, but I am looking for a CLI way to do it.
I am not looking for a HTML processor, because input is a text file with some HTML code in it (not a clean/valid HTML) and I am manually working with the output (copy, modify, use later in its final place). I just want to save some time with the replace.
I would use GNU sed for this task following way, let file.txt content be
<span class="desc e-font-family-cond">fork</span>
then
sed -e 's/<span[^>]*>/<strong>/g' -e 's/<\/span>/<\/strong>/g' file.txt
output
<strong>fork</strong>
Explanation: firstly replace span starting using <strong>, secondly replace span closing using </strong>.
Consider using Python and a tool like BeautifulSoup to handle HTML. Trying to parse HTML with other tools like sed or awk can lead to terrible places.
As an example:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<li><span class="desc e-font-family-cond">fork</span>')
for spanele in soup.findAll('span'):
spanele.name = 'p'
html_string = str(soup)
print(html_string);
That's lightweight and pretty simple and the html is handled properly with a library that is specifically built to parse it.
Don't use AWK for processing HTML files. If you can turn your HTML file into an XHTML file, you can use xsltproc for an XML transformation as follows:
trans.xsl file:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="utf-8"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="span[#class='desc e-font-family-cond']">
<strong><xsl:apply-templates/></strong>
</xsl:template>
</xsl:stylesheet>
CLI command for invoking xsltproc, which has to be installed, obviously:
xsltproc trans.xsl file.html
The standard output of this command is the corrected HTML file as you want to have it.
Using sed:
sed 's,<\(\/\)\?span\(\s\)\?,<\1strong\2,g'
$ echo '<span class="desc e-font-family-cond">fork</span>' | sed 's,<\(\/\)\?span\(\s\)\?,<\1strong\2,g'
<strong class="desc e-font-family-cond">fork</strong>
I have non XML compliant documents (XHTML pages) with improperly closed tags,img, br, hr.
I need close image, hr, and br tags properly, with '/>'
I tried xmlstarlet, it does the job, but alters XML declaration header.
So I have original code as follows:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en" lang="en">
<head>
<title> </title>
<link rel="stylesheet" type="text/css" href="style.css" />
</head>
<body>
if I run command xmlstarlet fo --recover --html file.xhtml,
the output is incorrect, have 2 declaration lines:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html>
<?xml version="1.0" encoding="UTF-8" standalone="no"??>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en" lang="en">
<head>
<title> </title>
<link rel="stylesheet" type="text/css" href="style.css"/>
</head>
<body>
if I run xmlstarlet fo --omit-decl --recover --html file.xhtml,
the output is also incorrect, as declaration need be the first line:
<!DOCTYPE html>
<?xml version="1.0" encoding="UTF-8" standalone="no"??>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en" lang="en">
<head>
<title> </title>
<link rel="stylesheet" type="text/css" href="style.css"/>
</head>
<body>
So I need to do post-processing, swap the first and second lines. What bash command can help here? Please specify command syntax for bath processing files and editing in place.
P.S. why xmlstarlet put 2 question mark chars at the end of declaration? ("no"??>)
I suggest to append | sed -n '1{h;d};2{p;g};p'.
This might work for you (GNU sed):
sed -zE 's/(.*)\n(.*)/\2\n\1/m' file
Slurp the file into memory and swap the contents of line 1 and 2.
N.B. The m flag allows .* to refer to lines contents.
I have two XML-files containing a "ß" ("scharfes S" in german), starting with:
<?xml version="1.0" encoding="utf-16" standalone="yes"?>
and
I used the following code to read the utf-8 file:
with open(file.xml, encoding='utf-8') as file:
f = file.read()
xml = xmltodict.parse(f)
and this code for the utf-16 file.
with open(file.xml, encoding='utf-16') as file:
f = file.read()
xml = xmltodict.parse(f)
for the UTF-16 file I get this error: UnicodeError: UTF-16 stream does not start with BOM.
Changing everything to:
with open(file.xml, encoding='utf-16') as file:
file.seek(1, os.SEEK_SET)
f = file.read()
xml = xmltodict.parse(f)
where I tried different points (e.g. seek(1,..), seek(2,..), ... ) doesn't help.
Then I checked the encoding with (Source)
alias vic="vim -c 'execute \"silent \!echo \" . &fileencoding | q'"
vic file.xml
> latin-1
(Therefore I replaced encoding='utf-16' to encoding='latin-1').
But now I get errors about the "ß" in the code (e.g. when trying "utf-16-le")
"'utf-16-le' codec can't decode bytes in position 12734-12735: illegal encoding"
Does someone know where the problem is here? Or in general: How can I read XML files in Python with utf-8 or utf-16 encoding without having BOM errors or errors about the character "ß".
Thank you in advance!
If I create a UTF-16LE file:
$ echo 'Character is: ß' | iconv -t utf-16le >f.txt
and examine it with a hex dump:
$ xxd f.txt
00000000: 4300 6800 6100 7200 6100 6300 7400 6500 C.h.a.r.a.c.t.e.
00000010: 7200 2000 6900 7300 3a00 2000 df00 0a00 r. .i.s.:. .....
and then read it in Python:
>>> open('f.txt', encoding='utf-16LE').read()
'Character is: ß\n'
then I get the expected results.
Your file is not correctly encoded with the encoding that you're declaring.
can't decode bytes in position 12734-12735: illegal encoding
Create a much smaller sample file, or generate one as suggested above and look for differences.
If you find yourself messing with the file encoding manually when handling XML files, you're doing something wrong.
Fundamental rule: Never read XML files with open() in text mode.
Use an XML parser to load the file. The parser will sort out the encoding for you automatically. That's the whole point of having an XML declaration like <?xml version="1.0" encoding="utf-16"?> at the top of the file .
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
If you want to use xmltodict, open the file in binary mode (rb):
with open('file.xml', 'rb') as f:
xml = xmltodict.parse(f)
Here, xmltodict will give the file to an XML parser internally, which again will sort out the encoding for you.
If the above mangles characters or even throws errors, your XML file is broken. Fix the producer of the file. If you've edited the file manually, double check that your text editor's encoding settings match the XML declaration.
I'm issuing the command xdg-mime install nv-custom.xml - using this file:
<?xml version="1.0" encoding="UTF-8"?>
<mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info">
<mime-type type="text/x-customtest">
<comment></comment>
<glob weight="60" pattern="*...someUnicodeHere"/>
</mime-type>
</mime-info>
Is it possible to use unicode in the glob's pattern field?
I tried Ctrl + Shift + U followed by a code, which works fine, but then the file does not have any effect, whereas it works just fine with "regular" text.
How can I add a custom xmlns in the output when I convert an asciidoc file with AsciiDoctor?
I'd like to add xmlns:xi="http://www.w3.org/2001/XInclude" in the top book tag.
The current implementation seems to generate:
<?xml version="1.0" encoding="UTF-8"?>
<?asciidoc-toc?>
<?asciidoc-numbered?>
<book xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en">
<info>
<title>title</title>
</info>
</book>
from this:
= title
:lang: en
When I run:
$ asciidoctor -b docbook5 -d book -o out.xml source.txt
There is a built-in attribute xmlns, but it seems to be for docbook 4.5.
The reason I want to use XInclude is to include some xml files from Docinfo files and Passthrough Blocks
With a bit of research inside the asciidoctor code it quickly became clear that the part you'd like to modify is fairly static.
See asciidoctor/converter/docbook5.rb Line 44 for more info.
The best approach is to create a postprocessor extension which modifies the output. The example below is just to show a possible implementation.
Create a file with the following content and call it docbook_postprocessor.rb.
class Docbook5XiPostprocessor < Asciidoctor::Extensions::Postprocessor
def process document, output
if document.basebackend? 'docbook'
input_regex = %r{^(<.*xmlns:xl="http://www.w3.org/1999/xlink") (version="5.0".*>)}
replacement = %(\\1 xmlns:xi="http://www.w3.org/2001/XInclude" \\2)
output = output.sub(input_regex, replacement)
end
output
end
end
Asciidoctor::Extensions.register do
postprocessor Docbook5XiPostprocessor
end
Note: The above extension is for the sake of brevity placed in the same directory as the asciidoctor source file called source.adoc.
The run the asciidoctor command with the -r ./docbook_postprocessor.rb parameters.
$ asciidoctor -r ./docbook_postprocessor.rb -b docbook5 -d book -o - source.adoc
<?xml version="1.0" encoding="UTF-8"?>
<?asciidoc-toc?>
<?asciidoc-numbered?>
<book
xmlns="http://docbook.org/ns/docbook"
xmlns:xl="http://www.w3.org/1999/xlink"
xmlns:xi="http://www.w3.org/2001/XInclude"
version="5.0"
xml:lang="en">
<info>
<title>test</title>
<date>2020-12-19</date>
</info>
</book>
* Above output has been slightly reformatted to eliminate the scrollbar
Creating ruby gem with the above code for easier distribution is a task left to the reader.