I want to use https://pypi.org/project/pyclibrary/ to parse some .h files.
Some of those .h files are unfortunately not UTF-8 encoded - Notepad++ tells me they are "ANSI" encoded (and as they originate on Windows, I guess that means CP-1252? Not sure ...)
Anyways, I can reduce the problem to this example:
mytest.h:
/*******************************************************
Just a test header file
© Copyright myself
*******************************************************/
#ifndef _MY_TEST_
#define _MY_TEST_
#endif
The tricky part here is the copyright character - and just to make sure, here is a hexdump of this:
$ hexdump -C mytest.h
00000000 2f 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |/***************|
00000010 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00000030 2a 2a 2a 2a 2a 2a 2a 2a 0d 0a 4a 75 73 74 20 61 |********..Just a|
00000040 20 74 65 73 74 20 68 65 61 64 65 72 20 66 69 6c | test header fil|
00000050 65 0d 0a a9 20 43 6f 70 79 72 69 67 68 74 20 6d |e... Copyright m|
00000060 79 73 65 6c 66 0d 0a 2a 2a 2a 2a 2a 2a 2a 2a 2a |yself..*********|
00000070 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00000090 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2f 0d |**************/.|
000000a0 0a 0d 0a 23 69 66 6e 64 65 66 20 5f 4d 59 5f 54 |...#ifndef _MY_T|
000000b0 45 53 54 5f 0d 0a 23 64 65 66 69 6e 65 20 5f 4d |EST_..#define _M|
000000c0 59 5f 54 45 53 54 5f 0d 0a 23 65 6e 64 69 66 0d |Y_TEST_..#endif.|
000000d0 0a |.|
000000d1
And then I try this Python script:
mytest.py
#!/usr/bin/env python3
import sys, os
from pyclibrary import CParser
myhfile = "mytest.h"
c_parser = CParser([myhfile])
print(c_parser)
When I run this, I get:
$ python3 mytest.py
Traceback (most recent call last):
File "mytest.py", line 7, in <module>
c_parser = CParser([myhfile])
File "/usr/lib/python3.8/site-packages/pyclibrary/c_parser.py", line 443, in __init__
self.load_file(f, replace)
File "/usr/lib/python3.8/site-packages/pyclibrary/c_parser.py", line 678, in load_file
self.files[path] = fd.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 83: invalid start byte
... and I guess, the "byte 0xa9 in position 83" is the copyright character. So, the way I see it:
I don't really have an option to choose the file encoding in pyclibrary - but I don't want to hack pyclibrary either
I don't really want to edit the .h files either, and make them UTF-8 compatible
... and so, the only thing I can think of, is to change the default encoding of Python (while opening files) to ANSI/CP-1252/whatever, only for the call to c_parser = CParser([myhfile]) - and then restore the default UTF-8.
Is this possible to do somehow? I have seen Changing default encoding of Python? - but most of the answers there seem to imply, that you better just change the default encoding once, at the start of the script - I cannot find any references to changing the default encoding temporarily, and then restoring the original UTF-8 default later.
OK, I think I got it - found this thread: Windows Python: Changing encoding using the locale module - and got inspired to try the locale package; note that I'm working in MSYS2 bash on Windows, and as such, I use the MSYS2 Python3. So, now the file is:
mytest.py
#!/usr/bin/env python3
import sys, os
import locale
from pyclibrary import CParser
import pprint
myhfile = "mytest.h"
print( locale.getlocale() ) # ('en_US', 'UTF-8')
#pprint.pprint(locale.locale_alias)
locale.setlocale( locale.LC_ALL, 'en_US.ISO8859-1' )
c_parser = CParser([myhfile])
print(c_parser)
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
print( locale.getlocale() ) # ('en_US', 'UTF-8')
... and running this produces:
$ python3 mytest.py
('en_US', 'UTF-8')
============== types ==================
{}
============== variables ==================
{}
============== fnmacros ==================
{}
============== macros ==================
{'_MY_TEST_': ''}
============== structs ==================
{}
============== unions ==================
{}
============== enums ==================
{}
============== functions ==================
{}
============== values ==================
{'_MY_TEST_': None}
('en_US', 'UTF-8')
Well - this looks OK to me ...
Related
Assume I visit the following link somerandomwebsite.com/a.pdf and download the file a.pdf. Now assume that the host replaces a.pdf with a new version of the same file under the same name so now the previous link would lead me to download a different file.
Is there a way for me to prove that the file I downloaded was indeed downloaded from that link at a given time?
File Attribute
This is by no means a proof you can use to convince someone else, but if your browser, platform, and file system support it, you may find an xattr on the downloaded file that tells you the URL.
On macOS:
$ xattr -l -p com.apple.metadata:kMDItemWhereFroms Downloads/logo-stackoverflow.svg
com.apple.metadata:kMDItemWhereFroms:
00000000 62 70 6C 69 73 74 30 30 A1 01 5F 10 47 68 74 74 |bplist00.._.Ghtt|
00000010 70 73 3A 2F 2F 73 74 61 63 6B 6F 76 65 72 66 6C |ps://stackoverfl|
00000020 6F 77 2E 64 65 73 69 67 6E 2F 61 73 73 65 74 73 |ow.design/assets|
00000030 2F 69 6D 67 2F 6C 6F 67 6F 73 2F 73 6F 2F 6C 6F |/img/logos/so/lo|
00000040 67 6F 2D 73 74 61 63 6B 6F 76 65 72 66 6C 6F 77 |go-stackoverflow|
00000050 2E 73 76 67 08 0A 00 00 00 00 00 00 01 01 00 00 |.svg............|
00000060 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 |................|
00000070 00 00 00 00 00 54 |.....T|
00000076
On Linux:
$ getfattr -d logo-stackoverflow.svg
# file: logo-stackoverflow.svg
user.xdg.origin.url="https://stackoverflow.design/assets/img/logos/so/logo-stackoverflow.svg"
Wayback Machine
You might find the URL was archived by a service, such as Internet Archive Wayback Machine. For example: https://web.archive.org/web/20201101014003/https://stackoverflow.design/assets/img/logos/so/logo-stackoverflow.svg
Timestamping Authority (TSA)
For a convincing proof, you might rely on a third-party to access the URL and provide a cryptographic signature with the contents, including a timestamp. For example: freetsa.org provides a "URL screenshot online" service you can use to get a signed PDF showing the accessed website.
I am learning the internals of git, and this is a tree in my repository:
git cat-file 88e38705fdbd3608cddbe904b67c731f3234c45b -p
100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt
100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt
When I use Ruby's zlib with:
puts Zlib::Inflate.inflate(STDIN.read)
and pipe the output with hexdump -C:
cat .git/objects/88/e38705fdbd3608cddbe904b67c731f3234c45b | rinflate | hexdump -C
this is the output:
00000000 74 72 65 65 20 37 34 00 31 30 30 36 34 34 20 68 |tree 74.100644 h|
00000010 65 6c 6c 6f 2e 74 78 74 00 ce 01 36 25 03 0b a8 |ello.txt...6%...|
00000020 db a9 06 f7 56 96 7f 9e 9c a3 94 46 4a 31 30 30 |....V......FJ100|
00000030 36 34 34 20 77 6f 72 6c 64 2e 74 78 74 00 cc 62 |644 world.txt..b|
00000040 8c cd 10 74 2b ae a8 24 1c 59 24 df 99 2b 5c 01 |...t+..$.Y$..+\.|
00000050 9f 71 |.q|
00000052
However, when I use NodeJS:
const zlib = require("zlib");
const fs = require("fs");
fs.writeFileSync("/dev/stdout", zlib.inflateSync(fs.readFileSync("/dev/stdin")).toString());
I get this output:
00000000 74 72 65 65 20 37 34 00 31 30 30 36 34 34 20 68 |tree 74.100644 h|
00000010 65 6c 6c 6f 2e 74 78 74 00 ef bf bd 01 36 25 03 |ello.txt.....6%.|
00000020 0b ef bf bd db a9 06 ef bf bd 56 ef bf bd 7f ef |..........V.....|
00000030 bf bd ef bf bd ef bf bd ef bf bd 46 4a 31 30 30 |...........FJ100|
00000040 36 34 34 20 77 6f 72 6c 64 2e 74 78 74 00 ef bf |644 world.txt...|
00000050 bd 62 ef bf bd ef bf bd 10 74 2b ef bf bd ef bf |.b.......t+.....|
00000060 bd 24 1c 59 24 df 99 2b 5c 01 ef bf bd 71 |.$.Y$..+\....q|
Why this difference? And how can I make NodeJS and Ruby output the same thing?
In JavaScript, a string is a sequence of Unicode characters encoded in UTF-16. You can't store non-text content in a JavaScript string, since it doesn't provide a way to store in any other encoding.
However, Git tree objects are binary and contain a cryptographic hash in binary format (usually SHA-1), so they aren't going to have text content and can't be stored in a JavaScript string. If you try to do so anyway, you're going to get invalid byte values replaced by U+FFFD, the replacement character, which is encoded in UTF-8 as 0xef 0xbf 0xbd, corrupting the data.
If you don't call toString(), your data is stored in some sort of binary buffer object and has exactly the bytes that zlib decoded.
Ruby, on the other hand, has an encoding per string and can store binary strings with the encoding ASCII-8BIT (also known as BINARY). So if you had Ruby code, this would probably work just fine.
I have a need to test if a program that I'm writing is parsing the gzip header correctly, and that includes reading the FEXTRA, FNAME, and FCOMMENT fields. Yet it seems that gzip doesn't support creating archives with the FEXTRA and FCOMMENT fields -- only FNAME. Are there any existing tools which can do all three of these?
The Perl module IO::Compress::Gzip optionally lets you set the three fields you are intrested in. (Fair disclosure: I am the author of the module)
Here is some sample code that sets FNAME to "filename", FCOMMENT to "This is a comment" and creates an FEXTRA field with a single subfield with ID "ab" and value "cde".
use IO::Compress::Gzip qw(gzip $GzipError);
gzip \"payload" => "/tmp/test.gz",
Name => "filename",
Comment => "This is a comment",
ExtraField => [ "ab" => "cde"]
or die "Cannot create gzip file: $GzipError" ;
And here is a hexdump of the file it created.
00000000 1f 8b 08 1c cb 3b 3a 5a 00 03 07 00 61 62 03 00 |.....;:Z....ab..|
00000010 63 64 65 66 69 6c 65 6e 61 6d 65 00 54 68 69 73 |cdefilename.This|
00000020 20 69 73 20 61 20 63 6f 6d 6d 65 6e 74 00 2b 48 | is a comment.+H|
00000030 ac cc c9 4f 4c 01 00 15 6a 2c 42 07 00 00 00 |...OL...j,B....|
0000003f
This will generate an alert:
alert tcp any any <> any any (msg:"Test_A"; sid:3000001; rev:1;)
This will not:
alert tcp any any <> any any (msg:"Test_B"; content:"badurl.com"; http_header; sid:3000002; rev:1;)
I have tried: fast_pattern:only; metadata:service http; nocase; http_header; and others. I cannot get it to work at this generic level. Any ideas why the content attribute does not work? The packet has a URL.
Updated from the comments
0000 9c d2 4b 7d 96 60 3c 15 c2 dc 48 fa 08 00 45 00 ..K}.<. ..H...E.
0010 01 5c ac 2c 40 00 40 06 cf f5 c0 a8 c8 1e 41 fe .\.,#.#. ......A.
0020 f2 b4 dc 41 00 50 d0 e7 97 d0 ae b8 f9 ba 80 18 ...A.P.. ........
0030 ff ff da 1f 00 00 01 01 08 0a 34 03 84 d8 b7 cc ........ ..4.....
0040 3f 04 47 45 54 20 2f 20 48 54 54 50 2f 31 2e 31 ?.GET / HTTP/1.1
0050 0d 0a 48 6f 73 74 3a 20 6d 79 64 6f 6d 61 69 6e ..Host: mydomain
0060 2e 63 6f 6d 0d 0a 55 73 65 72 2d 41 67 65 6e 74 .com..Us er-Agent
The rule that you have provided will never fire with the example packet that you have provided. You have used a content:"POST"; with a http_method modifier but you are attempting to match a packet that is a GET request.
I think that the right content modifier should be http_uri, not http_header. Unless you are trying to capture the Host POST parameter.
I am unable to unzip file in linux centos. Getting following error
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
As you are mentioning jar in your comments we can consider this a programming question ;-)
First of all you should try to validate your file. If available you can even compare the checksum provided for this file and / or the filesize with the location you downloaded it from.
To verify the zip file on a low level you can use this command:
hexdump -C -n 100 file.zip
This will show you the first 100 bytes of the zips structure which will look similar to this:
00000000 50 4b 03 04 0a 00 00 00 00 00 88 43 65 47 11 7a |PK.........CeG.z|
00000010 39 1e 15 00 00 00 15 00 00 00 0e 00 1c 00 66 69 |9.............fi|
00000020 6c 65 31 69 6e 7a 69 70 2e 74 78 74 55 54 09 00 |le1inzip.txtUT..|
00000030 03 0f 05 3b 56 2f 05 3b 56 75 78 0b 00 01 04 e8 |...;V/.;Vux.....|
00000040 03 00 00 04 e8 03 00 00 54 68 69 73 20 69 73 20 |........This is |
00000050 61 20 66 69 6c 65 0a 1b 5b 31 37 7e 0a 50 4b 03 |a file..[17~.PK.|
00000060 04 0a 00 00 |....|
The first two byte of the file have to be PK, if not the file is invalid. Some bytes later you will find the name of the first file stored. In this example it is file1inzip.txt.