CTRL-UP hex sequence - keyboard

What is the HEX sequence for CRTL-UP key-press?
For example for UP - '0x1b, 0x5b, 0x41'.

I wrote a simple Python 3 script to find out the answer:
while True:
print(" ".join("{:x}".format(ord(x)) for x in input()))
I had to turn off the standard "Mission control" shortcut for the CtrlUp keystroke, but after doing that I get the following in iTerm2:
$ python3 test.py
^[[A
1b 5b 41
^[[1;5A
1b 5b 31 3b 35 41
The first two lines represent an Up keypress, while the second two lines represent a CtrlUp. You can use this same program to find out the character sequences for any other keystrokes.

Related

modifying 2 lists and giving them scores

I want to make a output file which is simply the input file with the value of each byte incremented by one.
here is the expected output:
04 fb 56 13 21 67 68 51 e9 ac
which also will be in hexadecimal notation. I am trying to do that in python3 using the following command:

How to delimit file with "\t\n" on a Mac

I have a document whose lines are separated by "\t\n". Records are separated either by "\t", OR by "\n".
Normally, this should be a straigtforward awk query:
BEGIN {
RS='\t\n';
}
{
print;
print "Next entry:";
}
However, on a Mac, regular expressions do not seem to be supported (maybe I'm not doing something right?) So I tried, RS="\t\n"; however, this is interpreted as RS='\t | \n'. Similar problems running awk from the command line:
awk 1 RS='\t\n' ORS='abc' input > output
replaces the \t's, but leaves the \n's be.
Next try: using tr. This obviously fails for sequence of more than one character-- since \t and \n are both used individually in the rows.
Next:
sed -e '/\t\n/s//NextEntry:/g' input > output
However, doesn't work. Entering any ASCII character sequence instead of \t\n works.
Read the manual. It says that \t is not supported in sed strings. Fair enough
sed -e '/\x9\xa/s//abc/' input > output
Still doesn't work. Idea: use tr to replace \t and \n by characters unused in the input file, use sed to change them to what I want, and then tr to change the remaining characters back to what they should be.
tr: Illegal byte sequence
Turns out, that f6 character makes tr just totally fail.
Went through the suggestions in Sed not recognizing \t instead it is treating it as 't' why? . That might work for replacing output strings (except the "Pasting tab into command prompt via CTRL+V" suggestion-- the shell just rejected that paste.), but did not seem to help in my case.
Maybe it's because it's a Mac? Maybe it's because that's the text I'm looking for, not replacing with? Maybe it's the combination with \n?
Any other suggestions?
UPDATE:
I found thread How can I replace a newline (\n) using sed? . Apparently, I am unable even to replace a \n by the string "abc" using the suggestions in that thread.
EDIT: Hex head of source file:
5a 20 4e 4f 09 0a 41 53 20 4f 46 20 30 31 2d 30
34 2d 30 35 20 45 4d 50 4c 4f 59 45 45 0a 47 52
4f 55 50 09 48 49 52 45 20 44 41 54 45 09 53 41
4c 41 52 59 09 4a 4f 42 20 54 49 54 4c 45 09 0a
4a 4f 42 20 4c 45 56 45 4c 0a 53 45 52 49 45 53
09 41 50 50 54 20 54 59 50 45 09 0a 50 41 59 20
53 54 41 54 55 53 0a f6
Unfortunately, BSD awk, as also used on macOS, doesn't support multi-character record separators (RS) altogether (in line with POSIX) - only a single, literal character is supported.
BSD sed, as also used on macOS, supports only \n in regexes - any other escapes, including hex ones (e.g., \x09) are not supported.
See this answer of mine for a comprehensive comparison of GNU and BSD sed.
Assuming that your sed command works in principle, you can use an ANSI C-quoted string
($'\t') to splice a literal tab char. into your sed script (assumes bash (the macOS default shell), ksh, or zsh),:
sed -e ':a' -e '$!{N;ba' -e '}' -e '/'$'\t''\n/s//NextEntry:/g'
Note that, in order to replace newlines, you must instruct sed to read the entire file into memory first, which is what -e ':a' -e '$!{N;ba' -e '}' does (the BSD Sed-compatible form of the common GNU sed idiom :a;$!{N;ba}).

What encoding/compression algorithm is this?

I'm trying to reverse engineer a binary file format, but it has no magic bytes, and no specific extension. I can influence only a single aspect of the file: a short string. By trying different strings, I was able to figure out how data is stored in the file. It seems that the whole file uses some sort of simple encoding. I hope that finding the exact encoding allows me to narrow down my search for the file format. I know the file is generated by a Windows program written in C++.
Now, after much trial-and-error, I found that some sections of the file are encoded in runs. Each run starts with a byte that indicates how many bytes will follow and where the data is retrieved.
000ddddd (1 byte)Take the following (ddddd)+1 bytes from the encoded data.
111····· ···ddddd ···bbbbb (3 bytes)Go back (bbbbb)+1 bytes in the decoded data, and take the next (ddddd)+9 bytes from it.
ddd····· ··bbbbbb (2 bytes)Go back (bbbbbb)+1 bytes in the decoded data, and take the next (ddd)+2 bytes from it.
Here's an example:
This is the start of the file, with the UTF-16 string abracadabra encoded in it:
. . . a . b . r . . c . . d . € .
0C 20 03 04 61 00 62 00 72 20 05 00 63 20 03 00 64 20 03 80 0D
To decode the string:
0C number of Unicode chars: 12 (11 chars + \0)
20 03 . . . ??
04 next 5
61 00 a .
62 00 b .
72 r
20 05 . a . back 6, take 3
00 next 1
63 c
20 03 . a . back 4, take 3
00 next 1
64 d
20 03 . a . back 4, take 3
80 0D b . r . a . back 14, take 6
This results in (UTF-16):
a . b . r . a . c . a . d . a . b . r . a .
61 00 62 00 72 00 61 00 63 00 61 00 64 00 61 00 62 00 72 00 61 00
However, I have no clue as to what encoding/compression algorithm this might be. It looks like some variant of LZ that doesn't use a dictionary (like LZ77), but so far I haven't been able to find any algorithm that matches this description. I'm also not sure whether the entire file is encoded like this, or only portions of it.
Do you know this encoding? Or do you have any hints for things I might look for in the file to identify the encoding?
After your edit I think it's LZF with the following differences to your observations:
The magic header and indication of compressed vs uncompressed has been removed in your example (not too surprising if it's embedded in a file).
You took the block length as one byte, but it is two bytes and big-endian, so the prior 0x00 is part of the length, which still works.
Could be NTFS compression, which is LZNT1. This idea is supported by the platform and the apparent 2-byte structure, along with the byte-alignment of the actual data.
The following elements are specific to this algorithm.
Chunks: Segments of data that are compressed, uncompressed, or that denote the end of the buffer.
Chunk header: The header for a compressed or uncompressed chunk of data.
Flag bytes: A bit flag whose bits, read from low order to high order, specify the formats of the data elements that follow. For example, bit 0 corresponds to the first data element, bit 1 to the second, and so on. If the bit corresponding to a data element is set, the element is a 2-byte compressed word; otherwise, it is a 1-byte literal value.
Flag group: A flag byte followed by zero or more data elements, each of which is a single literal byte or a 2-byte compressed word.

How to search, replace specific hex code in automated way

I have a 100M row file that has some encoding problems -- was "originally" EBCDIC, saved as US-ASCII, now UTF-8. I don't know much more about its heritage, sorry -- I've just been asked to analyze the content.
The "cents" character from EBCDIC is "hidden" in this file in random places, causing all sorts of errors. Here is more on this bugger: cents character in hex
Converting this file using iconv -f foo -t UTF-8 -c is not working -- the cents character prevails.
When I use hex editor, I can find the appearance of 0xC2 0xA2 (c2a2). But in a BIG file, this isn't ideal. Sed doesn't work at hex level, so... Not sure about tr -- I only really use it for carriage return / new line.
What linux utility / command can I use to find and delete this character reasonably quickly on very big files?
2 parts:
1 -- utility / command to find / count the number of these occurrences (octal \242)
2 -- command to replace (this works tr '\242' ' ' < source > output )
How the text appears on my ubuntu terminal:
1019EQ?IT DEPT GENERATED
With xxd, how it looks at hex level (ascii to the side looks the same as above):
0000000: 3130 3139 4551 a249 5420 4445 5054 2047 454e 4552 4154 4544 0d0a
With xxd, how it looks with "show ebcdic" -- here, just showing the ebcdic from side:
......s.....&....+........
So hex "a2" is the culprit. I'm now trying xxd -E foo | grep a2 to count the instances up.
Adding output from od -ctxl, rather than xxd, for those interested:
0000000 1 0 1 9 E Q 242 I T D E P T G
31 30 31 39 45 51 a2 49 54 20 44 45 50 54 20 47
0000020 E N E R A T E D \r \n
45 4e 45 52 41 54 45 44 0d 0a
When you say the file was converted what do you mean? Do you mean the binary file was simply dumped from an IBM 360 to another ASCII based computer, or was the file itself converted over to ASCII when it was transferred over?
The question is whether the file is actually in a well encoded state or not. The other question is how do you want the file encoded?
On my Mac (which uses UTF-8 by default, just like Linux systems), I have no problem using sed to get rid of the ¢ character:
Here's my file:
$ cat test.txt
This is a test --¢-- TEST TEST
$ od -ctx1 test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - ¢ ** - - T E S T T E S T \n
2d c2 a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000040
You can see that cat has no problems printing out that ¢ character. And, you can see in the od dump the c2a2 encoding of the ¢ character.
$ sed 's/¢/$/g' test.txt > new_test.txt
$ cat new_test.txt
This is a test --$-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - $ - - T E S T T E S T \n
2d 24 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's my sed has no problems changing that ¢ into a $ sign. The dump now shows that this test file is equivalent to a strictly ASCII encoded file. That two hexadecimal digit encoded ¢ is now a nice clean single hexadecimal digit encoded $.
It looks like sed can handle your issue.
If you want to use this file on a Windows system, you can convert the file to the standard Windows Code Page 1252:
$ iconv -f utf8 -t cp1252 test.txt > new_test.txt
$ cat new_test.txt
This is a test --?-- TEST TEST
$ od -ctx1 new_test.txt
0000000 T h i s i s a t e s t -
54 68 69 73 20 69 73 20 61 20 74 65 73 74 20 2d
0000020 - 242 - - T E S T T E S T \n
2d a2 2d 2d 20 54 45 53 54 20 54 45 53 54 0a
0000037
Here's the file now in Codepage 1252 just like the way Windows likes it! Note that the ¢ is now a nice hex 242 character.
So, what is exactly the issue? Do you need to file in pure ASCII defined 127 characters? Do you need the file encoded, so Windows machines can work on it? Are you having problems entering the ¢ character?
Let me know. I'm not from the government, and yet I'm here to help you.

Shell script printing contents of variable containing output of a command removes newline characters [duplicate]

This question already has answers here:
Capturing multiple line output into a Bash variable
(7 answers)
Closed 7 years ago.
I'm writing a shell script which will store the output of a command in a variable, process the output, and later echo the results. Here's what I've got:
stuff=$(diff -u pens tape)
# process the output
echo $stuff
The problem is, the output I get from running the script is this:
--- pens 2009-09-27 10:29:06.000000000 -0400 +++ tape 2009-09-18 16:45:08.000000000 -0400 ## -1,4 +1,2 ## -highlighter -marker -pencil -POSIX +masking +duct
Whereas I was expecting this:
--- pens 2009-09-27 10:29:06.000000000 -0400
+++ tape 2009-09-18 16:45:08.000000000 -0400
## -1,4 +1,2 ##
-highlighter
-marker
-pencil
-POSIX
+masking
+duct
It looks like the newline characters are being removed somehow. How do I get them to say in?
If you want to preserve the newlines, enclose the variable in double quotes:
echo "$stuff"
When you write it without the double quotes, the shell expands $stuff into a space-separated list of words (where 'words' are sequences of non-space characters, and the space characters are blanks and tabs and newlines; upon experimentation, it seems that form feeds, carriage returns and back-spaces are not counted as space).
Demonstrating interpretation of control characters as white space. ASCII 8 is backspace, 9 is tab, 10 is new line (LF), 11 is vertical tab, 12 is form feed, 13 is carriage return. The first command generates a sequence of characters separated by the various control characters. The second command echoes with the result with the original characters preserved - see the hex dump. The third command echoes the result with the shell splitting the words; you can see that the tab and newline were replaced by blank (0x20).
$ x=$(./ascii 64 65 8 66 67 9 68 69 10 70 71 11 72 73 12 74 75 13 76 77)
$ echo "$x" | odx
0x0000: 40 41 08 42 43 09 44 45 0A 46 47 0B 48 49 0C 4A #A.BC.DE.FG.HI.J
0x0010: 4B 0D 4C 4D 0A K.LM.
0x0015:
$ echo $x | odx
0x0000: 40 41 08 42 43 20 44 45 20 46 47 0B 48 49 0C 4A #A.BC DE FG.HI.J
0x0010: 4B 0D 4C 4D 0A K.LM.
0x0015:
$

Resources