Search ill encoded characters in a file on Linux - linux

I have a lot of huge CSV files, some of them contain ill encoded characters: in vi, I see things like "<8f>" or "<8e>", for example.
First, I wanted to search and replace (:%s) all the characters, but it will be a very long process because I will have to do this everytime I have to handle a file, and I'm not always sure whether new characters are here.
Is it possible to detect such characters, so that I can extract lines containing ill encoded characters?
A simple command may exist, taking a file for argument and creating a file containing only the lines with a problem.
I don't know if I explain me very well...
Thanks in advance!

You could use :g/char/p [vim] to print all the lines in a given file, or the bash utility grep:
grep -lr 'char1\|char2\|char2' .
Will output all the files in a directory containing any of the chars you have listed (the -r makes it recursive and the -l lists only the filenames, rather than all the line matches.

Related

Data hidden in jpg

I am currently looking for hidden data in a jpg file but I have no clue on how to operate.
There is a jpg file containing text in a format I have never seen before :
-ne \xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01\x01\x01\x00\x60\x00\x60\x00\x00\xff\xdb\x00\x43\x00\x06\x04\x04\x05\x04\x04\x06\x05\x05\x05\x06\x06\x06\x07\x09\x0e\x09\x09\x08\x08\x09\x12\x0d\x0d\x0a\x0e\x15\x12\x16\x16\x15\x12\x14\x14\x17\x1a\x21\x1c\x17\x18\x1f\x19\x14\x14\x1d\x27\x1d\x1f\x22\x23\x25\x25\x25\x16\x1c\x29\x2c\x28\x24\x2b\x21\x24\x25\x24\xff\xdb\x00\x43\x01\x06\x06\x06\x09\x08\x09\x11\x09\x09\x11\x24\x18\x14\x18\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\x24\xff\xc0\x00\x11\x08\x01\x8e\x03\x4e\x03\x01\x22\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1f\x00\x00\x01\x05\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\xff\xc4\x00\xb5\x10\x00\x02\x01\x03\x03\x02\x04\x03\x05\x05\x04\x04\x00\x00\x01\x7d\x01\x02\x03\x00\x04\x11\x05\x12\x21\x31\x41\x06\x13\x51\x61\x07\x22\x71\x14\x32\x81\x91\xa1\x08\x23
-ne \x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82\x09\x0a\x16\x17\x18\x19\x1a\x25\x26\x27\x28\x29\x2a\x34\x35\x36\x37\x38\x39\x3a\x43\x44\x45\x46\x47\x48\x49\x4a\x53\x54\x55\x56\x57\x58\x59\x5a\x63\x64\x65\x66\x67\x68\x69\x6a\x73\x74\x75\x76\x77\x78\x79\x7a\x83\x84\x85\x86\x87\x88\x89\x8a\x92\x93\x94\x95\x96\x97\x98\x99\x9a\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xff\xc4\x00\x1f\x01\x00\x03\x01\x01\x01\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\xff\xc4\x00\xb5\x11\x00\x02\x01\x02\x04\x04\x03\x04\x07\x05\x04\x04\x00\x01\x02\x77\x00\x01\x02\x03\x11\x04\x05\x21\x31\x06\x12\x41\x51\x07\x61\x71\x13\x22\x32\x81\x08\x14\x42\x91\xa1\xb1\xc1\x09\x23\x33\x52\xf0\x15\x62\x72\xd1\x0a\x16\x24\x34\xe1\x25\xf1\x17\x18\x19\x1a\x26\x27\x28\x29\x2a\x35\x36\x37\x38\x39\x3a\x43\x44\x45\x46\x47\x48\x49
This is just the beggining of the file as there is at least a hundred lines.
The file type given by the command file : file.jpg: ASCII text, with very long lines
I tried some of the common tools to identify any patterns or hidden data like exiftools, strings, xxd but I found nothing.
If you have any idea on what to do it would be very much appreciated.
If it's a challenge of CTF, there are some common way to find out flag.
First try to find flag in file metadata, like description of file field
you can also try tool: stegsolve.jar.
In more advance sence, stego info hidden with some math calulation, give this tool a try: zsteg
Perhaps I'm misunderstanding the problem here, but if your file actually starts with a backslash character followed by the characters x, f, f, \, x, d, 8 and so on, then what you're looking at is the binary content of a JPG file that has been converted into ASCII text.
If so, you need to convert this back into binary data. For example, in Linux or MacOS, you could do this by entering the following on the command line:
echo -ne '\xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01...etc...' > img.jpg
echo -ne '\x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82...etc...' >> img.jpg
(Note: > sends the results to a new file, and >> appends to the end of the file)
Or alternatively in Python:
with open("img.jpg","wb") as f:
f.write(b'\xff\xd8\xff\xe0\x00\x10\x4a\x46\x49\x46\x00\x01...etc...')
f.write(b'\x42\xb1\xc1\x15\x52\xd1\xf0\x24\x33\x62\x72\x82...etc...')
# and so on for all the other lines
Either way, you should end up with a file called img.jpg containing the image you're after.

How to add file to .gitignore with \n in the filename

I just want to add file with file
name (file name contains \n) to .gitignore.
I try:
/file
name
/file\nname
file\
name
but have no luck.
Try
/file*name
* should match any characters.
Please tell me this is a homework assignment.
The .gitignore file uses a glob(7) style wildcard substitution so embedding a '*' should work. Linux really doesn't care about the filename spelling, so any character other than a '/' can go into a filename. On a command line be sure to use single-quotes (not double-quotes, mind you) so the shell doesn't get confused.
Reading the filenames from ls(1) and similar will split the name at the \n character because the glibc standard library is looking for newlines to find the end of a line of text. The '/' and '\n' characters are treated specially in many levels of the software stack, like the pathname of the file or splitting a buffer into lines or displaying the filename or letting awk(1), sed(1), and such scan a list of filenames.
But I agree with another poster: this is a bad idea. No, I take that back, it's a horrible idea. Long-term maintainability is going to be a nightmare.

How to echo/print actual file contents on a unix system

I would like to see the actual file contents without it being formatted to print. For example, to show:
\n0.032,170\n0.034,290
Instead of:
0.032,170
0.34,290
Is there a command to echo the file's actual data in bash? I've tried using head, cat, more, etc. but all those seem to echo the "print-formatted" text. For example:
$ cat example.csv
0.032,170
0.34,290
How can I print the actual characters within the file?
This reads as if you miss understand what the "actual characters in the file" are. You will not find the characters \ and n in that file. But only a line feed, which is a specific character. So the utilities like cat do actually output exactly the characters in the file.
Putting it the other way around: if you really had those two characters literally in the file, then a utility like cat would actually output them. I just checked that, just to be sure.
You can easily check that yourself if you open the file using a hexeditor. There you will see the character 0A (decimal 10) which is a line feed character. You will not see the pair of the two characters \ and n somewhere in that file.
Many programming languages and also shell environments use escape sequences like \n in string definitions and identify those as control characters which would not be typable otherwise. So maybe that is where your impression comes from that your files should contain those two characters.
To display newlines as \n, you might try:
awk 1 ORS='\\n' input-file
This is not the "actual characters in the file", as \n is merely a conventional method of displaying a newline, but this does seem to be what you want.

How to strip binary characters from a file?

I've got a file that contains lines that look like this in vim:
^[[0;32msalt-2016.3.2-1.el6.noarch^[[0;0m^M
which look like this in more:
salt-2016.3.2-1.el6.noarch
I would like to produce a copy of this file that only contains the displayed characters as more shows them. I tried piping it through dos2unix but it refuses to do anything, complaining that "dos2unix: Binary symbol 0x1B found at line 2".
Probably I could achieve what I want with some sed statements, but I'm wondering whether there is a linux/unix utility that will take output from more or cat and produce a file that contains only the whitespace and text as displayed?
There's something called ansifilter which does exactly this. I tested it out on my file and it works.

Delete some lines from text using Linux command

I know how to match text using regex patterns but not how to manipulate them.
I have used grep to match and extract lines from a text file, but I want to remove those lines from the text. How can I achieve this without having to write a python or bash shell script?
I have searched on Google and was recommended to use sed, but I am new to it and don't know how it works.
Can anyone point me in the right direction or help me achieve this goal?
The -v option to grep inverts the search, reporting only the lines that don't match the pattern.
Since you know how to use grep to find the lines to be deleted, using grep -v and the same pattern will give you all the lines to be kept. You can write that to a temporary file and then copy or move the temporary file over the original.
grep -v pattern original.file > tmp.file
mv tmp.file original.file
You can also use sed, as shown in shellfish's answer.
There are multiple possible refinements for the grep solution, but for most people most of the time, what is shown is more or less adequate (it would be a good idea to use a per process intermediate file name, preferably with a random name such as the mktemp command gives you). You can add code to remove the intermediate file on an interrupt; suppress interrupts while moving back; use copy and remove instead of move if the original file has multiple hard links or is a symlink; etc. The sed command more or less works around these issues for you, but it is not cognizant of multiple hard links or symlinks.
Create the pattern which matches the lines using grep. Then create a sed script as follows:
sed -i '/pattern/d' file
Explanation:
The -i option means overwrite the input file, thus removing the files matching pattern.
pattern is the pattern you created for grep, e.g. ^a*b\+.
d this sed command stands for delete, it will delete lines matching the pattern.
file this is the input file, it can consist of a relative or absolute path.
For more information see man sed.

Resources