How to grep on the content of a zipped non-standard textfile - linux

On my Windows-10 PC, I have installed Ubuntu app. There I'd like to grep on the content of a group of zipfiles, but let's start with just 1 zipfile. My zipfile contains two files: a crashdump and an errorlog (textfile), containing some information. I'm particularly interested in information within that error logfile:
<grep_inside> zipfile.zip "Access violation"
Until now, this is my best result:
unzip -c zipfile.zip error.log
This shows the error logfile, but it shows it as a hexdump, which makes it impossible to launch a grep on it.
As proposed on different websites, I've also tried following commands: vim, view, zcat, zless and zgrep, all are not working for different reasons.
Some further investigation
This question is not a duplicate of this post, a suggested, I believe the issue is caused by the encoding of the logfile, as you can see in following results of other basic Linux commands, after unzipping the error logfile:
emacs error.log
... caused an Access Violation (0xc0000005)
cat error.log
. . . c a u s e d a n A c c e s s V i o l a t i o n ( 0 x c 0 0 0 0 0 0 5 )
Apparently the error.log file is not being recognised as a simple textfile:
file error.log
error.log : Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

In this post on grepping non-standard text files, I found the answer:
unzip -c zipfile.zip error.log | grep -a "A.c.c.e.s.s"
Now I have something to start from.
Thanks, everyone, for your cooperation.

Related

How to convert text to UTF-8 encoding within a text file

When I use a text editor to see the actual content, I see
baliÄ<8d>ky 0 b a l i ch k i
and when I use cat to see it, I see
baličky 0 b a l i ch k i
How can I make it so, it is actually baličky in the text editor as well?
I've tried numerous commands such as iconv -f UTF-8 -t ISO-8859-15, iconv -f ISO-8859-15 -t UTF-8, recode utf8..l9.
None of them works. It's still baliÄ<8d>ky instead of baličky. This is a Czech word. If I do a simple sed command (/Ä<8d>/č), it works but I have so many other characters like this and manual work is basically really mundane at this point.
Any suggestions?

Is it possible to partially unzip a .vcf file?

I have a ~300 GB zipped vcf file (.vcf.gz) which contains the genomes of about 700 dogs. I am only interested in a few of these dogs and I do not have enough space to unzip the whole file at this time, although I am in the process of getting a computer to do this. Is it possible to unzip only parts of the file to begin testing my scripts?
I am trying to a specific SNP at a position on a subset of the samples. I have tried using bcftools to no avail: (If anyone can identify what went wrong with that I would also really appreciate it. I created an empty file for the output (722g.990.SNP.INDEL.chrAll.vcf.bgz) but it returns the following error)
bcftools view -f PASS --threads 8 -r chr9:55252802-55252810 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.bgz
The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised
I am planning on trying awk, but need to unzip the file first. Is it possible to partially unzip it so I can try this?
Double check your command line for bcftools view.
The error message 'The output type "something" is not recognized' is printed by bcftools when you specify an invalid value for the -O (upper-case O) command line option like this -O something. Based on the error message you are getting it seems that you might have put the file name there.
Check that you don't have your input and output file names the wrong way around in your command. Note that the -o (lower-case o) command line option specifies the output file name, and the file name at the end of the command line is the input file name.
Also, you write that you created an empty file for the output. You don't need to do that, bcftools will create the output file.
I don't have that much experience with bcftools but generically If you want to to use awk to manipulate a gzipped file you can pipe to it so as to only unzip the file as needed, you can also pipe the result directly through gzip so it too is compressed e.g.
gzip -cd largeFile.vcf.gz | awk '{ <some awk> }' | gzip -c > newfile.txt.gz
Also zcat is an alias for gzip -cd, -c is input/output to standard out, -d is decompress.
As a side note if you are trying to perform operations on just a part of a large file you may also find the excellent tool less useful it can be used to view your large file loading only the needed parts, the -S option is particularly useful for wide formats with many columns as it stops line wrapping, as is -N for showing line numbers.
less -S largefile.vcf.gz
quit the view with q and g takes you to the top of the file.

File name look the same but is different after copying

My file names look the same but they are not.
I copied many_img/ from Debian1 to OS X, then from OS X to Debian2 (for maintenance purpose) with using rsync -a -e ssh on each step to preserve everything.
If i do ls many_img/img1/* i get visually the same output on Debian1 and Debian2 :
prévisionnel.jpg
But somehow, ls many_img/img1/* | od -c gives different results:
On Debian1:
0000000 p r 303 251 v i s i o n n e l . j p
0000020 g \n
On Debian2:
0000000 p r e 314 201 v i s i o n n e l . j
0000020 p g \n
Thus my web app on Debian2 cannot match the picture in the file system with filename in database.
i thought maybe i need to change file encoding, but it looks like it's already utf-8 on every OS:
convmv --notest -f iso-8859-15 -t utf8 many_img/img1/*
Returns:
Skipping, already UTF-8
Is there a command to get back all my 40 thousands file names like on my Debian 1 from my Debian 2 (without transfering all again) ?
I am confused if it is a file name encoding problem or anything else ?
I finaly found command line conversion tools i was looking for (thanks #Mark for setting me on the right track !)
Ok, i didn't know OS X was encoding file names under the hood with a different UTF-8 Normalization.
It appears OS X is using Unicode Normalization Form D (NFD)
while Linux OS are using Unicode Normalization Form C (NFC)
HSF+ file system encode every single file name character in UTF-16.
Unicode characters are Decomposed on OS X versus Precomposed on Linux OS.
é for instance (Latin small letter e with acute accent), is technically a (U+00E9) character on Linux
and is decomposed into a base letter "e" (U+0065) and an acute accent (U+0301) in its decomposed form (NFD) on OS X.
Now about conversion tools:
This command executed from Linux OS will convert file name from NFD
to NFC:
convmv --notest --nfc -f utf8 -t utf8 /path/to/my/file
This command executed from OS X will rsync over ssh with NFD to NDC
on the fly conversion:
rsync -a --iconv=utf-8-mac,utf-8 -e ssh path/to/my/local/directory/* user#destinationip:/remote/path/
I tested the two methods and it works like a charm.
Note:
--iconv option is only available with rsync V3 whereas OS X provides an old 2.6.9 version by default so you'll need to update it first.
Typically to check and upgrade :
rsync --version
brew install rsync
echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile
The first filename contains the single character é while the second contains a simple e followed by the combining character ́ (COMBINING ACUTE ACCENT). They're both valid Unicode, they're just normalized differently. It appears the OS normalized the filename as it created the file.

Linux compare diff / meld

I have this strange issue. I have created an algorithm that compresses Inverted Files. I have the original file (in my example it's 198.3Mb) and the decompressed file (which is 198.0 Mb). File sizes are viewed by Nautilus. I ran meld and it returns identical files. Format of both files is exactly the same. What is wrong ?!?!
Example (i ran sdiff (-s mode) and i got this, the exact same data):
170832 | 170832
170833 | 170833
170834 | 170834
170835 | 170835
170836 | 170836
How are these not identical by sdiff ?
use e.g. od -c to analyze the lines that are reported different.
Each character is displayed, including \r \t and such, so you can see exactly where differences are.

Strange diff behaviour

I have a file called this.txt that has this content:
a
b
c
d
Which I generate using: ls /home > this.txt
Then I create a file called that.txt that has this content:
a
c
d
f
Which I generate using: ssh -p 1111 root#176.178.1.8 'ls /home' > that.txt
When I compare both using diff this.txt that.txt I get normal results.
Then I get the file that2.txt using an expect script to avoid typing the password for the ssh connection, with this content
a
c
d
f
Using cat I compare (visually) both files and are the same, but when I use diff this.txt that.txt I get results with no sense (it says that nothing from this.txt is in that2.txt).
Also if I use diff that.txt that2.txt I get the no sense result.
Maybe is because I'm using two different interpreters (because I use expect and bash) and the files are coded different? Any ideas?
PD: hopefully I explained myself. I'm not an English speaker and this is my first question.
I’d assume you have files with either blanks at the ends of lines or different end-of-line markers, possibly both. Please compare the outputs of od -c that.txt and od -c that2.txt. Also, it may be worth checking the file sizes.
Oh, and I should add that you do not need to put your password into an expect script. ssh can work with public key pairs, a much safer alternative, and not really hard to set up. Check man ssh-keygen for a start.

Resources