finding number of occurences in large text file in linux

finding number of occurences in large text file in linux - linux

I have a 17 GB txt file and i cannot seem to load it via vim. Researched on solutions provided here. However i do not seem to understand them very well and i am not good with linux or perl.
I understand i would have to use grep or something.
grep -oP "/^2" file
I have tried up to this code but i cannot seem to find the solution to output the number of occurences without printing all the lines to screen
I would like to find the number of lines that starts with a digit 2 in the file and output the number to shell.

If you want to continue using PCRE:
grep -cP ^2 file
Using grep's "basic regular expressions":
grep -c ^2 file

Related

Read only nth first lines [sublime text]

I've got some files so big to directly open them in Sublime Text. Is there any way to open only the nth first lines? Something like head in bash? Thanks

If you're on Linux or Mac, or have Cygwin, Git Bash, or similar installed on a Windows machine, check out the split utility, which is part of the coreutils package. It does exactly what it says: it splits input into separate files. It is configurable via command-line options, like every Unix utility. For example, if you wanted to split your input file into separate 10,000-line files starting with notsobigfile and using numeric suffixes ending with .txt, you would run
split -d -l 10000 --additional-suffix=".txt" reallybigfile.txt notsobigfile
and it would output files named notsobigfile01.txt, notsobigfile02.txt, etc. If this would generate more than 100 files (00 through 99), just add -a x where x is the number of digits (the default is 2).
For all the possible options, just read the man page:
man split
If you only want to output the first part of the file, check out the options for the -n/--number flag.
To figure out how many lines your input file has, run the word counting utility using the lines option:
wc -l reallybigfile.txt

Linux Header Removal from a ppm file

Does anybody know the command to remove the header from a ppm file in Linux? I've tried this already
´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´
head -n 4 Example.ppm > header.txt
tail -n 5+ Example.ppm > body.bin
´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´´
It tells me that "Tail" could not be found.

Most ppm files use newlines in the header so your first command is fine. However, the rest of the file is binary, so:
head -n 4 Example.ppm > header.txt
filesize=$(wc -c header.txt)
dd if=Example.ppm of=body.bin bs=1 skip=$filesize

You should have /bin/tail if you have /bin/head; both are in the coreutils RPM package.
The format of a ppm(5) file (http://netpbm.sourceforge.net/doc/ppm.html) is awkward to use with the line-based head/tail/sed family. The documentation describes fields separated by whitespace that is not necessarily a line break.
You will need to: 1) Ignore comments from '#' to end of line; and 2) process the remainder one field (not column, not line) at a time. Using awk(1) could be an option here.
Check the documentation (http://netpbm.sourceforge.net/doc/directory.html) for a list of conversion programs. You may find one that converts the PPM file into a form better suited to whatever usage is your ultimate goal.

Grep show filename and found line for binary files (PDF)

I have a folder with lots of PDF files. I need to get the filename of matching content files as well as specific text in them - Rotate 270, which defines a page rotation. Grep's arguments anH or /dev/null method seems not to work, nor can pdftotext or pdfgrep help, as it is not any visible or searchable text on page I need.
I can either get the "Binary file aaa.pdf matches" or the line like this (which is not a text visible on a page!):
<</Filter/FlateDecode/Length 61>>stream4 595.19995]/MediaBox[0 0 841.92004 595.19995]/Parent 5 0 R/Resources<</ProcSet[/PDF/Text/ImageB/ImageC/ImageI]/XObject<</img3 11 0 R>>>>/Rotate 270/Type/Page>>
Suspect there is a way to loose the non printable bytes before grep gets them, or split the filename before grep part and assemble back after the grep has found the line, or maybe sed has an easy way to achieve this?
How do I get both filename and found line, approximately like grep does on regular text files?

I don't have a pdf file with that string inside but you can try
identify -verbose somefile.pdf | grep 'Rotate 270'
identify is part of ImageMagick package.
You can also try a brute force method :-)
strings somefile.pdf | grep 'Rotatae 270'

"grep" offset of ascii string from binary file

I'm generating binary data files that are simply a series of records concatenated together. Each record consists of a (binary) header followed by binary data. Within the binary header is an ascii string 80 characters long. Somewhere along the way, my process of writing the files got a little messed up and I'm trying to debug this problem by inspecting how long each record actually is.
This seems extremely related, but I don't understand perl, so I haven't been able to get the accepted answer there to work. The other answer points to bgrep which I've compiled, but it wants me to feed it a hex string and I'd rather just have a tool where I can give it the ascii string and it will find it in the binary data, print the string and the byte offset where it was found.
In other words, I'm looking for some tool which acts like this:
tool foobar filename
or
tool foobar < filename
and its output is something like this:
foobar:10
foobar:410
foobar:810
foobar:1210
...
e.g. the string which matched and a byte offset in the file where the match started. In this example case, I can infer that each record is 400 bytes long.
Other constraints:
ability to search by regex is cool, but I don't need it for this problem
My binary files are big (3.5Gb), so I'd like to avoid reading the whole file into memory if possible.

grep --byte-offset --only-matching --text foobar filename
The --byte-offset option prints the offset of each matching line.
The --only-matching option makes it print offset for each matching instance instead of each matching line.
The --text option makes grep treat the binary file as a text file.
You can shorten it to:
grep -oba foobar filename
It works in the GNU version of grep, which comes with linux by default. It won't work in BSD grep (which comes with Mac by default).

You could use strings for this:
strings -a -t x filename | grep foobar
Tested with GNU binutils.
For example, where in /bin/ls does --help occur:
strings -a -t x /bin/ls | grep -- --help
Output:
14938 Try `%s --help' for more information.
162f0 --help display this help and exit

I wanted to do the same task. Though strings | grep worked, I found gsar was the very tool I needed.
http://tjaberg.com/
The output looks like:
>gsar.exe -bic -sfoobar filename.bin
filename.bin: 0x34b5: AAA foobar BBB
filename.bin: 0x56a0: foobar DDD
filename.bin: 2 matches found

Why can't i detect this file?

I have this file in a directory say test.php whose contents are below
< ? php $XZKsyG=’as’;
I want to pick up the file test.php with a search based on its content. So from the directory containing it I do:
grep 'php \$[a-zA-Z]*=.as.;'
However I get no result...what am I doing wrong?
Thanks

It works for me:
$ cat file
< ? php $XZKsyG=’as’;
$ grep 'php \$[a-zA-Z]*=.as.;' file
< ? php $XZKsyG=’as’;
Are you sure the contents of the file are exactly what you showed us?
Try cat -A file or od -c file to see whether the file really looks the way you think it does.
(Note that you don't need to escape the $ character; it's only a metacharacter at the end of a line. But escaping it should be ok.)
EDIT :
The characters around the as in your file are not ASCII apostrophes; they're Unicode RIGHT SINGLE QUOTATION MARK characters (0x2019). If the file is stored in UTF-8, each of them is represented as a 3-byte sequence. The grep command works for me because my locale settings "en_US.UTF-8" are such that a UTF-8 character is matched by . in a regexp, even if it has a multi-byte representation. I suspect your locale is such that it would be matched by ....
Probably the simplest solution is to edit the file to use ASCII apostrophes.
You might also want to play around with your locale settings. Try the grep command with $LANG set to "en_US.UTF-8".
What's the output of the locale command?

That works fine for me, though you may want to look into those "funny" single quotes you have around as:
pax$ cat testfile
< ? php $XZKsyG='as';
pax$ grep 'php \$[a-zA-Z]*=.as.;' testfile
< ? php $XZKsyG='as';
Failing that, there's some things you can look at. Some of these may sound silly but I'm really just checking all bases.
Are you sure the file contains only what you think it does? Executing od -xcb file will give you a hex dump of it for better checking.
Are you sure you're accessing the right file, in the right directory?
Have you done something silly like aliasing grep to be something else?
That's if you're looking for a file containing that string. If instead you're looking for a file named like that, you can use something like:
ls -1 | grep 'php \$[a-zA-Z]*=.as.;'
The ls -1 command gives you one file per line, and piping that through grep will filter out those not matching the pattern.
I suppose I should mention that I'm not really a big fan of file names with spaces in them, but I'm violently opposed to file names made up of PHP scripts :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

finding number of occurences in large text file in linux - linux

If you want to continue using PCRE: grep -cP ^2 file Using grep's "basic regular expressions": grep -c ^2 file

Related

Read only nth first lines [sublime text]

Linux Header Removal from a ppm file

Grep show filename and found line for binary files (PDF)

"grep" offset of ascii string from binary file

Why can't i detect this file?

Categories

Resources