How to extract the first x-megabyte from a large file in unix/linux?

How to extract the first x-megabyte from a large file in unix/linux? - linux

I have a large file which I am only interested in the first couple of megabytes in the head.
How do I extract the first x-megabyte from a large file in unix/linux and put it into a seperate file?
(I know the split command can split files into many pieces. And using bash scripts I can erase the pieces I don't want. I would prefer a easier way)

Head works with binary files and the syntax is neater than dd.
head -c 2M input.file > output.file
Tail works the same way if you want the end of a file.

E.g.
dd if=largefile count=6 bs=1M > largefile.6megsonly
The 1M spelling assumes GNU dd. Otherwise, you could do
dd if=largefile count=$((6*1024)) bs=1024 > largefile.6megsonly
This again assumes bash-style arithmetic evaluation.

On a Mac (Catalina) the head and tail commands don't seem to take modifiers like m (mega) and g (giga) in upper or lower case, but will take a large integer byte count like this one for 50 MB
head -c50000000 inputfile.txt > outputfile.txt

Try the command dd. You can use "man dd" to get main ideas of it.

Related

How to split large file to small files with prefix in Linux/Bash

I have a file in Linux called test. Now I want to split the test into say 10 small files.
The test file has more than 1000 table names. I want the small files to have equal no of lines, the last file might have the same no of table names or not.
What I want is can we add a prefix to the split files while invoking the split command in the Linux terminal.
Sample:
test_xaa test_xab test_xac and so on..............
Is this possible in Linux.

I was able to solve my question with the following statement
split -l $(($(wc -l < test.txt )/10 + 1)) test.txt test_x
With this I was able to get the desired result

I would've sworn split did this on it's own, but to my surprise, it does not.
To get your prefix, try something like this:
for x in /path/to/your/x*; do
mv $x your_prefix_$x
done

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?

As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt

Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

sort across multiple files in linux

I have multiple (many) files; each very large:
file0.txt
file1.txt
file2.txt
I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).
I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt will be moved to the beginning of file1.txt and vice versa.
I am working on Linux and fairly new to it. I know about the sort command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.
What I know can do:
I can sort each file individually and read into file1.txt to find the value larger than the largest in file0.txt (and similarly grab the lines from the end of file0.txt), join and then sort.. but this is a pain and assumes no values from file2.txt belong in file0.txt (however highly unlikely in my case)
Edit
To be clear, if the files look like this:
f0.txt
DDD
XXX
AAA
f1.txt
BBB
FFF
CCC
f2.txt
EEE
YYY
ZZZ
I want this:
f0.txt
AAA
BBB
CCC
f1.txt
DDD
EEE
FFF
f2.txt
XXX
YYY
ZZZ

I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:
for file in *.txt; do
sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
The sort in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -o parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1)
The second sort does efficient merging of the input files, all while keeping the output sorted.
This is piped to the split command which will then write to suffixed output files. Notice the - character; this tells split to read from standard input (i.e. the pipe) instead of a file.
Also, here's a short summary of how the merge sort works:
sort reads a line from each file.
It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
Repeat step 2 until there are no more lines in any file.
At this point, the output should be a perfectly sorted file.
Profit!

It isn't exactly what you asked for, but the sort(1) utility can help, a little, using the --merge option. Sort each file individually, then sort the resulting pile of files:
for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file
(That's 100,000 lines per output file. Perhaps that's still way too small.)

I believe that this is your best bet, using stock linux utilities:
sort each file individually, e.g. for f in file*.txt; do sort $f > sorted_$f.txt; done
sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix>, where <lines> is the number of lines per file, and <prefix> is the filename prefix. (The -d tells split to use numeric suffixes).
The -m option to sort lets it know the input files are already sorted, so it can be smart.

mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). Don't forget the msync at the end.

If the files are sorted individually, then you can use sort -m file*.txt to merge them together - read the first line of each file, output the smallest one, and repeat.

grepping for a large binary value from an even larger binary file

As the title suggests I would like to grep a reasonably large (about 100MB) binary file, for a binary string - this binary string is just under 5K.
I've tried grep using the -P option, but this only seems to return matches when the pattern is only a few bytes - when I go up to about 100 bytes it no longer finds any matches.
I've also tried bgrep. This worked well originally, however, when I needed to extend the pattern to the length I have now I just get "invalid/empty search string" errors.
The irony is, in Windows I can use HxD to search the file and I finds it in a instance. What I really need though is a Linux command line tool.
Thanks for your help,
Simon

Say we have a couple of big binary data files. For a big one that shouldn't match, we create a 100MB file whose contents are all NUL bytes.
dd ibs=1 count=100M if=/dev/zero of=allzero.dat
For the one we want to match, create a hundred random megabytes.
#! /usr/bin/env perl
use warnings;
binmode STDOUT or die "$0: binmode: $!";
for (1 .. 100 * 1024 * 1024) {
print chr rand 256;
}
Execute it as ./mkrand >myfile.dat.
Finally, extract a known match into a file named pattern.
dd skip=42 count=10 if=myfile.dat of=pattern
I assume you want only the files that match (-l) and want your pattern to be treated literally (-F or --fixed-strings). I suspect you may have been running into a length limit with -P.
You may be tempted to use the --file=PATTERN-FILE option, but grep interprets the contents of PATTERN-FILE as newline-separated patterns, so in the likely case that your 5KB pattern contains newlines, you'll hit an encoding problem.
So hope your system's ARG_MAX is big enough and go for it. Be sure to quote the contents of pattern. For example:
$ grep -l --fixed-strings "$(cat pattern)" allzero.dat myfile.dat
myfile.dat

Try using grep -U which treats files as binary.
Also, how are you specifying the search pattern? It might just need escaping to survive shell parameter expansions

As the string you are searching is pretty long. You could benefit by an implementation of the Boyer-Moore search algorithm which is very efficient when search string is very long
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
The wiki also has links to some sample code.

You might want to look at a simple Python script.
match= (b"..."
b"...."
b"..." ) # Some byte string literal of immense proportions
with open("some_big_file","rb") as source:
block= read(len(match))
while block != match:
byte= read(1)
if not byte: break
block= block[1:]+read(1)
This might work reliably under Linux as well as Windows.

Compare 2 files with shell script

I was trying to find the way for knowing if two files are the same, and found this post...
Parsing result of Diff in Shell Script
I used the code in the first answer, but i think it's not working or at least i cant get it to work properly...
I even tried to make a copy of a file and compare both (copy and original), and i still get the answer as if they were different, when they shouldn't be.
Could someone give me a hand, or explain what's happening?
Thanks so much;
peixe

Are you trying to compare if two files have the same content, or are you trying to find if they are the same file (two hard links)?
If you are just comparing two files, then try:
diff "$source_file" "$dest_file" # without -q
or
cmp "$source_file" "$dest_file" # without -s
in order to see the supposed differences.
You can also try md5sum:
md5sum "$source_file" "$dest_file"
If both files return same checksum, then they are identical.

comm is a useful tool for comparing files.
The comm utility will read file1 and file2, which should be
ordered in the current collating sequence, and produce three
text columns as output: lines only in file1; lines only in
file2; and lines in both files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to extract the first x-megabyte from a large file in unix/linux? - linux

Head works with binary files and the syntax is neater than dd. head -c 2M input.file > output.file Tail works the same way if you want the end of a file.

E.g. dd if=largefile count=6 bs=1M > largefile.6megsonly The 1M spelling assumes GNU dd. Otherwise, you could do dd if=largefile count=$((6*1024)) bs=1024 > largefile.6megsonly This again assumes bash-style arithmetic evaluation.

On a Mac (Catalina) the head and tail commands don't seem to take modifiers like m (mega) and g (giga) in upper or lower case, but will take a large integer byte count like this one for 50 MB head -c50000000 inputfile.txt > outputfile.txt

Try the command dd. You can use "man dd" to get main ideas of it.

Related

How to split large file to small files with prefix in Linux/Bash

egrep not writing to a file

sort across multiple files in linux

grepping for a large binary value from an even larger binary file

Compare 2 files with shell script

Categories

Resources