Editing large text files on linux ( 5 - 10gb) - text-editor

Basically, i need a file of specified format and large size(Around 10gb). To get this, i am copying the contents of my original file into the same file, multiple times, to increase its size. I dont care about the contents of the file as long as they have the required format.
Initially, i tried to do this using gedit, which failed miserably after few 100mbs. I'm looking for an editor which will help me do this. Or, may be a suggestion on alternate ways

You could make 2 files and repeatedly append them to each other:
cp file1 file2
for x in `seq 1 200`; do
cat file1 >> file2
cat file2 >> file1
done;

In Windows, from the command line:
copy file1.txt+file2.txt file3.txt
concats 1 and 2, places in 3 - repeat or add +args until you get the size you need.
For Unix,
cat file1.txt file2.txt >> file3.txt
concats 1 and 2, places in 3 - repeat or add more input files until you get the size you need.
There are probably many other ways to do this in Unix.

Related

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

Is there a way to compare N files at once, and only leave lines unique to each file?

Background
I have five files that I am trying to make unique relative to each other. In other words, I want to make it so that the lines of text in each file have no commonality with each other.
Attempted solution
So far, I have been able to run the grep -vf command comparing one file with the other 4 as so:
grep -vf file2.txt file1.txt
grep -vf file3.txt file1.txt
...
This makes it print out the lines in file1 that are not in file2, nor file3, etc.. However, this becomes cumbersome because I would need to do this for the superset of all files. In otherwords, to truly reduce each file to lines of text only in that file, I would have to do every combination of files into the grep -vf command. Given that this sounds cumbersome to me, I wanted to know...
Question
What is the command/series of commands in linux to find the lines of text in each file that is mutually exclusive to all the other files?
You could just do:
awk '!a[$0]++ { out=sprintf("%s.out", FILENAME); print > out}' file*
This will write the lines that are uniq in file to file.out. Each line will be written to the output file of the associated input file in which it first appears, and subsequent duplicates of that same line will be suppressed.

Fastest way to replace string in first row of huge file in linux command line?

I have a huge plain text file (~500Gb) on linux machine. I want the replace some string in header line (the first row of the file), but all the method I known seems to be slow and low efficiency.
example file:
foo apple cat
1 2 2
2 3 4
3 4 6
...
expected file output:
bar apple cat
1 2 2
2 3 4
3 4 6
...
sed:
sed -i '1s/foo/bar/g' file
-i can change the file in place, but this command generate a tmp file on disk and use the tmp file to replace the original one. The io waste time.
vim:
ex -c '1s/foo/bar/g' -c 'wq' file
vim doesn't generate a tmp file, but this tool load the whole file in to memory, which waste a lot of time either.
Is there a better solution that only read the first row in to memory and write it back to the original file? I known that linux head command can extract the first column very fast.
Could you please try following awk command and let me know if this helps you, I couldn't test it as I don't have a huge size file like 500 GB. For sure it shouldn't create any temp file in backend as it is not using inplace substitution on Input_file.
awk 'FNR==1{$1="bar";print;next} 1' Input_file > temp_file && mv temp_file Input_file

How to copy data from file to another file starting from specific line

I have two files data.txt and results.txt, assuming there are 5 lines in data.txt, I want to copy all these lines and paste them in file results.txt starting from the line number 4.
Here is a sample below:
Data.txt file:
stack
ping
dns
ip
remote
Results.txt file:
# here are some text
# please do not edit these lines
# blah blah..
this is the 4th line that data should go on.
I've tried sed with various combinations but I couldn't make it work, I'm not sure if it fit for that purpose as well.
sed -n '4p' /path/to/file/data.txt > /path/to/file/results.txt
The above code copies line 4 only. That isn't what I'm trying to achieve. As I said above, I need to copy all lines from data.txt and paste them in results.txt but it has to start from line 4 without modifying or overriding the first 3 lines.
Any help is greatly appreciated.
EDIT:
I want to override the copied data starting from line number 4 in
the file results.txt. So, I want to leave the first 3 lines without
modifications and override the rest of the file with the data copied
from data.txt file.
Here's a way that works well from cron. Less chance of losing data or corrupting the file:
# preserve first lines of results
head -3 results.txt > results.TMP
# append new data
cat data.txt >> results.TMP
# rename output file atomically in case of system crash
mv results.TMP results.txt
You can use process substitution to give cat a fifo which it will be able to read from :
cat <(head -3 result.txt) data.txt > result.txt
head -n 3 /path/to/file/results.txt > /path/to/file/results.txt
cat /path/to/file/data.txt >> /path/to/file/results.txt
if you can use awk:
awk 'NR!=FNR || NR<4' Result.txt Data.txt

I did not understand use of patch command in Linux

I am really did not understand use of patch command
I have file1 with 1 2 3
file2 with 1 2 4
diff -u file1 file2 > out.patch
patch -b file1 out.patch
Now file1 will have 1 2 4 ... Is it a copy file2 or what?
What is happening here or what s the use of patch command
man patch says
patch takes a patch file patchfile containing a difference listing
produced by the diff program and applies those differences to one or
more original files, producing patched versions.
Normally the patched versions are put in place of the originals.
Backups can be made; see the -b or --backup option.
So, in your case diff -u file1 file2 results in the difference between two files which is 4 in this case; then patch command applies that difference to the original file.
Is it a copy file2 or what?
It will not be but rather appending the difference of files to the original file.

Resources