Splitting bulk text file every n line - linux

I have a folder that contains multiple text files. I'm trying to split all text files at 10000 line per file while keeping the base file name i.e. if filename1.txt contains 20000 lines the output will be filename1-1.txt (10000 lines) and filename1-2.txt (10000 lines).
I tried to use split -10000 filename1.txt but this is not keeping the base filename and i have to repeat the command for each text file in the folder. I also tried doing for f in *.txt; do split -10000 $f.txt; done. This didn't work too.
Any idea how can i do this? Thanks.

for f in filename*.txt; do split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"; done
Or, written over multiple lines:
for f in filename*.txt
do
split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"
done
How it works:
-d tells split to use numeric suffixes
-a1 tells split to start with only single digits for the suffix.
-l10000 tells split to split every 10,000 lines.
--additional-suffix=.txt tells split to add .txt to the end of the names of the new files.
"$f" tells split the name of the file to split.
"${f%.txt}-" tells split the prefix name to use for the split files.
Example
Suppose that we start with these files:
$ ls
filename1.txt filename2.txt
Then we run our command:
$ for f in filename*.txt; do split -d -a1 -l10000 --additional-suffix=.txt "$f" "${f%.txt}-"; done
When this is done, we now have the original files and the new split files:
$ ls
filename1-0.txt filename1-1.txt filename1.txt filename2-0.txt filename2-1.txt filename2.txt
Using older, less featureful forms of split
If your split does not offer --additional-suffix, then consider:
for f in filename*.txt
do
split -d -a1 -l10000 "$f" "${f%.txt}-"
for g in "${f%.txt}-"*
do
mv "$g" "$g.txt"
done
done

No need for shell loops, just one simple awk command does it for all files:
awk 'FNR%1000==1{if(FNR==1)c=0; close(out); out=FILENAME; sub(/.txt/,"-"++c".txt)} {print > out}' *

--suffix-length=3
If it is going to make more than 9 files you might need to add something like that.

Related

pasting many files to a single large file

i have many text files in a directory like 1.txt 2.txt 3.txt 4.txt .......2000.txt and i want to paste them to make a large file.
In this regard i did something like
paste *.txt > largefile.txt
but the above command reads the .txt file randomly, so i need to read the files sequentially and paste as 1.txt 2.txt 3.txt....2000.txt
please suggest a better solution for pasting many files.
Thanks and looking forward to hearing from you.
Sort the file names numerically yourself then.
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
When dealing with many files, you may hit ulimit -n. On my system ulimit -n is 1024, but this is a soft limit and can be raised with just like ulimit -n 99999.
Without raising the soft limit, go with a temporary file that would accumulate results each "round" of ulimit -n count of files, like:
touch accumulator.txt
... | xargs -d '\n' -n $(($(ulimit -n) - 1)) sh -c '
paste accumulator.txt "$#" > accumulator.txt.sav;
mv accumulator.txt.sav accumulator.txt
' _
cat accumulator.txt
Instead use the wildcard * to enumerate all your files in a directory, if your file names pattern are sequentially ordered, you can manually list all files in order and concatenate to a large file. The output order of * enumeration might look different in different environment, as it not works as you expect.
Below is a simple example
$ for i in `seq 20`;do echo $i > $i.txt;done
# create 20 test files, 1.txt, 2.txt, ..., 20.txt with number 1 to 20 in each file respectively
$ cat {1..20}.txt
# show content of all file in order 1.txt, 2.txt, ..., 20.txt
$ cat {1..20}.txt > 1_20.txt
# concatenate them to a large file named 1_20.txt
In bash or any other shell, glob expansions are done in lexicographical order. When having files numberd, this sadly means that 11.txt < 1.txt < 2.txt. This weird ordering comes from the fact that, lexicographically, 1 < . (<dot>-character (".")).
So here are a couple of ways to operate on your files in order:
rename all your files:
for i in *.txt; do mv "$i" "$(sprintf "%0.5d.txt" ${i%.*}"); done
paste *.txt
use brace-expansion:
Brace expansion is a mechanism that allows for the generation of arbitrary strings. For integers you can use {n..m} to generate all numbers from n to m or {n..m..s} to generate all numbers from n to m in steps of s:
paste {1..2000}.txt
The downside here is that it is possible that a file is missing (eg. 1234.txt). So you can do
shopt -s extglob; paste ?({1..2000}.txt)
The pattern ?(pattern) matches zero or one glob-matches. So this will exclude the missing files but keeps the order.

Matching file content with other filenames to extract and merge contents

I have two directories.
In directory_1, I have many .txt files
Content of these files (for example file1.txt) are a list of characters
file1.txt
--
rer_098
dfrkk9
In directory_2, I have many files, two of them are ‘rer_098’ and ‘dfrkk9’.
Content of these files are as follows:
rer_098
--
>123_nbd
sasert
>456_nbd
ffjko
dfrkk9
--
>789_nbd
figyi
>012_nbd
jjjygk
Now in a separate output directory (directory_3), for this above example, I want output files like:
file1.txt
--
>123_nbd
sasert
>456_nbd
ffjko
>789_nbd
figyi
>012_nbd
jjjygk
and so on for file2.txt
Thanks!
This might work for you (GNU parallel):
parallel 'cat {} | parallel -I## cat dir_2/## > dir_3/{/}' ::: dir_1/*.txt
Use two invocations of parallel, the first traverses dir_1 and pipes its output in a second parallel. This cats the input files and outputs the result dir_3 keeping the original name from the first parallel invocation.
N.B. The use of the -I option to rename the parameter delimiters from the default {} to ##.
Pretty easy to do with just shell. Something like
for fullname in directory_1/*.txt; do
file=$(basename "$fullname")
while read -r line; do
cat "directory_2/$line"
done <"$fullname" >"directory_3/$file"
done
for file in directory_1/*.txt; do
awk 'NR==FNR{ARGV[ARGC++]="directory_2/"$0; next} 1' "$file" > "directory_3/${file##%/}"
done

bash: check if multiple files in a directory contain strings from a list

Folks,
I have a text file which contains multiple lines with one string per line :
str1
str2
str3
etc..
I would like to read every line of this file and then search for those strings inside multiple files located in a different directory.
I am not quite sure how to proceed.
Thanks very much for your help.
awk 'NR==FNR{a[$0];next} { for (word in a) if ($0 ~ word) print FILENAME, $0 }' fileOfWords /wherever/dir/*
for wrd in $(cut -d, -f1 < testfile.txt); do grep $wrd dir/files* ; done
Use the GNU Grep's --file Option
According to grep(1):
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
The -H and -n flags will print the filename and line number of each match. So, assuming you store your patterns in /tmp/foo and want to search all files in /tmp/bar, you could use something like:
# Find regular files with GNU find and grep them all using a pattern
# file.
find /etc -type f -exec grep -Hnf /tmp/foo {} +
while read -r str
do
echo "$str"
grep "$str" /path/to/other/files
done < inputfile

How to append contents of multiple files into one file

I want to copy the contents of five files to one file as is. I tried doing it using cp for each file. But that overwrites the contents copied from the previous file. I also tried
paste -d "\n" 1.txt 0.txt
and it did not work.
I want my script to add the newline at the end of each text file.
eg. Files 1.txt, 2.txt, 3.txt. Put contents of 1,2,3 in 0.txt
How do I do it ?
You need the cat (short for concatenate) command, with shell redirection (>) into your output file
cat 1.txt 2.txt 3.txt > 0.txt
Another option, for those of you who still stumble upon this post like I did, is to use find -exec:
find . -type f -name '*.txt' -exec cat {} + >> output.file
In my case, I needed a more robust option that would look through multiple subdirectories so I chose to use find. Breaking it down:
find .
Look within the current working directory.
-type f
Only interested in files, not directories, etc.
-name '*.txt'
Whittle down the result set by name
-exec cat {} +
Execute the cat command for each result. "+" means only 1 instance of cat is spawned (thx #gniourf_gniourf)
>> output.file
As explained in other answers, append the cat-ed contents to the end of an output file.
if you have a certain output type then do something like this
cat /path/to/files/*.txt >> finalout.txt
If all your files are named similarly you could simply do:
cat *.log >> output.log
If all your files are in single directory you can simply do
cat * > 0.txt
Files 1.txt,2.txt, .. will go into 0.txt
for i in {1..3}; do cat "$i.txt" >> 0.txt; done
I found this page because I needed to join 952 files together into one. I found this to work much better if you have many files. This will do a loop for however many numbers you need and cat each one using >> to append onto the end of 0.txt.
Edit:
as brought up in the comments:
cat {1..3}.txt >> 0.txt
or
cat {0..3}.txt >> all.txt
Another option is sed:
sed r 1.txt 2.txt 3.txt > merge.txt
Or...
sed h 1.txt 2.txt 3.txt > merge.txt
Or...
sed -n p 1.txt 2.txt 3.txt > merge.txt # -n is mandatory here
Or without redirection ...
sed wmerge.txt 1.txt 2.txt 3.txt
Note that last line write also merge.txt (not wmerge.txt!). You can use w"merge.txt" to avoid confusion with the file name, and -n for silent output.
Of course, you can also shorten the file list with wildcards. For instance, in case of numbered files as in the above examples, you can specify the range with braces in this way:
sed -n w"merge.txt" {1..3}.txt
if your files contain headers and you want remove them in the output file, you can use:
for f in `ls *.txt`; do sed '2,$!d' $f >> 0.out; done
All of the (text-) files into one
find . | xargs cat > outfile
xargs makes the output-lines of find . the arguments of cat.
find has many options, like -name '*.txt' or -type.
you should check them out if you want to use it in your pipeline
If the original file contains non-printable characters, they will be lost when using the cat command. Using 'cat -v', the non-printables will be converted to visible character strings, but the output file would still not contain the actual non-printables characters in the original file. With a small number of files, an alternative might be to open the first file in an editor (e.g. vim) that handles non-printing characters. Then maneuver to the bottom of the file and enter ":r second_file_name". That will pull in the second file, including non-printing characters. The same could be done for additional files. When all files have been read in, enter ":w". The end result is that the first file will now contain what it did originally, plus the content of the files that were read in.
Send multi file to a file(textall.txt):
cat *.txt > textall.txt
If you want to append contents of 3 files into one file, then the following command will be a good choice:
cat file1 file2 file3 | tee -a file4 > /dev/null
It will combine the contents of all files into file4, throwing console output to /dev/null.

Splitting a file and its lines under Linux/bash

I have a rather large file (150 million lines of 10 chars). I need to split it in 150 files of 2 million lines, with each output line being alternatively the first 5 characters or the last 5 characters of the source line.
I could do this in Perl rather quickly, but I was wondering if there was an easy solution using bash.
Any ideas?
Homework? :-)
I would think that a simple pipe with sed (to split each line into two) and split (to split things up into multiple files) would be enough.
The man command is your friend.
Added after confirmation that it is not homework:
How about
sed 's/\(.....\)\(.....\)/\1\n\2/' input_file | split -l 2000000 - out-prefix-
?
I think that something like this could work:
out_file=1
out_pairs=0
cat $in_file | while read line; do
if [ $out_pairs -gt 1000000 ]; then
out_file=$(($out_file + 1))
out_pairs=0
fi
echo "${line%?????}" >> out${out_file}
echo "${line#?????}" >> out${out_file}
out_pairs=$(($out_pairs + 1))
done
Not sure if it's simpler or more efficient than using Perl, though.
First five chars of each line variant, assuming that the large file called x.txt, and assuming it's OK to create files in the current directory with names x.txt.* :
split -l 2000000 x.txt x.txt.out && (for splitfile in x.txt.out*; do outfile="${splitfile}.firstfive"; echo "$splitfile -> $outfile"; cut -c 1-5 "$splitfile" > "$outfile"; done)
Why not just use native linux split function?
split -d -l 999999 input_filename
this will output new split files with file names like x00 x01 x02...
for more info see the manual
man split

Resources