I am not sure how to do this at all.
I have two text files, FILE1 and FILE2.
I would like to run a for loop for each file at the same time and display the
contents next to each other.
For example,
for $i in $(cat FILE1); do echo $i; done
for $j in $(cat FILE2); do echo $j; done
I would like to combine these two commands, so I can run both files at the same time and have an output like $i $j
Solution 1
Use the paste command
paste FILE1 FILE2
Details for paste command
Another resource
Solution 2
You can do this if they have the same number of lines.
#!/bin/bash
t=$(cat FILE1 | wc -l)
for i in `seq 1 $t`;
do
cat FILE1|head -n $i|tail -n 1
cat FILE2|head -n $i|tail -n 1
done
You can extend it to what you want for unequal number of lines.
You shouldn't be using for loops at all; see Bash FAQ 001. Instead, use two read commands in a single while loop.
while IFS= read -r line1 && IFS= read -r line2 <&3; do
printf '%s | %s\n' "$line1" "$line2"
done < FILE1 3< FILE2
Each read command reads from a separate file descriptor. In this version, the loop will exit when the shorter of the two files is exhausted.
There are two different questions being asked here. Other answers address the question of how to display the contents of the file in 2 columns. Running two loops simultaneously (which is the wrong way to address the first problem) can be done by running them each asynchronously: for i in ${seqi?}; do ${cmdi?}; done & for j in ${seqj?}; do ${cmdj?}; done & wait
Although you could also implement paste -d ' ' file1 file2 with something like:
while read line_from_file1; p=$?; read line_from_file2 <&3 || test "$p" = 0; do
echo "$line_from_file1" "$line_from_file2"
done < file1 3< file2
Another option, in bash v4+ is to read the two files into 2 arrays, then echo the array elements side-by-side:
# Load each file into its own array
readarray -t f1 < file1
readarray -t f2 < file2
# Print elements of both arrays side-by-side
for ((i=0;i<${#f1[#]};i++)) ; do echo ${f1[i]} ${f2[i]}; done
Or change the echo to printf if you want the columns to line up:
printf "%-20s %-20s\n" ${f1[i]} ${f2[i]}
I'm not suggesting you do this if your files are 100s of megabytes.
Related
I have a number of files that look something like this:
418_S32_L003_R1_001.fastq.gz
418_S32_L003_R2_001.fastq.gz
418_S1_L002_R1_001.fastq.gz
418_S1_L002_R2_001.fastq.gz
419_S32_L003_R1_001.fastq.gz
419_S32_L003_R2_001.fastq.gz
419_S1_L002_R1_001.fastq.gz
419_S1_L002_R2_001.fastq.gz
The first number is different for each set of four files.
Samples that start with the same number should be combined together if they have the same value for *R1* or *R2*.
So, these two samples should be concatenated:
418_S32_L003_R1_001.fastq.gz
418_S1_L002_R1_001.fastq.gz
And these two should be concatenated:
419_S32_L003_R2_001.fastq.gz
419_S1_L002_R2_001.fastq.gz
And this should be repeated for all files within the directory.
Is there a good way to do this in bash other than manually concatenating like this:
cat 418_S32_L003_R1_001.fastq.gz 418_S1_L002_R1_001.fastq.gz > 418_R1.fastq.gz
You can read each file and append it to the target file which you can get from the file name.
for file in *.fastq.gz;
do
IFS='_' read -a array <<< $file;
name="${array[0]}_${array[3]}.fastq.gz";
cat $file >> $name;
done
a = 0
for i in *
do
for j in *
do
if [ $(echo $j |cut -d _ -f 1) = $(echo $i |cut -d _ -f 1) ]
then
cat $i $j > $a
a = a + 1
fi
done
done
This might work for you (GNU parallel):
parallel --dry-run -N4 --plus cat {1} {4} \> {1%_.*}_R1.{1+..} ::: *R[12]*
This will print out the intended cat commands, check the results and if OK remove the --dry-run option.
I have a file example.txt with about 3000 lines with a string in each line. A small file example would be:
>cat example.txt
saudifh
sometestPOIFJEJ
sometextASLKJND
saudifh
sometextASLKJND
IHFEW
foo
bar
I want to check all repeated lines in this file and output them. The desired output would be:
>checkRepetitions.sh
found two equal lines: index1=1 , index2=4 , value=saudifh
found two equal lines: index1=3 , index2=5 , value=sometextASLKJND
I made a script checkRepetions.sh:
#!bin/bash
size=$(cat example.txt | wc -l)
for i in $(seq 1 $size); do
i_next=$((i+1))
line1=$(cat example.txt | head -n$i | tail -n1)
for j in $(seq $i_next $size); do
line2=$(cat example.txt | head -n$j | tail -n1)
if [ "$line1" = "$line2" ]; then
echo "found two equal lines: index1=$i , index2=$j , value=$line1"
fi
done
done
However this script is very slow, it takes more than 10 minutes to run. In python it takes less than 5 seconds... I tried to store the file in memory by doing lines=$(cat example.txt) and doing line1=$(cat $lines | cut -d',' -f$i) but this is still very slow...
When you do not want to use awk (a good tool for the job, parsing the input only once),
you can run through the lines several times. Sorting is expensive, but this solution avoids the loops you tried.
grep -Fnxf <(uniq -d <(sort example.txt)) example.txt
With uniq -d <(sort example.txt) you find all lines that occur more than once. Next grep will search for these (option -f) complete (-x) lines without regular expressions (-F) and show the line it occurs (-n).
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why your script is so slow.
$ cat tst.awk
{ val2hits[$0] = val2hits[$0] FS NR }
END {
for (val in val2hits) {
numHits = split(val2hits[val],hits)
if ( numHits > 1 ) {
printf "found %d equal lines:", numHits
for ( hitNr=1; hitNr<=numHits; hitNr++ ) {
printf " index%d=%d ,", hitNr, hits[hitNr]
}
print " value=" val
}
}
}
$ awk -f tst.awk file
found 2 equal lines: index1=1 , index2=4 , value=saudifh
found 2 equal lines: index1=3 , index2=5 , value=sometextASLKJND
To give you an idea of the performance difference using a bash script that's written to be as efficient as possible and an equivalent awk script:
bash:
$ cat tst.sh
#!/bin/bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac
# initialize an associative array, mapping each string to the last line it was seen on
declare -A lines=( )
lineNum=0
while IFS= read -r line; do
(( ++lineNum ))
if [[ ${lines[$line]} ]]; then
printf 'Content previously seen on line %s also seen on line %s: %s\n' \
"${lines[$line]}" "$lineNum" "$line"
fi
lines[$line]=$lineNum
done < "$1"
$ time ./tst.sh file100k > ou.sh
real 0m15.631s
user 0m13.806s
sys 0m1.029s
awk:
$ cat tst.awk
lines[$0] {
printf "Content previously seen on line %s also seen on line %s: %s\n", \
lines[$0], NR, $0
}
{ lines[$0]=NR }
$ time awk -f tst.awk file100k > ou.awk
real 0m0.234s
user 0m0.218s
sys 0m0.016s
There are no differences in the output of both scripts:
$ diff ou.sh ou.awk
$
The above is using 3rd-run timing to avoid caching issues and being tested against a file generated by the following awk script:
awk 'BEGIN{for (i=1; i<=10000; i++) for (j=1; j<=10; j++) print j}' > file100k
When the input file had zero duplicate lines (generated by seq 100000 > nodups100k) the bash script executed in about the same amount of time as it did above while the awk script executed much faster than it did above:
$ time ./tst.sh nodups100k > ou.sh
real 0m15.179s
user 0m13.322s
sys 0m1.278s
$ time awk -f tst.awk nodups100k > ou.awk
real 0m0.078s
user 0m0.046s
sys 0m0.015s
To demonstrate a relatively efficient (within the limits of the language and runtime) native-bash approach, which you can see running in an online interpreter at https://ideone.com/iFpJr7:
#!/bin/bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: bash 4.0 required" >&2; exit 1;; esac
# initialize an associative array, mapping each string to the last line it was seen on
declare -A lines=( )
lineNum=0
while IFS= read -r line; do
lineNum=$(( lineNum + 1 ))
if [[ ${lines[$line]} ]]; then
printf 'found two equal lines: index1=%s, index2=%s, value=%s\n' \
"${lines[$line]}" "$lineNum" "$line"
fi
lines[$line]=$lineNum
done <example.txt
Note the use of while read to iterate line-by-line, as described in BashFAQ #1: How can I read a file line-by-line (or field-by-field)?; this permits us to open the file only once and read through it without needing any command substitutions (which fork off subshells) or external commands (which need to be individually started up by the operating system every time they're invoked, and are likewise expensive).
The other part of the improvement here is that we're reading the whole file only once -- implementing an O(n) algorithm -- as opposed to running O(n^2) comparisons as the original code did.
I have one file called uniq.txt (20,000 lines).
head uniq.txt
1
103
10357
1124
1126
I have another file called all.txt (106,371,111 lines)
head all.txt
cg0001 ? 1 -0.394991215660192
cg0001 AB 103 -0.502535661820095
cg0002 A 10357 -0.563632386999913
cg0003 ? 1 -0.394991215660444
cg0004 ? 1 -0.502535661820095
cg0004 A 10357 -0.563632386999913
cg0003 AB 103 -0.64926706504459
I would like to make new 20,000 files from all.txt matching each line pattern of uniq.txt. For example,
head 1.newfile.txt
cg0001 ? 1 -0.394991215660192
cg0003 ? 1 -0.394991215660444
cg0004 ? 1 -0.502535661820095
head 103.newfile.txt
cg0001 AB 103 -0.502535661820095
cg0003 AB 103 -0.64926706504459
head 10357.newfile.txt
cg0002 A 10357 -0.563632386999913
cg0004 A 10357 -0.563632386999913
Is there any way that I can make new 20,000 files really fast?
My current script takes 1 min to make one new file. I guess it's scanning all.txt file every time it makes a new file.
You can try it with awk. Ideally you don't need >> in awk but since you have stated there would be 20,000 files, we don't want to exhaust system's resources by keeping too many file open.
awk '
NR==FNR { names[$0]++; next }
($3 in names) { file=$3".newfile.txt"; print $0 >>(file); close (file) }
' uniq.txt all.txt
This will first scan the uniq.txt file into memory creating a lookup table of sorts. It will then read through the all.txt file and start inserting entries into corresponding files.
This uses a while loop — This may or may not be the quickest way, although give it a try:
lines_to_files.sh
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
num=$(echo "$line" | awk '{print $3}')
echo "$line" >> /path/to/save/${num}_newfile.txt
done < "$1"
usage:
$ ./lines_to_files.sh all.txt
This should create a new file for each line in your all.txt file based on the third column. As it reads each line it will add it to the appropriate file. Keep in mind that if you run the script successive times it will append the data that is already there for each file.
An explanation of the while loop used above for reading the flie can be found here:
↳ https://stackoverflow.com/a/10929511/499581
You can read each line into a Bash array, then append to the file named after the number in column three (array index 2):
#!/bin/bash
while read -ra arr; do
echo "${arr[#]}" >> "${arr[2]}".newfile.txt
done < all.txt
This creates space separated output. If you prefer tab separated, it depends a bit on your input data: if it is tab separated as well, you can just set IFS to a tab to get tab separated output:
IFS=$'\t'
while read -ra arr; do
echo "${arr[*]}" >> "${arr[2]}".newfile.txt
done < all.txt
Notice the change in printing the array, the * is now actually required.
Or, if the input data is not tab separated (or we don't know), we can set IFS in a subshell in each loop:
while read -ra arr; do
( IFS=$'\t'; echo "${arr[*]}" >> "${arr[2]}".newfile.txt )
done < all.txt
I'm not sure what's more expensive, spawning a subshell or a few parameter assignments, but I feel it's the subshell – to avoid spawning it, we can set and reset IFS in each loop instead:
while read -ra arr; do
old_ifs="$IFS"
IFS=$'\t'
echo "${arr[*]}" >> "${arr[2]}".newfile.txt
IFS="$old_ifs"
done < all.txt
OP asked for fast ways. This is the fastest I've found.
sort -S 4G -k3,3 all.txt |
awk '{if(last!=$3){close(file); file=$3".newfile.txt"; last=$3} print $0 > file}'
Total time was 2m4.910s vs 10m4.058s for the runner-up. Note that it uses 4 GB of memory (possibly faster if more, definitely slower if less) and that it ignores uniq.txt.
Results for full-sized input files (100,000,000-line all.txt, 20,000-line uniq.txt):
sort awk write me ~800,000 input lines/second
awk append #jaypal-singh ~200,000 input lines/second
bash append #benjamin-w ~15,000 input lines/second
bash append + extra awk #lll ~2000 input lines/second
Here's how I created the test files:
seq 1 20000 | sort -R | sed 's/.*/cg0001\tAB\t&\t-0.502535661820095/' > tmp.txt
seq 1 5000 | while read i; do cat tmp.txt; done > all.txt
seq 1 20000 | sort -R > uniq.txt
PS: Apologies for the flaw in my original test setup.
I'm obviously missing something simply, and know the problem is that it's creating a blank output which is why it can't compare. However if someone could shed some light on this it would be great - I haven't isolated it.
Ultimately, I'm trying to compare the md5sum from a list stored in a txt file, to that stored on the server. If errors, I need it to report that. Here's the output:
root#vps [~/testinggrounds]# cat md5.txt | while read a b; do
> md5sum "$b" | read c d
> if [ "$a" != "$c" ] ; then
> echo "md5 of file $b does not match"
> fi
> done
md5 of file file1 does not match
md5 of file file2 does not match
root#vps [~/testinggrounds]# md5sum file*
2a53da1a6fbfc0bafdd96b0a2ea29515 file1
bcb35cddc47f3df844ff26e9e2167c96 file2
root#vps [~/testinggrounds]# cat md5.txt
2a53da1a6fbfc0bafdd96b0a2ea29515 file1
bcb35cddc47f3df844ff26e9e2167c96 file2
Not directly answering your question, but md5sum(1):
-c, --check
read MD5 sums from the FILEs and check them
Like:
$ ls
1.txt 2.txt md5.txt
$ cat md5.txt
d3b07384d113edec49eaa6238ad5ff00 1.txt
c157a79031e1c40f85931829bc5fc552 2.txt
$ md5sum -c md5.txt
1.txt: OK
2.txt: OK
The problem that you are having is that your inner read is executed in a subshell. In bash, a subshell is created when you pipe a command. Once the subshell exits, the variables $c and $d are gone. You can use process substitution to avoid the subshell:
while read -r -u3 sum filename; do
read -r cursum _ < <(md5sum "$filename")
if [[ $sum != $cursum ]]; then
printf 'md5 of file %s does not match\n' "$filename"
fi
done 3<md5.txt
The redirection 3<md5.txt causes the file to be opened as file descriptor 3. The -u 3 option to read causes it to read from that file descriptor. The inner read still reads from stdin.
I'm not going to argue. I simply try to avoid double read from inside loops.
#! /bin/bash
cat md5.txt | while read sum file
do
prev_sum=$(md5sum $file | awk '{print $1}')
if [ "$sum" != "$prev_sum" ]
then
echo "md5 of file $file does not match"
else
echo "$file is fine"
fi
done
I have a file with a list of address it looks like this (ADDRESS_FILE)
0xf012134
0xf932193
.
.
0fx12923a
I have another file with a list of numbers it looks like this (NUMBERS_FILE)
20
40
.
.
12
I want to cut the first 20 lines from ADDRESS_FILE and put that into a new file
then cut the next 40 lines from ADDRESS_FILE so on ...
I know that a series of sed commands like the one given below does the job
sed -n 1,20p ADDRESSS_FILE > temp_file_1
sed -n 20,60p ADDRESSS_FILE > temp_file_2
.
.
sed -n somenumber,endofilep. ADDRESS_FILE > temp_file_n
But I want to does this automatically using shell scripting which will change the numbers of lines to cut on each sed execution.
How to do this ???
Also on a general note, which are the text processing commands in linux which are very useful in such cases?
Assuming your line numbers are in a file called lines, sorted etc., try:
#!/bin/sh
j=0
count=1
while read -r i; do
sed -n $j,$i > filename.$count # etc... details of sed/redirection elided
j=$i
count=$(($count+1))
done < lines
Note. The above doesn't assume a consistent number of lines to split on for each iteration.
Since you've additionally asked for a general utility, try split. However this splits on a consistent number of lines, and is perhaps of limited use here.
Here's an alternative that reads directly from the NUMBERS_FILE:
n=0; i=1
while read; do
sed -n ${i},+$(( REPLY - 1 ))p ADDRESS_FILE > temp_file_$(( n++ ))
(( i += REPLY ))
done < NUMBERS_FILE
size=$(wc -l ADDRESSS_FILE)
i=1
n=1
while [ $n -lt $size ]
do
sed -n $n,$((n+19))p ADDRESSS_FILE > temp_file_$i
i=$((i+1))
n=$((n+20))
done
or just
split -l20 ADDRESSS_FILE temp_file_
(thanks Brian Agnew for the idea).
An ugly solution which works with a single sed invocation, can probably be made less horrible.
This generates a tiny sed script to split the file
#!/bin/bash
sum=0
count=0
sed -n -f <(while read -r n ; do
echo $((sum+1),$((sum += n)) "w temp_file_$((count++))" ;
done < NUMBERS_FILE) ADDRESS_FILE