Finding a line that shows in a file only once - linux

Assuming that I have files with 100 lines. There are a lot of lines that repeat themselves in the file, and only one line that does not.
I want to find the line that shows only once. Is there a command for that or do I have to build some complicated loop as below?
My code so far:
#!/bin/bash
filename="repeat_lines.txt"
var="$(wc -l <$filename )"
echo "length:" $var
#cp ex4.txt ex4_copy.txt
for((index=0; index < var; index++));
do
one="$(head -n $index $filename | tail -1)"
counter=0
for((index2=0; index2 < var; index2++));
do
two="$(head -n $index2 $filename | tail -1)"
if [ "$one" == "$two" ]; then
counter=$((counter+1))
fi
done
echo $one"is "$counter" times in the text: "
done

If I understood your question correctly, then
sort repeat_lines.txt | uniq -u should do the trick.
e.g. for file containing:
a
b
a
c
b
it will output c.
For further reference, see sort manpage, uniq manpage.

You've got a reasonable answer that uses standard shell tools sort and uniq. That's probably the solution you want to use, if you want something that is portable and doesn't require bash.
But an alternative would be to use functionality built into your bash shell. One method might be to use an associative array, which is a feature of bash 4 and above.
$ cat file.txt
a
b
c
a
b
$ declare -A lines
$ while read -r x; do ((lines[$x]++)); done < file.txt
$ for x in "${!lines[#]}"; do [[ ${lines["$x"]} -gt 1 ]] && unset lines["$x"]; done
$ declare -p lines
declare -A lines='([c]="1" )'
What we're doing here is:
declare -A creates the associative array. This is the bash 4 feature I mentioned.
The while loop reads each line of the file, and increments a counter that uses the content of a line of the file as the key in the associative array.
The for loop steps through the array, deleting any element whose counter is greater than 1.
declare -p prints the details of an array in a predictable, re-usable format. You could alternately use another for loop to step through the remaining array elements (of which there might be only one) in order to do something with them.
Note that this solution, while fine for small files (say, up to a few thousand lines), may not scale well for very large files of, say, millions of lines. Bash isn't the fastest at reading input this way, and one must be cognizant of memory limits when using arrays.
The sort alternative has the benefit of memory optimization using files on disk for extremely large files, at the expense of speed.
If you're dealing with files of only a few hundred lines, then it's hard to predict which solution will be faster. In the end, the form of output may dictate your choice of solution. The sort | uniq pipe generates a list to standard output. The bash solution above generates the same list as keys in an array. Otherwise, they are functionally equivalent.

Related

Behavior of Arrays in bash scripting and zsh shell (Start Index 0 or 1?)

I need explenation about the following behavior of arrays in shell scripting:
Imagine the following is given:
arber#host ~> ls
fileA fileB script.sh
Now i can do the following commands:
arber#host ~> ARR=($(ls -d file*))
arber#host ~> echo ${ARR[0]} # start index 0
arber#host ~> echo ${ARR[1]} # start index 1
fileA
arber#host ~> echo ${ARR[2]} # start index 2
fileB
But when I do this via script.sh it behaves different (Start Index = 0):
arber#host ~> cat script.sh
#!/bin/bash
ARR=($(ls -d file*))
# get length of an array
aLen=${#ARR[#]}
# use for loop read all items (START INDEX 0)
for (( i=0; i<${aLen}; i++ ));
do
echo ${ARR[$i]}
done
Here the result:
arber#host ~> ./script.sh
fileA
fileB
I use Ubuntu 18.04 LTS and zsh. Can someone explain this?
TL;DR:
bash array indexing starts at 0 (always)
zsh array indexing starts at 1 (unless option KSH_ARRAYS is set)
To always get consistent behaviour, use:
${array[#]:offset:length}
Explanation
For code which works in both bash and zsh, you need to use the offset:length syntax rather than the [subscript] syntax.
Even for zsh-only code, you'll still need to do this (or use emulate -LR zsh) since zsh's array subscripting basis is determined by the KSH_ARRAYS option.
Eg, to reference the first element in an array:
${array[#]:0:1}
Here, array[#] is all the elements, 0 is the offset (which always is 0-based), and 1 is the number of elements desired.
Arrays in Bash are indexed from zero, and in zsh they're indexed from one.
But you don't need the indices for a simple use case such as this. Looping over ${array[#]} works in both:
files=(file*)
for f in "${files[#]}"; do
echo "$f"
done
In zsh you could also use $files instead of "${files[#]}", but that doesn't work in Bash. (And there's the slight difference that it drops empty array elements, but you won't get any from file names.)
Also, don't use $(ls file*), it will break if you have filenames with spaces (see WordSpliting on BashGuide), and is completely useless to begin with.
The shell is perfectly capable of generating filenames by itself. That's actually what will happen there, the shell finds all files with names matching file*, passes them to ls, and ls just prints them out again for the shell to read and process.

Trying to scrub 700 000 data against 15 million data

I am trying to scrub 700 000 data obtained from single file, which need to be scrubbed against a data of 15 million present in multiple files.
Example: 1 file of 700 000 say A. Multiple files pool which have 15 million call it B.
I want a pool B of files with no data of file A.
Below is the shell script I am trying to use it is working fine. But it is taking massive time of more than 8 Hours in scrubbing.
IFS=$'\r\n' suppressionArray=($(cat abhinav.csv1))
suppressionCount=${#suppressionArray[#]}
cd /home/abhinav/01-01-2015/
for (( j=0; j<$suppressionCount; j++));
do
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
arrayOffileNameInWhichSuppressionFoundCount=${#arrayOffileNameInWhichSuppressionFound[#]}
if [ $arrayOffileNameInWhichSuppressionFoundCount -gt 0 ];
then
echo -e "${suppressionArray[$j]}" >> /home/abhinav/emailid_Deleted.txt
for (( k=0; k<$arrayOffileNameInWhichSuppressionFoundCount; k++));
do
sed "/^${suppressionArray[$j]}/d" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$k]} > /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" && mv -f /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}
done
fi
done
Another solution clicked in my mind is to breakdown 700k data into smaller size files of 50K and send across 5-available servers, also POOL A will be available at each server.
Each server will serve for 2-Smaller file.
These two lines are peculiar:
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
The first assigns an empty string to the mile-long variable name because the standard output is directed to the file. The second then reads that file into the array. ('Tis curious that the name is not arrayOfFileNameInWhichSuppressionFound, but the lower-case f for file is consistent, so I guess it doesn't matter beyond making it harder to read the variable name.)
That could be reduced to:
ArrFileNames=( $(grep -l "${suppressionArray[$j]}," *.csv) )
You shouldn't need to keep futzing with carriage returns in IFS; either set it permanently, or make sure there are no carriage returns before you start.
You're running these loops 7,00,000 times (using the Indian notation). That's a lot. No wonder it is taking hours. You need to group things together.
You should probably simply take the lines from abhinav.csv1 and arrange to convert them into appropriate sed commands, and then split them up and apply them. Along the lines of:
sed 's%.*%/&,/d%' abhinav.csv1 > names.tmp
split -l 500 names.tmp sed-script.
for script in sed-script.*
do
sed -f "$script" -i.bak *.csv
done
This uses the -i option to backup the files. It may be necessary to do redirection explicitly if your sed does not support the -i option:
for file in *.csv
do
sed -f "$script" "$file" > "$file.tmp" &&
mv "$file.tmp" "$file"
done
You should experiment to see how big the scripts can be. I chose 500 in the split command as a moderate compromise. Unless you're on antique HP-UX, that should be safe, but you may be able to increase the size of the script more, which will reduce the number of times you have to edit each file, which speeds up the processing. If you can use 5,000 or 50,000, you should do so. Experiment to see what the upper limit. I'm not sure that you'd find doing all 700,000 lines at once is feasible — but it should be fastest if you can do it that way.

Line from bash command output stored in variable as string

I'm trying to find a solution to a problem analog to this one:
#command_A
A_output_Line_1
A_output_Line_2
A_output_Line_3
#command_B
B_output_Line_1
B_output_Line_2
Now I need to compare A_output_Line_2 and B_output_Line_1 and echo "Correct" if they are equal and "Not Correct" otherwise.
I guess the easiest way to do this is to copy a line of output in some variable and then after executing the two commands, simply compare the variables and echo something.
This I need to implement in a bash script and any information on how to get certain line of output stored in a variable would help me put the pieces together.
Also, it would be cool if anyone can tell me not only how to copy/store a line, but probably just a word or sequence like : line 1, bytes 4-12, stored like string in a variable.
I am not a complete beginner but also not anywhere near advanced linux bash user. Thanks to any help in advance and sorry for bad english!
An easier way might be to use diff, no?
Something like:
command_A > command_A.output
command_B > command_B.output
diff command_A.output command_B.output
This will work for comparing multiple strings.
But, since you want to know about single lines (and words in the lines) here are some pointers:
# first line of output of command_A
command_A | head -n 1
The -n 1 option says only to use the first line (default is 10 I think)
# second line of output of command_A
command_A | head -n 2 | tail -n 1
that will take the first two lines of the output of command_A and then the last of those two lines. Happy times :)
You can now store this information in a variable:
export output_A=`command_A | head -n 2 | tail -n 1`
export output_B=`command_B | head -n 1`
And then compare it:
if [ "$output_A" == "$output_B" ]; then echo 'Correct'; else echo 'Not Correct'; fi
To just get parts of a string, try looking into cut or (for more powerful stuff) sed and awk.
Also, just learing a good general purpose scripting language like python or ruby (even perl) can go a long way with this kind of problem.
Use the IFS (internal field separator) to separate on newlines and store the outputs in an array.
#!/bin/bash
IFS='
'
array_a=( $(./a.sh) )
array_b=( $(./b.sh) )
if [ "${array_a[1]}" = "${array_b[0]}" ]; then
echo "CORRECT"
else
echo "INCORRECT"
fi

How to convert the script from using command,read to command,cut?

Here is the test sample:
test_catalog,test_title,test_type,test_artist
And i can use the following sript to cut off the text above by comma and set the variable respectively:
IFS=","
read cdcatnum cdtitle cdtype cdac < $temp_file
(ps:and the $temp_file is the dir of the test sample)
And if i want to replace the read with command,cut.Any idea?
There are many solutions:
line=$(head -1 "$temp_file")
echo $line | cut -d, ...
or
cut -d, ... <<< "$line"
or you can tell BASH to copy the line into an array:
typeset IFS=,
set -A ARRAY $(head -1 "$temp_file")
# use it
echo $ARRAY[0] # test_catalog
echo $ARRAY[1] # test_title
...
I prefer the array solution because it gives you a distinct data type and clearly communicates your intent. The echo/cut solution is also somewhat slower.
[EDIT] On the other hand, the read command splits the line into individual variables which gives each value a name. Which is more readable: $ARRAY[0] or $cdcatnum?
If you move columns around, you will just need to rearrange the arguments to the read command - if you use arrays, you will have to update all the array indices which you will get wrong.
Also read makes it much more simple to process the whole file:
while read cdcatnum cdtitle cdtype cdac ; do
....
done < "$temp_file"
man cut ?
But seriously, if you have something that works, why do you want to change it?
Personally, I'd probably use awk or perl to manipulate CSV files in linux.

Bash script to copy numbered files in reverse order

I have a sequence of files:
image001.jpg
image002.jpg
image003.jpg
Can you help me with a bash script that copies the images in reverse order so that the final result is:
image001.jpg
image002.jpg
image003.jpg
image004.jpg <-- copy of image003.jpg
image005.jpg <-- copy of image002.jpg
image006.jpg <-- copy of image001.jpg
The text in parentheses is not part of the file name.
Why do I need it? I am creating video from a sequence of images and would like the video to play "forwards" and then "backwards" (looping the resulting video).
You can use printf to print a number with leading 0s.
$ printf '%03d\n' 1
001
$ printf '%03d\n' 2
002
$ printf '%03d\n' 3
003
Throwing that into a for loop yields:
MAX=6
for ((i=1; i<=MAX; i++)); do
cp $(printf 'image%03d.jpg' $i) $(printf 'image%03d.jpg' $((MAX-i+1)))
done
I think that I'd use an array for this... that way, you don't have to hard code a value for $MAX.
image=( image*.jpg )
MAX=${#image[*]}
for i in ${image[*]}
do
num=${i:5:3} # grab the digits
compliment=$(printf '%03d' $(echo $MAX-$num | bc))
ln $i copy_of_image$compliment.jpg
done
I used 'bc' for arithmetic because bash interprets leading 0s as an indicator that the number is octal, and the parameter expansion in bash isn't powerful enough to strip them without jumping through hoops. I could have done that in sed, but as long as I was calling something outside of bash, it made just as much sense to do the arithmetic directly.
I suppose that Kuegelman's script could have done something like this:
MAX=(ls image*.jpg | wc -l)
That script has bigger problems though, because it's overwriting half of the images:
cp image001.jpg image006.jpg # wait wait!!! what happened to image006.jpg???
Also, once you get above 007, you run into the octal problem.

Resources