need to remove lines from syslog that have a certain string of data duplicate - string

hey guys wonder if anyone can help with this little dilema
Trying to remove lines from a syslog text file that have duplicate strings
Mar 10 06:51:11[http-8080-1] INFO com.MYCOMPANY.webservices.userservice.web.UserServiceController [u:2533274802474744|360] Authorize [platformI$tformIdAndOs=2533274802474744|360, userRegion=America|360]
then a few lines down
Mar 10 06:52:03 [http-8080-1] INFO com.MYCOMPANY.webservices.userservice.web.UserServiceController [u:2533274802474744|360] Authorize [platformI$tformIdAndOs=2533274802474744|360, userRegion=America|360
got the same thing in terms of a u: number but the issue is I need to remove duplicates and just leave one and the file has multiple duplicates of different u: numbers and it's 14,000 lines long.
can anyone tell me if I can use awk? sed? or sort for something like this to ? removing lines that have a certain string in there that's a duplicate.
I basically need to de-dupe but the problem is only one little part of the string is the indicator.
Any help is appreciated! thanks

There is probably a better way to do this, but here's my first stab at it:
First, create a new file, call it
uvalues.txt
Read the file line by line, for each
line grep for "u:", store the result
in $u
if $u exists in uvalues.txt, ignore
this line
if $u does not exist in uvalues.txt, write this line to another file, write $u to uvalues.txt
repeat
Code would be something like this:
#!/bin/bash
touch uvalues.txt
for l in `cat file.txt`; do
uvalue=`echo "$l" | grep "u:" | cut -f2 -d':' | cut -f1 -d'|'`
#if uvalue is not empty, check it against our temp file
if [ -n "$uvalue" ]; then
existing_value=`grep "$uvalue" uvalues.txt`;
#if it is empty, it means it's not a duplicate
if [ -z "$existing_value" ]; then
echo $l >> save.txt
echo $uvalue >> uvalues.txt
fi
fi
done
rm uvalues.txt

Related

Using columns in bash

I've used the column command to split some of my output into 3 different columns. Problem is with the final column, the filetype output is being split into a 4th and 5th column because of the spaces.
Can somebody tell me how to change my code so that output stays under the Filetype column?
list_files()
{
if [ "$(ls -A ~/.junkdir)" ]
then
filesdir=/home/student/.junkdir/*
echo "Listing files in Junk Directory"
output="FILENAME SIZE(BYTES) TYPE \n\n---------------- ---------------- ------------------- "
for listed_file in $filesdir
do
file_name=$(basename "file $listed_file" | cut -d ' ' -f1)
file_size=$(du --bytes $listed_file | awk '{print $1}')
file_type=$(file $listed_file | cut -d ' ' -f2-)
output="$output\n${file_name} ${file_size} ${file_type}\n"
done
echo -ne $output | column -t
else
echo 'Junk directory is empty'
fi
}
The output at the moment..
Listing files in Junk Directory
FILENAME SIZE(BYTES) TYPE
---------------- ---------------- -------------------
files.txt 216 ASCII text
forLoop 401 Bourne-Again shell script,
ASCII text executable
I rarely give full solution, but it seems you are really stuck.
list_files2()
{
filesdir=/home/student/.junkdir/*
printf "FILENAME\1SIZE(BYTES)\1TYPE\1\n\n----------------\1----------------\1-------------------\n"
for listed_file in $filesdir
do
file_name=$(basename "file $listed_file" | cut -d ' ' -f1)
file_size=$(du --bytes $listed_file | awk '{print $1}')
file_type=$(file $listed_file | cut -d ' ' -f2-)
printf "%s\1%s\1%s\n" "${file_name}" "${file_size}" "${file_type}"
done
}
list_files()
{
if [ "$(ls -A ~/.junkdir)" ]
then
echo "Listing files in Junk Directory"
list_files2 | column -t -s $'\1'
else
echo 'Junk directory is empty'
fi
}
I slightly reorganized your code and made some other changes as well. I will explain what I did.
$'\1' is the 0x01 char. Even though I proposed to use $'\0', it's weird that my version of column has weird interaction with it appearing in the input. But in shell scripting practice, it's generally a bad idea to assume blank separator. In your case, you got caught by space, which is reasonable, because white space has so much overloaded meaning, you cannot prevent it from appearing in a human readable text. The solution to this is to consider using exotic chars like 0x00 or 0x01 as separator instead, which is almost always the case that it won't show up as a part of the text. So it's very safe to use and in fact, it's common in portable shell scripting to use 0x00 as separator.
Do not concatenate string like you did. In fact, there are couple problems with it.
one is you keep concatenating string while you don't really need the intermediate result.
another biteback is what if inside of your text, strings like \n exist and it should be interpreted as literal? echo -e is not going to distinguish that. yet another SO question on the road.
using printf in fact is more flavorable in shell, though I use echo a lot myself as well. Here the benefit of using printf is evident.
I don't have your files so I didn't try it, though I would expect it works. Let me know if there're glitches here and there.
Perhaps you can try
output="$output\n${file_name}\t${file_size}\t${file_type}\n"
...
echo -ne $output

Line from bash command output stored in variable as string

I'm trying to find a solution to a problem analog to this one:
#command_A
A_output_Line_1
A_output_Line_2
A_output_Line_3
#command_B
B_output_Line_1
B_output_Line_2
Now I need to compare A_output_Line_2 and B_output_Line_1 and echo "Correct" if they are equal and "Not Correct" otherwise.
I guess the easiest way to do this is to copy a line of output in some variable and then after executing the two commands, simply compare the variables and echo something.
This I need to implement in a bash script and any information on how to get certain line of output stored in a variable would help me put the pieces together.
Also, it would be cool if anyone can tell me not only how to copy/store a line, but probably just a word or sequence like : line 1, bytes 4-12, stored like string in a variable.
I am not a complete beginner but also not anywhere near advanced linux bash user. Thanks to any help in advance and sorry for bad english!
An easier way might be to use diff, no?
Something like:
command_A > command_A.output
command_B > command_B.output
diff command_A.output command_B.output
This will work for comparing multiple strings.
But, since you want to know about single lines (and words in the lines) here are some pointers:
# first line of output of command_A
command_A | head -n 1
The -n 1 option says only to use the first line (default is 10 I think)
# second line of output of command_A
command_A | head -n 2 | tail -n 1
that will take the first two lines of the output of command_A and then the last of those two lines. Happy times :)
You can now store this information in a variable:
export output_A=`command_A | head -n 2 | tail -n 1`
export output_B=`command_B | head -n 1`
And then compare it:
if [ "$output_A" == "$output_B" ]; then echo 'Correct'; else echo 'Not Correct'; fi
To just get parts of a string, try looking into cut or (for more powerful stuff) sed and awk.
Also, just learing a good general purpose scripting language like python or ruby (even perl) can go a long way with this kind of problem.
Use the IFS (internal field separator) to separate on newlines and store the outputs in an array.
#!/bin/bash
IFS='
'
array_a=( $(./a.sh) )
array_b=( $(./b.sh) )
if [ "${array_a[1]}" = "${array_b[0]}" ]; then
echo "CORRECT"
else
echo "INCORRECT"
fi

Simple sed substitution

I have a text file with a list of files with the structure ABC123456A or ABC123456AA. What I would like to do is check whether the files ABC123456ZZP also exists. i.e I want to substitute the letter(s) after ABC123456 with ZZP
Can I do this using sed?
Like this?
X=ABC123456 ; echo ABC123456AA | sed -e "s,\(${X}\).*,\1ZZP,"
You could use sed as wilx suggests but I think a better option would be bash.
while read file; do
base=${file:0:9}
[[ -f ${base}ZZP ]] && echo "${base}ZZP exists!"
done < file
This will loop over each line in file
then base is set to the first 9 characters of the line (excluding whitespace)
then check to see if a file exists with ZZP on the end of base and print a message if it does.
Look:
$ str="ABC123456AA"
$ echo "${str%[[:alpha:]][[:alpha:]]*}"
ABC123456
so do this:
while IFS= read -r tgt; do
tgt="${tgt%[[:alpha:]][[:alpha:]]*}ZZP"
[[ -f "$tgt" ]] && printf "%s exists!\n" "$tgt"
done < file
It will still fail for file names that contain newlines so let us know if you have that situation but unlike the other posted solutions it will work for file names with other than 9 key characters, file names containing spaces, commas, backslashes, globbing characters, etc., etc. and it is efficient.
Since you said now that you only need the first 9 characters of each line and you were happy with piping every line to sed, here's another solution you might like:
cut -c1-9 file |
while IFS= read -r tgt; do
[[ -f "${tgt}ZZP" ]] && printf "%sZZP exists!\n" "$tgt"
done
It'd be MUCH more efficient and more robust than the sed solution, and similar in both contexts to the other shell solutions.

Error with a script in bash

I have a little error with a script I wrote in bash and I can't figure out what's I'm doing wrong
note that I'm using this script for thousands of calculations and this error happened only a few times (like 20 or so), but it still happened
What the script does is this: basically it takes in input a web page that I got from a site with the utility w3m and it counts all the occurrences of the words in it... After it orders them from the most common to the ones that occur only once
this is the code:
#!/bin/bash
# counts the numbers of words from specific sites #
# writes in a file the occurrences ordered from the most common #
touch check # file used to analyze the occurrences
touch distribution # final file ordered
page=$1 # the web page that needs to be analyzed
occurrences=$2 # temporary file for the occurrences
dictionary=$3 # dictionary used for another purpose (ignore this)
# write the words one by column
cat $page | tr -c [:alnum:] "\n" | sed '/^$/d' > check
# lopp to analyze the words
cat check | while read words
do
word=${words}
strlen=${#word}
# ignores blacklisted words or small ones
if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
then
# if the word isn't in the file
if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
then
echo "$word: 1" | cat >> $occurrences
# else if it is already in the file, it calculates the occurrences
else
old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
### HERE IS THE ERROR, EITHER THE LET OR THE SED ###
let "new=old+1"
sed -i "s/^$word: $old$/$word: $new/g" $occurrences
fi
fi
done
# orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution
# ignore this, not important
grep -w "1" distribution | awk -F ":" '{print $1}' > temp_dictionary
for line in `cat temp_dictionary`
do
if ! grep -Fxq $line $dictionary
then
echo $line >> $dictionary
fi
done
rm check
rm temp_dictionary
this is the error: (I'm translating it, so it could be different in english)
./wordOccurrences line:30 let:x // where x is a number, usually 9 or 10 (but also 11, 13, etc)
1: syntax error in the espression (the error token is 1)
sed: expression -e #1, character y: command 's' not terminated // where y is another number (this one is also usually 9 or 10) with y being different from x
EDIT:
Talking with kev it looks like it's a newline problem
I added an echo between let and sed to print the sed and it worked perfectly for like 5 to 10 minutes until that error. Usually the sed without error looked like this:
s/^CONSULENTI: 6$/CONSULENTI: 7/g
but when I got the error it was like this:
s/^00145: 1
1$/00145: 4/g
how to fix this?
If you get a new line in $old, it means awk prints two lines so there is a duplicate in $occurences.
The script seems complicated to count words, and not efficient because it launches many processes and process file in a loop ;
maybe you can do something similar with
sort | uniq -c
You should also consider that your case-insensitivity is not consistent throughout the program. I created a page with just "foooo" in it and ran the program, then created one with "Foooo" in it and ran the program again. The 'old=`awk...' line sets 'old' to the empty string because awk is matching case sensitively. This results in the occurrences file not being updated. The subsequent sed and possibly some of the greps are also case sensitive.
This may not be the only error since it doesn't explain the error message you saw, but it is an indication that the same word with different capitalization will be handled erroneously by your script.
The following would separate the words, lowercase them, and then remove the ones smaller than three characters:
tr -cs '[:alnum:]' '\n' <foo | tr '[:upper:]' '[:lower:]' | egrep -v '^.{0,2}$'
Using this at the front of your script would mean that the rest of the script would not have to be case insensitive to be correct.

Quick unix command to display specific lines in the middle of a file?

Trying to debug an issue with a server and my only log file is a 20GB log file (with no timestamps even! Why do people use System.out.println() as logging? In production?!)
Using grep, I've found an area of the file that I'd like to take a look at, line 347340107.
Other than doing something like
head -<$LINENUM + 10> filename | tail -20
... which would require head to read through the first 347 million lines of the log file, is there a quick and easy command that would dump lines 347340100 - 347340200 (for example) to the console?
update I totally forgot that grep can print the context around a match ... this works well. Thanks!
I found two other solutions if you know the line number but nothing else (no grep possible):
Assuming you need lines 20 to 40,
sed -n '20,40p;41q' file_name
or
awk 'FNR>=20 && FNR<=40' file_name
When using sed it is more efficient to quit processing after having printed the last line than continue processing until the end of the file. This is especially important in the case of large files and printing lines at the beginning. In order to do so, the sed command above introduces the instruction 41q in order to stop processing after line 41 because in the example we are interested in lines 20-40 only. You will need to change the 41 to whatever the last line you are interested in is, plus one.
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
method 3 efficient on large files
fastest way to display specific lines
with GNU-grep you could just say
grep --context=10 ...
No there isn't, files are not line-addressable.
There is no constant-time way to find the start of line n in a text file. You must stream through the file and count newlines.
Use the simplest/fastest tool you have to do the job. To me, using head makes much more sense than grep, since the latter is way more complicated. I'm not saying "grep is slow", it really isn't, but I would be surprised if it's faster than head for this case. That'd be a bug in head, basically.
What about:
tail -n +347340107 filename | head -n 100
I didn't test it, but I think that would work.
I prefer just going into less and
typing 50% to goto halfway the file,
43210G to go to line 43210
:43210 to do the same
and stuff like that.
Even better: hit v to start editing (in vim, of course!), at that location. Now, note that vim has the same key bindings!
You can use the ex command, a standard Unix editor (part of Vim now), e.g.
display a single line (e.g. 2nd one):
ex +2p -scq file.txt
corresponding sed syntax: sed -n '2p' file.txt
range of lines (e.g. 2-5 lines):
ex +2,5p -scq file.txt
sed syntax: sed -n '2,5p' file.txt
from the given line till the end (e.g. 5th to the end of the file):
ex +5,p -scq file.txt
sed syntax: sed -n '2,$p' file.txt
multiple line ranges (e.g. 2-4 and 6-8 lines):
ex +2,4p +6,8p -scq file.txt
sed syntax: sed -n '2,4p;6,8p' file.txt
Above commands can be tested with the following test file:
seq 1 20 > file.txt
Explanation:
+ or -c followed by the command - execute the (vi/vim) command after file has been read,
-s - silent mode, also uses current terminal as a default output,
q followed by -c is the command to quit editor (add ! to do force quit, e.g. -scq!).
I'd first split the file into few smaller ones like this
$ split --lines=50000 /path/to/large/file /path/to/output/file/prefix
and then grep on the resulting files.
If your line number is 100 to read
head -100 filename | tail -1
Get ack
Ubuntu/Debian install:
$ sudo apt-get install ack-grep
Then run:
$ ack --lines=$START-$END filename
Example:
$ ack --lines=10-20 filename
From $ man ack:
--lines=NUM
Only print line NUM of each file. Multiple lines can be given with multiple --lines options or as a comma separated list (--lines=3,5,7). --lines=4-7 also works.
The lines are always output in ascending order, no matter the order given on the command line.
sed will need to read the data too to count the lines.
The only way a shortcut would be possible would there to be context/order in the file to operate on. For example if there were log lines prepended with a fixed width time/date etc.
you could use the look unix utility to binary search through the files for particular dates/times
Use
x=`cat -n <file> | grep <match> | awk '{print $1}'`
Here you will get the line number where the match occurred.
Now you can use the following command to print 100 lines
awk -v var="$x" 'NR>=var && NR<=var+100{print}' <file>
or you can use "sed" as well
sed -n "${x},${x+100}p" <file>
With sed -e '1,N d; M q' you'll print lines N+1 through M. This is probably a bit better then grep -C as it doesn't try to match lines to a pattern.
Building on Sklivvz' answer, here's a nice function one can put in a .bash_aliases file. It is efficient on huge files when printing stuff from the front of the file.
function middle()
{
startidx=$1
len=$2
endidx=$(($startidx+$len))
filename=$3
awk "FNR>=${startidx} && FNR<=${endidx} { print NR\" \"\$0 }; FNR>${endidx} { print \"END HERE\"; exit }" $filename
}
To display a line from a <textfile> by its <line#>, just do this:
perl -wne 'print if $. == <line#>' <textfile>
If you want a more powerful way to show a range of lines with regular expressions -- I won't say why grep is a bad idea for doing this, it should be fairly obvious -- this simple expression will show you your range in a single pass which is what you want when dealing with ~20GB text files:
perl -wne 'print if m/<regex1>/ .. m/<regex2>/' <filename>
(tip: if your regex has / in it, use something like m!<regex>! instead)
This would print out <filename> starting with the line that matches <regex1> up until (and including) the line that matches <regex2>.
It doesn't take a wizard to see how a few tweaks can make it even more powerful.
Last thing: perl, since it is a mature language, has many hidden enhancements to favor speed and performance. With this in mind, it makes it the obvious choice for such an operation since it was originally developed for handling large log files, text, databases, etc.
print line 5
sed -n '5p' file.txt
sed '5q' file.txt
print everything else than line 5
`sed '5d' file.txt
and my creation using google
#!/bin/bash
#removeline.sh
#remove deleting it comes move line xD
usage() { # Function: Print a help message.
echo "Usage: $0 -l LINENUMBER -i INPUTFILE [ -o OUTPUTFILE ]"
echo "line is removed from INPUTFILE"
echo "line is appended to OUTPUTFILE"
}
exit_abnormal() { # Function: Exit with error.
usage
exit 1
}
while getopts l:i:o:b flag
do
case "${flag}" in
l) line=${OPTARG};;
i) input=${OPTARG};;
o) output=${OPTARG};;
esac
done
if [ -f tmp ]; then
echo "Temp file:tmp exist. delete it yourself :)"
exit
fi
if [ -f "$input" ]; then
re_isanum='^[0-9]+$'
if ! [[ $line =~ $re_isanum ]] ; then
echo "Error: LINENUMBER must be a positive, whole number."
exit 1
elif [ $line -eq "0" ]; then
echo "Error: LINENUMBER must be greater than zero."
exit_abnormal
fi
if [ ! -z $output ]; then
sed -n "${line}p" $input >> $output
fi
if [ ! -z $input ]; then
# remove this sed command and this comes move line to other file
sed "${line}d" $input > tmp && cp tmp $input
fi
fi
if [ -f tmp ]; then
rm tmp
fi
You could try this command:
egrep -n "*" <filename> | egrep "<line number>"
Easy with perl! If you want to get line 1, 3 and 5 from a file, say /etc/passwd:
perl -e 'while(<>){if(++$l~~[1,3,5]){print}}' < /etc/passwd
I am surprised only one other answer (by Ramana Reddy) suggested to add line numbers to the output. The following searches for the required line number and colours the output.
file=FILE
lineno=LINENO
wb="107"; bf="30;1"; rb="101"; yb="103"
cat -n ${file} | { GREP_COLORS="se=${wb};${bf}:cx=${wb};${bf}:ms=${rb};${bf}:sl=${yb};${bf}" grep --color -C 10 "^[[:space:]]\\+${lineno}[[:space:]]"; }

Resources