I have a log file containing lines about different users, and I'm tailing this file in real time. I want to filter out the lines that are only related to a user that I specify, ex: 1234. The log entries look like this:
ID:101 Username=1234
ID:102 Username=1234
ID:999 UNWANTED LINE (because this ID was not assigned to user 1234)
ID:102 some log entry regarding the same user
ID:123 UNWANTED LINE (because this ID was not assigned to user 1234)
ID:102 some other text
ID:103 Username=1234
ID:103 blablabla
A dynamic ID is assigned to a user in a line like "ID:101 Username=1234". Any subsequent lines that start with that ID pertain to the same user and will need to be displayed. I need a dynamic tail that would get all IDs related to the specified user (1234) and filter the previous lines as follows:
ID:101 Username=1234
ID:102 Username=1234
ID:102 some log entry regarding the same user
ID:102 some other text
ID:103 Username=1234
ID:103 blablabla
I need to first filter the lines where "Username=1234" is found, then extract the "ID:???" from that line, then tail all lines that contain "ID:???". When another line with "Username=1234" is found, extract the new ID and use it to display the subsequent lines with this new ID.
I am able to chain greps to filter out the ID when I use cat, but it doesn't work when I chain them after a tail. But even if I could, how do I "watch" for a new value of ID and dynamically update my grep pattern???
Thanks in advance!
This is a task that Awk can handle with ease (and it could be handled with Perl or Python too).
awk '$2 == "Username=1234" { ids[$1]++; } $1 in ids { print }' data
The first pattern/action pair records the ID:xxx value for an entry where $2 is Username=1234 in the array ids. The second pattern/action pair looks whether the ID:xxx entry is listed in ids; if so, it prints the line. The Username=1234 lines satisfy both criteria (at least, after the entry is added to the array).
How do I use it so it can act like tail (i.e. print the new lines as they're added to data)?
tail -f logfile | awk …
You'd miss the name of the data file from the awk part of the command, of course. The only thing you'd have to watch for is that tail doesn't hang-up waiting to fill the pipe buffer. It probably won't be a problem, but you might have to look hard at the options to tail if it takes longer for lines to appear in the Awk input than you expected.
I realized that ID:XXX doesn't necessarily always come at position $1... is there a way to match the ID with a regular expression regardless of its position in the line ($1, $2, ...)?
Yes:
awk '$2 == "Username=1234" { ids[$1]++; }
{ for (i = 1; i <= NF; i++) if ($i in ids) { print; break }' data
The second line matches every line, and for each field in the line, checks whether that field is present in ids array. If it is, it prints the line and breaks out of the loop (you could use next instead of break in this context, though the two are not equivalent in general).
Related
I have a long list of URLs stored in a text files which I will go through and download. But before doing this I want to remove the duplicate URLs from the list. One thing to note is that some of the URLs look different but infact lead to the same page. The unique elements in the URL (aside from the domain and path) are the first 2 parameters in the query string. So for example, my text file would look like this:
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=399494&group=23
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=454665&group=12
If a unique URL is defined up to the second query string (key) then lines 1 and 4 are a duplicate. I would like to completely remove the duplicate lines, so not even keeping one. In the example above, lines 2 and 3 would remain and the 1 and 4 get deleted.
How can I achieve this using basic command line tools?
To shorten the code from other answer:
awk -F\& 'FNR == NR { url[$1,$2]++; next } url[$1,$2] == 1' urls.txt urls.txt
Using awk:
$ awk -F'[?&]' 'FNR == NR { url[$1,$2,$3]++; next } url[$1,$2,$3] == 1' urls.txt urls.txt
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
Reads the file twice; first time to keep a count of how many times the bits you're interested in occur, the second time to print only those that showed up once.
I need to analyze logs and my end user has to be able to see them in a formatted way, as mentioned below, and my nature of logs is the key variables will be in different position, rather than at fixed columns based on the application, as these log formats are from various applications.
"thread":"t1","key1":"value1","key2":"value2",......"key15":"value15"
I have a way to split and cut this to analyze only particular keys, using the following,
cat file.txt | grep 'value1' | cut -d',' -f2,7,8-
This is the command I am able to get, the requirement is I need to grep all logs which have 'key1' as 'value1', and this value1 will be most likely unique among all, so I am using a grep directly, if required, I can use grep to pick along with the key and value string, but main problem I am facing, is the part is after cut. I want to pick only key2, key7, key8 among these lines, but key2, key7, key8 might not appear in the same column numbers like in this order, key2 might even be at column 3 or 4 or after key7/key8, so I want pick based on the key value and get exactly
"key2":"value2", "key7":"value7", "key8:value8"
The end user is not particularly picky about the order in which they appear, they need only these keys from each line to be displayed..
Can someone help me? I tried piping with awk / grep again, but they still match the entire line not on the columns alone
My input is
{"#timestamp":"2021-08-05T06:38:48.084Z","level":"INFO","thread":"main","logger":"className1","message":"Message 1"}
{"#timestamp":"2021-08-05T06:38:48.092Z","level":"DEBUG","thread":"main","logger":"className2","message":"Message 2"}
{"#timestamp":"2021-08-05T06:38:48.092Z","level":"DEBUG","thread":"thead1","logger":"className2","message":"Message 2"}
I basically want my output to be more like, find only the "thread":"main" lines and print only the key and values of "logger" and "message" for each line which matched, since the other key and value are irrelevant to me. there is more than 15 to 16 keys in my file and my key positions could be swapped, like "message" could be the first to appear and "logger" could be the second to appear in some log files. Of course, the keys are just an example, the real keys I am trying to find are not "logger" and "message" alone.
There are log analysis tools, but this is a pretty old system, and the logs are not real time ones I am analyzing and displaying some files which are pretty much older than years.
Not sure I really understand your specification but the following awk script could be a starting point:
$ cat foo.awk
BEGIN {
k["\"key1\""] = 1; k["\"key7\""] = 1; k["\"key8\""] = 1;
}
/"key1":"value1"/ {
s = "";
for(i = 1; i <= NF; i+=2)
if($i in k)
s = s (s ? "," : "") $i ":" $(i+1);
print s;
}
$ awk -F',|:' -f foo.awk foo.txt
"key1":"value1","key7":"value7","key8":"value8"
Explanation:
awk is called with the -F',|:' option such that the fields separator in each record is the comma or the colon.
In the BEGIN section we declare an associative array (k) of the selected keys, including the surrounding double quotes.
The rest of the awk script applies to each record containing "key1":"value1".
Variable s is used to prepare the output string; it is initialized to "".
For each odd field (the keys) we check if it is in k. If it is, we concatenate to s:
a comma if s is not empty,
the key field,
a colon,
the following even field (the value).
We print s.
I have a file with almost 5*(10^6) lines of integer numbers. So, my file is big enough.
The question is all about extract specific lines, filtering them by a condition.
For example, I'd like to:
Extract the N first lines without read entire file.
Extract the lines with the numbers less or equal X (or >=, <=, <, >)
Extract the lines with a condition related a number (math predicate)
Is there a cleaver way to perform these tasks? (using sed or awk or cat or head)
Thanks in advance.
To extract the first $NUMBER lines,
head -n $NUMBER filename
Assuming every line contains just a number (although it will also work if the first token is one), 2 can be solved like this:
awk '$1 >= 1234 && $1 < 5678' filename
And keeping in spirit with that, 3 is just the extension
awk 'condition' filename
It would have helped if you had specified what condition is supposed to be, though. This way, you'll have to read the awk documentation to find out how to code it. Again, the number will be represented by $1.
I don't think I can explain anything about the head call, it's really just what it says on the tin. As for the awk lines: awk, like sed, works linewise. awk fetches lines in a loop and applies your code to each line. This code takes the form
condition1 { action1 }
condition2 { action2 }
# and so forth
For every line awk fetches, the conditions are checked in the order they appear, and the associated action to each condition is performed if the condition is true. It would, for example, have been possible to extract the first $NUMBER lines of a file with awk like this:
awk -v number="$NUMBER" '1 { print } NR == number { exit }' filename
where 1 is synonymous with true (like in C) and NR is the line number. The -v command line option initializes the awk variable number to $NUMBER. If no action is specified, the default action is { print }, which prints the whole line. So
awk 'condition' filename
is shorthand for
awk 'condition { print }' filename
...which prints every line where the condition holds.
I have a flat file separated by | that I want to update from information already inside of the flat file. I want to fill the third field using information from the first and second. From the first field I want to ignore the last two numbers when using that data to compare against the data missing the third field. When matching against the second field I want it to be exact. I do not want to create a new flat file. I want to update the existing file. I researched a way to pull out the first two fields from the file but I do not know if that will even be helpful for the goal I am trying to achieve. To sum all of that up, I want to compare the first and second fields to other fields in the file to pull the third field that may be missing on some of the lines on the flat file.
awk -F'|' -v OFS='|' '{sub(/[0-9 ]+$/,"",$1)}1 {print $1 "\t" $2}' tstfile
first field|second field|third field
Original intput:
t1ttt01|/a1
t1ttt01|/b1
t1ttt01|/c1
t1ttt03|/a1|1
t1ttt03|/b1|1
t1ttt03|/c1|1
l1ttt03|/a1|3
l1ttt03|/b1|3
l1ttt03|/c1|3
What it should do:
t1ttt03|/a1|1 = t1ttt01|/a1
when comparing t1ttt|/a1| = t1ttt|/a1
Therefore
t1ttt01|/a1 becomes t1ttt01|/a1|/1
What I want the Output to look like:
t1ttt01|/a1|1
t1ttt01|/b1|1
t1ttt01|/c1|1
t1ttt03|/a1|1
t1ttt03|/b1|1
t1ttt03|/c1|1
l1ttt03|/a1|3
l1ttt03|/b1|3
l1ttt03|/c1|3
One way with awk:
awk '
# set the input and output field separator to "|"
BEGIN{FS=OFS="|"}
# Do this action when number of fields on a line is 3 for first file only. The
# action is to strip the number portion from first field and store it as a key
# along with the second field. The value of this should be field 3
NR==FNR&&NF==3{sub(/[0-9]+$/,"",$1);a[$1$2]=$3;next}
# For the second file if number of fields is 2, store the line in a variable
# called line. Validate if field 1 (without numbers) and 2 is present in
# our array. If so, print the line followed by "|" followed by value from array.
NF==2{line=$0;sub(/[0-9]+$/,"",$1);if($1$2 in a){print line OFS a[$1$2]};next}1
' file file
Test:
$ cat file
t1ttt01|/a1
t1ttt01|/b1
t1ttt01|/c1
t1ttt03|/a1|1
t1ttt03|/b1|1
t1ttt03|/c1|1
l1ttt03|/a1|3
l1ttt03|/b1|3
l1ttt03|/c1|3
$ awk 'BEGIN{FS=OFS="|"}NR==FNR&&NF==3{sub(/[0-9]+$/,"",$1);a[$1$2]=$3;next}NF==2{line=$0;sub(/[0-9]+$/,"",$1);if($1$2 in a){print line OFS a[$1$2]};next}1' file file
t1ttt01|/a1|1
t1ttt01|/b1|1
t1ttt01|/c1|1
t1ttt03|/a1|1
t1ttt03|/b1|1
t1ttt03|/c1|1
l1ttt03|/a1|3
l1ttt03|/b1|3
l1ttt03|/c1|3
I have a file like this file, and I am trying to verify one field of each line, and add some wording if that field has a duplicate earlier in the file.
\\FILE04\BUET-PCO;\\SERVER24\DFS\SHARED\CORP\ET\PROJECT CONTROL OFFICE;/FS7_150a/FILE04/BU-D/PROJECT CONTROL OFFICE;10000bytes;9888;;;
\\FILE12\BUAG-GOLDMINE$;\\SERVER24\DFS\SHARED\CAN\AGENCY\GOLDMINE;/FS3_150a/FILE12/BU/AGENCY/GOLDMINE;90000bytes;98834;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;other stuff;;
In my example Line #3 and #4 have the same physical path.
I would to have a script that could compare third field for example /FS3_150a/FILE12/BU/GB/BUSINTEG against the same file,
and if it found the exact match to print something like "same physical path as Line #" for both cases,
\\FILE04\BUET-PCO;\\SERVER24\DFS\SHARED\CORP\ET\PROJECT CONTROL OFFICE;/FS7_150a/FILE04/BU-D/PROJECT CONTROL OFFICE;10000bytes;9888;;;
\\FILE12\BUAG-GOLDMINE$;\\SERVER24\DFS\SHARED\CAN\AGENCY\GOLDMINE;/FS3_150a/FILE12/BU/AGENCY/GOLDMINE;90000bytes;98834;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;;;Same Physical Path as Line #4
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;other stuff;; Same Physical Path as Line #3
This code tackles a simplified version of your problem. It identifies each line that contains a duplicate value compared to a previous line in field 3. It doesn't handle the tagging of a line that has subsequent duplicates.
awk -F';' '{ tag = ""
if (field3[$3] != 0) tag = " Same physical path as line " field3[$3]
else field3[$3] = NR
printf "%s%s\n", $0, tag
}' "$#"
There are probably other ways to organize it, but the key point is to use the associative array field3 to keep track of which names are seen in field 3 and the line number at which a given name was first seen. This assumes you're processing a single file of input. Lookup FNR etc if you must process multiple files (but you have to decide whether the same name can appear in different files or not).
It works almost as desired on the data given:
\\FILE04\BUET-PCO;\\SERVER24\DFS\SHARED\CORP\ET\PROJECT CONTROL OFFICE;/FS7_150a/FILE04/BU-D/PROJECT CONTROL OFFICE;10000bytes;9888;;;
\\FILE12\BUAG-GOLDMINE$;\\SERVER24\DFS\SHARED\CAN\AGENCY\GOLDMINE;/FS3_150a/FILE12/BU/AGENCY/GOLDMINE;90000bytes;98834;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;other stuff;; Same physical path as line 3
The difficulty with producing the 'tag' on line 3 is predicting the future; it is hard. To do that, you'd have to slurp the entire file into memory, keeping tabs on the line numbers where a given value in field 3 appears (in general, that could be an extensive list of line numbers), and then iterating through the data and tagging appropriately. Very, very much harder to do; I'd prefer to use Perl to awk for that job, though it is probably feasible to organize the data correctly in awk too.
Were it me, I'd be OK with the 90% of the job done; lines with duplicates are identified. If you want the last 10% done, expect it to take the other 90% of the time planned for the first phase.
Here's one way using GNU awk. It is a little hackish, YMMV. Run like:
awk -f script.awk file.txt{,}
Contents of script.awk:
BEGIN {
FS = ";"
}
FNR==NR {
array[$3]=array[$3] "#" NR
next
}
{
if ($3 in array && array[$3] ~ /#.#/) {
copy = array[$3]
sub("#"FNR, "", copy)
printf "%s Same Physical Path as Line as %s\n", $0, copy
}
else {
print
}
}
Results:
\\FILE04\BUET-PCO;\\SERVER24\DFS\SHARED\CORP\ET\PROJECT CONTROL OFFICE;/FS7_150a/FILE04/BU-D/PROJECT CONTROL OFFICE;10000bytes;9888;;;
\\FILE12\BUAG-GOLDMINE$;\\SERVER24\DFS\SHARED\CAN\AGENCY\GOLDMINE;/FS3_150a/FILE12/BU/AGENCY/GOLDMINE;90000bytes;98834;;;
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;;; Same Physical Path as Line as #4
\\FILE12\BUGB-BUSINTEG$;\\SERVER24\DFS\SHARED\CAN\GB\BUSINTEG;/FS3_150a/FILE12/BU/GB/BUSINTEG;50000bytes;988822;other stuff;; Same Physical Path as Line as #3