input file isn't being read correctly scripting awk

input file isn't being read correctly scripting awk - linux

Hi I have two input files.
input1.txt:
id above
id below
id still
id getting
input2.txt
id above
value above the sky
id below
value under the chair
im trying to do an awk command and it shows up empty.
awk -f find.awk input1.txt input2.txt
I know my awk works because im inputting 2 different txt files and all the outputs are correct and visible.
the difference between the different input2.txt files is...
when i go to use notepad on a windows machine the whole file turns out to be one string, while if you use any txt editor, it's formatted with separate lines.
example input2.txt on notepad.
id above value above the sky id below value under the chair
I can't just reparse this input by id, because my real txt file has more data ... which is inconsistent so i can't just search for a string or reg expression.
find.awk
FNR==NR { id[$0]; next }
$0 in id { f=1 }
f
NF==0 { f=0 }
and idea on why my awk isn't working?

Run "dos2unix" on your input files before running awk on them. man dos2unix.

Related

Is there a linux command that can cut and pick columns that match string patterns?

I need to analyze logs and my end user has to be able to see them in a formatted way, as mentioned below, and my nature of logs is the key variables will be in different position, rather than at fixed columns based on the application, as these log formats are from various applications.
"thread":"t1","key1":"value1","key2":"value2",......"key15":"value15"
I have a way to split and cut this to analyze only particular keys, using the following,
cat file.txt | grep 'value1' | cut -d',' -f2,7,8-
This is the command I am able to get, the requirement is I need to grep all logs which have 'key1' as 'value1', and this value1 will be most likely unique among all, so I am using a grep directly, if required, I can use grep to pick along with the key and value string, but main problem I am facing, is the part is after cut. I want to pick only key2, key7, key8 among these lines, but key2, key7, key8 might not appear in the same column numbers like in this order, key2 might even be at column 3 or 4 or after key7/key8, so I want pick based on the key value and get exactly
"key2":"value2", "key7":"value7", "key8:value8"
The end user is not particularly picky about the order in which they appear, they need only these keys from each line to be displayed..
Can someone help me? I tried piping with awk / grep again, but they still match the entire line not on the columns alone
My input is
{"#timestamp":"2021-08-05T06:38:48.084Z","level":"INFO","thread":"main","logger":"className1","message":"Message 1"}
{"#timestamp":"2021-08-05T06:38:48.092Z","level":"DEBUG","thread":"main","logger":"className2","message":"Message 2"}
{"#timestamp":"2021-08-05T06:38:48.092Z","level":"DEBUG","thread":"thead1","logger":"className2","message":"Message 2"}
I basically want my output to be more like, find only the "thread":"main" lines and print only the key and values of "logger" and "message" for each line which matched, since the other key and value are irrelevant to me. there is more than 15 to 16 keys in my file and my key positions could be swapped, like "message" could be the first to appear and "logger" could be the second to appear in some log files. Of course, the keys are just an example, the real keys I am trying to find are not "logger" and "message" alone.
There are log analysis tools, but this is a pretty old system, and the logs are not real time ones I am analyzing and displaying some files which are pretty much older than years.

Not sure I really understand your specification but the following awk script could be a starting point:
$ cat foo.awk
BEGIN {
k["\"key1\""] = 1; k["\"key7\""] = 1; k["\"key8\""] = 1;
}
/"key1":"value1"/ {
s = "";
for(i = 1; i <= NF; i+=2)
if($i in k)
s = s (s ? "," : "") $i ":" $(i+1);
print s;
}
$ awk -F',|:' -f foo.awk foo.txt
"key1":"value1","key7":"value7","key8":"value8"
Explanation:
awk is called with the -F',|:' option such that the fields separator in each record is the comma or the colon.
In the BEGIN section we declare an associative array (k) of the selected keys, including the surrounding double quotes.
The rest of the awk script applies to each record containing "key1":"value1".
Variable s is used to prepare the output string; it is initialized to "".
For each odd field (the keys) we check if it is in k. If it is, we concatenate to s:
a comma if s is not empty,
the key field,
a colon,
the following even field (the value).
We print s.

How to extract two part-numerical values from a line in shell script

I have multiple text files in this format. I would like to extract lines matching this pattern "pass filters and QC".
File1:
Before main variant filters, 309 founders and 0 nonfounders present.
0 variants removed due to missing genotype data (--geno).
9302015 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
7758518 variants and 309 people pass filters and QC.
Calculating allele frequencies... done.
I was able to grep the line, but when I tried to assign to line variable it just doesn't work.
grep 'people pass filters and QC' File1
line="$(echo grep 'people pass filters and QC' File1)"
I am new to shell script and would appreciate if you could help me do this.
I want to create a tab separated file with just
"File1" "7758518 variants" "309 people"

GNU awk
gawk '
BEGIN { patt = "([[:digit:]]+ variants) .* ([[:digit:]]+ people) pass filters and QC" }
match($0, patt, m) {printf "\"%s\" \"%s\" \"%s\"\n", FILENAME, m[1], m[2]}
' File1

You are almost there, just remove double quotes and echo from your command:
line=$(grep 'people pass filters and QC' File1)
Now view the value stored in variable:
echo $line
And if your file structure is same, i.e., it will always be in this form: 7758518 variants and 309 people pass filters and QC, you can use awk to get selected columns from output. So complete command would be like below:
OIFS=$IFS;IFS=$'\n';for i in $line;do echo $i;echo '';done | awk -F "[: ]" '{print $1"\t"$2" "$3"\t"$5" "$6}';IFS=$OIFS
Explanation:
IFS means internal field separator, and we are setting it to newline character, because we need to use it in for loop.
But before that, we are taking it's backup in another variable OIFS, so we can restore it later.
We are using a for loop to iterate through all the matched strings, and using awk to select, 1st, 2nd, 3rd , 4th and 5th column as per your requirement.
But please note, if your file structure varies, we may need to use a different technique to extract "7758518 variants" and "309 people" part.

How to make while read faster (how to use grep instead)

I have a file named "compare" and a file named "final_contigs_c10K.fa"
I want to eleminate lines AND THE NEXT LINE from "final_contigs_c10K.fa" containing specific strings in "compare".
compare looks like this :
k119_1
k119_3
...
and the number of lines of compare is 26364.
final_contigs_c10K.fa looks like :
>k119_1
AAAACCCCC
>k119_2
CCCCC
>k119_3
AAAAAAAA
...
I want to make make final_contigs_c10K.fa into a format :
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
...
I tried this code, but this code takes too much time, though it seems to be working fine. I think it takes too much time because the number of lines in compare is 26364, which is too much compared to my other files that I had tested the code on.
while read line; do sed -i -e "/$line/ { N; d; }" final_contigs_c10K.fa; done < compare
Is there a way to make this command faster?

Using awk
$ awk 'NR==FNR{a[">" $1];next}$1 in a{p=3} --p>0' compare final_contigs_c10K.fa
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
This will produce the output to stdout ie. won't make any changes to original files.
Explained:
$ awk '
NR==FNR { # process the first file
a[">" $1] # hash to a, adding > while at it
next # process the next record
} # process th second file after this point
$1 in a { p=3 } # if current record was in compare file set p
--p>0 # print current file match and the next record
' compare final_contigs_c10K.fa # mind the file order

how can I tail a file with a dynamic grep pattern?

I have a log file containing lines about different users, and I'm tailing this file in real time. I want to filter out the lines that are only related to a user that I specify, ex: 1234. The log entries look like this:
ID:101 Username=1234
ID:102 Username=1234
ID:999 UNWANTED LINE (because this ID was not assigned to user 1234)
ID:102 some log entry regarding the same user
ID:123 UNWANTED LINE (because this ID was not assigned to user 1234)
ID:102 some other text
ID:103 Username=1234
ID:103 blablabla
A dynamic ID is assigned to a user in a line like "ID:101 Username=1234". Any subsequent lines that start with that ID pertain to the same user and will need to be displayed. I need a dynamic tail that would get all IDs related to the specified user (1234) and filter the previous lines as follows:
ID:101 Username=1234
ID:102 Username=1234
ID:102 some log entry regarding the same user
ID:102 some other text
ID:103 Username=1234
ID:103 blablabla
I need to first filter the lines where "Username=1234" is found, then extract the "ID:???" from that line, then tail all lines that contain "ID:???". When another line with "Username=1234" is found, extract the new ID and use it to display the subsequent lines with this new ID.
I am able to chain greps to filter out the ID when I use cat, but it doesn't work when I chain them after a tail. But even if I could, how do I "watch" for a new value of ID and dynamically update my grep pattern???
Thanks in advance!

This is a task that Awk can handle with ease (and it could be handled with Perl or Python too).
awk '$2 == "Username=1234" { ids[$1]++; } $1 in ids { print }' data
The first pattern/action pair records the ID:xxx value for an entry where $2 is Username=1234 in the array ids. The second pattern/action pair looks whether the ID:xxx entry is listed in ids; if so, it prints the line. The Username=1234 lines satisfy both criteria (at least, after the entry is added to the array).
How do I use it so it can act like tail (i.e. print the new lines as they're added to data)?
tail -f logfile | awk …
You'd miss the name of the data file from the awk part of the command, of course. The only thing you'd have to watch for is that tail doesn't hang-up waiting to fill the pipe buffer. It probably won't be a problem, but you might have to look hard at the options to tail if it takes longer for lines to appear in the Awk input than you expected.
I realized that ID:XXX doesn't necessarily always come at position $1... is there a way to match the ID with a regular expression regardless of its position in the line ($1, $2, ...)?
Yes:
awk '$2 == "Username=1234" { ids[$1]++; }
{ for (i = 1; i <= NF; i++) if ($i in ids) { print; break }' data
The second line matches every line, and for each field in the line, checks whether that field is present in ids array. If it is, it prints the line and breaks out of the loop (you could use next instead of break in this context, though the two are not equivalent in general).

How to extract values from text using multiple (nested) delimiters

On a day-to-day basis I need to extract bits of text from logs and other text data in various mixed formats. Is there a utility (like awk, grep, etc.) I could use to quickly perform the task without having to resort to writing long bash/perl/python scripts?
Example 1: For input text below
mylog user=UserName;password=Password;other=information
I would like to extract user name and password values. The pseudo-utility would preferably looks like this (a la awk):
cat input-text.txt | magic --delimit-by=";" --then-by="="
'{print "The username is $values[0][1] and password is $values[1][1]"}'
Where the input string delimited by ; is placed in $values array, and each value in that array is further delimited by = to form a nested array.
Even better, would be nice to have something like this:
cat input-text.txt | magic --map-entry-sep=";" --map-key-val-sep="="
'{print "The username is $[user] and password is $[password]"}'
Where the result of parsing is converted into a map for easy lookup by key.
Example 2: Would be nice to parse triple nested elements too. Consider input text like
mylog mylist=one,two,three;other=information
I would like to now extract the 2nd element of list mylist using something like:
cat input-text.txt | magic --delimit-by=";" --then-by="=" --and-then-by=","
'{print "The second element of mylist is: $values[0][1][1]}'
Of course, I would rather use some kind of JSON parser and convert input data into it's respective object/map/list format for easier extraction, but it's not possible because I am working with data in different formats.
I usually use a combination of awk, grep, cut and sed combined using several pipes and extract each value (column) of interest at a time, but that is tedious and requires merging different columns into one later. Usually, I need all extracted columns in CSV format for further processing in Excel.
Would be grateful for any suggestions or comments.

$ echo 'mylog user=UserName;password=Password;other=information' |
awk -F '[ ;]' -v keysep="=" \
'{
for (i=1; i<=NF; i++) {
split($i, t, keysep);
a[t[1]] = t[2]
};
print "The username is " a["user"] " and password is " a["password"]
}'
The username is UserName and password is Password
$ echo 'mylog mylist=one,two,three;other=information' | awk -F "[ =,;]" '{print $4}'
two

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

input file isn't being read correctly scripting awk - linux

Run "dos2unix" on your input files before running awk on them. man dos2unix.

Related

Is there a linux command that can cut and pick columns that match string patterns?

How to extract two part-numerical values from a line in shell script

How to make while read faster (how to use grep instead)

how can I tail a file with a dynamic grep pattern?

How to extract values from text using multiple (nested) delimiters

Categories

Resources