I haven't used my linux skills in a while and I'm struggling with extracting certain lines out of a csv log file.
The file is structured as:
code,client_id,local_timestamp,operation_code,error_code,etc
I want to extract only those lines of the file with a specific code and a positive client_id greater than 0.
for example if I have the lines:
message_received,1,134,20,0,xxx<br>
message_ack,0,135,10,1,xxx<br>
message_received,0,140,20,1,xxx<br>
message_sent,1,150,30,0,xxx
I only want to extract those lines having code message_received and positive client_id > 0, resulting in just the first line:
message_received,1,134,20,0,xxx
I want to use awk somewhat like:
awk '/message_received,[[:digit:]]>0'/ my log.csv which I know isn't quite correct.. but how do I achieve this in a one liner?
This is probably what you want:
awk -F, '($1=="message_received") && ($2>0)' mylog.csv
If not, edit your question to clarify.
Related
I have a problem with my bash script, I would like to retrieve information contained in several files and gather them in one.
I have a file in this form which contains about 15000 lines: (file1)
1;1;A0200101C
2;2;A0200101C
3;3;A1160101A
4;4;A1160101A
5;5;A1130304G
6;6;A1110110U
7;7;A1110110U
8;8;A1030002V
9;9;A1030002V
10;10;A2120100C
11;11;A2120100C
12;12;A3410071A
13;13;A3400001A
14;14;A3385000G1
15;15;A3365070G1
I would need to retrieve the first record of each row matching the id.
My second file is this, I just need to retrieve the 3rd row: (file2)
count
-------
131
(1 row)
I would therefore like to be able to assemble the id of (file1) and the 3rd line of (file2) in order to achieve this result:
1;131
2;131
3;131
4;131
5;131
6;131
7;131
8;131
9;131
11;131
12;131
13;131
14;131
15;131
Thank you.
One possible way:
#!/usr/bin/env bash
count=$(awk 'NR == 3 { print $1 }' file2)
while IFS=';' read -r id _; do
printf "%s;%s\n" "$id" "$count"
done < file1
First, read just the third line of file2 and save that in a variable.
Then read each line of file1 in a loop, extracting the first semicolon-separated field, and print it along with that saved value.
Using the same basic approach in a purely awk script instead of shell will be much faster and more efficient. Such a rewrite is left as an exercise for the reader (Hint: In awk, FNR == NR is true when reading the first file given, and false on any later ones. Alternatively, look up how to pass a shell variable to an awk script; there are Q&As here on SO about it.)
I have record
Example:
EMP_ID|EMP_NAME|AGE|SALARAy
123456|XXXXXXXXX|30|10000000
Is there a way i can split the record into multiple records. Example output should be like
EMP_ID|Attributes
123456|XXXXXXX
123456|30
123456|10000000
I want to split the same record into multiple records. Here Employee id is my unique column and remaining 3 columns i want to run in a loop and create 3 records. Like EMP_ID|EMP_NAME , EMP_ID|AGE , EMP_ID|SALARY. I may have some more columns as well but for sample i have provided 3 columns along with Employee id.
Please help me with any suggestion.
With bash:
record='123456|XXXXXXXXX|30|10000000'
IFS='|' read -ra fields <<<"$record"
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
123456|XXXXXXXXX
123456|30
123456|10000000
For the whole file:
{
IFS= read -r header
while IFS='|' read -ra fields; do
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
done
} < filename
Record of lines with fields separated by a special delimiter character such as | can be manipulated by basic Unix command line tools such as awk. For example with your input records in file records.txt:
awk -F\| 'NR>1{for(i=2;i<=NF;i++){print $1"|"$(i)}}' records.txt
I recommend to read a awk tutorial and play around with it. Related command line tools worth to learn include grep, sort, wc, uniq, head, tail, and cut. If you regularly do data processing of delimiter-separated files, you will likely need them on a daily basis. As soon as your data structuring format gets more complex (e.g. CSV format with possibility to also use the delimiter character in field values) you need more specific tools, for instance see this question on CSV tools or jq for processing JSON. Still knowledge of basic Unix command line tools will save you a lot of time.
I am trying to separate a file into two different files based on whether the line contains certain string. If a line contain "ITS", this line and the line right after it will be write to file ITS.txt; if a line contains "V34" then this line and the line right after it will be write to file "V34.txt".
My awk code is
awk '/ITS/{print>"ITX.txt";getline;print>"ITX.txt";}; /V34/{print>"V34.txt";getline;print>"V34.txt";}' seqs.fna
It works well. But I am wondering whether there is an efficient way to do so?
seqs.fna (9-10G)
>16S.V34.S7.5_1
ACGGGAGGCAGCAGTAGGGAATCTTCC
>PCR.ITS.S8.14_2
CATTTAGAGGAAGTAAAAGTCGTAACA
>PCR.ITS.S7.11_3
CATTTAGAGGAAGTACAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTTTTGAAGGCTACAC
>16S.V34.S8.6_4
ACGGGCGGCAGCAGTAGGGAAT
>16S.V34.S8.13_5
ACGGGCGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGT
>16S.V34.S7.14_6
ACGGGGGGCAGCAGTAGGGAATCTTCCACAATGGGTGCAAACCTGATGGAGCAATGCCG
>16S.V34.S8.4_7
ACGGGAGGCAGCAGTAGGGAATCTTCCACAAT
>16S.V34.S8.14_8
CGTAGAGATGTGGAGGAACACCAGTGGCGAAG
>16S.V34.S8.8_9
CTGGGATAACACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTTGTAGTC
>16S.V34.S7.3_10
GGTCTGTAATTGACGCTGAGGTTCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCGGGTAGTC
getline has a few very specific uses and this would not be one of them. See http://awk.freeshell.org/AllAboutGetline. If you rewrote your script without getline you'd solve the problem yourself but given the input file you posted, this is all you need:
awk -F'.' '/^>/{out=$2".txt"} {print > out}' seqs.fna
To learn how to use awk correctly, read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
I have a large delimited file (with pipe '|' as the delimiter) which I have managed to sort (using linux sort) according to first (numeric), second (numeric) and fourth column (string ordering since it is a timestamp value). The file is like this:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I was wondering if there is an easy way to split this file to multiple text files with an awk, sed, grep or perl one liner whenever the first column or the second column value changes. The final result for the example file should be 3 text files like that:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I could do that in Java of course, but I think it would be kind of overkill, if it can be done with a script. Also, is this possible that the filenames created use those two columns values, something like 77_141.txt for the first file, 77_171.txt for the second file and 78_115.txt for the third one?
awk is very handy for this kind of problems. This can be an approach:
awk -F"|" '{print >> $1"_"$2".txt"}' file
Explanation
-F"|" sets field separator as |.
{print > something} prints the lines into the file something.
$1"_"$2".txt" instead of something, set the output file as $1"_"$2, being $1 the first field based on the | separator. That is, 77, 78... And same for $2, being 141, 171...
How do we build Normalized table from DeNormalized text file one?
Thanks for your replies/time.
We need to build a Normalized DB Table from DeNormalized text file. We explored couple of options such as unix shell , and PostgreSQL etc. I am looking learn better ideas for resolutions from this community.
The input text file is various length with comma delimited records. The content may look like this:
XXXXXXXXXX , YYYYYYYYYY, TTTTTTTTTTT, UUUUUUUUUU, RRRRRRRRR,JJJJJJJJJ
111111111111, 22222222222, 333333333333, 44444444, 5555555, 666666
EEEEEEEE,WWWWWW,QQQQQQQ,PPPPPPPP
We like to normalize as follows (Split & Pair):
XXXXXXXXXX , YYYYYYYYYY
TTTTTTTTTTT, UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111, 22222222222
333333333333, 44444444
5555555, 666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP
Do we need to go with text pre-process and Load approach?
If yes, what is the best way to pre-process?
Are there any single SQL/Function approach to get the above?
Thanks in helping.
Using gnu awk (due to the RS)
awk '{$1=$1} NR%2==1 {printf "%s,",$0} NR%2==0' RS="[,\n]" file
XXXXXXXXXX,YYYYYYYYYY
TTTTTTTTTTT,UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111,22222222222
333333333333,44444444
5555555,666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP
{$1=$1} Cleans up and remove extra spaces
NR%2==1 {printf "%s,",$0} prints odd parts
NR%2==0 prints even part and new line
RS="[,\n]" sets the record to , or newline
Here is an update. Here is what I did in Linux server.
sed -i 's/\,,//g' inputfile <------ Clean up lot of trailing commas
awk '{$1=$1} NR%2==1 {printf "%s,",$0} NR%2==0' RS="[,\n]" inputfile <----Jotne's idea
dos2unix -q -n inputfile outputfle <------ to remove ^M in some records
outputfile is ready to process as comma delimited format
Any thoughts to improve above steps further?
Thanks in helping.