Linux - How to remove certain lines from a files based on a field value - linux

I want to remove certain lines from a tab-delimited file and write output to a new file.
a b c 2017-09-20
a b c 2017-09-19
es fda d 2017-09-20
es fda d 2017-09-19
The 4th column is Date, basically I want to keep only lines that has 4th column as "2017-09-19" (keep line 2&4) and write to a new file. The new file should have same format as the raw file.
How to write the linux command for this example?
Note: The search criteria should be on the 4th field as I have other fields in the real data and possibly have same value as 4th field.

With awk:
awk 'BEGIN{OFS="\t"} $4=="2017-09-19"' file
OFS: output field separator, a space by default

Use grep to filter:
cat file.txt | grep '2017-09-19' > filtered_file.txt
This is not perfect, since the string 2017-09-19 is not required to appear in the 4th column, but if your file looks like the example, it'll work.

Sed solution:
sed -nr "/^([^\t]*\t){3}2017-09-19/p" input.txt >output.txt
this is:
-n - don't output every line
-r - extended regular expresion
/regexp/p - print line that contains regular expression regexp
^ - begin of line
(regexp){3} - repeat regexp 3 times
[^\t] - any character except tab
\t - tab character
* - repeat characters multiple times
2017-09-19 - search text
That is, skip 3 columns separated by a tab from the beginning of the line, and then check that the value of column 4 coincides with the required value.

awk '/2017-09-19/' file >newfile
cat newfile
a b c 2017-09-19
es fda d 2017-09-19

Related

How to edit the lines in text file in Linux - format the date to YYYY-MM-DD and then grep the line by time period

Can anyone help to format this text file(YYYYMMDD) as a date formatted(YYYY-MM-DD) text file using bash script or in Linux command line? I am not sure how to start editing 23millon lines!!!
I have YYYYMMDD format textfile :-
3515034013|50008|20140601|20240730
and I want to edit like YYYY-MM-DD formatted text file(Only 3rd and 4th fields need to be changed for 23million lines):-
3515034013|50008|2014-06-01|2024-07-30
I Want to convert from YYYYMMDD formatted text file to the YYYY-DD-MM format and I want to get specific lines from the text file based on the time period after this file manipulation which is the end goal.
The end goal is to format the 3rd field and 4th field as YYYY-MM-DD and also want to grep the line by date from that formatted text file:- 03rd field is the start date and the 04th field is the end date Let's say for example I need,
(01). The end date(04th field) before today i.e 2022-08-06 - all the old lines
(02). The end date(04th field) is 2 years from now i.e lines in between 2022-08-06th <-> 2024-08-06th?
Please note:- There are more than a 23million lines to edit and analyze based on the date.
How to approach this problem statement? which method is time efficient awk or sed or Bash line-by-line editing?
$ awk '
BEGIN { FS=OFS="|" }
{
for ( i=3; i<=4; i++ ) {
$i = substr($i,1,4) "-" substr($i,5,2) "-" substr($i,7)
}
print
}
' file
3515034013|50008|2014-06-01|2024-07-30
Here is a way to do it with sed. It has the same restrictions as steffens answer: | as fieldseparator and that all dates have the same format i.e. leading zeros in the month and date part.
sed -E 's/^(.*[|])([0-9]{4})([0-9]{2})([0-9]{2})[|]([0-9]{4})([0-9]{2})([0-9]{2})$/\1\2-\3-\4|\5-\6-\7/g'
Here is what the regular expression does:
^(.*[|]) captures the first part of the string from linestart (^) to a | into \1, this captures the first two columns, because the remaining part of the re matches the remaining part of the line up until lineend!
([0-9]{4})([0-9]{2})([0-9]{2})[|] captures the first date field parts into \2 to \4, notice the [|]
([0-9]{4})([0-9]{2})([0-9]{2})$ does the same for the second date column anchored at lineend ($) and captures the parts into \5 to \7, notice the $
the replacement part \1\2-\3-\4|\5-\6-\7 inserts - at the different places
the capturing into \n happens because of the use of (...) parens in the regular expression.
Here's one way to change the format with awk:
awk '{$3=substr($3,1,4) "-" substr($3,5,2) "-" substr($3,7,2); $4=substr($4,1,4) "-" substr($4,5,2) "-" substr($4,7,2); print}' FS='|' OFS='|'
It should work given that
| is only used for field separation
all dates have the same format
You can pipe the transformed lines to a new file or change it in place. Of course you can do the same with sed or ed. I'd go for awk because you'd be able to extract your specific lines just in the same run to an extra file.
This might work for you (GNU sed):
sed -E 's/^([^|]*\|[^|]*\|....)(..)(..\|....)(..)/\1-\2-\3-\4-/' file
Pattern match and insert - where desired.
Or if the file is only 4 columns:
sed -E 's/(..)(..\|....)(..)(..)$/-\1-\2-\3-\4/' file

Append string to column on command line

I have a 3 column file. I would like to append a third column which is just one word repeated many times. I tried the following
paste file.tsv <(echo 'new_text') > new_file.tsv
But the text 'new_text' only appears on the first line, not every line.
How can I get 'new_text' to appear on every line.
Thanks
sed '1,$ s/$/;ABC/' infile > outfile
This replaces the line end ("$") with ";ABC".

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !
You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane
Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.
Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt
So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

Increment numbers within string using awk and sed

I have a text file that has about 500 rows of information.
I am adding a few strings to the beginning of each line separated by a comma (Excel recognizes it as another column).
I have this code so far:
sed -e "2,$s#^# =HYPERLINK(B2,C2), https://otrs.city.pittsburgh.pa.us/index.pl?Action=AgentTicketZoom;TicketID=#"** C:\Users\hd\Desktop\newaction.txt > C:\Users\hd\Desktop\test.txt
I have a columns want. Once column is adding on a link to a previous column (easy enough)
Which will be a formula(string) in the first column is =HYPERLINK(B2,C2) and I want to increment the 2's to 3's,4's and so on.
Example:
=HYPERLINK(B2,C2)
=HYPERLINK(B3,C3)
=HYPERLINK(B4,C4)
=HYPERLINK(B5,C5)
=HYPERLINK(B6,C6)
It is my second day coding with sed and awk.
Is there any way I can make this happen using awk and sed?
This Perl one-liner:
perl -pe "BEGIN{$i = 2} s#^#=HYPERLINK(B${i},C${i})#; $i++" "input.txt"
will add =HYPERLINK(B2,C2) to the front of each line and increment the numbers each time.

CSV grep but keep the header

I have a CSV file that look like this:
A,B,C
1,2,3
4,4,4
1,2,6
3,6,9
Is there an easy way to grep all the rows in which the B column is 2, and keep the header? For example, I want the output be like
A,B,C
1,2,3
1,2,6
I am working under linux
Using awk:
awk -F, 'NR==1 || $2==2' file
NR==1 -> if first line,
$2==2 -> if second column is equal to 2. Lines are printed if either of the above is true.
To choose the column using the header column name:
awk -F, -v col="B" 'NR==1{for(i=1;i<=NF;i++)if($i==col)break;print;next}$i==2' file
Replace B with the appropriate name of the column which you want to check against.
You can use addresses in sed:
sed -n '1p;/^[^,]*,2/p'
It means:
1p Print the first line.
/ Start a match.
^ Match the beginnning of a line.
[^,] Match anything but a comma
* zero or more times.
, Match a comma.
2 Match a 2.
/p End of match, if it matches, print.
If the header can contain the value you are looking for, you should be more careful:
sed -n '1p;1!{/^[^,]*,2/p}'
1!{ ... } just means "Do the following for lines other then the first one".
For column number n>2, you can add a quantifier:
sed -n '1p;1!{/^\([^,]*,\)\{M\}2/p}'
where M=n-1. The quantifier just means repetition, so the non-comma-0-or-more-times-comma thing is repeated M times.
For true CSV files where a value can contain a comma, switch to Perl and Text::CSV.
$ awk -F, 'NR==1 { for (i=1;i<=NF;i++) h[$i] = i; print; next } $h["B"] == 2' file
A,B,C
1,2,3
1,2,6
By the way, sed is an excellent tool for simple substitutions on a single line, for anything else, just use awk - the code will be clearer and MUCH easier to enhance in future if necessary.

Resources