How to edit the lines in text file in Linux - format the date to YYYY-MM-DD and then grep the line by time period - linux

Can anyone help to format this text file(YYYYMMDD) as a date formatted(YYYY-MM-DD) text file using bash script or in Linux command line? I am not sure how to start editing 23millon lines!!!
I have YYYYMMDD format textfile :-
3515034013|50008|20140601|20240730
and I want to edit like YYYY-MM-DD formatted text file(Only 3rd and 4th fields need to be changed for 23million lines):-
3515034013|50008|2014-06-01|2024-07-30
I Want to convert from YYYYMMDD formatted text file to the YYYY-DD-MM format and I want to get specific lines from the text file based on the time period after this file manipulation which is the end goal.
The end goal is to format the 3rd field and 4th field as YYYY-MM-DD and also want to grep the line by date from that formatted text file:- 03rd field is the start date and the 04th field is the end date Let's say for example I need,
(01). The end date(04th field) before today i.e 2022-08-06 - all the old lines
(02). The end date(04th field) is 2 years from now i.e lines in between 2022-08-06th <-> 2024-08-06th?
Please note:- There are more than a 23million lines to edit and analyze based on the date.
How to approach this problem statement? which method is time efficient awk or sed or Bash line-by-line editing?

$ awk '
BEGIN { FS=OFS="|" }
{
for ( i=3; i<=4; i++ ) {
$i = substr($i,1,4) "-" substr($i,5,2) "-" substr($i,7)
}
print
}
' file
3515034013|50008|2014-06-01|2024-07-30

Here is a way to do it with sed. It has the same restrictions as steffens answer: | as fieldseparator and that all dates have the same format i.e. leading zeros in the month and date part.
sed -E 's/^(.*[|])([0-9]{4})([0-9]{2})([0-9]{2})[|]([0-9]{4})([0-9]{2})([0-9]{2})$/\1\2-\3-\4|\5-\6-\7/g'
Here is what the regular expression does:
^(.*[|]) captures the first part of the string from linestart (^) to a | into \1, this captures the first two columns, because the remaining part of the re matches the remaining part of the line up until lineend!
([0-9]{4})([0-9]{2})([0-9]{2})[|] captures the first date field parts into \2 to \4, notice the [|]
([0-9]{4})([0-9]{2})([0-9]{2})$ does the same for the second date column anchored at lineend ($) and captures the parts into \5 to \7, notice the $
the replacement part \1\2-\3-\4|\5-\6-\7 inserts - at the different places
the capturing into \n happens because of the use of (...) parens in the regular expression.

Here's one way to change the format with awk:
awk '{$3=substr($3,1,4) "-" substr($3,5,2) "-" substr($3,7,2); $4=substr($4,1,4) "-" substr($4,5,2) "-" substr($4,7,2); print}' FS='|' OFS='|'
It should work given that
| is only used for field separation
all dates have the same format
You can pipe the transformed lines to a new file or change it in place. Of course you can do the same with sed or ed. I'd go for awk because you'd be able to extract your specific lines just in the same run to an extra file.

This might work for you (GNU sed):
sed -E 's/^([^|]*\|[^|]*\|....)(..)(..\|....)(..)/\1-\2-\3-\4-/' file
Pattern match and insert - where desired.
Or if the file is only 4 columns:
sed -E 's/(..)(..\|....)(..)(..)$/-\1-\2-\3-\4/' file

Related

Change date format from dd/mm/yyyy to yyyy-mm-dd in a file using shell scripting

I have a source file with 18 columns in which columns 10 , 11 and 15 are in the format dd/mm/yyyy and all these needs to be converted to yyyy-mm-dd and written to target file along with other columns.
I am aware of date formatting functions on Variables but do not know how to apply the same on few columns in a file.
I don’t have a machine available to test, but consider using awk with a little function since you are doing the same thing 3 times. It will look something like this:
awk ‘
function dodate(in){
split(in,/\//,a) # split existing date into elements of array “a”
return a[3] “-“ a[2] “-“ a[1]
}
{ $10=dodate($10); $11=dodate($11); $15=dodate($15); print }’ yourFile
Reference for awk functions, and split.
If the fields on each line are separated by commas, tell awk that with:
awk -F, ...
Maybe you could use command awk to solve it.
As you have 3 cols contain date (col 10, 11, 15), here I assume a sample string which field seperator is |, col contains date is the 4th col
aa|bb|cc|29/09/2017|dd|ee|ff
use String-Manipulation Functions to extract date, then format it with getline to format it to expected syntax.
command is
echo 'aa|bb|cc|2017-09-29|dd|ee|ff' | awk -F\| 'BEGIN{OFS="|"}{$4=gensub(/([0-9]{1,2})\/([0-9]{1,2})\/([0-9]{4})/,"\\3\\2\\1","g",$4); "date --date=\""$4"\" +\"%F\"" | getline a; $4=a; print $0}'
output is
aa|bb|cc|2017-09-29|dd|ee|ff
Hope to help you.
If you have the dateutils package installed, you can use dateutils.dconv
cat file | dateutils.dconv -S -i "%d/%m/%Y"
-i specify input date format
-S sed mode, process only the matched string and copy the rest
Input File
aa|bb|cc|29/09/2017|dd|ee|ff|02/10/2017|gg
Output
aa|bb|cc|2017-09-29|dd|ee|ff|2017-10-02|gg
I'd use the date command:
while read fmtDate
do
date -d ${fmtDate} "+%Y-%m-%d"
done

Linux - How to remove certain lines from a files based on a field value

I want to remove certain lines from a tab-delimited file and write output to a new file.
a b c 2017-09-20
a b c 2017-09-19
es fda d 2017-09-20
es fda d 2017-09-19
The 4th column is Date, basically I want to keep only lines that has 4th column as "2017-09-19" (keep line 2&4) and write to a new file. The new file should have same format as the raw file.
How to write the linux command for this example?
Note: The search criteria should be on the 4th field as I have other fields in the real data and possibly have same value as 4th field.
With awk:
awk 'BEGIN{OFS="\t"} $4=="2017-09-19"' file
OFS: output field separator, a space by default
Use grep to filter:
cat file.txt | grep '2017-09-19' > filtered_file.txt
This is not perfect, since the string 2017-09-19 is not required to appear in the 4th column, but if your file looks like the example, it'll work.
Sed solution:
sed -nr "/^([^\t]*\t){3}2017-09-19/p" input.txt >output.txt
this is:
-n - don't output every line
-r - extended regular expresion
/regexp/p - print line that contains regular expression regexp
^ - begin of line
(regexp){3} - repeat regexp 3 times
[^\t] - any character except tab
\t - tab character
* - repeat characters multiple times
2017-09-19 - search text
That is, skip 3 columns separated by a tab from the beginning of the line, and then check that the value of column 4 coincides with the required value.
awk '/2017-09-19/' file >newfile
cat newfile
a b c 2017-09-19
es fda d 2017-09-19

Increment numbers within string using awk and sed

I have a text file that has about 500 rows of information.
I am adding a few strings to the beginning of each line separated by a comma (Excel recognizes it as another column).
I have this code so far:
sed -e "2,$s#^# =HYPERLINK(B2,C2), https://otrs.city.pittsburgh.pa.us/index.pl?Action=AgentTicketZoom;TicketID=#"** C:\Users\hd\Desktop\newaction.txt > C:\Users\hd\Desktop\test.txt
I have a columns want. Once column is adding on a link to a previous column (easy enough)
Which will be a formula(string) in the first column is =HYPERLINK(B2,C2) and I want to increment the 2's to 3's,4's and so on.
Example:
=HYPERLINK(B2,C2)
=HYPERLINK(B3,C3)
=HYPERLINK(B4,C4)
=HYPERLINK(B5,C5)
=HYPERLINK(B6,C6)
It is my second day coding with sed and awk.
Is there any way I can make this happen using awk and sed?
This Perl one-liner:
perl -pe "BEGIN{$i = 2} s#^#=HYPERLINK(B${i},C${i})#; $i++" "input.txt"
will add =HYPERLINK(B2,C2) to the front of each line and increment the numbers each time.

Linux Script to find string containing specific formatting & manipulate the data

I need to create a linux script to search for lines in a file that are formatted like this:
text:text:text:text:number:number
so 6 text/number strings divided by 5 semicolon
For example:
2f0d:011a0000:07f8:0002:1:0
I want to treat the semicolon as column divider
e.g.
Column1:Column2:Column3:Column4:Column5:Column6
I then want to rearrange the data like so:
Column1:Column3:Column4:Column2 discarding column5 & column6
For example:
2f0d:07f8:0002:011a0000
I then want to replace semicolon with underscore, remove leading Zeros from each column & convert to UPERCASE
For example:
2F0D_7F8_2_11A0000
End Result
in file1, an entry like this
2f0d:011a0000:07f8:0002:1:0
E4+1
p:BSkyB,C:0000
will be converted to this:
2F0D_7F8_2_11A0000
E4+1
p:BSkyB,C:0000
Please note also, there are 100's if not 1000s of these 3 line entries in file1
kent$ awk -F: -v OFS="_" 'NF==6{for(i=1;i<=4;i++){sub(/^0*/,"",$i);$i=toupper($i)};print $1,$3,$4,$2;next}7' file
2F0D_7F8_2_11A0000
E4+1
p:BSkyB,C:0000
you may want to know that, in awk:
sub(pat, rep,input) will do replacement;
toupper(string) will change string into upper case (yes, there is tolower() too)
print $1,$2 will print col1 and col2 separated by OFS
the command much more important than the above one-liner:
man gawk
a solution using sed:
sed -r 's/^0*([a-f0-9]+):0*([a-f0-9]+):0*([a-f0-9]+):0*([a-f0-9]+):[a-f0-9]+:[a-f0-9]+$/\1_\3_\4_\2/'
see DEMO
With sed:
sed -r 's/^0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:digit:]]+):0*([[:digit:]]+)$/\U\1_\3_\4_\2/' foo

Replacing a column of data in text files with Linux command

I have several text files whose lines are tab-delimited.
The second column contains incorrect data.
How do I change everything in the second column to a specific text string?
awk ' { $2="<STRING>"; print } ' <FILENAME>
cat INFILE | perl -ne '$ln=$_;#x=split(/","/); #a=split(/","/, $ln,8);#b=splice(#a,0,7); $l=join("\",\"", #b); $r=join("\",\"", splice(#x,8)); print "$l\",\"10\",\"$r"'
This is an example that changes the 10th column to "10". I prefer this as I don't have to count the matching parenthesis like in the sed technique.
A simple and cheap hack:
cat INFILE | sed 's/\(.*\)\t\(.*\)\t\(.*\)/\1\tREPLACEMENT\t\3/' > OUTFILE
testing it:
echo -e 'one\ttwo\tthree\none\ttwo\tthree' | sed 's/\(.*\)\t\(.*\)\t\(.*\)/\1\tREPLACEMENT\t\3/'
takes in
one two three
one two three
and produces
one REPLACEMENT three
one REPLACEMENT three

Resources