Increment numbers within string using awk and sed - linux

I have a text file that has about 500 rows of information.
I am adding a few strings to the beginning of each line separated by a comma (Excel recognizes it as another column).
I have this code so far:
sed -e "2,$s#^# =HYPERLINK(B2,C2), https://otrs.city.pittsburgh.pa.us/index.pl?Action=AgentTicketZoom;TicketID=#"** C:\Users\hd\Desktop\newaction.txt > C:\Users\hd\Desktop\test.txt
I have a columns want. Once column is adding on a link to a previous column (easy enough)
Which will be a formula(string) in the first column is =HYPERLINK(B2,C2) and I want to increment the 2's to 3's,4's and so on.
Example:
=HYPERLINK(B2,C2)
=HYPERLINK(B3,C3)
=HYPERLINK(B4,C4)
=HYPERLINK(B5,C5)
=HYPERLINK(B6,C6)
It is my second day coding with sed and awk.
Is there any way I can make this happen using awk and sed?

This Perl one-liner:
perl -pe "BEGIN{$i = 2} s#^#=HYPERLINK(B${i},C${i})#; $i++" "input.txt"
will add =HYPERLINK(B2,C2) to the front of each line and increment the numbers each time.

Related

Removing leading 0 from third column

I'm trying to remove the first 0 from the third column in my CSV file
tel.csv -
test,01test,01234567890
test,01test,09876054321
I have been trying to use the following with no luck -
cat tel.csv | sed 's/^0*//'
Something like:
sed 's/^\([^,]*\),\([^,]*\),0\(.*\)$/\1,\2,\3/' file.csv
Or awk
awk 'BEGIN{FS=OFS=","}{sub(/^0/, "", $3)}1' file.csv
Assumptions:
3rd column consists of only numbers (0-9)
3rd column could have multiple leading 0's
Adding a row with a 3rd column that has multiple leading 0's:
$ cat tel.csv
test,01test,01234567890
test,01test,09876054321
test,02test,00001234567890
One awk idea:
$ awk 'BEGIN{FS=OFS=","}{$3=$3+0}1' tel.csv
test,01test,1234567890
test,01test,9876054321
test,02test,1234567890
Where: adding 0 to a number ($3+0) has the side effect of removing leading 0's.
If the third field is the last field, as it is in the sample lines:
sed 's/,0\([^,]*\)$/,\1/' file

How to edit the lines in text file in Linux - format the date to YYYY-MM-DD and then grep the line by time period

Can anyone help to format this text file(YYYYMMDD) as a date formatted(YYYY-MM-DD) text file using bash script or in Linux command line? I am not sure how to start editing 23millon lines!!!
I have YYYYMMDD format textfile :-
3515034013|50008|20140601|20240730
and I want to edit like YYYY-MM-DD formatted text file(Only 3rd and 4th fields need to be changed for 23million lines):-
3515034013|50008|2014-06-01|2024-07-30
I Want to convert from YYYYMMDD formatted text file to the YYYY-DD-MM format and I want to get specific lines from the text file based on the time period after this file manipulation which is the end goal.
The end goal is to format the 3rd field and 4th field as YYYY-MM-DD and also want to grep the line by date from that formatted text file:- 03rd field is the start date and the 04th field is the end date Let's say for example I need,
(01). The end date(04th field) before today i.e 2022-08-06 - all the old lines
(02). The end date(04th field) is 2 years from now i.e lines in between 2022-08-06th <-> 2024-08-06th?
Please note:- There are more than a 23million lines to edit and analyze based on the date.
How to approach this problem statement? which method is time efficient awk or sed or Bash line-by-line editing?
$ awk '
BEGIN { FS=OFS="|" }
{
for ( i=3; i<=4; i++ ) {
$i = substr($i,1,4) "-" substr($i,5,2) "-" substr($i,7)
}
print
}
' file
3515034013|50008|2014-06-01|2024-07-30
Here is a way to do it with sed. It has the same restrictions as steffens answer: | as fieldseparator and that all dates have the same format i.e. leading zeros in the month and date part.
sed -E 's/^(.*[|])([0-9]{4})([0-9]{2})([0-9]{2})[|]([0-9]{4})([0-9]{2})([0-9]{2})$/\1\2-\3-\4|\5-\6-\7/g'
Here is what the regular expression does:
^(.*[|]) captures the first part of the string from linestart (^) to a | into \1, this captures the first two columns, because the remaining part of the re matches the remaining part of the line up until lineend!
([0-9]{4})([0-9]{2})([0-9]{2})[|] captures the first date field parts into \2 to \4, notice the [|]
([0-9]{4})([0-9]{2})([0-9]{2})$ does the same for the second date column anchored at lineend ($) and captures the parts into \5 to \7, notice the $
the replacement part \1\2-\3-\4|\5-\6-\7 inserts - at the different places
the capturing into \n happens because of the use of (...) parens in the regular expression.
Here's one way to change the format with awk:
awk '{$3=substr($3,1,4) "-" substr($3,5,2) "-" substr($3,7,2); $4=substr($4,1,4) "-" substr($4,5,2) "-" substr($4,7,2); print}' FS='|' OFS='|'
It should work given that
| is only used for field separation
all dates have the same format
You can pipe the transformed lines to a new file or change it in place. Of course you can do the same with sed or ed. I'd go for awk because you'd be able to extract your specific lines just in the same run to an extra file.
This might work for you (GNU sed):
sed -E 's/^([^|]*\|[^|]*\|....)(..)(..\|....)(..)/\1-\2-\3-\4-/' file
Pattern match and insert - where desired.
Or if the file is only 4 columns:
sed -E 's/(..)(..\|....)(..)(..)$/-\1-\2-\3-\4/' file

Uniqing a delimited file based on a subset of fields

I have data such as below:
1493992429103289,207.55,207.5
1493992429103559,207.55,207.5
1493992429104353,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
Due to the nature of the last two columns, their values change throughout the day and their values are repeated regularly. By grouping the way outlined in my desired output (below), I am able to view each time there was a change in their values (with the enoch time in the first column). Is there a way to achieve the desired output shown below:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
So I consolidate the data by the second two columns. However, the consolidation is not completely unique (as can be seen by 207.55, 207.5 being repeated)
I have tried:
uniq -f 1
However the output gives only the first line and does not go on through the list
The awk solution below does not allow the occurrence which happened previously to be outputted again and so gives the output (below the awk code):
awk '!x[$2 $3]++'
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
I do not wish to sort the data by the second two columns. However, since the first is epoch time, it may be sorted by the first column.
You can't set delimiters with uniq, it has to be white space. With the help of tr you can
tr ',' ' ' <file | uniq -f1 | tr ' ' ','
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
You can use an Awk statement as below,
awk 'BEGIN{FS=OFS=","} s != $2 && t != $3 {print} {s=$2;t=$3}' file
which produces the output as you need.
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
The idea is to store the second and third column values in variables s and t respectively and print the line contents only if the current line is unique.
I found an answer which is not as elegant as Inian but satisfies my purpose.
Since my first column is always enoch time in microseconds and does not increase or decrease in characters, I can use the following uniq command:
uniq -s 17
You can try to manually (with a loop) compare current line with previous line.
previous_line=""
# start at first line
i=1
# suppress first column, that don't need to compare
sed 's#^[0-9][0-9]*,##' ./data_file > ./transform_data_file
# for all line within file without first column
for current_line in $(cat ./transform_data_file)
do
# if previous record line are same than current line
if [ "x$prev_line" == "x$current_line" ]
then
# record line number to supress after
echo $i >> ./line_to_be_suppress
fi
# record current line as previous line
prev_line=$current_line
# increment current number line
i=$(( i + 1 ))
done
# suppress lines
for line_to_suppress in $(tac ./line_to_be_suppress) ; do sed -i $line_to_suppress'd' ./data_file ; done
rm line_to_be_suppress
rm transform_data_file
Since your first field seems to have a fixed length of 18 characters (including the , delimiter), you could use the -s option of uniq, which would be more optimal for larger files:
uniq -s 18 file
Gives this output:
1493992429103289,207.55,207.5
1493992429104491,207.6,207.55
1493992429110551,207.55,207.5
From man uniq:
-f num
Ignore the first num fields in each input line when doing comparisons.
A field is a string of non-blank characters separated from adjacent fields by blanks.
Field numbers are one based, i.e., the first field is field one.
-s chars
Ignore the first chars characters in each input line when doing comparisons.
If specified in conjunction with the -f option, the first chars characters after
the first num fields will be ignored. Character numbers are one based,
i.e., the first character is character one.

CSV grep but keep the header

I have a CSV file that look like this:
A,B,C
1,2,3
4,4,4
1,2,6
3,6,9
Is there an easy way to grep all the rows in which the B column is 2, and keep the header? For example, I want the output be like
A,B,C
1,2,3
1,2,6
I am working under linux
Using awk:
awk -F, 'NR==1 || $2==2' file
NR==1 -> if first line,
$2==2 -> if second column is equal to 2. Lines are printed if either of the above is true.
To choose the column using the header column name:
awk -F, -v col="B" 'NR==1{for(i=1;i<=NF;i++)if($i==col)break;print;next}$i==2' file
Replace B with the appropriate name of the column which you want to check against.
You can use addresses in sed:
sed -n '1p;/^[^,]*,2/p'
It means:
1p Print the first line.
/ Start a match.
^ Match the beginnning of a line.
[^,] Match anything but a comma
* zero or more times.
, Match a comma.
2 Match a 2.
/p End of match, if it matches, print.
If the header can contain the value you are looking for, you should be more careful:
sed -n '1p;1!{/^[^,]*,2/p}'
1!{ ... } just means "Do the following for lines other then the first one".
For column number n>2, you can add a quantifier:
sed -n '1p;1!{/^\([^,]*,\)\{M\}2/p}'
where M=n-1. The quantifier just means repetition, so the non-comma-0-or-more-times-comma thing is repeated M times.
For true CSV files where a value can contain a comma, switch to Perl and Text::CSV.
$ awk -F, 'NR==1 { for (i=1;i<=NF;i++) h[$i] = i; print; next } $h["B"] == 2' file
A,B,C
1,2,3
1,2,6
By the way, sed is an excellent tool for simple substitutions on a single line, for anything else, just use awk - the code will be clearer and MUCH easier to enhance in future if necessary.

Replacing a column of data in text files with Linux command

I have several text files whose lines are tab-delimited.
The second column contains incorrect data.
How do I change everything in the second column to a specific text string?
awk ' { $2="<STRING>"; print } ' <FILENAME>
cat INFILE | perl -ne '$ln=$_;#x=split(/","/); #a=split(/","/, $ln,8);#b=splice(#a,0,7); $l=join("\",\"", #b); $r=join("\",\"", splice(#x,8)); print "$l\",\"10\",\"$r"'
This is an example that changes the 10th column to "10". I prefer this as I don't have to count the matching parenthesis like in the sed technique.
A simple and cheap hack:
cat INFILE | sed 's/\(.*\)\t\(.*\)\t\(.*\)/\1\tREPLACEMENT\t\3/' > OUTFILE
testing it:
echo -e 'one\ttwo\tthree\none\ttwo\tthree' | sed 's/\(.*\)\t\(.*\)\t\(.*\)/\1\tREPLACEMENT\t\3/'
takes in
one two three
one two three
and produces
one REPLACEMENT three
one REPLACEMENT three

Resources