Linux Script to find string containing specific formatting & manipulate the data

Linux Script to find string containing specific formatting & manipulate the data - linux

I need to create a linux script to search for lines in a file that are formatted like this:
text:text:text:text:number:number
so 6 text/number strings divided by 5 semicolon
For example:
2f0d:011a0000:07f8:0002:1:0
I want to treat the semicolon as column divider
e.g.
Column1:Column2:Column3:Column4:Column5:Column6
I then want to rearrange the data like so:
Column1:Column3:Column4:Column2 discarding column5 & column6
For example:
2f0d:07f8:0002:011a0000
I then want to replace semicolon with underscore, remove leading Zeros from each column & convert to UPERCASE
For example:
2F0D_7F8_2_11A0000
End Result
in file1, an entry like this
2f0d:011a0000:07f8:0002:1:0
E4+1
p:BSkyB,C:0000
will be converted to this:
2F0D_7F8_2_11A0000
E4+1
p:BSkyB,C:0000
Please note also, there are 100's if not 1000s of these 3 line entries in file1

kent$ awk -F: -v OFS="_" 'NF==6{for(i=1;i<=4;i++){sub(/^0*/,"",$i);$i=toupper($i)};print $1,$3,$4,$2;next}7' file
2F0D_7F8_2_11A0000
E4+1
p:BSkyB,C:0000
you may want to know that, in awk:
sub(pat, rep,input) will do replacement;
toupper(string) will change string into upper case (yes, there is tolower() too)
print $1,$2 will print col1 and col2 separated by OFS
the command much more important than the above one-liner:
man gawk

a solution using sed:
sed -r 's/^0*([a-f0-9]+):0*([a-f0-9]+):0*([a-f0-9]+):0*([a-f0-9]+):[a-f0-9]+:[a-f0-9]+$/\1_\3_\4_\2/'
see DEMO

With sed:
sed -r 's/^0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:digit:]]+):0*([[:digit:]]+)$/\U\1_\3_\4_\2/' foo

Related

Change date format from dd/mm/yyyy to yyyy-mm-dd in a file using shell scripting

I have a source file with 18 columns in which columns 10 , 11 and 15 are in the format dd/mm/yyyy and all these needs to be converted to yyyy-mm-dd and written to target file along with other columns.
I am aware of date formatting functions on Variables but do not know how to apply the same on few columns in a file.

I don’t have a machine available to test, but consider using awk with a little function since you are doing the same thing 3 times. It will look something like this:
awk ‘
function dodate(in){
split(in,/\//,a) # split existing date into elements of array “a”
return a[3] “-“ a[2] “-“ a[1]
}
{ $10=dodate($10); $11=dodate($11); $15=dodate($15); print }’ yourFile
Reference for awk functions, and split.
If the fields on each line are separated by commas, tell awk that with:
awk -F, ...

Maybe you could use command awk to solve it.
As you have 3 cols contain date (col 10, 11, 15), here I assume a sample string which field seperator is |, col contains date is the 4th col
aa|bb|cc|29/09/2017|dd|ee|ff
use String-Manipulation Functions to extract date, then format it with getline to format it to expected syntax.
command is
echo 'aa|bb|cc|2017-09-29|dd|ee|ff' | awk -F\| 'BEGIN{OFS="|"}{$4=gensub(/([0-9]{1,2})\/([0-9]{1,2})\/([0-9]{4})/,"\\3\\2\\1","g",$4); "date --date=\""$4"\" +\"%F\"" | getline a; $4=a; print $0}'
output is
aa|bb|cc|2017-09-29|dd|ee|ff
Hope to help you.

If you have the dateutils package installed, you can use dateutils.dconv
cat file | dateutils.dconv -S -i "%d/%m/%Y"
-i specify input date format
-S sed mode, process only the matched string and copy the rest
Input File
aa|bb|cc|29/09/2017|dd|ee|ff|02/10/2017|gg
Output
aa|bb|cc|2017-09-29|dd|ee|ff|2017-10-02|gg

I'd use the date command:
while read fmtDate
do
date -d ${fmtDate} "+%Y-%m-%d"
done

linux: extract pattern from file

I have a big tab delimited .txt file of 4 columns
col1 col2 col3 col4
name1 1 2 ens|name1,ccds|name2,ref|name3,ref|name4
name2 3 10 ref|name5,ref|name6
... ... ... ...
Now I want to extract from this file everything that starts with 'ref|'. This pattern is only present in col4
So for this example I would like to have as output
ref|name3
ref|name4
ref|name5
ref|name6
I thought of using 'sed' for this, but I don't know where to start.

I think awk is better suited for this task:
$ awk '{for (i=1;i<=NF;i++){if ($i ~ /ref\|/){print $i}}}' FS='( )|(,)' infile
ref|name3
ref|name4
ref|name5
ref|name6
FS='( )|(,)' sets a multile FS to itinerate columns by , and blank spaces, then prints the column when it finds the ref pattern.

Now I want to extract from this file everything that starts with
'ref|'. This pattern is only present in col4
If you are sure that the pattern only present in col4, you could use grep:
grep -o 'ref|[^,]*' file
output:
ref|name3
ref|name4
ref|name5
ref|name6

One solution I had was to first use awk to only get the 4th column, then use sed to convert commas into newlines, and then use grep (or awk again) to get the ones that start with ref:
awk '{print $4}' < data.txt | sed -e 's/,/\n/g' | grep "^ref"

This might work for you (GNU sed):
sed 's/\(ref|[^,]*\),/\n\1\n/;/^ref/P;D' file
Surround the required strings by newlines and only print those lines that begin with the start of the required string.

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !

You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane

Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.

Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt

So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

CSV grep but keep the header

I have a CSV file that look like this:
A,B,C
1,2,3
4,4,4
1,2,6
3,6,9
Is there an easy way to grep all the rows in which the B column is 2, and keep the header? For example, I want the output be like
A,B,C
1,2,3
1,2,6
I am working under linux

Using awk:
awk -F, 'NR==1 || $2==2' file
NR==1 -> if first line,
$2==2 -> if second column is equal to 2. Lines are printed if either of the above is true.
To choose the column using the header column name:
awk -F, -v col="B" 'NR==1{for(i=1;i<=NF;i++)if($i==col)break;print;next}$i==2' file
Replace B with the appropriate name of the column which you want to check against.

You can use addresses in sed:
sed -n '1p;/^[^,]*,2/p'
It means:
1p Print the first line.
/ Start a match.
^ Match the beginnning of a line.
[^,] Match anything but a comma
* zero or more times.
, Match a comma.
2 Match a 2.
/p End of match, if it matches, print.
If the header can contain the value you are looking for, you should be more careful:
sed -n '1p;1!{/^[^,]*,2/p}'
1!{ ... } just means "Do the following for lines other then the first one".
For column number n>2, you can add a quantifier:
sed -n '1p;1!{/^\([^,]*,\)\{M\}2/p}'
where M=n-1. The quantifier just means repetition, so the non-comma-0-or-more-times-comma thing is repeated M times.
For true CSV files where a value can contain a comma, switch to Perl and Text::CSV.

$ awk -F, 'NR==1 { for (i=1;i<=NF;i++) h[$i] = i; print; next } $h["B"] == 2' file
A,B,C
1,2,3
1,2,6
By the way, sed is an excellent tool for simple substitutions on a single line, for anything else, just use awk - the code will be clearer and MUCH easier to enhance in future if necessary.

Replacing a column of data in text files with Linux command

I have several text files whose lines are tab-delimited.
The second column contains incorrect data.
How do I change everything in the second column to a specific text string?

awk ' { $2="<STRING>"; print } ' <FILENAME>

cat INFILE | perl -ne '$ln=$_;#x=split(/","/); #a=split(/","/, $ln,8);#b=splice(#a,0,7); $l=join("\",\"", #b); $r=join("\",\"", splice(#x,8)); print "$l\",\"10\",\"$r"'
This is an example that changes the 10th column to "10". I prefer this as I don't have to count the matching parenthesis like in the sed technique.

A simple and cheap hack:
cat INFILE | sed 's/\(.*\)\t\(.*\)\t\(.*\)/\1\tREPLACEMENT\t\3/' > OUTFILE
testing it:
echo -e 'one\ttwo\tthree\none\ttwo\tthree' | sed 's/\(.*\)\t\(.*\)\t\(.*\)/\1\tREPLACEMENT\t\3/'
takes in
one two three
one two three
and produces
one REPLACEMENT three
one REPLACEMENT three

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux Script to find string containing specific formatting & manipulate the data - linux

a solution using sed: sed -r 's/^0([a-f0-9]+):0([a-f0-9]+):0([a-f0-9]+):0([a-f0-9]+):[a-f0-9]+:[a-f0-9]+$/\1_\3_\4_\2/' see DEMO

With sed: sed -r 's/^0([[:alnum:]]+):0([[:alnum:]]+):0([[:alnum:]]+):0([[:alnum:]]+):0([[:digit:]]+):0([[:digit:]]+)$/\U\1_\3_\4_\2/' foo

Related

Change date format from dd/mm/yyyy to yyyy-mm-dd in a file using shell scripting

linux: extract pattern from file

How to use grep or awk to process a specific column ( with keywords from text file )

CSV grep but keep the header

Replacing a column of data in text files with Linux command

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux Script to find string containing specific formatting & manipulate the data - linux

a solution using sed: sed -r 's/^0*([a-f0-9]+):0*([a-f0-9]+):0*([a-f0-9]+):0*([a-f0-9]+):[a-f0-9]+:[a-f0-9]+$/\1_\3_\4_\2/' see DEMO

With sed: sed -r 's/^0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:digit:]]+):0*([[:digit:]]+)$/\U\1_\3_\4_\2/' foo

Related

Change date format from dd/mm/yyyy to yyyy-mm-dd in a file using shell scripting

linux: extract pattern from file

How to use grep or awk to process a specific column ( with keywords from text file )

CSV grep but keep the header

Replacing a column of data in text files with Linux command

Categories

Resources

a solution using sed: sed -r 's/^0([a-f0-9]+):0([a-f0-9]+):0([a-f0-9]+):0([a-f0-9]+):[a-f0-9]+:[a-f0-9]+$/\1_\3_\4_\2/' see DEMO

With sed: sed -r 's/^0([[:alnum:]]+):0([[:alnum:]]+):0([[:alnum:]]+):0([[:alnum:]]+):0([[:digit:]]+):0([[:digit:]]+)$/\U\1_\3_\4_\2/' foo