linux: extract pattern from file - linux

I have a big tab delimited .txt file of 4 columns
col1 col2 col3 col4
name1 1 2 ens|name1,ccds|name2,ref|name3,ref|name4
name2 3 10 ref|name5,ref|name6
... ... ... ...
Now I want to extract from this file everything that starts with 'ref|'. This pattern is only present in col4
So for this example I would like to have as output
ref|name3
ref|name4
ref|name5
ref|name6
I thought of using 'sed' for this, but I don't know where to start.

I think awk is better suited for this task:
$ awk '{for (i=1;i<=NF;i++){if ($i ~ /ref\|/){print $i}}}' FS='( )|(,)' infile
ref|name3
ref|name4
ref|name5
ref|name6
FS='( )|(,)' sets a multile FS to itinerate columns by , and blank spaces, then prints the column when it finds the ref pattern.

Now I want to extract from this file everything that starts with
'ref|'. This pattern is only present in col4
If you are sure that the pattern only present in col4, you could use grep:
grep -o 'ref|[^,]*' file
output:
ref|name3
ref|name4
ref|name5
ref|name6

One solution I had was to first use awk to only get the 4th column, then use sed to convert commas into newlines, and then use grep (or awk again) to get the ones that start with ref:
awk '{print $4}' < data.txt | sed -e 's/,/\n/g' | grep "^ref"

This might work for you (GNU sed):
sed 's/\(ref|[^,]*\),/\n\1\n/;/^ref/P;D' file
Surround the required strings by newlines and only print those lines that begin with the start of the required string.

Related

combine two csv files based on common column using awk or sed [duplicate]

This question already has answers here:
How to merge two files using AWK? [duplicate]
(4 answers)
Closed 2 years ago.
I have a two CSV file which have a common column in both files along with duplicates in one file. How to merge both csv files using awk or sed?
CSV file 1
5/1/20,user,mark,Type1 445566
5/2/20,user,ally,Type1 445577
5/1/20,user,joe,Type1 445588
5/2/20,user,chris,Type1 445566
CSV file 2
Type1 445566,Name XYZ11
Type1 445577,Name AAA22
Type1 445588,Name BBB33
Type1 445566,Name XYZ11
What I want is?
5/1/20,user,mark,Type1 445566,Name XYZ11
5/2/20,user,ally,Type1 445577,Name AAA22
5/1/20,user,joe,Type1 445588,Name BBB33
5/2/20,user,chris,Type1 445566,Name XYZ11
So is there a bash command in Linux/Unix to achieve this? Can we do this using awk or sed?
Basically, I need to match column 4 of CSV file 1 with column 1 of CSV file 2 and merge both csv's.
Tried following command:
Command:
paste -d, <(cut -d, -f 1-2 ./test1.csv | sed 's/$/,Type1/') test2.csv
Got Result:
5/1/20,user,Type1,Type1 445566,Name XYZ11
If you are able to install the join utility, this command works:
join -t, -o 1.1 1.2 1.3 2.1 2.2 -1 4 -2 1 file1.csv file2.csv
Explanation:
-t, identify the field separator as comma (',')
-o 1.1 1.2 1.3 2.1 2.2 format the output to be "file1col1, file1col2, file1col3, file2col1, file2col2`
-1 4 join by column 4 in file1
-2 1 join by column 1 in file2
For additional usage information for join, reference the join manpage.
Edit: You specifically asked for the solution using awk or sed so here is the awk implementation:
awk -F"," 'NR==FNR {a[$1] = $2; next} {print $1","$2","$3","$4"," a[$4]}' \
file2.csv \
file1.csv
Explanation:
-F"," Delimit by the comma character
NR==FNR Read the first file argument (notice in the above solution that we're passing file2 first)
{a[$1] = $2; next} In the current file, save the contents of Column2 in an array that uses Column1 as the key
{print $1","$2","$3","$4"," a[$4]} Read file1 and using Column4, match the value to the key's value from the array. Print Column1, Column2, Column3, Column4, and the key's value.
The two example input files seem to be already appropriately sorted, so you just have to put them side by side, and paste is good for this; however you want to remove some ,-separated columns from file1, and you can use cut for that; but you also want to insert another (constant) column, and sed can do it. A possible command is this:
paste -d, <(cut -d, -f 1-2 file1 | sed 's/$/,abcd/') file2
Actually sed can do the whole processing of file1, and the output can be pided into paste, which uses - to capture it from the standard input:
sed -E 's/^(([^,]+,){2}).*/\1abcd/' file1 | paste -d, - file2

Linux bash script: how to search on a column but return full row?

I have a tab-delimited file with data like this:
col1 col2 col3
I wrote a bash script that allows the file to be searched using this code:
echo -en "Search term: "
read search
data=`cat data.data | egrep -i "$search"`
This works great for searching the entire file, but I'm now wanting to search only on a specific column (which the user can choose).
I am aware of the cut command and can search on a column using this:
cat data.data | cut -f$col | egrep -i "$search"
But then only that column is outputted, so if I use this method then I somehow need to get the rest of the row back.
How can I search on a column in the file, but return the full rows for the results?
You can pass two variables to awk: the column number and the search term.
awk -vcol="$col" -vsearch="$search" '$col ~ search' data.data
If the value of $col is 2, then $2 in awk will correspond to the second column. The ~ operator is used to do a regular expression pattern match. The line will be printed if the column matches the regular expression.
Testing it out:
$ cat data.data
col1 col2 col3
$ col=2
$ search=l2
$ awk -vcol="$col" -vsearch="$search" '$col ~ search' data.data
col1 col2 col3
$ search=l3
$ awk -vcol="$col" -vsearch="$search" '$col ~ search' data.data
# no output
If you want to do case-insensitive pattern matching, you have two options: convert everything to upper or lower case (tolower($col) ~ tolower(search)), or if you are using GNU awk, set the IGNORECASE variable:
$ search=L2
$ awk -vIGNORECASE=1 -vcol="$col" -vsearch="$search" '$col ~ search' data.data
col1 col2 col3
awk is easier for this:
data=$(awk -v col=$col -v term="$term" 'toupper($col)==toupper(term)' file)
col - column number
term - search term
You could also pass field separator with -F if needed.

Linux Script to find string containing specific formatting & manipulate the data

I need to create a linux script to search for lines in a file that are formatted like this:
text:text:text:text:number:number
so 6 text/number strings divided by 5 semicolon
For example:
2f0d:011a0000:07f8:0002:1:0
I want to treat the semicolon as column divider
e.g.
Column1:Column2:Column3:Column4:Column5:Column6
I then want to rearrange the data like so:
Column1:Column3:Column4:Column2 discarding column5 & column6
For example:
2f0d:07f8:0002:011a0000
I then want to replace semicolon with underscore, remove leading Zeros from each column & convert to UPERCASE
For example:
2F0D_7F8_2_11A0000
End Result
in file1, an entry like this
2f0d:011a0000:07f8:0002:1:0
E4+1
p:BSkyB,C:0000
will be converted to this:
2F0D_7F8_2_11A0000
E4+1
p:BSkyB,C:0000
Please note also, there are 100's if not 1000s of these 3 line entries in file1
kent$ awk -F: -v OFS="_" 'NF==6{for(i=1;i<=4;i++){sub(/^0*/,"",$i);$i=toupper($i)};print $1,$3,$4,$2;next}7' file
2F0D_7F8_2_11A0000
E4+1
p:BSkyB,C:0000
you may want to know that, in awk:
sub(pat, rep,input) will do replacement;
toupper(string) will change string into upper case (yes, there is tolower() too)
print $1,$2 will print col1 and col2 separated by OFS
the command much more important than the above one-liner:
man gawk
a solution using sed:
sed -r 's/^0*([a-f0-9]+):0*([a-f0-9]+):0*([a-f0-9]+):0*([a-f0-9]+):[a-f0-9]+:[a-f0-9]+$/\1_\3_\4_\2/'
see DEMO
With sed:
sed -r 's/^0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:alnum:]]+):0*([[:digit:]]+):0*([[:digit:]]+)$/\U\1_\3_\4_\2/' foo

LINUX: Using cat to remove columns in CSV - some have commas in the data

I need to remove some columns from a CSV. Easy.
The problem is I have two columns with full text that actually has commas in them as a part of the data. My cols are enclosed with quotes and the cat is counting the commas in the text as columns. How can I do this so the commas enclosed with quotes are ignored?
example:
"first", "last", "dob", "some long sentence, it has commas in it,", "some data", "foo"
i want to print only rows 1-4, 6
You will save yourself a lot of aggravation by writing a short Perl script that uses Parse::CSV http://metacpan.org/pod/Parse::CSV
I am sure there is a Python way of doing this too.
cat file | sed -e 's|^"||;s|"$||' | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
Example:
http://ideone.com/g2gZmx
How it works:
Look at line:
"a,b","c,d","e,f"
We know that each row is surrounded by "". So we can split this line by ",":
cat file | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
and rows will be:
"a,b c,d e,f"
But we have annoying " in the start and end of line. So we remove it with sed:
cat file | sed -e 's|^"||;s|"$||' | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
And rows will be
a,b c,d e,f
Then we can simply take second row by awk '{print $2}.
Read about regexp field splitting in awk: http://www.gnu.org/software/gawk/manual/html_node/Regexp-Field-Splitting.html

Removing last column from rows that have three columns using bash

I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?
If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2
The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m
This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.
awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file

Resources