bash: grep exact matches based on the first column - linux

I have a .txt file like below:
9342432_A1 9342432 1 0 0 0
4392483_A2 4392483 2 0 0 0
4324321_A3 4324321 1 0 0 0
9342432 9342432 2 0 0 0
For example, I want to generate a subset with the IDs 4324321_A3 and 9342432 (based on the first column!).
I tried the following command to find the exact matches:
grep -E '4324321_A3|9342432'
But when I use this line, I end up with a dataset like this:
9342432_A1 9342432 1 0 0 0
4324321_A3 4324321 1 0 0 0
9342432 9342432 2 0 0 0
The problem is that the line that matches a part of the ID (9342432_A1) shouldn't be there.
Can anyone help me with this?
I would like to end up with this:
4324321_A3 4324321 1 0 0 0
9342432 9342432 2 0 0 0

It matches
9342432_A1 9342432 1 0 0 0
because it has 9342432 in the second column.
You need to update the command to make grep check lines starting with those words, that is, use ^word:
$ grep -E '^4324321_A3|^9342432' file
4324321_A3 4324321 1 0 0 0
9342432 9342432 2 0 0 0
To make it more accurate, you can also use -w that matches the full word. This way grep -wE '^4324321_A3|^9342432' file would not match a line like
4324321_A3something 4324321 1 0 0 0

Your regex doesn't check if the ID is at the start of the line. Simply include a ^ at the beginning of your regular expression to tell it to match only ID's at the start of the line, and then group the alternatives using ():
grep -E '^(4324321_A3|9342432)\b' <file>
\b is a boundary character which forces it to only match whole words.

When you need matching on a specific field (or column) of your files, it could be better to use a tool like awk instead of grep. you can write something like this:
awk '$1 == "STRING_TO_MATCH"' txtfile.txt
and this could work also on a column different from the first (just use $2 for second column, $3 for the third, and so on).
awk accept regex as well as grep.
Regards.

Include in your grep the ^ at the beginning and after the pattern the space .

Add a start of line anchor at the beginning and a word boundary at the end of each pattern
grep -E '^4324321_A3\b|^9342432\b'

Related

deleting part after a specific character and that character

I'm working within my command line/bash on a large file with millions of rows. I'm analyzing the data with a software that requires the rsIDs to be less than 40 characters.
awk 'length($2)>40' 1000G_All_chr_merged.bim > IDtoolong.bim
head IDtoolong.bim
1 rs540674385;rs540674385;rs540674385;rs576523156 0 4439107 AAG AAGGAGG
1 rs561687032;rs546685337;rs528205989;rs370782231 0 4804685 GCACACA GCA
1 rs561021122;rs542858700;rs527502051;rs560257256;rs545143128 0 6210427 AGG GGAAT
1 rs529037702;rs561824298;rs539915961;rs528175459 0 12122415 CCCATCCAT AT
1 rs571308260;rs549871057;rs537509991;rs587738155 0 12611561 CAAA CAAAA
1 rs553093917;rs553093917;rs534535365;rs570185860 0 16657917 AAAT AAATAAT
How can I run through the second column and delete the first semicolon, ;, and anything after that?
I tried this:
awk '{sub(/;.*/,"", $2)}' 1000G_All_chr_merged.bim > adjusted_IDlength.bim
And I also using sed but found myself ruining the file at one point. Any help is appreciated!
I'm guessing that by "ruining the file" you mean changing the white space between fields. If that's the problem, the following won't do that:
$ sed 's/;[^[:space:]]*//' file
1 rs540674385 0 4439107 AAG AAGGAGG
1 rs561687032 0 4804685 GCACACA GCA
1 rs561021122 0 6210427 AGG GGAAT
1 rs529037702 0 12122415 CCCATCCAT AT
1 rs571308260 0 12611561 CAAA CAAAA
1 rs553093917 0 16657917 AAAT AAATAAT

How can I replace a specific character in a file where it's position changes in bash command line or script?

I have the following file:
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1
The character "3" that I need to change is bolded and italicized. The value of this character is dynamic, but always a single digit. I have tried a few things using sed but I can't come up with a way to account for the character changing position due to additional characters being added before that position.
This character is always at the same position from the END of the line, but not from the beginning. Meaning, the content to the left of this character may change and it may be longer, but this is always the 11th character and 6th digit from the end. It is easy to devise a way to cut it, or find it using tail, but I can't devise a way to replace it.
To be clear, the single digit character in question will always be replaced with another single digit character.
With GNU awk
$ cat file
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1
$ gawk -i inplace -v new=9 'NF {$(NF-5) = new} 1' file
$ cat file
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 9 1 1 1 1 1
Where:
NF {$(NF-5) = new} means, when the line is not empty, replace the 6th-last field with the new value (9).
1 means print every record.
awk '{ $(NF-5) = ($(NF - 5) + 8) % 10; print }'
Given your input data, it produces;
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 1 1 1 1 1 1
The 3 has been mapped via 11 to 1 — pick your poison on how you assign the new value, but the magic is $(NF - 5) to pick up the fifth column before the last one (or sixth from end).
Would you try the following:
replace="x" # or whatever you want to replace
sed 's/\(.\)\(.\{10\}\)$/'"$replace"'\2/' file
The left portion of the sed command \(.\)\(.\{10\}\)$ matches a character, followed by ten characters, then anchored by the end of line.
Then the 1st character is replaced with the specified character and the following ten characters are reused.
I'm gonna assume that the number that you're looking for is the same distance from the end, regardless of what comes before it:
rev ~/test.txt | awk '$6=<value to replace>' | rev
Using the bash shell which should be the last option.
rep=10
read -ra var <<< '2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1'
for i in "${!var[#]}"; do printf '%s ' "${var[$i]/${var[-6]}/$rep}"; done
If it is in a file.
rep=10
read -ra var < file.txt
for i in "${!var[#]}"; do printf '%s ' "${var[$i]/${var[-6]}/$rep}"; done
Not the shortest and fastest way but it can be done...

first column copy under empty line

Salam
Following is the required output:
RXOTG-136 VENEN6 0
VENEN6 1
VENEN7 0
VENEN7 1
RXOTG-137 TIVIK6 0
TIVIK6 1
RXOTG-138 KESTA1 0
KESTA1 1
KESTA2 0
KESTA2 1
KESTA3 0
KESTA3 1
RXOTG-139 KESTA4 0
KESTA4 1
For which i used following command
awk 'NF==1{a=$1; next}{ print val}'
but the output I am getting is
RXOTG-136 VENEN6 0
RXOTG-136 VENEN6 1
RXOTG-136 VENEN7 0
RXOTG-136 VENEN7 1
RXOTG-137 TIVIK6 0
RXOTG-137 TIVIK6 1
RXOTG-138 KESTA1 0
RXOTG-138 KESTA1 1
RXOTG-138 KESTA2 0
RXOTG-138 KESTA2 1
RXOTG-138 KESTA3 0
RXOTG-138 KESTA3 1
RXOTG-139 KESTA4 0
RXOTG-139 KESTA4 1
awk 'NF==3{a=$1} NF==2{$1=a OFS $1} 1' file
you need to store the first field somewhere
1 is for printing every line
the format will change due to reassignment of $1~$3 thus you can use column -t to format it
awk 'NF==3{a=$1} NF==2{$1=a OFS $1} 1' file | column -t
Following simple awk may help you on same.
awk '!/^ /{val=$1} /^ /{$1=val OFS $1} 1' Input_file | column -t
This might work for you (GNU sed):
sed -r '1h;1b;s/^/\n/;G;:a;/\n\s(.*\n)(.)(.*\S+\s+\S+$)/s//\2\n\1\3/;ta;s/\n//;s/\n.*//;h' file
Print the first line after making a copy in the hold space. For all subsequent lines, prepend a newline and append the previous line. Copy a character at a time from the previous line to the front of the current line until either there are no more spaces at the front of the current line or there are only two fields in the previous line. Remove the first introduced newline and remove the remains of the previous line. Copy the current line to the hold space, ready for the next time and print the current line.

How to split a column which has multiple dots using Linux command line

I have a file which looks like this:
chr10:100013403..100013414,- 0 0 0 0
chr10:100027943..100027958,- 0 0 0 0
chr10:100076685..100076699,+ 0 0 0 0
I want output to be like:
chr10 100013403 100013414 - 0 0 0 0
chr10 100027943 100027958 - 0 0 0 0
chr10 100076685 100076699 + 0 0 0 0
So, I want the first column to be tab separated at field delimiter = : , ..
I have used awk -F":|," '$1=$1' OFS="\t" file to separate first column. But, I am still struggling with .. characters.
I tried awk -F":|,|.." '$1=$1' OFS="\t" file but this doesn't work.
.. should be escaped.
awk -F':|,|\\.\\.' '$1=$1' OFS="\t" file
It is important to remember that when you assign a string constant as the value of FS, it undergoes normal awk string processing. For example, with Unix awk and gawk, the assignment FS = "\.." assigns the character string .. to FS (the backslash is stripped). This creates a regexp meaning “fields are separated by occurrences of any two characters.” If instead you want fields to be separated by a literal period followed by any single character, use FS = "\\..".
https://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html
If your Input_file is same as shown sample then following may help you too in same.
awk '{gsub(/:|\.+|\,/,"\t");} 1' Input_file
Here I am using gsub keyword of awk to globally substitute (:) (.+ which will take all dots) (,) with TAB and then 1 will print the edited/non-edited line of Input_file. I hope this helps.

How to delete the first column ( which is in fact row names) from a data file in linux?

I have data file with many thousands columns and rows. I want to delete the first column which is in fact the row counter. I used this command in linux:
cut -d " " -f 2- input.txt > output.txt
but nothing changed in my output. Does anybody knows why it does not work and what should I do?
This is what my input file looks like:
col1 col2 col3 col4 ...
1 0 0 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 0 0
5 0 1 1 1
6 1 1 1 0
7 1 0 0 0
8 0 0 0 0
9 1 0 0 0
10 1 1 1 1
11 0 0 0 1
.
.
.
I want my output look like this:
col1 col2 col3 col4 ...
0 0 0 1
0 1 0 1
0 1 0 0
0 0 0 0
0 1 1 1
1 1 1 0
1 0 0 0
0 0 0 0
1 0 0 0
1 1 1 1
0 0 0 1
.
.
.
I also tried the sed command:
sed '1d' input.file > output.file
But it deletes the first row not the first column.
Could anybody guide me?
idiomatic use of cut will be
cut -f2- input > output
if you delimiter is tab ("\t").
Or, simply with awk magic (will work for both space and tab delimiter)
awk '{$1=""}1' input | awk '{$1=$1}1' > output
first awk will delete field 1, but leaves a delimiter, second awk removes the delimiter. Default output delimiter will be space, if you want to change to tab, add -vOFS="\t" to the second awk.
UPDATED
Based on your updated input the problem is the initial spaces that cut treats as multiple columns. One way to address is to remove them first before feeding to cut
sed 's/^ *//' input | cut -d" " -f2- > output
or use the awk alternative above which will work in this case as well.
#Karafka I had CSV files so I added the "," separator (you can replace with yours
cut -d"," -f2- input.csv > output.csv
Then, I used a loop to go over all files inside the directory
# files are in the directory tmp/
for f in tmp/*
do
name=`basename $f`
echo "processing file : $name"
#kepp all column excep the first one of each csv file
cut -d"," -f2- $f > new/$name
#files using the same names are stored in directory new/
done
You can use cut command with --complement option:
cut -f1 -d" " --complement input.file > output.file
This will output all columns except the first one.
As #karakfa notes, it looks like it's the leading whitespace which is causing your issues.
Here's a sed oneliner to do the job (that will account for spaces or tabs):
sed -i.bak "s|^[ \t]\+[0-9]\+[ \t]\+||" input.txt
Explanation:
-i edit existing file in place
.bak backup original file and add .bak file extension (can use whatever you like)
s substitute
| separator (easiest character to read as sed separator IMO)
^ start match at start of the line
[ \t] match space or tab
\+ match one or more times (escape required so sed does not interpret '+' literally)
[0-9] match any number 0 - 9
As noted; the input.txt file will be edited in place. The original content of input.txt will be saved as input.txt.bak. Use just -i instead if you don't want a backup of the original file.
Also, if you know that they are definitely leading spaces (not tabs), you could shorten it to this:
sed -i.bak "s|^ \+[0-9]\+[ \t]\+||" input.txt
You can also achieve this with grep:
grep -E -o '[[:digit:]]([[:space:]][[:digit:]]){3}$' input.txt
Which assumes single character digit and space columns. To accommodate a variable number of spaces and digits you can do:
grep -E -o '[[:digit:]]+([[:space:]]+[[:digit:]]+){3}$' input.txt
If your grep supports the -P flag (--perl-regexp) you can do:
grep -P -o '\d+(\s+\d+){3}$' input.txt
And here are a few options if you are using GNU sed:
sed 's/^\s\+\w\+\s\+//' input.txt
sed 's/^\s\+\S\+\s\+//' input.txt
sed 's/^\s\+[0-9]\+\s\+//' input.txt
sed 's/^\s\+[[:digit:]]\+\s\+//' input.txt
Note that the grep regexes are matching the parts that we want to keep while the sed regexes are matching the parts we want to remove.

Resources