How to replace parts of lines of certain form? - linux

I have a large file where each line is of the form
b d
where b and d are numbers. I'd like to change all lines of the form
b -1
to
b 1
where b is an arbitrary number (i.e. it should remain unchanged).
For a concrete example, the file
0.2 0.5
0.1 -1
0 -1
0.3 0.6
should become
0.2 0.5
0.1 1
0 1
0.3 0.6
Is there an easy way to achieve this using, say, sed or a similar tool?
Edit. It suffices to remove all -'s from a file. Thanks to #Cyrus for this observation. This particular problem has now been solved, however, the general question of how to do replace files in this manner with a more general pattern remains open. Answers are still welcome.

Try this:
tr -d '-' < old_file > new_file
or replace all -1 in column 2 by 1:
awk '$2==-1 {$2=1} 1' old_file > new_file
or with GNU sed:
sed 's/ -1$/ 1/' old_file > new_file
If you want to edit your file with GNU sed "in place" use sed's option -i:
sed -i 's/ -1$/ 1/' file

Related

how to convert floating number to integer in linux

I have a file that look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
I want to convert it to look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0 0 0 0 0 0 0 0 0 0 0
I basically want to convert the 0.0 or 1.0 or 2.0 to 0,1,2
I tried to use this command but it doesn't give me the correct output:
cat dosage.txt | "%d\n" "$2" 2>/dev/null
Does anyone know how to do this using awk or sed command.
Thank you.
how to convert floating number to integer in linux(...)using awk
You might use int function of GNU AWK, consider following simple example, let file.csv content be
name,x,y,z
A,1.0,2.1,3.5
B,4.7,5.9,7.0
then
awk 'BEGIN{FS=OFS=","}NR==1{print;next}{for(i=2;i<=NF;i+=1){$i=int($i)};print}' file.csv
gives output
name,x,y,z
A,1,2,3
B,4,5,7
Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS). I print first row as-is and instruct GNU AWK to go to next line, i.e. do nothing else for that line. For all but first line I use for loop to apply int to fields from 2nd to last, after that is done I print such altered line.
(tested in GNU Awk 5.0.1)
This might work for you (GNU sed):
sed -E ':a;s/((\s|^)[0-9]+)\.[0-9]+(\s|$)/\1\3/g;ta' file
Presuming you want to remove the period and trailing digits from all floating point numbers (where n.n represents a minimum example of such a number).
Match a space or start-of-line, followed by one or more digits, followed by a period, followed by one or more digits, followed by a space or end-of-line and remove the period and the digits following it. Do this for all such numbers through out the file (globally).
N.B. The substitution must be performed twice (hence the loop) because the trailing space of one floating point number may overlap with the leading space of another. The ta command is enacted when the previous substitution is true and causes sed to branch to the a label at the start of the sed cycle.
Maybe this will help. This regex saves the whole part in a variable, and removes the rest. regex can often be fooled by unexpected input, so make sure that you test this against all forms of input data. as I did (partially) for this example.
echo 1234.5 345 a.2 g43.3 546.0 234. hi | sed 's/\b\([0-9]\+\)\.[0-9]\+/\1/g'
outputs
1234 345 a.2 g43.3 546 234. hi
It is important to note, that this was based on gnu sed (standard on linux), so it should not be assumed to work on systems that use an older sed (like on freebsd).

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...
Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.
take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Swapping the first word with itself 3 times only if there are 4 words only using sed

Hi I'm trying to solve a problem only using sed commands and without using pipeline. But I am allowed to pass the result of a sed command to a file or te read from a file.
EX:
sed s/dog/cat/ >| tmp
or
sed s/dog/cat/ < tmp
Anyway lets say I had a file F1 and its contents was :
Hello hi 123
if a equals b
you
one abc two three four
dany uri four 123
The output should be:
if if if a equals b
dany dany dany uri four 123
Explanation: the program must only print lines that have exactly 4 words and when it prints them it must print the first word of the line 3 times.
I've tried doing commands like this:
sed '/[^ ]*.[^ ]*.[^ ]*/s/[^ ]\+/& & &/' F1
or
sed 's/[^ ]\+/& & &/' F1
but I can't figure out how i can calculate with sed that there are only 4 words in a line.
any help will be appreciated
$ sed -En 's/^([^[:space:]]+)([[:space:]]+[^[:space:]]+){3}$/\1 \1 &/p' file
if if if a equals b
dany dany dany uri four 123
The above uses a sed that supports EREs with a -E option, e.g. GNU and OSX seds).
If the fields are tab separated
sed 'h;s/[^[:blank:]]//g;s/[[:blank:]]\{3\}//;/^$/!d;x;s/\([^[:blank:]]*[[:blank:]]\)/\1\1\1/' infile

How to remove odd lines except for first line using SED or AWK

I have the following file
# header1 header2
zzzz yyyy
1
kkkkk wwww
2
What I want to do is to remove odd lines except the header
yielding:
# header1 header2
zzzz yyyy
kkkkk wwww
I tried this but it removes the header too
awk 'NR%2==0'
What's the right way to do it?
Works on GNU sed
sed '3~2d' ip.txt
This deletes line numbers starting from 3rd line and then +2,+4,+6, etc
Example:
$ seq 10 | sed '3~2d'
1
2
4
6
8
10
awk 'NR==1 || NR%2==0'
If the record number is 1 or is even, print it.
awk 'NR % 2 == 0 || NR == 1'
Reversing the comparisons might be marginally faster. The difference probably isn't measurable. (And the choice of spacing is essentially immaterial too.)
You just need
awk 'NR==1 || NR%2==0' file
This keeps the header part of the file intact and applies the rule NR%2==0, which is true only for even lines(starting from the header) in which case it is printed.
Another variant of the same above answer
awk 'NR==1 || !(NR%2)' file
For even lines (NR%2) becomes 0 and negation of that becomes a true condition to print the line
sed '1!{N;P;d}'
1! On lines other than the first (the default behavior echoes the first line)
N append the next line to the current line
P print only the first of the two
d delete them both.
This might work for you (GNU sed):
sed '1b;n;d' file
But:
sed '3~2d' file
Is far neater.

How to use Linux command(sed?) to delete specific lines in a file?

I have a file that contains a matrix. For example, I have:
1 a 2 b
2 b 5 b
3 d 4 b
4 b 7 b
I know it's easy to use sed command to delete specific lines with specific strings. But what if I only want to delete those lines where the second field's value is b (i.e., second line and fourth line)?
You can use regex in sed.
sed -i 's/^[0-9]\s+b.*//g' xxx_file
or
sed -i '/^[0-9]\s+b.*/d' xxx_file
The "-i" argument will modify the file's content directly, you can remove "-i" and output the result to other files as you want.
Awk just work fine, just use code as below:
awk '{if ($2 != "b") print $0;}' file
if you want get more usage about awk, just man it!
awk:
cat yourfile.txt | awk '{if($2!="b"){print;}}'

Resources