how to convert floating number to integer in linux - linux

I have a file that look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
I want to convert it to look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0 0 0 0 0 0 0 0 0 0 0
I basically want to convert the 0.0 or 1.0 or 2.0 to 0,1,2
I tried to use this command but it doesn't give me the correct output:
cat dosage.txt | "%d\n" "$2" 2>/dev/null
Does anyone know how to do this using awk or sed command.
Thank you.

how to convert floating number to integer in linux(...)using awk
You might use int function of GNU AWK, consider following simple example, let file.csv content be
name,x,y,z
A,1.0,2.1,3.5
B,4.7,5.9,7.0
then
awk 'BEGIN{FS=OFS=","}NR==1{print;next}{for(i=2;i<=NF;i+=1){$i=int($i)};print}' file.csv
gives output
name,x,y,z
A,1,2,3
B,4,5,7
Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS). I print first row as-is and instruct GNU AWK to go to next line, i.e. do nothing else for that line. For all but first line I use for loop to apply int to fields from 2nd to last, after that is done I print such altered line.
(tested in GNU Awk 5.0.1)

This might work for you (GNU sed):
sed -E ':a;s/((\s|^)[0-9]+)\.[0-9]+(\s|$)/\1\3/g;ta' file
Presuming you want to remove the period and trailing digits from all floating point numbers (where n.n represents a minimum example of such a number).
Match a space or start-of-line, followed by one or more digits, followed by a period, followed by one or more digits, followed by a space or end-of-line and remove the period and the digits following it. Do this for all such numbers through out the file (globally).
N.B. The substitution must be performed twice (hence the loop) because the trailing space of one floating point number may overlap with the leading space of another. The ta command is enacted when the previous substitution is true and causes sed to branch to the a label at the start of the sed cycle.

Maybe this will help. This regex saves the whole part in a variable, and removes the rest. regex can often be fooled by unexpected input, so make sure that you test this against all forms of input data. as I did (partially) for this example.
echo 1234.5 345 a.2 g43.3 546.0 234. hi | sed 's/\b\([0-9]\+\)\.[0-9]\+/\1/g'
outputs
1234 345 a.2 g43.3 546 234. hi
It is important to note, that this was based on gnu sed (standard on linux), so it should not be assumed to work on systems that use an older sed (like on freebsd).

Related

How do I use grep to get numbers larger than 50 from a txt file

I am relatively new to grep and unix. I am trying to get the names of people who have won more than 50 races from a txt file. So far the code I have used is, cat file.txt|grep -E "[5-9][0-9]$" but this is only giving me numbers from 50-99. How could I get it from 50-200. Thank you!!
driver
races
wins
Some_Man
90
160
Some_Man
10
80
the above is similar to the format of the data, although it is not tabulated.
Do you have to use grep? you could use awk like this:
awk '{if($[replace with the field number]>50)print$2}' < file.txt
assuming your fields are delimited by spaces, otherwise you could use -F flag to specify delimiter.
if you must use grep, then it's regular expression like you did. to make it 50 to 200 you will do:
cat file.txt|grep -E "(\b[5-9][0-9]|\b1[0-9][0-9])$"
Input:
Rank Country Driver Races Wins
1 [United_Kingdom] Lewis_Hamilton 264 94
2 [Germany] Sebastian_Vettel 254 53
3 [Spain] Fernando_Alonso 311 32
4 [Finland] Kimi_Raikkonen 326 21
5 [Germany] Nico_Rosberg 200 23
Awk would be a better candidate for this:
awk '$4>=50 && $4<=200 { print $0 }' file
Check to see if the fourth space delimited field ($4 - Change to what ever field number this actually is) if both greater than or equal to 50 and less than or equal to 200 and print the line ($0) if the condition is met

Deleting the first row and replacing the first column data with the given data

I have the directories name like folder1,folder2,folder3....folder10. Inside each directory there are many text files with different names.And each text files contain a text row on the top and two columns of data that looks like as below
data_example1 textfile_1000_manygthe>written
20.0 0.53
30.0 2.56
40.0 2.26
45.59 1.28
50.24 1.95
data_example2 textfile_1002_man_gth<>writ
10.0 1.28
45.0 2.58
48.0 1.58
12.0 2.69
69.0 1.59
Then what i want to do is== at first i want to remove completely the first row of all text files present inside the directories folder1,folder2,folder3....folder10.After that whatever the value in the first column of each text file present inside the directories, i want to replace that first column values with other values that is saved in separate one single text file named as "replace_first_column.txt" and the content of replace_first_column.txt file looks like as below and finally want to save all files in the original name inside same directories.
10.0
20.0
30.0
40.0
50.0
I tried the code below: but it doesnot work.Hope i will get some solutions.Thanks.
#!/bin/sh
for file in folder1,folder2....folder10
do
sed -1 $files
done
for file in folder1,folder2....folder10
do
replace ##i dont know here replace first column
Something like this should do it using bash (for {1..10}), GNU find (for + at the end) and GNU awk (for -i inplace):
#!/usr/bin/env bash
find folder{1..10} -type f -exec gawk -i inplace '
{ print FILENAME, $0 | "cat>&2" } # for tracing, see comments
NR==FNR { a[NR+1]=$1; print; next }
FNR>1 { $1=a[FNR]; print }
' /wherever/replace_first_column.txt {} +

How can a Linux cpuset be iterated over with shell?

Linux's sys filesystem represents sets of CPU ids with the syntax:
0,2,8: Set of CPUs containing 0, 2 and 8.
4-6: Set of CPUs containing 4, 5 and 6.
Both syntaxes can be mixed and matched, for example: 0,2,4-6,8
For example, running cat /sys/devices/system/cpu/online prints 0-3 on my machine which means CPUs 0, 1, 2 and 3 are online.
The problem is the above syntax is difficult to iterate over using a for loop in a shell script. How can the above syntax be converted to one more conventional such as 0 2 4 5 6 8?
Try:
$ echo 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' ORS=' ' RS=, FS=-
0 2 4 5 6 8
This can be used in a loop as follows:
for n in $(echo 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' RS=, FS=-)
do
echo cpu="$n"
done
Which produces the output:
cpu=0
cpu=2
cpu=4
cpu=5
cpu=6
cpu=8
Or like:
printf "%s" 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' RS=, FS=- | while read n
do
echo cpu="$n"
done
Which also produces:
cpu=0
cpu=2
cpu=4
cpu=5
cpu=6
cpu=8
How it works
The awk command works as follows:
RS=,
This tells awk to use , as the record separator.
If, for example, the input is 0,2,4-6,8, then awk will see four records: 0 and 2 and 4-6 and 8.
FS=-
This tells awk to use - as the field separator.
With FS set this way and if, for example, the input record consists of 2-4, then awk will see 2 as the first field and 4 as the second field.
/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next}
For any record that contains -, we print out each number starting with the value of the first field, $1, and ending with the value of the second field, $2. Each such number is followed by the Output Record Separator, ORS. By default, ORS is a newline character. For some of the examples above, we set ORS to a blank.
After we have printed these numbers, we skip the rest of the commands and jump to the next record.
1
If we get here, then the record did not contain - and we print it out as is. 1 is awk's shorthand for print-the-line.
A Perl one:
echo "0,2,4-6,8" | perl -lpe 's/(\d+)-(\d+)/{$1..$2}/g; $_="echo {$_}"' | bash
Just convert the original string into echo {0,2,{4..6},8} and let bash 'brace expansion' to interpolate it.
eval echo $(cat /sys/devices/system/cpu/online | sed 's/\([[:digit:]]\+\)-\([[:digit:]]\+\)/$(seq \1 \2)/g' | tr , ' ')
Explanation:
cat /sys/devices/system/cpu/online reads the file from sysfs. This can be changed to any other file such as offline.
The output is piped through the substitution s/\([[:digit:]]\+\)-\([[:digit:]]\+\)/$(seq \1 \2)/g. This matches something like 4-6 and replaces it with $(seq 4 6).
tr , ' ' replaces all commas with spaces.
At this point, the input 0,2,4-6,8 is transformed to 0 2 $(seq 4 6) 8. The final step is to eval this sequence to get 0 2 4 5 6 8.
The example echo's the output. Alternatively, it can be written to a variable or used in a for loop.

How to replace parts of lines of certain form?

I have a large file where each line is of the form
b d
where b and d are numbers. I'd like to change all lines of the form
b -1
to
b 1
where b is an arbitrary number (i.e. it should remain unchanged).
For a concrete example, the file
0.2 0.5
0.1 -1
0 -1
0.3 0.6
should become
0.2 0.5
0.1 1
0 1
0.3 0.6
Is there an easy way to achieve this using, say, sed or a similar tool?
Edit. It suffices to remove all -'s from a file. Thanks to #Cyrus for this observation. This particular problem has now been solved, however, the general question of how to do replace files in this manner with a more general pattern remains open. Answers are still welcome.
Try this:
tr -d '-' < old_file > new_file
or replace all -1 in column 2 by 1:
awk '$2==-1 {$2=1} 1' old_file > new_file
or with GNU sed:
sed 's/ -1$/ 1/' old_file > new_file
If you want to edit your file with GNU sed "in place" use sed's option -i:
sed -i 's/ -1$/ 1/' file

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Resources