splitting the file based on repeatation - linux

experts i have a file as below where first column is repeated 0.0,5.0,10.Now i want to split the third column at repeatation of the first column row and want to arrange the data side by side as below:
0.0 0.0 0.50000E+00
5.0 0.0 0.80000E+00
10.0 0.0 0.80000E+00
0.0 1.0 0.10000E+00
5.0 1.0 0.90000E+00
10.0 1.0 0.30000E+00
0.0 2.0 0.90000E+00
5.0 2.0 0.50000E+00
10.0 2.0 0.60000E+00
so that my final file will be
0.50000E+00 0.10000E+00 0.90000E+00
0.80000E+00 0.90000E+00 0.50000E+00
0.80000E+00 0.30000E+00 0.60000E+00
so that my final file will be

Using GNU awk:
awk '{ map[$1][NR]=$3 } END { PROCINFO["sorted_in"]="#ind_num_asc";for(i in map) { for ( j in map[i]) { printf "%s\t",map[i][j] } printf "\n" } }' file
Process each line and add to a two dimensional array map, with the first space delimited field as the first index and the line number the second. The third field is the value. At the end of processing the file, set the array ordering and then loop through the array printing the values in the format required.

Related

how to convert floating number to integer in linux

I have a file that look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
I want to convert it to look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0 0 0 0 0 0 0 0 0 0 0
I basically want to convert the 0.0 or 1.0 or 2.0 to 0,1,2
I tried to use this command but it doesn't give me the correct output:
cat dosage.txt | "%d\n" "$2" 2>/dev/null
Does anyone know how to do this using awk or sed command.
Thank you.
how to convert floating number to integer in linux(...)using awk
You might use int function of GNU AWK, consider following simple example, let file.csv content be
name,x,y,z
A,1.0,2.1,3.5
B,4.7,5.9,7.0
then
awk 'BEGIN{FS=OFS=","}NR==1{print;next}{for(i=2;i<=NF;i+=1){$i=int($i)};print}' file.csv
gives output
name,x,y,z
A,1,2,3
B,4,5,7
Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS). I print first row as-is and instruct GNU AWK to go to next line, i.e. do nothing else for that line. For all but first line I use for loop to apply int to fields from 2nd to last, after that is done I print such altered line.
(tested in GNU Awk 5.0.1)
This might work for you (GNU sed):
sed -E ':a;s/((\s|^)[0-9]+)\.[0-9]+(\s|$)/\1\3/g;ta' file
Presuming you want to remove the period and trailing digits from all floating point numbers (where n.n represents a minimum example of such a number).
Match a space or start-of-line, followed by one or more digits, followed by a period, followed by one or more digits, followed by a space or end-of-line and remove the period and the digits following it. Do this for all such numbers through out the file (globally).
N.B. The substitution must be performed twice (hence the loop) because the trailing space of one floating point number may overlap with the leading space of another. The ta command is enacted when the previous substitution is true and causes sed to branch to the a label at the start of the sed cycle.
Maybe this will help. This regex saves the whole part in a variable, and removes the rest. regex can often be fooled by unexpected input, so make sure that you test this against all forms of input data. as I did (partially) for this example.
echo 1234.5 345 a.2 g43.3 546.0 234. hi | sed 's/\b\([0-9]\+\)\.[0-9]\+/\1/g'
outputs
1234 345 a.2 g43.3 546 234. hi
It is important to note, that this was based on gnu sed (standard on linux), so it should not be assumed to work on systems that use an older sed (like on freebsd).

pasting file side by side

I have many ascii files in a directory, i just want to sort the file name numerically and want to paste side by side.
Secondly, after pasting i want to make all the column of same length by appending zero at the end.
My files are named as
data_Z_1 data_N_457 data_E_45
1.5 1.2 2.3
2.0 2.3 1.8
4.5
At first I just want sort the above file names numerically as given below and then want to paste side by side as
data_Z_1 data_E_45 data_N_457
1.5 2.3 1.2
2.0 1.8 2.3
4.5
Secondly i need to make all the columns equal length in a pasted file, so that output should be like
1.5 2.3 1.2
2.0 1.8 2.3
0.0 0.0 4.5
I tried as below:
ls data_*_* | sort -V
But it doesnot work.Can anybody help me overcoming this problem.Thanks in advance.
Would you please try the following:
paste $(ls data* | sort -t_ -k3n) | awk -F'\t' -v OFS='\t' '
{for (i=1; i<=NF; i++) if ($i == "") $i = "0.0"} 1'
Output:
1.5 2.3 1.2
2.0 1.8 2.3
0.0 0.0 4.5
sort -t_ -k3n sets the field separator to _ and numerically sorts
the filenames on the 3rd field values.
The options -F'\t' -v OFS='\t' to the awk command assign
input/output field separator to a tab character.
The awk statement for (i=1; i<=NF; i++) if ($i == "") $i = "0.0"
scans the input fields and sets 0.0 for the empty fields.
The final 1 is equivalent to print $0 to print the fields.
[Edit]
If you have huge number of files, it may exceed the capability of bash. Here is an alternative with python using dataframe.
#!/usr/bin/python
import glob
import pandas as pd
import re
files = glob.glob('data*')
files.sort(key=lambda x: int(re.sub(r'.*_', '', x))) # sort filenames numerically by its number
dfs = [] # list of dataframes
for f in files:
df = pd.read_csv(f, header=None, names=[f]) # read file and assign column
df = df.apply(pd.to_numeric, errors='coerce') # force the cell values to floats
dfs.append(df) # add as a new column
df = pd.concat(dfs, axis=1, join='outer') # create a dataframe from the list of dataframes
df = df.fillna(0) # fill empty cells
print(df.to_string(index=False, header=False)) # print the dataframe removing index and header
which will produce the same results.

Deleting the first row and replacing the first column data with the given data

I have the directories name like folder1,folder2,folder3....folder10. Inside each directory there are many text files with different names.And each text files contain a text row on the top and two columns of data that looks like as below
data_example1 textfile_1000_manygthe>written
20.0 0.53
30.0 2.56
40.0 2.26
45.59 1.28
50.24 1.95
data_example2 textfile_1002_man_gth<>writ
10.0 1.28
45.0 2.58
48.0 1.58
12.0 2.69
69.0 1.59
Then what i want to do is== at first i want to remove completely the first row of all text files present inside the directories folder1,folder2,folder3....folder10.After that whatever the value in the first column of each text file present inside the directories, i want to replace that first column values with other values that is saved in separate one single text file named as "replace_first_column.txt" and the content of replace_first_column.txt file looks like as below and finally want to save all files in the original name inside same directories.
10.0
20.0
30.0
40.0
50.0
I tried the code below: but it doesnot work.Hope i will get some solutions.Thanks.
#!/bin/sh
for file in folder1,folder2....folder10
do
sed -1 $files
done
for file in folder1,folder2....folder10
do
replace ##i dont know here replace first column
Something like this should do it using bash (for {1..10}), GNU find (for + at the end) and GNU awk (for -i inplace):
#!/usr/bin/env bash
find folder{1..10} -type f -exec gawk -i inplace '
{ print FILENAME, $0 | "cat>&2" } # for tracing, see comments
NR==FNR { a[NR+1]=$1; print; next }
FNR>1 { $1=a[FNR]; print }
' /wherever/replace_first_column.txt {} +

Read line range from a file and find largest value within the range in another file

I'm looking to extract the largest value from a range of line numbers in a file, with the range being read from another file.
Define three files:
position_file: Containing two columns of integers defining a range of line numbers so col1[i] < col2[i]
full_data_file: Containing a single column of numerical data (>=0)
extracted_data_file: Containing for each line in position_file the largest value in full_data_file where the line number in full_data_file falls within the range defined in position_file
cat position_file
1 3
5 7
cat full_data_file
1
4.3
5.2
2.0
0.1
0
4
9
cat extracted_data_file
5.2
4
My current way of doing this is
while read pos1 pos2; do
awk -v p1="$pos1" -v p2="$pos2" 'BEGIN {max=0} NR>=p1 && NR<=p2 && $1>max {max=$1} END {print max}' < full_data_file >> extracted_data_file
done < position_file
This is not a good way because I repeatedly load full_data_file to memory, which is very slow. I'm looking for a way to do this in a single step. I'm not very accomplished in using arrays in awk but I imagine the solution will probably (but not necessarily) utilize arrays in awk.
Thank you very much for your help.
You may use this awk:
awk 'FNR==NR{a[FNR]=$1; next} {max=a[$1]; for (i=$1+1; i<=$2; i++)
if (a[i]>max) max=a[i]; print max}' full_data_file position_file > extracted_data_file
cat extracted_data_file
5.2
4

Cleaning up CSV (artifacts and lack of spacing)

My data was in this form. Where you can see that the 3rd column (2nd if you start with 0) touches the one before when it's values rise to the next order of magnitude. As well as artifacts in the last column that are from none data input being recorded.
17:10:39 2.039 26.84 4.6371E-9 -0.7$R200$O100
17:10:41 2.082 27.04 4.6334E-9 -0.4
17:10:43 1.980 26.97 4.6461E-9 0.3
17:10:45 2.031 26.87 4.6502E-9 1.0$R200
17:10:47 2.090 27.09 4.6296E-9 0.1
...
18:49:40 1.930226.34 2.8246E-5 7.1
18:49:42 2.031226.04 2.8264E-5 8.2
Now I did fix this all by hand by adding a "|" deliminator instead " ", and cutting away the few artifacts, but it was a pain.
So in the prospect of getting even larger data sets in the future from the same machine, are there any tips on how to write a script in python or if there are any linux based tools out there already to fix this csv/make a new fixed csv out of this ?
In linux shell:
cut -c 1-14 data.csv > DataA
cut -c 15-49 data.csv > DataB
paste DataA DataB | tr -s " " "\t" > DataC
cuts the csv into two parts, with the intersection being where they touch, also in the second part we cut away the added unwanted artifacts.
paste them together and change the delimiter for a tab as paste adds a tab
Now in case we'd want to stick to the "|" delimiter the next step could be
cat DataC | tr -s "\t" "|" > DataFinal
rm DataA DataB DataC
But this is purely optional
The data you are showing is not csv (or dsv), but plain text data with fixed field widths. Trying to read this as csv will be error prone.
Instead this data should be processed as fixed width with the following field widths:
8 / 6 / 6 / 10 (or 11) / 8 (or 7) / rest of line
See this question on how to parse fixed width fields in Python.

Resources