Deleting the first row and replacing the first column data with the given data - linux

I have the directories name like folder1,folder2,folder3....folder10. Inside each directory there are many text files with different names.And each text files contain a text row on the top and two columns of data that looks like as below
data_example1 textfile_1000_manygthe>written
20.0 0.53
30.0 2.56
40.0 2.26
45.59 1.28
50.24 1.95
data_example2 textfile_1002_man_gth<>writ
10.0 1.28
45.0 2.58
48.0 1.58
12.0 2.69
69.0 1.59
Then what i want to do is== at first i want to remove completely the first row of all text files present inside the directories folder1,folder2,folder3....folder10.After that whatever the value in the first column of each text file present inside the directories, i want to replace that first column values with other values that is saved in separate one single text file named as "replace_first_column.txt" and the content of replace_first_column.txt file looks like as below and finally want to save all files in the original name inside same directories.
10.0
20.0
30.0
40.0
50.0
I tried the code below: but it doesnot work.Hope i will get some solutions.Thanks.
#!/bin/sh
for file in folder1,folder2....folder10
do
sed -1 $files
done
for file in folder1,folder2....folder10
do
replace ##i dont know here replace first column

Something like this should do it using bash (for {1..10}), GNU find (for + at the end) and GNU awk (for -i inplace):
#!/usr/bin/env bash
find folder{1..10} -type f -exec gawk -i inplace '
{ print FILENAME, $0 | "cat>&2" } # for tracing, see comments
NR==FNR { a[NR+1]=$1; print; next }
FNR>1 { $1=a[FNR]; print }
' /wherever/replace_first_column.txt {} +

Related

how to convert floating number to integer in linux

I have a file that look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
I want to convert it to look like this:
#[1]CHROM [2]POS [3]REF [4]ALT [5]GTEX-1117F_GTEX-1117F [6]GTEX-111CU_GTEX-111CU [7]GTEX-111FC_GTEX-111FC [8]GTEX-111VG_GTEX-111VG [9]GTEX-111YS_GTEX-111YS [10]GTEX-ZZPU_GTEX-ZZPU
22 20012563 T C 0 0 0 0 0 0 0 0 0 0 0
I basically want to convert the 0.0 or 1.0 or 2.0 to 0,1,2
I tried to use this command but it doesn't give me the correct output:
cat dosage.txt | "%d\n" "$2" 2>/dev/null
Does anyone know how to do this using awk or sed command.
Thank you.
how to convert floating number to integer in linux(...)using awk
You might use int function of GNU AWK, consider following simple example, let file.csv content be
name,x,y,z
A,1.0,2.1,3.5
B,4.7,5.9,7.0
then
awk 'BEGIN{FS=OFS=","}NR==1{print;next}{for(i=2;i<=NF;i+=1){$i=int($i)};print}' file.csv
gives output
name,x,y,z
A,1,2,3
B,4,5,7
Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS). I print first row as-is and instruct GNU AWK to go to next line, i.e. do nothing else for that line. For all but first line I use for loop to apply int to fields from 2nd to last, after that is done I print such altered line.
(tested in GNU Awk 5.0.1)
This might work for you (GNU sed):
sed -E ':a;s/((\s|^)[0-9]+)\.[0-9]+(\s|$)/\1\3/g;ta' file
Presuming you want to remove the period and trailing digits from all floating point numbers (where n.n represents a minimum example of such a number).
Match a space or start-of-line, followed by one or more digits, followed by a period, followed by one or more digits, followed by a space or end-of-line and remove the period and the digits following it. Do this for all such numbers through out the file (globally).
N.B. The substitution must be performed twice (hence the loop) because the trailing space of one floating point number may overlap with the leading space of another. The ta command is enacted when the previous substitution is true and causes sed to branch to the a label at the start of the sed cycle.
Maybe this will help. This regex saves the whole part in a variable, and removes the rest. regex can often be fooled by unexpected input, so make sure that you test this against all forms of input data. as I did (partially) for this example.
echo 1234.5 345 a.2 g43.3 546.0 234. hi | sed 's/\b\([0-9]\+\)\.[0-9]\+/\1/g'
outputs
1234 345 a.2 g43.3 546 234. hi
It is important to note, that this was based on gnu sed (standard on linux), so it should not be assumed to work on systems that use an older sed (like on freebsd).

splitting the file based on repeatation

experts i have a file as below where first column is repeated 0.0,5.0,10.Now i want to split the third column at repeatation of the first column row and want to arrange the data side by side as below:
0.0 0.0 0.50000E+00
5.0 0.0 0.80000E+00
10.0 0.0 0.80000E+00
0.0 1.0 0.10000E+00
5.0 1.0 0.90000E+00
10.0 1.0 0.30000E+00
0.0 2.0 0.90000E+00
5.0 2.0 0.50000E+00
10.0 2.0 0.60000E+00
so that my final file will be
0.50000E+00 0.10000E+00 0.90000E+00
0.80000E+00 0.90000E+00 0.50000E+00
0.80000E+00 0.30000E+00 0.60000E+00
so that my final file will be
Using GNU awk:
awk '{ map[$1][NR]=$3 } END { PROCINFO["sorted_in"]="#ind_num_asc";for(i in map) { for ( j in map[i]) { printf "%s\t",map[i][j] } printf "\n" } }' file
Process each line and add to a two dimensional array map, with the first space delimited field as the first index and the line number the second. The third field is the value. At the end of processing the file, set the array ordering and then loop through the array printing the values in the format required.

Move all rows in a tsv with a certain date to their own file

I have a TSV file with 4 columns in this format
dog phil tall 2020-12-09 12:34:22
cat jill tall 2020-12-10 11:34:22
The 4th column is a date string Example : 2020-12-09 12:34:22
I want every row with the same date to go into its own file
For example,
file 20201209 should have all rows that start with 2020-12-09 in the 4th column
file 20201210 should have all rows that start with 2020-12-10 in the 4th column
Is there any way to do this through the terminal?
With GNU awk to allow potentially large numbers of concurrently open output files and gensub():
awk '{print > gensub(/-/,"","g",$(NF-1))}' file
With any awk:
awk '{out=$(NF-1); gsub(/-/,"",out); if (seen[out]++) print >> out; else print > out; close(out)}' file
There's ways to speed up either script by sorting the input first if that's an issue.

How to extract names of compound present in sub files?

I have a list of 15000 compound names (file name: uniq-compounds) which contains names of 15000 folder. the folder have sub files i.e. out.pdbqt which contains names of compound in 3rd Row. (Name = 1-tert-butyl-5-oxo-N-[2-(3-pyridinyl)ethyl]-3-pyrrolidinecarboxamide). I want to extract all those 15000 names by providing uniq-compound file (it contain folder names e.g ligand_*) out of 50,000 folder.
directory and subfiles
sidra---50,000folder (ligand_00001 - ligand50,000)--each contains subfiles (out.pdbqt)--that conatins names.(mention below)
another file (uniq-compound) contains 15000 folder names (that compound names i want).
out.pdbqt
MODEL 1
REMARK VINA RESULT: -6.0 0.000 0.000
REMARK Name = 1-tert-butyl-5-oxo-N-[2-(3-pyridinyl)ethyl]-3-pyrrolidinecarboxamide
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: N_1 and C_7
Assuming, uniq-compound.txt contains the folder names and each folder contains an out.pdbqt. Also, the compound name appears in the 3rd row of the file out.pdbqt. If that is the case below script will work:
#!/bin/bash
while IFS= read -r line; do
awk 'FNR == 3 {print $4}' $line/out.pdbqt
done < uniq-compound.txt
Loop will iterate through the uniq-compound.txt one by one, for each line in the file (i.e folder), it uses awk to display the 4th column in the 3rd line of the file out.pdbqt inside that folder.

How to delete lines from TXT or CSV with specific pattern

I have a txt file formatted as follows:
The aim is to remove the rows which begin with the word "Subtotal Group 1" or "Subtotal Group 2" or "Grand Total" (such strings are always at the beginning of the line), but I need to remove them only if the remaining portion of the line have blank fields (or filled with spaces).
It could be achievable with awk or sed (1 pass), but I'm currently doing with 3 separate steps (one for each text). A more generic syntax would be great. Thanks everybody.
My txt file looks like this:
Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00 500 First Line Text 1685.52
1.00 502 Second Line Text 280.98
530 Other Line text 157.32
_________________________________________________________________________
Subtotal Group 1
Subtotal Group 1
Subtotal Group 1
Subtotal Group 1 2123.82
Subtotal Group 1
Subtotal Group 1
========================================================================
GROUP 2
========================================================================
7.00 701 First Line Text 53.63
711 Second Line text 97.85
7.00 740 Third Line text 157.32
741 Any Line text 157.32
742 Any Line text 18.04
801 Last Line text 128.63
_______________________________________________________________________
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2 612.79
Subtotal Group 2
_______________________________________________________________________
Grand total
Grand total
Grand total
Grand total
Grand total
Grand total
Grand total 1511.03
The goal output I'm trying to achieve is:
Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00 500 First Line Text 1685.52
1.00 502 Second Line Text 280.98
530 Other Line text 157.32
_______________________________________________________________________
Subtotal Group 1 2123.82
=======================================================================
GROUP 2
=======================================================================
7.00 701 First Line Text 53.63
711 Second Line text 97.85
7.00 740 Third Line text 157.32
741 Any Line text 157.32
742 Any Line text 18.04
801 Last Line text 128.63
_______________________________________________________________________
Subtotal Group 2 612.79
_______________________________________________________________________
Grand total 1511.03
That's a job grep was invented to do:
$ grep -Ev '^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$' file
Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00 500 First Line Text 1685.52
1.00 502 Second Line Text 280.98
530 Other Line text 157.32
_________________________________________________________________________
Subtotal Group 1 2123.82
========================================================================
GROUP 2
========================================================================
7.00 701 First Line Text 53.63
711 Second Line text 97.85
7.00 740 Third Line text 157.32
741 Any Line text 157.32
742 Any Line text 18.04
801 Last Line text 128.63
_______________________________________________________________________
Subtotal Group 2 612.79
_______________________________________________________________________
Grand total 1511.03
You can use the same regexp in awk or sed if you prefer:
awk '!/^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$/' file
sed -E '/^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$/d' file
If your good lines always end with a number and your Any Text lines don't, you could use:
sed -n '/^.*[0-9]$/p' file
Where -n will suppress printing of pattern space, and you will only output lines ending with [0-9]. Given your example file, the output is:
Subtotal 2123.82
Total 625.80
Any Word 9999.99
You can do:
grep -v -P "^(Subtotal Group \d+|Grand total)[,\s]*$" inputfile > outputfile
Edited as per comment.
Second Edit: adapted to new specs
The question isn't quite clear if the goal is to keep the total/subtotal lines, or if they should be removed.
Also, it is not clear if the "#*" comments are an actual part of the input file, or they're merely descriptive.
Fortunately, both of these are minor details. This is fairly simple to do with perl:
$ perl -n -e 'print if /^(Subtotal|Grand Total),(,| |#.*)*/' inputfile
Subtotal,,, #This is unuseful --> To be removed
Subtotal,,, #This is unuseful --> To be removed
Subtotal,,,125.40 #This is a good line
Subtotal,,, #This is unuseful --> To be removed
Grand Total,,, #This is unuseful --> To be removed
Grand Total,,,125.40 #This is a good line
This assumes you want to keep the total and the subtotal lines, and remove all other lines.
To do it the other way around, to remove the total/subtotal lines, and keep the others, replace the if keyword with unless.
And if the comments aren't actually in the input file itself, the pattern only needs to be tweaked slightly:
perl -n -e 'print if /^(Subtotal|Grand Total),(,| )*/' inputfile
This also ignores any extra whitespace. If you want whitespace to be significant, this becomes:
perl -n -e 'print if /^(Subtotal|Grand Total),(,)*/' inputfile
Like I said, even though your question is not 100% clear, the unclear parts are just minor details. perl will easily handle every possibility.
As shown in the example, perl will print the edited inputfile on standard output. In order to replace inputfile with the edited contents, simply add the -i option to the command (before the -e option).
And an attempt at an awk solution ...
awk -F, '{for(i=2;i<=NF;i++){if($i~/[0-9.-]+/){print $0;next}}}' falzone
Subtotal,,,125.40
Grand Total,,,125.40
Any other text,,,9999.99
Or, looking at the non-csv version:
grep [0-9.-] falzone2
Subtotal 2123.82
Total 625.80
Any Word 9999.99

Resources