Splitting text files on two consecutive lines containing only one integer number - text

I have a single long text file that contains a list os 3D coordinates. The beginning of the file is composed by a header like this:
10112
2455
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
After that starts the list of coordinates. All the lines are composed by 3 to 7 numbers. For example:
0.001686 0.812066 -1.686245 0.074434
0.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
...
The total length of the list is equal to the product of the first two numbers of the header (10112*2455). These are PTX files, that contain 3D points from laser scanning in text format.
The point is that the file is a concatenation of headers and coordinates, and I want to split the file breaking it on the header. The ideal solution would split the file on the two consecutive single integer lines. I was looking for a generic solution using, for example, csplit, but csplit reads one line at a time, so it cannot detect the two consecutive lines.
As last resort, I will write a piece of software by myself, but I prefer to find a solution based on CLI tools (Awk?), if available.
Is there any idea?
Thank-you
Edit: examples
Let's say I have a file with the following content:
2
3
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
6.001686 0.812066 -1.686245 0.074434
3.001695 0.816359 -1.692300 0.087190
6.001699 0.818673 -1.694508 0.097398
2.001686 0.812066 -1.686245 0.074434
1.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
3 <--- cut before this line
1
421.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
421.417670 172.321300 1.704072 1
1.001686 0.812066 -1.686245 0.074434
2.001695 0.816359 -1.692300 0.087190
3.001699 0.818673 -1.694508 0.097398
In this case I should end up with two files, cut just before the first of the two lines composed by a single integer.
As an alternative, knowing that the two single number lines say how many points compose the section, we can say that the first output file is composed by the first 2*3+10=16 (10 lines of header and 6 of data) lines, and the second file is composed by the subsequent 3*1+10=13 (always 10 lines fo header and this time 3 of data) lines.

So you want to split a file into different ones, printing the header in all of them.
This can do it, you just have to assign the number of lines to store in the parameter -v lines=XX and number of lines of header you want to store -v head=YY:
awk -v lines=5 -v head=2
'NR<=head{header[NR]=$1; next}
!((NR-3)%lines) {file="output_"++count; for (i=1;i<=head;i++) print header[i] > file}
{print > file}
' file
One-liner:
awk -v lines=5 -v head=2 'NR<=head{header[NR]=$1; next} !((NR-3)%lines) {file="output_"++count; for (i=1;i<=head;i++) print header[i] > file} {print > file}' file
For your specific sample input, giving head=2 and lines=5, it returns two files:
$ cat output_1
10112
2455
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
$ cat output_2
10112
2455
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
If what you want is to split the file for every header you find, this should do:
awk '(!flag && NF==1) {header[1]=$1; flag=1; next} (flag && NF==1) {header[2]=$1; flag=0; file="output_"++count; printf "%d\n%d\n", header[1], header[2] > file; next} {print > file}' file
Explanation
(!flag && NF==1) {header[1]=$1; flag=1; next} if no flag is set, assume it is the first line of the header and store it.
( flag && NF==1) {header[2]=$1; flag=0; file="output_"++count; printf "%d\n%d\n", header[1], header[2] > file; next} if flag is set, it means that we already captured the first line of the header and we are in the second one. For this, unset the flag, generate the file name as output_ + number and populate with the stored header.
{print > file} on the rest of the cases, print the current line into the file.
Given your sample file, it returns output_1 and output_2:
$ cat output_1
2
3
121.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
121.417670 172.321300 1.704072 1
6.001686 0.812066 -1.686245 0.074434
3.001695 0.816359 -1.692300 0.087190
6.001699 0.818673 -1.694508 0.097398
2.001686 0.812066 -1.686245 0.074434
1.001695 0.816359 -1.692300 0.087190
0.001699 0.818673 -1.694508 0.097398
$ cat output_2
3
1
421.417670 172.321300 1.704072
0.997697 0.067831 -0.000222
-0.067831 0.997697 0.000207
0.000236 -0.000191 1.000000
0.997697 0.067831 -0.000222 0
-0.067831 0.997697 0.000207 0
0.000236 -0.000191 1.000000 0
421.417670 172.321300 1.704072 1
1.001686 0.812066 -1.686245 0.074434
2.001695 0.816359 -1.692300 0.087190
3.001699 0.818673 -1.694508 0.097398

Related

BashScript to organize a given txt file

I have a txt file of single column that I will like to divide into multiple columns and label them.
I've tried what I know, but the division is not what I want.
I have a ivmeasurementff.txt which contains:
24.000000
0.003207
0.000002
25.000000
0.003435
0.000002
26.000000
0.003991
0.000002
27.000000
0.003207
0.000002
28.000000
0.003232
0.000002
29.000000
0.003283
pr -ts" " --columns 2 ivmeasurementff.txt
This code just split the column into two
Expected output:
Actual Vol T Vol Current
24.000000 0.003207 0.000002
25.000000 0.003435 0.000002
26.000000 0.003991 0.000002
Actual output:
24.000000 0.003435
0.003207 0.000002
0.000002 26.000000
25.000000 0.003991
You may use paste to format it into 3 columns:
paste - - - < ivmeasurementff.txt
Since there's no header, you have to manually add it:
echo "Actual Vol\tT Vol\tCurrent"; paste - - - < ivmeasurementff.txt
Simple solution with xargs:
xargs -n 3 < Input-file
Or if you are ok with awk.
awk 'FNR%3==0 && FNR!=1{print val,$0;val="";next} {val=(val?val OFS:"")$0} END{if(val){print val}}' Input_file
perl solution:
perl -pe 's{\n$}{ } if $. % 3' Input_file
Take a look at https://docstore.mik.ua/orelly/unix3/upt/ch21_15.htm
Under 21.15.3 it suggests using the -l option
pr -ts" " -l1 -3 ivmeasurementff.txt
24.000000 0.003207 0.000002
25.000000 0.003435 0.000002
26.000000 0.003991 0.000002
27.000000 0.003207 0.000002
28.000000 0.003232 0.000002
29.000000 0.003283
Quoting:
21.15.3. Order Lines Across Columns: -l
Do you want to arrange your data across the columns, so that the first three lines print across the top of each column, the next three lines are the second in each column, and so on, like this?
% pr -l1 -t -3 file1
Line 1 here Line 2 here Line 3 here
Line 4 here Line 5 here Line 6 here
Line 7 here Line 8 here Line 9 here
... ... ...
Use the -l1 (page length 1 line) and -t (no title) options. Each "page" will be filled by three lines (or however many columns you set). You have to use -t; otherwise, pr will silently ignore any page lengths that don't leave room for the header and footer. That's just what you want if you want data in columns with no headings.
Simply add option -a and specify three columns.
$ pr -ats" " --columns 3 ivmeasurementff.txt
24.000000 0.003207 0.000002
25.000000 0.003435 0.000002
26.000000 0.003991 0.000002
27.000000 0.003207 0.000002
28.000000 0.003232 0.000002
29.000000 0.003283
I'm not sure why this works. Seems like pr treats columns as rows in some cases?

How to remove duplicate rows and create index in awk

I have tab delimited files as shown below:
CNV_chr1_12623251_12632176 8925 3 RR123 XX
CNV_chr1_13398757_13402091 3334 4 RR123 YY
CNV_chr1_13398757_13402091 3334 4 RR224 YY
CNV_chr1_14001365_14004064 2699 1 RR123 YX
CNV_chr1_14001365_14004064 2699 1 RR224 YX
Columns $1 and $2 stay identical. In this case, i would need to remove the duplicate row by indexing with the value in 4th column. and add an additional $5 with number of strings separated by comma in $4. Sample output shown below:
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX
CNV_chr1_13398757_13402091 3334 4 RR123,RR124 2 YY
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
Any working soultion would be helpful.
Try this:
awk '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
{br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
END{for(key in ar){c=split(br[key],s,",")
gsub("REPLACE_ME", br[key] FS c, ar[key])
print ar[key]}}' test.txt
The output:
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX
For tab-delimited input just add -F"\t" to awk:
awk -F"\t" '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
{br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
END{for(key in ar){c=split(br[key],s,",")
gsub("REPLACE_ME", br[key] FS c, ar[key])
print ar[key]}}' test.txt
and get:
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX

In place string replacement of file in linux

I Am trying to Read a file in linux line by line containg data something like this
522240227 B009CPMJ1M 20141003 20141103 1063278 1 1 6
604710621 B004NPI3OI 20141003 20141103 166431 1 1 6
1498812521 B00LFEHWJM 20141003 20141103 1044646 1 10 6
1498812521 B00D3IK0Y2 20141003 20141103 1044646 2 10 6
I then have to add 2000000000 to the first integer of each line and replace that integer. So final File would be like
2522240227 B009CPMJ1M 20141003 20141103 1063278 1 1 6
2604710621 B004NPI3OI 20141003 20141103 166431 1 1 6
31498812521 B00LFEHWJM 20141003 20141103 1044646 1 10 6
31498812521 B00D3IK0Y2 20141003 20141103 1044646 2 10 6
Is there i can do this operation on the same file without creating any other temporary file using shell script
This is for awk!
awk '{$1+=2000000000}1' file
It returns:
2522240227 B009CPMJ1M 20141003 20141103 1063278 1 1 6
2604710621 B004NPI3OI 20141003 20141103 166431 1 1 6
3498812521 B00LFEHWJM 20141003 20141103 1044646 1 10 6
3498812521 B00D3IK0Y2 20141003 20141103 1044646 2 10 6
It is quite explanatory: $1+=2000000000 adds 2000000000 to the first value. Then, 1 is True, so it performs the default awk action: {print $0}, that is, print the line.
To replace the file, redirect to a temp file and then move:
awk '{$1+=2000000000}1' file > new_file && mv new_file file
Since it was a big number and it was printing in a e+xx format, let's fix the format with sprintf():
awk '{$1 = sprintf("%0.f", $1+=2000000000)}1' file
This takes profit of the sprintf to store a formated value into a variable, so that then it is printed properly.

sort a file based on a column in another file

I have two files both in the format of:
loc1 num1 num2
loc2 num3 num4
The first column is the location and I want to use the order of the locations in the first file to sort the second file so that I can put the two files together where the numbers are right for the location.
I can write a perl script to do this but I felt there might be some quick/easy shell/awk command to achieve this. Do you have any ideas?
Thanks.
Edits:
Here is the input, now I actually want to use column 2 in file 1 to sort file2.
File1:
GID location NAME GWEIGHT C1SI M1CO M1SI C1LY M1LY C1CO C1LI M1LI
AID ARRY2X ARRY1X ARRY3X ARRY4X ARRY5X ARRY0X ARRY6X ARRY7X
EWEIGHT 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
GENE735X chr17:66199278-66199496 chr17:66199278-66199496 1.000000 0.211785 -0.853890 1.071875 0.544136 0.703871 0.371880 0.218960 -2.268618
GENE1562X chr10:80097054-80097298 chr10:80097054-80097298 1.000000 0.533673 -0.397202 0.783363 0.109824 -0.436342 0.158667 0.475748 -1.227730
GENE6579X chr19:23694188-23694395 chr19:23694188-23694395 1.000000 0.127748 -0.203827 0.846738 0.045599 -0.211767 0.415442 0.282123 -1.302055
File 2:
GID location NAME GWEIGHT C1SI M1CO M1SI C1LY M1LY C1CO C1LI M1LI
AID ARRY2X ARRY1X ARRY3X ARRY4X ARRY5X ARRY0X ARRY6X ARRY7X
EWEIGHT 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
GENE6579X chr19:23694188-23694395 chr19:23694188-23694395 1.000000 0.127748 -0.203827 0.846738 0.045599 -0.211767 0.415442 0.282123 -1.302055
GENE735X chr17:66199278-66199496 chr17:66199278-66199496 1.000000 0.211785 -0.853890 1.071875 0.544136 0.703871 0.371880 0.218960 -2.268618
GENE1562X chr10:80097054-80097298 chr10:80097054-80097298 1.000000 0.533673 -0.397202 0.783363 0.109824 -0.436342 0.158667 0.475748 -1.227730
An awk solution: store the 2nd file in memory, then loop over the first file, emitting matching lines from the 2nd file:
awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' second first
Implementing #Barmar's comment
join -1 2 -o "1.1 1.2 2.2 2.3" <(cat -n first | sort -k2) <(sort second) |
sort -n |
cut -d ' ' -f 2-
note to other answerers, I tested with these files:
$ cat first
foo x y
bar x y
baz x y
$ cat second
bar x1 y1
baz x2 y2
foo x3 y3
Explanation of
awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' second first
This part reads the 1st file in the command line paramters (here, "second"):
FNR==NR {x2[$1] = $0; next}
The condition FNR == NR will be true only for the first named file. FNR is awk's "File Record Number" variable, NR is the current record number from all input sources. The current line is stored in an associative array named x2 (not a great variable name) indexed by the first field of the record.
The next condition, $1 in x2, will only start after the file "second" has been completely read. It will look at the first field of the line in file named "first", and the action prints the corresponding line from file "second", which has been stored in the array.
Note that the order of the files in the awk command is important. Since you control the output based on the file named "first", it must be the last file processed by awk.
Use the paste command to merge lines of two files.
For example:
file1:
f1_11 f1_12
f1_21 f1_22
f1_31 f1_32
f1_41 f1_42
file2:
f2_11 f2_12
f2_21 f2_22
f2_31 f2_32
f2_41 f2_42
➜ ~ paste file1 file2
f1_11 f1_12 f2_11 f2_12
f1_21 f1_22 f2_21 f2_22
f1_31 f1_32 f2_31 f2_32
f1_41 f1_42 f2_41 f2_42
Now you can do a sort on column 1.
paste file1 file2 | sort -k1,1
Last but not least cut out the columns which belong to the second file, if you do not want to see the data of file1 in your final output:
paste file1 file2 | sort -k1,1 | cut -f4-6

replacing specific lines below the line containing a certain string using sed inplace editing in linux

I am trying to script the automatic input of file, which is as follows
*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE
$# cid title
$# ssid msid sstyp mstyp sboxid mboxid spr mpr
1 2 3 3 0 0 0 0
$# fs fd dc vc vdc penchk bt dt
0.0100 0.000 0.000 0.000 0.000 0 0.000 1.0000E+7
$# sfs sfm sst mst sfst sfmt fsf vsf
1.000000 1.000000 0.000 0.000 1.000000 1.000000 1.000000 1.000000
*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE
$# cid title
$# ssid msid sstyp mstyp sboxid mboxid spr mpr
1 3 3 3 0 0 0 0
$# fs fd dc vc vdc penchk bt dt
0.0100 0.000 0.000 0.000 0.000 0 0.000 1.0000E+7
$# sfs sfm sst mst sfst sfmt fsf vsf
1.000000 1.000000 0.000 0.000 1.000000 1.000000 1.000000 1.000000
I want to changed fifth line after the string
*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE
with a line from other file frictionValues.txt
What I am using is as follows
sed -i -e '/^\*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE/{n;n;n;n;n;R frictionValues.txt' -e 'd}' input.txt
but this changes all the 5 lines after the string but it reads the values 2 times from the file frictionValues.txt. I want that it reads only first line and then copy it at all the instance where it finds the string. Can anybody tell me using sed with inplace editing like this one?
This might work for you (I might be well off the mark as to what you want!):
sed '1s|.*|1{x;s/^/&/;x};/^\*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE/{n;n;n;n;n;G;s/.*\\n//}|;q' frictionValues.txt |
sed -i -f - input.txt
Explanation:
Build a sed script from the first line of the frictionValues.txt that stuffs the said first line into the hold space (HS). The remaining script is as before but instead of R frictionValues.txt appends the HS to the pattern space using G.
Run the above sed script against the input.txt file using the -f - switch the sed script is passed via stdin from the previous pipeline.
Try with this:
Content of frictionValues.txt:
monday
tuesday
Content of input.txt will be the same that you pasted in the question.
Content of script.sed:
## Match literal string.
/^\*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE/ {
## Append next five lines.
N
N
N
N
N
## Delete the last one.
s/\(^.*\)\n.*$/\1/
## Print the rest of lines.
p
## Queue a line from external file.
R frictionValues.txt
## Read next line (it will the external one).
b
}
## Print line.
p
Run it like:
sed -nf script.sed input.txt
With following result:
*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE
$# cid title
$# ssid msid sstyp mstyp sboxid mboxid spr mpr
1 2 3 3 0 0 0 0
$# fs fd dc vc vdc penchk bt dt
monday
$# sfs sfm sst mst sfst sfmt fsf vsf
1.000000 1.000000 0.000 0.000 1.000000 1.000000 1.000000 1.000000
*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE
$# cid title
$# ssid msid sstyp mstyp sboxid mboxid spr mpr
1 3 3 3 0 0 0 0
$# fs fd dc vc vdc penchk bt dt
tuesday
$# sfs sfm sst mst sfst sfmt fsf vsf
1.000000 1.000000 0.000 0.000 1.000000 1.000000 1.000000 1.000000
I got a two step approach :
First find out the line number that has your matching text:
linenum=`grep -m 1 \*CONTACT_FORMING_ONE_WAY_SURFACE_TO_SURFACE input.txt | awk '{print $1}'`
Now, combine sed commands to replace based on line number.
Change data at linenum+5 with value from "frictionValues.txt" - and also, delete data at linenum+5
sed -e "$((linenum+5)) c `cat frictionValues.txt`" -e "$((linenum+5)) d" input.txt
Assumptions
frictionValues.txt - has only one line
You are using one of the modern Linux OSs

Resources