Split text file into parts based on a pattern taken from the text file

Split text file into parts based on a pattern taken from the text file - linux

I have many text files of fixed-width data, e.g.:
$ head model-q-060.txt
% x y
15.0 0.0
15.026087 -1.0
15.052174 -2.0
15.07826 -3.0
15.104348 -4.0
15.130435 -5.0
15.156522 -6.0
15.182609 -6.9999995
15.208695 -8.0
The data comprise 3 or 4 runs of a simulation, all stored in the one text file, with no separator between runs. In other words, there is no empty line or anything, e.g. if there were only 3 'records' per run it would look like this for 3 runs:
$ head model-q-060.txt
% x y
15.0 0.0
15.026087 -1.0
15.052174 -2.0
15.0 0.0
15.038486 -1.0
15.066712 -2.0
15.0 0.0
15.041089 -1.0
15.087612 -2.0
It's a COMSOL Multiphysics output file for those interested. Visually you can tell where the new run data begin, as the first x-value is repeated (actually the entire second line might be the same for all of them). So I need to firstly open the file and get this x-value, save it, then use it as a pattern to match with awk or csplit. I am struggling to work this out!
csplit will do the job:
$ csplit -z -f 'temp' -b '%02d.txt' model-q-060.txt /^15\.0\\s/ {*}
but I have to know the pattern to split on. This question is similar but each of my text files might have a different pattern to match: Split files based on file content and pattern matching.
Ben.

Here's a simple awk script that will do what you want:
BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$1 }
$1 == delim {
f=sprintf("test%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f }
initialize output file number
ignore the first line
extract the delimiter from the second line
for every input line whose first token matches the delimiter, set up the output file name
for all lines, write to the current output file

This should do the job - test somewhere you don't have a lot of temp*.txt files: :)
rm -f temp*.txt
cat > f1.txt <<EOF
% x y
15.0 0.0
15.026087 -1.0
15.052174 -2.0
15.0 0.0
15.038486 -1.0
15.066712 -2.0
15.0 0.0
15.041089 -1.0
15.087612 -2.0
EOF
first=`awk 'NR==2{print $1}' f1.txt|sed 's/\\./\\\\./'`
echo --- Splitting by: $first
csplit -z -f temp -b %02d.txt f1.txt /^"$first"\\s/ {*}
for i in temp*.txt; do
echo ---- $i
cat $i
done
The output of the above is:
--- Splitting by: 15\.0
51
153
153
136
---- temp00.txt
% x y
---- temp01.txt
15.0 0.0
15.026087 -1.0
15.052174 -2.0
---- temp02.txt
15.0 0.0
15.038486 -1.0
15.066712 -2.0
---- temp03.txt
15.0 0.0
15.041089 -1.0
15.087612 -2.0
Of course, you will run into trouble if you have repeating second column value (15.0 in the above example) - solving that would be a tad harder - exercise left for the reader...

If the amount of lines per run is constant, you could use this:
cat your_file.txt | grep -P "^\d" | \
split --lines=$(expr \( $(wc -l "your_file.txt" | \
awk '{print $1'}) - 1 \) / number_of_runs)

Related

Awk average of column by moving difference of grouping column variable

I have a file that look like this:
1 snp1 0.0 4
1 snp2 0.2 6
1 snp3 0.3 4
1 snp4 0.4 3
1 snp5 0.5 5
1 snp6 0.6 6
1 snp7 1.3 5
1 snp8 1.3 3
1 snp9 1.9 4
File is sorted by column 3. I want the average of 4th column grouped by column 3 every 0.5 unit apart. For example it should output like this:
1 snp1 0.0 4.4
1 snp6 0.6 6.0
1 snp7 1.3 4.0
1 snp9 1.9 4.0
I can print all positions without average like this:
awk 'NR==1 {pos=$3; print $0} $3>=pos+0.5{pos=$3; print $0}' input
But I am not able to figure out how to print average of 4th column. It would be great if someone can help me to find solution to this problem. Thanks!

Something like this, maybe:
awk '
NR==1 {c1=$1; c2=$2; v=$3; n=1; s=$4; next}
$3>v+0.5 {print c1, c2, v, s/n; c1=$1; c2=$2; v=$3; n=1; s=$4; next}
{n+=1; s+=$4}
END {print c1, c2, v, s/n}
' input

How to match two different length and different column text file with header using join command in linux

I have two different length text files A.txt and B.txt
A.txt looks like :
ID pos val1 val2 val3
1 2 0.8 0.5 0.6
2 4 0.9 0.6 0.8
3 6 1.0 1.2 1.3
4 8 2.5 2.2 3.4
5 10 3.2 3.4 3.8
B.txt looks like :
pos category
2 A
4 B
6 A
8 C
10 B
I want to match pos column and in both files and want the output like this
ID catgeory pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
I used the join function join -1 2 -2 1 <(sort -k2 A.txt) <(sort -k1 B.txt) > C.txt
The C.txt comes without a header
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
I want to get output with a header from the join function. kindly help me out
Thanks in advance

In case you are ok with awk, could you please try following. Written and tested with shown samples in GNU awk.
awk 'FNR==NR{a[$1]=$2;next} ($2 in a){$2=a[$2] OFS $2} 1' B.txt A.txt | column -t
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when B.txt is being read.
a[$1]=$2 ##Creating array a with index of 1st field and value is 2nd field of current line.
next ##next will skip all further statements from here.
}
($2 in a){ ##Checking condition if 2nd field is present in array a then do following.
$2=a[$2] OFS $2 ##Adding array a value along with 2nd field in 2nd field as per output.
}
1 ##1 will print current line.
' B.txt A.txt | column -t ##Mentioning Input_file names and passing awk program output to column to make it look better.

As you requested... It is perfectly possible to get the desired output using just GNU join:
$ join -1 2 -2 1 <(sort -k2 -g A.txt) <(sort -k1 -g B.txt) -o 1.1,2.2,1.2,1.3,1.4,1.5
ID category pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
$
The key to getting the correct output is using the sort -g option, and specifying the join output column order using the -o option.
To "pretty print" the output, pipe to column -t
$ join -1 2 -2 1 <(sort -k2 -g A.txt) <(sort -k1 -g B.txt) -o 1.1,2.2,1.2,1.3,1.4,1.5 | column -t
ID category pos val1 val2 val3
1 A 2 0.8 0.5 0.6
2 B 4 0.9 0.6 0.8
3 A 6 1.0 1.2 1.3
4 C 8 2.5 2.2 3.4
5 B 10 3.2 3.4 3.8
$

how to compare two text file with first column if match then print same if not then put zero?

1.txt contain
1
2
3
4
5
.
.
180
2.txt contain
3 0.5
4 0.8
9 9.0
120 3.0
179 2.0
so I want my output like if 2.txt match with first column of 1.txt then should print the value of second column that is in 2.txt. while if not match then should print zero .
like output should be:
1 0.0
2 0.0
3 0.5
4 0.8
5 0.0
.
.
8 0.0
9 9.0
10 0.0
11 0.0
.
.
.
120 3.0
121 0.0
.
.
150 0.0
.
179 2.0
180 0.0

awk 'NR==FNR{a[$1]=$2;next}{if($1 in a){print $1,a[$1]}else{print $1,"0.0"}}' 2.txt 1.txt
Brief explanation,
NR==FNR{a[$1]=$2;next: Record $1 of 2.txt into array a
If the $1 in 1.txt exists in array a, print a[$1], else print 0.0

Could you please try following and let me know if this helps you.
awk 'FNR==NR{a[$1];next} {for(i=prev+1;i<=($1-1);i++){print i,"0.0"}}{prev=$1;$1=$1;print}' OFS="\t" 1.txt 2.txt
Explanation of code:
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1.txt is being read.
a[$1]; ##Creating an array a whose index is $1.
next ##next will skip all further statements from here.
}
{
for(i=prev+1;i<=($1-1);i++){ ##Starting a for loop from variable prev+1 to till value of first field with less than 1 to it.
print i,"0.0"} ##Printing value of variable i and 0.0 here.
}
{
prev=$1; ##Setting $1 value to variable prev here.
$1=$1; ##Resetting $1 here to make TAB output delimited in output.
print ##Printing the current line here.
}' OFS="\t" 1.txt 2.txt ##Setting OFS as TAB and mentioning Input_file(s) name here.
Execution of above code:
Input_file(s):
cat 1.txt
1
2
3
4
5
6
7
cat 2.txt
3 0.5
4 0.8
9 9.0
Output will be as follows:
awk 'FNR==NR{a[$1];next} {for(i=prev+1;i<=($1-1);i++){print i,"0.0"}}{prev=$1;$1=$1;print}' OFS="\t" 1.txt 2.txt
1 0.0
2 0.0
3 0.5
4 0.8
5 0.0
6 0.0
7 0.0
8 0.0
9 9.0

This might work for you (GNU sed):
sed -r 's#^(\S+)\s.*#/^\1\\s*$/c&#' file2 | sed -i -f - -e 's/$/ 0.0/' file1
Create a sed script from file2 that if the first field from file2 matches the first field of file1 changes the matching line to the contents of the matching line in file2. All other lines are then zeroed i.e. lines not changed have 0.0 appended.

sorting rows of a data file with Linux

I would like to sort the lines of a data file (each line idependent from each other) from the first character. For example, if I have a data file
1 0.1 0.6 0.4
2 0.5 0.2 0.3
3 1.0 0.2 0.8
I would like to end with something like
1 0.6 0.4 0.1
2 0.5 0.3 0.2
3 1.0 0.8 0.2
I have tried to do it using the sort command, but it sorts the columns (not the line). Transposing the data file +sort could be also a good solution (I don't know any easy way for transposing datafiles).
Thanks for the help!

Perl to the rescue!
perl -lawne '
print join "\t", $F[0], sort { $b <=> $a } #F[1..$#F]
' < input > output
-n reads the input line by line
-a splits the line on whitespace into the #F array
-l adds newlines to print
See sort, join
.

Or to read input line by line, use tr and sort like this:
#! /bin/sh
while read -r line; do
echo $line | tr ' ' '\n' | sort -k1,1nr -k2 | tr '\n' '\t' >> output
echo >> output
done < input
tr ' ' '\n' is to convert row to column.

how to remove only the first two leading spaces in all lines of a files

my input file is like
*CONTROL_ADAPTIVE
$ adpfreq adptol adpopt maxlvl tbirth tdeath lcadp ioflag
0.10 5.000 2 3 0.0 0.0 0 0
I JUST want to remove the leading 2 spaces in all the lines.
I used
sed "s/^[ \t]*//" -i inputfile.txt
but it deletes all the space from all the lines.. I just want to shift the complete text in files to two position to left.
Any solutions to this?

You can specify that you want to delete two matches of the character set in the brackets:
sed -r -i "s/^[ \t]{2}//" inputfile.txt
See the output:
$ sed -r "s/^[ \t]{2}//" file
*CONTROL_ADAPTIVE
$ adpfreq adptol adpopt maxlvl tbirth tdeath lcadp ioflag
0.10 5.000 2 3 0.0 0.0 0 0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split text file into parts based on a pattern taken from the text file - linux

If the amount of lines per run is constant, you could use this: cat your_file.txt | grep -P "^\d" | \ split --lines=$(expr \( $(wc -l "your_file.txt" | \ awk '{print $1'}) - 1 \) / number_of_runs)

Related

Awk average of column by moving difference of grouping column variable

How to match two different length and different column text file with header using join command in linux

how to compare two text file with first column if match then print same if not then put zero?

sorting rows of a data file with Linux

how to remove only the first two leading spaces in all lines of a files

Categories

Resources