Split text file into blocks and save

Split text file into blocks and save - linux

I have a big text file saved in the name test.txt.Now i want to split the big text files into blocks at ... symbol and want to save in the name same as what is there after /home/niu/ . (In the data example below i need the blocks of data to be saved in 20190630_073410_1.5_29_PCK.txt for the first block, 20180630_073410_1.5_29_PCK.txt for second block and 20190830_093410_1.5_29_PCK.txt for third block.
Thus i tried the code below:
#!/bin/sh
for file in 'test.txt'
do
split -l '...'
done
It does not work: i hope somebody will help me.Thanks.
My data saved in test.txt is given below:
...........................................................................................................
/home/niu/20190630_073410_1.5_29_PCK.txt 470.2359935984357 41573823894247.63 53.46648291467124 216 1 0.1
/home/niu/20190630_073410_1.5_29_PCK.txt 13.124782961287574 219608788311302.7 53.46425102814092 219 1 0.6
/home/niu/20190630_073410_1.5_29_PCK.txt 4.092419925137149 12174862157739.746 53.44206693334351 291 1 1.1
...........................................................................................................
/home/niu/20180630_073410_1.5_29_PCK.txt 2.241494955966288 363350265475740.4 53.36874778729164 219 1 0.1
/home/niu/20180630_073410_1.5_29_PCK.txt 1.6671382966847936 282579486756.3921 53.234249504389624 218 1 2.1
/home/niu/20180630_073410_1.5_29_PCK.txt 1.4410832347641427 17729080367.579777 53.06935945567802 216 1 2.6
...........................................................................................................
/home/niu/20190830_093410_1.5_29_PCK.txt 1.2367527642969733 5141.577700615736 52.776493933960644 127 0 3.6
/home/niu/20190830_093410_1.5_29_PCK.txt 1.171644866817557 3279.978138771641 52.65760209064783 135 0 4.1
/home/niu/20190830_093410_1.5_29_PCK.txt 1.120249969361367 2441.45977994814 52.54882982584634 105 0 4.6

awk '/\.\.\./{close(out); next} {split($1, a, "/"); out=a[4]; print > out}' file
You can use this awk. I have assumed that dots (...) exist only in the separating lines, also all the other lines starts with /home/niu/filename.txt, from where we get the output filename. If this is not the case, please update the question.

you can use csplit like this:
csplit test.txt '/^\./' {*}

Could you please try following, written and tested with shown samples in GNU awk.
awk -F'[ /]' '
!NF || /^\.+/{
next
}
out_file!=$4{
close(out_file)
out_file=$4
}
{
print >> (out_file)
}' Input_file
Explanation: Adding detailed explanation for above.
awk -F'[ /]' ' ##Starting awk program from here and setting space and / for all lines.
!NF || /^\.+/{ ##Checking condition if number of fields is NULL OR line starting from dot then do following.
next ##next will skip all further statements from here.
}
out_file!=$4{ ##Checking condition if prev is NOT equal to out_file then do following.
close(out_file) ##Closing file in back end to avoid too many files opened error here.
out_file=$4 ##Setting out_file as 4th field here.
}
{
print >> (out_file) ##Printing current line to out_file output file.
}' Input_file ##Mentioning Input_file name here.
EDIT: As per OP there could be lines starting with spaces so in that case try.
awk -F'/' '
!NF || /^\./{
next
}
{
split($4,arr," ")
}
out_file!=arr[1]{
close(out_file)
out_file=arr[1]
}
{
print >> (out_file)
}' Input_file

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)

With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.

Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)

This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

How to edit output rows from awk with defined position?

Is there a way how to solve this?
I have a bash script, which creates .dat and .log file from source files.
I'm using awk with print and position what I need to print. The problem is with the last position - ID2 (lower). It should be just \*[0-9]{3}\*#, but in some cases there is a string before [0-9]{12}\[00]\>.
Then row looks for example like this:
2020-01-11 01:01:01;test;test123;123456789123[00]>*123*#
What I need is remove the string before in a file:
2020-01-11 01:01:01;test;test123;*123*#
File structure:
YYYY-DD-MM HH:MM:SS;string;ID1;ID2
I will be happy for any advice, thanks.

awk 'BEGIN{FS=OFS=";"} {$NF=substr($NF,length($NF)-5)}1' file
Here we keep only last 6 characters of the last field, while semicolon is the field separator. If there is nothing else in front of that *ID*#, then we keep all of it.

Delete everything before the first *:
$ awk 'BEGIN{FS=OFS=";"}{sub(/^[^*]*/,"",$NF)}1' file
Output:
2020-01-11 01:01:01;test;test123;*123*#

Could you please try following tested and written with shown samples in GNU awk.
awk '
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{ ##Using match function to match regex in it, what regex does is: It matches digits(12 in number) then [ then digits(continuously coming) and ] Also checking condition if line ends with *3 digits *
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) ##If above condition is TRUE then printing sub-string from 1st character to RSTART-1 and then sub-string from RSTART+RLENGTH value to till last of line.
}
' Input_file ##Mentioning Input_file name here.

Select sepcific strings in different columns and print it AWK

I have syntax :
awk -F'\t' '{for(i=1;i<=NF;i++) {if($i~/ensembl_gene_id*/) {h=$i}} ;for(a=1;a<=NF;a++) {if($a~/ensembl_gn*/) {z=$a}} print $1,$2,$3,z,h}'
This is syntax for search more strings in multiple unspecific fields separated by "\t" and print them. But my skills are not so good and I would like to rewrite it with only one loop (Now I have got two loops for "i" and "a"). Could you help me to get easier way with awk? (Code is working).
I think something like this :
awk -F'\t' '{for(i=1;i<=NF;i++) {if($i~/ensembl_gene_id* | esnembl_gn*/) {h=$i}} {print $1,$2,$3,h}'
But it prints only first match.
INPUT:
1 2 les ensembl_gene_id=aaa aha ensembl_gn=BRAF
2 3 pes ccds ensembl_gene_id=kkk ahl klkl ensembl_gn=OTC
2 2 ves ccds=1 ccds=2 ensembl_gene_id=cac ensembl_gn=BRCA
OUTPUT:
1 2 les ensembl_gene_id=aaa ensembl_gn=BRAF
2 3 pes ensembl_gene_id=kkk ensembl_gn=OTC
2 2 ves ensembl_gene_id=cac
Thank you

EDIT: After seeing OP's samples adding following solution.(change awk to awk 'BEGIN{FS=OFS="\t"} in case your Input_file is TAB delimited and your output should be in TAB delimited too.
awk '
match($0,/ensembl_gene_id[^ ]*/){
val=substr($0,RSTART,RLENGTH)
}
match($0,/ensembl_gn[^ ]*/){
val1=substr($0,RSTART,RLENGTH)
}
{
print $1,$2,$3,val,val1
val=val1=""
}
' Input_file
As far as I understood from your question(you want to run single for loop and check 2 conditions. if yes then we need not to use 2 loops rather we can use single loop with 2 conditions in it), could you please try following.
awk -F'\t' '{h=z="";for(i=1;i<=NF;i++){if($i~/ensembl_gene_id*/){h=$i};if($i~/ensembl_gn*/){z=$i}};print $1,$2,$3,z,h}' Input_file
OR(a non-one liner form of solution):
awk '
{
h=z=""
for(i=1;i<=NF;i++){
if($i~/ensembl_gene_id*/){
h=$i
}
if($i~/ensembl_gn*/){
z=$i
}
}
print $1,$2,$3,z,h
}
' Input_file
Issue with OP's attempt: It will always print 1 value only since in case other character's finding it will overwrite its previous value.

Are just trying to print the ensembl_gene_id and ensembl_gn fields? That'd be:
$ awk '{
delete f
for (i=1;i<=NF;i++) {
split($i,t,/=/)
f[t[1]] = $i
}
print $1, $2, $3, f["ensembl_gene_id"], f["ensembl_gn"]
}' file
1 2 les ensembl_gene_id=aaa ensembl_gn=BRAF
2 3 pes ensembl_gene_id=kkk ensembl_gn=OTC
2 2 ves ensembl_gene_id=cac ensembl_gn=BRCA

Find a pattern and replace

This is the input to my file.
Number : 123
PID : IIT/123/Dakota
The expected output is :
Number : 111
PID : IIT/111/Dakota
I want to replace 123 to 111. To solve this I have tried following:
awk '/Number/{$NF=111} 1' log.txt
awk -F '[/]' '/PID/{$2="123"} 1' log.txt

Use sed for something this simple ?
Print the change to the screen (test with this) :
sed -e 's:123:111:g' f2.txt
Update the file (with this) :
sed -i 's:123:111:g' f2.txt
Example:
$ sed -i 's:123:111:g' f2.txt
$ cat f2.txt
Number : 111
PID : IIT/111/Dakota

EDIT2: Or you want to substitute each line's 123 with 111 without checking any condition which you tried in your awk then simply do:
awk '{sub(/123/,"111")} 1' Input_file
Change sub to gsub in case of many occurrences of 123 in a single line too.
Explanation of above code:
awk -v new_value="111" ' ##Creating an awk variable named new_value where OP could keep its new value which OP needs to be there in line.
/^Number/ { $NF=new_value } ##Checking if a line starts from Number string and then setting last field value to new_value variable here.
/^PID/ { num=split($NF,array,"/"); ##Checking if a line starts from PID then creating an array named array whose delimiter it / from last field value
array[2]=new_value; ##Setting second item of array to variable new_value here.
for(i=1;i<=num;i++){ val=val?val "/" array[i]:array[i] }; ##Starting a loop from 1 to till length of array and creating variable val to re-create last field of current line.
$NF=val; ##Setting last field value to variable val here.
val="" ##Nullifying variable val here.
}
1' Input_file ##Mentioning 1 to print the line and mentioning Input_file name here too.
EDIT: In case you need to / in your output too then use following awk.
awk -v new_value="111" '
/^Number/ { $NF=new_value }
/^PID/ { num=split($NF,array,"/");
array[2]=new_value;
for(i=1;i<=num;i++){ val=val?val "/" array[i]:array[i] };
$NF=val;
val=""
}
1' Input_file
Following awk may help you here.(Seems after I have applied code tags to your samples your sample input is changed a bit so editing my code accordingly now)
awk -F"[ /]" -v new_value="111" '/^Number/{$NF=new_value} /^PID/{$(NF-1)=new_value}1' Input_file
In case you want to save changes into Input_file itself append > temp_file &7 mv temp_file Input_file in above code then.
Explanation:
awk -F"[ /]" -v new_value="111" ' ##Setting field separator as space and / to each line and creating awk variable new_value which OP wants to have new value.
/^Number/{ $NF=new_value } ##Checking condition if a line is starting with string Number then change its last field to new_value value.
/^PID/ { $(NF-1)=new_value } ##Checking condition if a line starts from string PID then setting second last field to variable new_value.
1 ##awk works on method of condition then action, so putting 1 making condition TRUE here and not mentioning any action so by default print of current line will happen.
' Input_file ##Mentioning Input_file name here.

AWK interpretation awk -F'AUTO_INCREMENT=' 'NF==1{print "0";next}{sub(/ .*/,"",$2);print $2}'

I've going through some simple bash scripts at work that someone else wrote month ago and I've found this line:
| awk -F'AUTO_INCREMENT=' 'NF==1{print "0";next}{sub(/ .*/,"",$2);print $2}'
Can someone help me to interpret this line in simple words. Thank you!

awk -F'AUTO_INCREMENT=' ' # Set 'AUTO_INCREMENT=' as a field separator
NF==1 { # If number of fields is one i.e. a blank line
print "0"; # print '0'
next # Go to next record i.e. skip following code
}
{
sub(/ .*/,"",$2); # Delete anything after a space in the second field
print $2 # Print the second field
}'
Example
Sample inputs
AUTO_INCREMENT=3
AUTO_INCREMENT=10 20 30 foo bar
Output
3
0
10

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split text file into blocks and save - linux

you can use csplit like this: csplit test.txt '/^\./' {*}

Related

Match lines based on patterns and reformat file Bash/ Linux

How to edit output rows from awk with defined position?

Select sepcific strings in different columns and print it AWK

Find a pattern and replace

AWK interpretation awk -F'AUTO_INCREMENT=' 'NF==1{print "0";next}{sub(/ .*/,"",$2);print $2}'

Categories

Resources