TCL "join" command not merging list into single string - string

With the "join" command I've assumed tcl would merge a list of elements into a single string with delimiter.
However this is not what I see at my terminal. Also, without a delimiter it returns the same list of elements with a space in between although ideally it should merge them with no spaces
Example:
## Setting original string
set A [list 1 2 3]
% 1 2 3
puts [llength $A]
% 3
## Join list without delimiter
set B [join $A]
% 1 2 3
puts [llength $B]
% 3
## Join list with space delimiter (actual requirement)
set C [join $A " "]
% 1 2 3
puts [llength $C]
% 3
## Join list with comma delimiter (to also visibly check what happens to each element of list)
set D [join $A ","]
% 1, 2, 3
puts [llength $D]
% 3
foreach item $D {puts $item}
1,
2,
3
Not sure what is going wrong here.
I am trying to set a variable as a single string "1 2 3"
Trying to merge all elements of a list into single string.
However "join" returns the same list as initial but with delimiter added to each element of list (except last).
EDIT: On my new machine, the [join $A ","] is working correctly as 1,2,3 without spaces.

The highly unlikely bit is this:
set A [list 1 2 3]
# ==> 1 2 3
set D [join $A ","]
# ==> 1, 2, 3
as when I put that into a fresh Tcl session I instead get a final output of 1,2,3 (and that's not behaviour anyone's planning to change). I'm guessing you have a stray space in there or have defined a custom version of join.

Related

Get most appear phrase (not word) in a file in bash

My file is
cat a.txt
a
b
aa
a
a a
I am trying to get most appear phrase (not word).
my code is
tr -c '[:alnum:]' '[\n*]' < a.txt | sort | uniq -c | sort -nr
4 a
1 b
1 aa
1
I need
2 a
1 b
1 aa
1 a a
sort a.txt | uniq -c | sort -rn
When you say “in Bash”, I’m going to assume that no external programs are allowed in this exercise. (Also, what is a phrase? I’m going to assume that there is one phrase per line and that no extra preprocessing (such as whitespace trimming) is needed.)
frequent_phrases() {
local -Ai phrases
local -ai {dense_,}counts
local phrase
local -i count i
while IFS= read -r phrase; # Step 0
do ((++phrases["${phrase}"]))
done
for phrase in "${!phrases[#]}"; do # Step 1
((count = phrases["${phrase}"]))
((++counts[count]))
local -a "phrases_$((count))"
local -n phrases_ref="phrases_$((count))"
phrases_ref+=("${phrase}")
done
dense_counts=("${!counts[#]}") # Step 2
for ((i = ${#counts[#]} - 1; i >= 0; --i)); do # Step 3
((count = dense_counts[i]))
local -n phrases_ref="phrases_$((count))"
for phrase in "${phrases_ref[#]}"; do
printf '%d %s\n' "$((count))" "${phrase}"
done
done
}
frequent_phrases < a.txt
Steps taken by the frequent_phrases function (marked in code comments):
Read lines (phrases) into an associative array while counting their occurrences. This yields a mapping from phrases to their counts (the phrases array).
Create a reverse mapping from counts back to phrases. Obviously, this will be a “multimap”, because multiple different phrases can occur the same number of times. To avoid assumptions around separator characters disallowed in a phrase, we store lists of phrases for each count using dynamically named arrays (instead of a single array). For example, all phrases that occur 11 times will be stored in an array called phrases_11.
Besides the map inversion (from (phrase → count) to (count → phrases)), we also gather all known counts in an array called counts. Values of this array (representing how may different phrases occur a particular number of times) are somewhat useless for this task, but its keys (the counts themselves) are a useful representation of a sparse set of counts that can be (later) iterated in a sorted order.
We compact our sparse array of counts into a dense array of dense_counts for easy backward iteration. (This would be unnecessary if we were to just iterate through the counts in increasing order. A reverse order of iteration is not that easy in Bash, as long as we want to implement it efficiently, without trying all possible counts between the maximum and 1.)
We iterate through all known counts backwards (from highest to lowest) and for each count we print out all phrases that occur that number of times. Again, for example, phrases that occur 11 times will be stored in an array called phrases_11.
Just for completeness, to print out (also) the extra bits of statistics we gathered, one could extend the printf command like this:
printf 'count: %d, phrases with this count: %d, phrase: "%s"\n' \
"$((count))" "$((counts[count]))" "${phrase}"

Print out even numbers between two given integers

How would I print out even numbers between two numbers?
I have a script where a user enters in two values and them two values are placed into their respective array elements. How would I print the even numbers between the two values?
See man seq. You can use
seq first incr last
for example
seq 4 2 18
to print even numbers from 4 to 18 (inclusive)
If you have bash.
printf '%s\n' {4..18..2}
Or a c-style for loop
for
for ((i=4;i<=18;i+=2)); do echo "$i"; done

Multiple text insertion in Linux

Can someone help me how to write a piece of command that will insert some text in multiple places (given column and row) of a given file that already contains data. For example: old_data is a file that contains:
A
And I wish to get new_data that will contain:
A 1
I read something about awk and sed commands, but I don't believe to understand how to incorporate these, to get what I want.
I would like to add up, that this command I would like to use as a part of script
for b in ./*/ ; do (cd "$b" && command); done
If we imagine content of old_data as a matrix of elements {An*m} where n corresponds to number of row and m to number of column of this matrix, I wish to manipulate with matrix so that I could add new elements. A in old-data has coordinates (1,1). In new_data therefore, I wish to assign 1 to a matrix element that has coordinates (1,3).
If we compare content of old_data and new_data we see that (1,2) element corresponds to space (it is empty).
It's not at all clear to me what you are asking for, but I suspect you are saying that you would like a way to insert some given text in to a particular row and column. Perhaps:
$ cat input
A
B
C
D
$ row=2 column=2 text="This is some new data"
$ awk 'NR==row {$column = new_data " " $column}1' row=$row column=$column new_data="$text" input
A
B This is some new data
C
D
This bash & unix tools code works:
# make the input files.
echo {A..D} | tr ' ' '\n' > abc ; echo {1..4} | tr ' ' '\n' > 123
# print as per previous OP spec
head -1q abc 123 ; paste abc 123 123 | tail -n +2
Output:
A
1
B 2 2
C 3 3
D 4 4
Version #3, (using commas as more visible separators), as per newest OP spec:
# for the `sed` code change the `2` to whatever column needs deleting.
paste -d, abc 123 123 | sed 's/[^,]*//2'
Output:
A,,1
B,,2
C,,3
D,,4
The same, with tab delimiters (less visually obvious):
paste abc 123 123 | sed 's/[^\t]*//2'
A 1
B 2
C 3
D 4

Mapping lines to columns in *nix

I have a text file that was created when someone pasted from Excel into a text-only email message. There were originally five columns.
Column header 1
Column header 2
...
Column header 5
Row 1, column 1
Row 1, column 2
etc
Some of the data is single-word, some has spaces. What's the best way to get this data into column-formatted text with unix utils?
Edit: I'm looking for the following output:
Column header 1 Column header 2 ... Column header 5
Row 1 column 1 Row 1 column 2 ...
...
I was able to achieve this output by manually converting the data to CSV in vim by adding a comma to the end of each line, then manually joining each set of 5 lines with J. Then I ran the csv through column -ts, to get the desired output. But there's got to be a better way next time this comes up.
Perhaps a perl-one-liner ain't "the best" way, but it should work:
perl -ne 'BEGIN{$fields_per_line=5; $field_seperator="\t"; \
$line_break="\n"} \
chomp; \
print $_, \
$. % $fields_per_row ? $field_seperator : $line_break; \
END{print $line_break}' INFILE > OUTFILE.CSV
Just substitute the "5", "\t" (tabspace), "\n" (newline) as needed.
You would have to use a script that uses readline and counter. When the program reaches that line you want, use cut command and space as a dilimeter to get the word you want
counter=0
lineNumber=3
while read line
do
counter += 1
if lineNumber==counter
do
echo $line | cut -d" " -f 4
done
fi

Extract rows and substrings from one file conditional on information of another file

I have a file 1.blast with coordinate information like this
1 gnl|BL_ORD_ID|0 100.00 33 0 0 1 3
27620 gnl|BL_ORD_ID|0 95.65 46 2 0 1 46
35296 gnl|BL_ORD_ID|0 90.91 44 4 0 3 46
35973 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
41219 gnl|BL_ORD_ID|0 100.00 27 0 0 1 27
46914 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
and a file 1.fasta with sequence information like this
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
...
>100000
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTG
I am searching now a script that takes from 1.blast the first column and extracts those sequence IDs (=first column $1) plus sequence and then from the sequence itself all but those positions between $7 and $8 from the 1.fasta file, meaning from the first two matches the output would be
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
GTAGATAGAGATAGAGAGAGAGAGGGGGGAGA
...
(please notice that the first three entries from >1 are not in this sequence)
The IDs are consecutive, meaning I can extract the required information like this:
awk '{print 2*$1-1, 2*$1, $7, $8}' 1.blast
This gives me then a matrix that contains in the first column the right sequence identifier row, in the second column the right sequence row (= one after the ID row) and then the two coordinates that should be excluded. So basically a matrix that contains all required information which elements from 1.fasta shall be extracted
Unfortunately I do not have too much experience with scripting, hence I am now a bit lost, how to I feed the values e.g. in the suitable sed command?
I can get specific rows like this:
sed -n 3,4p 1.fasta
and the string that I want to remove e.g. via
sed -n 5p 1.fasta | awk '{print substr($0,2,5)}'
But my problem is now, how can I pipe the information from the first awk call into the other commands so that they extract the right rows and remove from the sequence rows then the given coordinates. So, substr isn't the right command, I would need a command remstr(string,start,stop) that removes everything between these two positions from a given string, but I think that I could do in an own script. Especially the correct piping is a problem here for me.
If you do bioinformatics and work with DNA sequences (or even more complicated things like sequence annotations), I would recommend having a look at Bioperl. This obviously requires knowledge of Perl, but has quite a lot of functionality.
In your case you would want to generate Bio::Seq objects from your fasta-file using the Bio::SeqIO module.
Then, you would need to read the fasta-entry-numbers and positions wanted into a hash. With the fasta-name as the key and the value being an array of two values for each subsequence you want to extract. If there can be more than one such subsequence per fasta-entry, you would have to create an array of arrays as the value entry for each key.
With this data structure, you could then go ahead and extract the sequences using the subseq method from Bio::Seq.
I hope this is a way to go for you, although I'm sure that this is also feasible with pure bash.
This isn't an answer, it is an attempt to clarify your problem; please let me know if I have gotten the nature of your task correct.
foreach row in blast:
get the proper (blast[$1]) sequence from fasta
drop bases (blast[$7..$8]) from sequence
print blast[$1], shortened_sequence
If I've got your task correct, you are being hobbled by your programming language (bash) and the peculiar format of your data (a record split across rows). Perl or Python would be far more suitable to the task; indeed Perl was written in part because multiple file access in awk of the time was really difficult if not impossible.
You've come pretty far with the tools you know, but it looks like you are hitting the limits of their convenient expressibility.
As either thunk and msw have pointed out, more suitable tools are available for this kind of task but here you have a script that can teach you something about how to handle it with awk:
Content of script.awk:
## Process first file from arguments.
FNR == NR {
## Save ID and the range of characters to remove from sequence.
blast[ $1 ] = $(NF-1) " " $NF
next
}
## Process second file. For each FASTA id...
$1 ~ /^>/ {
## Get number.
id = substr( $1, 2 )
## Read next line (the sequence).
getline sequence
## if the ID is one found in the other file, get ranges and
## extract those characters from sequence.
if ( id in blast ) {
split( blast[id], ranges )
sequence = substr( sequence, 1, ranges[1] - 1 ) substr( sequence, ranges[2] + 1 )
## Print both lines with the shortened sequence.
printf "%s\n%s\n", $0, sequence
}
}
Assuming your 1.blasta of the question and a customized 1.fasta to test it:
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
>27620
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTGTTTGCGA
Run the script like:
awk -f script.awk 1.blast 1.fasta
That yields:
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
TTTGCGA
Of course I'm assumming some things, the most important that fasta sequences are not longer than one line.
Updated the answer:
awk '
NR==FNR && NF {
id=substr($1,2)
getline seq
a[id]=seq
next
}
($1 in a) && NF {
x=substr(a[$1],$7,$8)
sub(x, "", a[$1])
print ">"$1"\n"a[$1]
} ' 1.fasta 1.blast

Resources