Concatenate two columns of a text file

Concatenate two columns of a text file - text

I have a tsv file like
1 2 3 4 5 ...
a b c d e ...
x y z j k ...
How can I merge two contiguous columns, say the 2nd and the 3rd, to get
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...
I need the code to work with text files with different numbers of columns, so I can't use something like awk 'BEGIN{FS="\t"} {print $1"\t"$2"-"$3"\t"$4"\t"$5}' file
awk is the first tool I thought about for the task and one I'm trying to learn, so I'm very interested in answers using it, but any solution with any other tool would be greatly appreciated.

With simple sed command for tsv file:
sed 's/\t/-/2' file
The output:
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...

Following awk may help you in same, in case you are not worried about little space which will be created when 3rd field will be nullified.
awk '{$2=$2"-"$3;$3=""} 1' Input_file

With awk:
awk -v OFS='\t' -v col=2 '{
$(col)=$(col)"-"$(col+1); # merge col and col+1
for (i=col+1;i<NF;i++) $(i)=$(i+1); # shift columns right of col+1 by one to the left
NF--; # remove the last field
}1' file # print the record
Output:
1 2-3 4 5 ...
a b-c d e ...
x y-z j k ...

Related

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)

And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...

Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file

Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t

Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.

take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Fast extraction of lines based on line numbers

I am looking for a fast way to extract lines of a file based on a list of line numbers read from a different file in bash.
Define three files:
position_file: Containing a single column of integers
full_data_file: Containing a single column of data
extracted_data_file: Containing those lines in full_data_file whose line numbers match the integers in position_file
My current way of doing this is
while read position; do
awk -v pos="$position" 'NR==pos {print; exit}' < full_data_file >> extracted_data_file
done < position_file
The problem is that this is painfully slow and I'm trying to do this for a large number of rather large files. I was hoping someone might be able to suggest a faster way.
Thank you for your help.

The right way with awk command:
Input files:
$ head pos.txt data.txt
==> pos.txt <==
2
4
6
8
10
==> data.txt <==
a
b
c
d
e
f
g
h
i
j
awk 'NR==FNR{ a[$1]; next }FNR in a' pos.txt data.txt > result.txt
$ cat result.txt
b
d
f
h
j

Multiple text insertion in Linux

Can someone help me how to write a piece of command that will insert some text in multiple places (given column and row) of a given file that already contains data. For example: old_data is a file that contains:
A
And I wish to get new_data that will contain:
A 1
I read something about awk and sed commands, but I don't believe to understand how to incorporate these, to get what I want.
I would like to add up, that this command I would like to use as a part of script
for b in ./*/ ; do (cd "$b" && command); done
If we imagine content of old_data as a matrix of elements {An*m} where n corresponds to number of row and m to number of column of this matrix, I wish to manipulate with matrix so that I could add new elements. A in old-data has coordinates (1,1). In new_data therefore, I wish to assign 1 to a matrix element that has coordinates (1,3).
If we compare content of old_data and new_data we see that (1,2) element corresponds to space (it is empty).

It's not at all clear to me what you are asking for, but I suspect you are saying that you would like a way to insert some given text in to a particular row and column. Perhaps:
$ cat input
A
B
C
D
$ row=2 column=2 text="This is some new data"
$ awk 'NR==row {$column = new_data " " $column}1' row=$row column=$column new_data="$text" input
A
B This is some new data
C
D

This bash & unix tools code works:
# make the input files.
echo {A..D} | tr ' ' '\n' > abc ; echo {1..4} | tr ' ' '\n' > 123
# print as per previous OP spec
head -1q abc 123 ; paste abc 123 123 | tail -n +2
Output:
A
1
B 2 2
C 3 3
D 4 4
Version #3, (using commas as more visible separators), as per newest OP spec:
# for the `sed` code change the `2` to whatever column needs deleting.
paste -d, abc 123 123 | sed 's/[^,]*//2'
Output:
A,,1
B,,2
C,,3
D,,4
The same, with tab delimiters (less visually obvious):
paste abc 123 123 | sed 's/[^\t]*//2'
A 1
B 2
C 3
D 4

How do I grep a number and all numbers above that number?

So I have a huge list of items.
I need to grep every lines containing the number: 1300 and above.
How can I do this? Will grep do this? Thanks

While grep technically can it's probably not the best tool for the job. If the list is in a fixed format you might be better off using something like awk.
Sample input:
a b c 1100 d e f
g h i 1200 j k l
m n o 1300 p q r
s t u 1400 v w x
Sample code:
awk -F' ' '($4 >= 1300) { print $0 }' input_file
Sample output:
m n o 1300 p q r
s t u 1400 v w x
awk goes through every line, splitting it into tokens, delimited by a space (as dictated by the parameter -F' ', by default it already uses space but explicitly showing it here lets you change it to however your file is formatted). The logic then says for all values in field 4 that are greater or equal to 1300, print the line (print $0).

Yes you can do this with grep, something along the lines of:
$ grep -E '(1[3-9][0-9]{2}|[2-9][0-9]{3}|[1-9][0-9]{4,})'

in Linux: merge two very big files

I would like to merge two files (one is space delimited and the other tab delimited) keeping only the records that are matching between the two files:
File 1: space delimited
A B C D E F G H
s e id_234 4 t 5 7 9
r d id_45 6 h 3 9 10
f w id_56 2 y 7 3 0
s f id_67 2 y 10 3 0
File 2: tab delimited
I L M N O P
s e 4 u id_67 88
d a 5 d id_33 67
g r 1 o id_45 89
I would like to match File 1 field 3 ("C") with file 2 field 5 ("O"), and merge the files like this:
File 3: tab delimited
I L M N O P A B D E F G H
s e 4 u id_67 88 s f 2 y 10 3 0
g r 1 o id_45 89 r d 6 h 3 9 10
There are entries in file 1 that don't appear in file 2, and vice versa, but I only want to keep the intersection (the common ids).
I don't really care about the order.
I would prefer not to use join because these are really big unsorted files and join requires to sort by common field before, which takes a very long time and much memory.
I have tried with awk but unsuccessfully
awk > file3 'NR == FNR {
f2[$3] = $2; next
}
$5 in f2 {
print $0, f2[$2]
}' file2 file1
Can someone please help me?
Thank you very much

Hmm.. you'll ideally be looking to avoid an n^2 solution which is what the awk based approach seems to require. For each record in file1 you have to scan file2 to see if occurs. That's where the time is going.
I'd suggest writing a python (or similar) script for this and building a map id->file position for one of the files and then querying that whilst scanning the other file. That'd get you an nlogn runtime which, to me at least, looks to be the best you could do here (using a hash for the index leaves you with the expensive problem of seeking to the file pos).
In fact, here's the Python script to do that:
f1 = file("file1.txt")
f1_index = {}
# Generate index for file1
fpos = f1.tell()
line = f1.readline()
while line:
id = line.split()[2]
f1_index[id] = fpos
fpos = f1.tell()
line = f1.readline()
# Now scan file2 and output matches
f2 = file("file2.txt")
line = f2.readline()
while line:
id = line.split()[4]
if id in f1_index:
# Found a matching line, seek to file1 pos and read
# the line back in
f1.seek(f1_index[id], 0)
line2 = f1.readline().split()
del line2[2] # <- Remove the redundant id_XX
new_line = "\t".join(line.strip().split() + line2)
print new_line
line = f2.readline()

If sorting the two files (on the columns you want to match on) is a possibility (and wouldn't break the content somehow), join is probably a better approach than trying to accomplish this with bash or awk. Since you state you don't really care about the order, then this would probably be an appropriate method.
It would look something like this:
join -1 3 -2 5 -o '2.1,2.2,2.3,2.4,2.5,2.6,1.1,1.2,1.4,1.5,1.6,1.7,1.8' <(sort -k3,3 file1) <(sort -k5,5 file2)
I wish there was a better way to tell it which columns to output, because that's a lot of typing, but that's the way it works. You could probably also leave off the -o ... stuff, and then just post-process the output with awk or something to get it into the order you want...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Concatenate two columns of a text file - text

With simple sed command for tsv file: sed 's/\t/-/2' file The output: 1 2-3 4 5 ... a b-c d e ... x y-z j k ...

Following awk may help you in same, in case you are not worried about little space which will be created when 3rd field will be nullified. awk '{$2=$2"-"$3;$3=""} 1' Input_file

With awk: awk -v OFS='\t' -v col=2 '{ $(col)=$(col)"-"$(col+1); # merge col and col+1 for (i=col+1;i<NF;i++) $(i)=$(i+1); # shift columns right of col+1 by one to the left NF--; # remove the last field }1' file # print the record Output: 1 2-3 4 5 ... a b-c d e ... x y-z j k ...

Related

Bash script: filter columns based on a character

Fast extraction of lines based on line numbers

Multiple text insertion in Linux

How do I grep a number and all numbers above that number?

in Linux: merge two very big files

Categories

Resources