Mapping lines to columns in *nix - linux

I have a text file that was created when someone pasted from Excel into a text-only email message. There were originally five columns.
Column header 1
Column header 2
...
Column header 5
Row 1, column 1
Row 1, column 2
etc
Some of the data is single-word, some has spaces. What's the best way to get this data into column-formatted text with unix utils?
Edit: I'm looking for the following output:
Column header 1 Column header 2 ... Column header 5
Row 1 column 1 Row 1 column 2 ...
...
I was able to achieve this output by manually converting the data to CSV in vim by adding a comma to the end of each line, then manually joining each set of 5 lines with J. Then I ran the csv through column -ts, to get the desired output. But there's got to be a better way next time this comes up.

Perhaps a perl-one-liner ain't "the best" way, but it should work:
perl -ne 'BEGIN{$fields_per_line=5; $field_seperator="\t"; \
$line_break="\n"} \
chomp; \
print $_, \
$. % $fields_per_row ? $field_seperator : $line_break; \
END{print $line_break}' INFILE > OUTFILE.CSV
Just substitute the "5", "\t" (tabspace), "\n" (newline) as needed.

You would have to use a script that uses readline and counter. When the program reaches that line you want, use cut command and space as a dilimeter to get the word you want
counter=0
lineNumber=3
while read line
do
counter += 1
if lineNumber==counter
do
echo $line | cut -d" " -f 4
done
fi

Related

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...
Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.
take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Linux filtering a file by two columns and printing the output

I have a table that has 9 columns as shown below.
How would I first sort by the strand column so only those with a "+" are selected, and then of those I select the ones that have 3 exons (In the exon count column).
I have been trying to use grep for this as I understand I can pick out a word from a column, but I only get the particular column or just the total number.
using awk
awk -F "," ' $4=="+" && $9=="3" ' file.csv
If it's not CSV then remove -F "," from this command

Extract substring from first column

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...
If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3
Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

Multiple text insertion in Linux

Can someone help me how to write a piece of command that will insert some text in multiple places (given column and row) of a given file that already contains data. For example: old_data is a file that contains:
A
And I wish to get new_data that will contain:
A 1
I read something about awk and sed commands, but I don't believe to understand how to incorporate these, to get what I want.
I would like to add up, that this command I would like to use as a part of script
for b in ./*/ ; do (cd "$b" && command); done
If we imagine content of old_data as a matrix of elements {An*m} where n corresponds to number of row and m to number of column of this matrix, I wish to manipulate with matrix so that I could add new elements. A in old-data has coordinates (1,1). In new_data therefore, I wish to assign 1 to a matrix element that has coordinates (1,3).
If we compare content of old_data and new_data we see that (1,2) element corresponds to space (it is empty).
It's not at all clear to me what you are asking for, but I suspect you are saying that you would like a way to insert some given text in to a particular row and column. Perhaps:
$ cat input
A
B
C
D
$ row=2 column=2 text="This is some new data"
$ awk 'NR==row {$column = new_data " " $column}1' row=$row column=$column new_data="$text" input
A
B This is some new data
C
D
This bash & unix tools code works:
# make the input files.
echo {A..D} | tr ' ' '\n' > abc ; echo {1..4} | tr ' ' '\n' > 123
# print as per previous OP spec
head -1q abc 123 ; paste abc 123 123 | tail -n +2
Output:
A
1
B 2 2
C 3 3
D 4 4
Version #3, (using commas as more visible separators), as per newest OP spec:
# for the `sed` code change the `2` to whatever column needs deleting.
paste -d, abc 123 123 | sed 's/[^,]*//2'
Output:
A,,1
B,,2
C,,3
D,,4
The same, with tab delimiters (less visually obvious):
paste abc 123 123 | sed 's/[^\t]*//2'
A 1
B 2
C 3
D 4

script to compare two large 900 x 900 comma delimited files

I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files. I have tried awk but havent been able to perform a diff for every cell 1 at a time on both files.
If you just want a rough answer, possibly the simplest thing is to do something like:
tr , \\n file1 > /tmp/output
tr , \\n file2 | diff - /tmp/output
That will convert each file to one column and run diff. You can compute the cells that differ from the line numbers of the output.
Simplest way with awk without accounting for newlines inside fields,quoted commas etc.
Print the same
awk 'BEGIN{RS=",|"RS}a[FNR]==$0;{a[NR]=$0}' file{,2}
Print differences
awk 'BEGIN{RS=",|"RS}FNR!=NR&&a[FNR]!=$0;{a[NR]=$0}' file{,2}
Print which are the same different
awk 'BEGIN{RS=",|"RS}FNR!=NR{print "cell"FNR (a[FNR]==$0?"":" not")" the same"}{a[NR]=$0}' file{,2}
Input
file
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
file2
1,2,3,4,5
2,7,1,9,12
1,1,1,1,12
Output
same
1
2
3
4
5
7
9
Different
2
1
12
1
1
1
1
12
Same different
cell1 the same
cell2 the same
cell3 the same
cell4 the same
cell5 the same
cell6 not the same
cell7 the same
cell8 not the same
cell9 the same
cell10 not the same
cell11 not the same
cell12 not the same
cell13 not the same
cell14 not the same
cell15 not the same

Resources