Bash script: filter columns based on a character - linux

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)

And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...

Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file

Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t

Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.

take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Related

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count++ } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be:
2 (count of ad=c and bd=water)
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i++) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count++} END{print count+0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i++) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i++) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk to be more succinct
$ awk -v cols='ad bd' '{while(++i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt++ }
END { print cnt+0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.
Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n +2 | wc -l
Prints:
2
SEE ALSO:
miller manual
Note that miller can be easily installed, for example, using conda, like so:
conda create --name miller miller
For years it bugged me there is no succinct way in Unix to do this sort of thing, although miller is a pretty good tool for this. Recently I wrote pick to choose columns by name, and additionally modify, combine and add them by name, as well as filtering rows by clauses using column names. The solution to the above with pick is
pick -h #ad=c #bd=water < data.txt | wc -l
By default pick prints the header of the selected columns, -h is to omit it. To print columns you simply name them on the command line, e.g.
pick ad water < data.txt | wc -l
Pick has many modes, all of them focused on manipulating columns and selecting/filtering rows with a minimal amount of syntax.

How can a Linux cpuset be iterated over with shell?

Linux's sys filesystem represents sets of CPU ids with the syntax:
0,2,8: Set of CPUs containing 0, 2 and 8.
4-6: Set of CPUs containing 4, 5 and 6.
Both syntaxes can be mixed and matched, for example: 0,2,4-6,8
For example, running cat /sys/devices/system/cpu/online prints 0-3 on my machine which means CPUs 0, 1, 2 and 3 are online.
The problem is the above syntax is difficult to iterate over using a for loop in a shell script. How can the above syntax be converted to one more conventional such as 0 2 4 5 6 8?
Try:
$ echo 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' ORS=' ' RS=, FS=-
0 2 4 5 6 8
This can be used in a loop as follows:
for n in $(echo 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' RS=, FS=-)
do
echo cpu="$n"
done
Which produces the output:
cpu=0
cpu=2
cpu=4
cpu=5
cpu=6
cpu=8
Or like:
printf "%s" 0,2,4-6,8 | awk '/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next} 1' RS=, FS=- | while read n
do
echo cpu="$n"
done
Which also produces:
cpu=0
cpu=2
cpu=4
cpu=5
cpu=6
cpu=8
How it works
The awk command works as follows:
RS=,
This tells awk to use , as the record separator.
If, for example, the input is 0,2,4-6,8, then awk will see four records: 0 and 2 and 4-6 and 8.
FS=-
This tells awk to use - as the field separator.
With FS set this way and if, for example, the input record consists of 2-4, then awk will see 2 as the first field and 4 as the second field.
/-/{for (i=$1; i<=$2; i++)printf "%s%s",i,ORS;next}
For any record that contains -, we print out each number starting with the value of the first field, $1, and ending with the value of the second field, $2. Each such number is followed by the Output Record Separator, ORS. By default, ORS is a newline character. For some of the examples above, we set ORS to a blank.
After we have printed these numbers, we skip the rest of the commands and jump to the next record.
1
If we get here, then the record did not contain - and we print it out as is. 1 is awk's shorthand for print-the-line.
A Perl one:
echo "0,2,4-6,8" | perl -lpe 's/(\d+)-(\d+)/{$1..$2}/g; $_="echo {$_}"' | bash
Just convert the original string into echo {0,2,{4..6},8} and let bash 'brace expansion' to interpolate it.
eval echo $(cat /sys/devices/system/cpu/online | sed 's/\([[:digit:]]\+\)-\([[:digit:]]\+\)/$(seq \1 \2)/g' | tr , ' ')
Explanation:
cat /sys/devices/system/cpu/online reads the file from sysfs. This can be changed to any other file such as offline.
The output is piped through the substitution s/\([[:digit:]]\+\)-\([[:digit:]]\+\)/$(seq \1 \2)/g. This matches something like 4-6 and replaces it with $(seq 4 6).
tr , ' ' replaces all commas with spaces.
At this point, the input 0,2,4-6,8 is transformed to 0 2 $(seq 4 6) 8. The final step is to eval this sequence to get 0 2 4 5 6 8.
The example echo's the output. Alternatively, it can be written to a variable or used in a for loop.

Extract substring from first column

I have a large text file with 2 columns. The first column is large and complicated, but contains a name="..." portion. The second column is just a number.
How can I produce a text file such that the first column contains ONLY the name, but the second column stays the same and shows the number? Basically, I want to extract a substring from the first column only AND have the 2nd column stay unaltered.
Sample data:
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
So the result file would be something like this
app-name_01 0
myapp-02 1
app_name_public 3
...
If your actual Input_file is same as the shown sample then following code may help you in same.
awk '{sub(/.*name=\"/,"");sub(/\".* /," ")} 1' Input_file
Output will be as follows.
app-name_01 0
myapp-02 1
app_name_public 3
Using GNU awk
$ awk 'match($0,/name="([^"]*)"/,a){print a[1],$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Non-Gawk
awk 'match($0,/name="([^"]*)"/){t=substr($0,RSTART,RLENGTH);gsub(/name=|"/,"",t);print t,$NF}' infile
app-name_01 0
myapp-02 1
app_name_public 3
Input:
$ cat infile
application{id="1821", name="app-name_01"} 0
application{id="1822", name="myapp-02", optionalFlag="false"} 1
application{id="1823", optionalFlag="false", name="app_name_public"} 3
...
Here's a sed solution:
sed -r 's/.*name="([^"]+).* ([0-9]+)$/\1 \2/g' Input_file
Explanation:
With the parantheses your store in groups what's inbetween.
First group is everything after name=" till the first ". [^"] means "not a double-quote".
Second group is simply "one or more numbers at the end of the line preceeded with a space".

Multiple text insertion in Linux

Can someone help me how to write a piece of command that will insert some text in multiple places (given column and row) of a given file that already contains data. For example: old_data is a file that contains:
A
And I wish to get new_data that will contain:
A 1
I read something about awk and sed commands, but I don't believe to understand how to incorporate these, to get what I want.
I would like to add up, that this command I would like to use as a part of script
for b in ./*/ ; do (cd "$b" && command); done
If we imagine content of old_data as a matrix of elements {An*m} where n corresponds to number of row and m to number of column of this matrix, I wish to manipulate with matrix so that I could add new elements. A in old-data has coordinates (1,1). In new_data therefore, I wish to assign 1 to a matrix element that has coordinates (1,3).
If we compare content of old_data and new_data we see that (1,2) element corresponds to space (it is empty).
It's not at all clear to me what you are asking for, but I suspect you are saying that you would like a way to insert some given text in to a particular row and column. Perhaps:
$ cat input
A
B
C
D
$ row=2 column=2 text="This is some new data"
$ awk 'NR==row {$column = new_data " " $column}1' row=$row column=$column new_data="$text" input
A
B This is some new data
C
D
This bash & unix tools code works:
# make the input files.
echo {A..D} | tr ' ' '\n' > abc ; echo {1..4} | tr ' ' '\n' > 123
# print as per previous OP spec
head -1q abc 123 ; paste abc 123 123 | tail -n +2
Output:
A
1
B 2 2
C 3 3
D 4 4
Version #3, (using commas as more visible separators), as per newest OP spec:
# for the `sed` code change the `2` to whatever column needs deleting.
paste -d, abc 123 123 | sed 's/[^,]*//2'
Output:
A,,1
B,,2
C,,3
D,,4
The same, with tab delimiters (less visually obvious):
paste abc 123 123 | sed 's/[^\t]*//2'
A 1
B 2
C 3
D 4

Separate comma delimited cells to new rows with shell script

I have a table with comma delimited columns and I want to separate the comma delimited values in my specified column to new rows. For example, the given table is
Name Start Name2
A 1,2 X,a
B 5 Y,b
C 6,7,8 Z,c
And I need to separate the comma delimited values in column 2 to get the table below
Name Start Name2
A 1 X,a
A 2 X,a
B 5 Y,b
C 6 Z,c
C 7 Z,c
C 8 Z,c
I am wondering if there is any solution with shell script, so that I can create a workflow pipe.
Note: the original table may contain more than 3 columns.
Assuming the format of your input and output does not change:
awk 'BEGIN{FS="[ ,]"} {print $1, $2, $NF; print $1, $3, $NF}' input_file
Input:
input_file:
A 1,2 X
B 5,6 Y
Output:
A 1 X
A 2 X
B 5 Y
B 6 Y
Explanation:
awk: invoke awk, a tool for manipulating lines (records) and fields
'...': content enclosed by single-quotes are supplied to awk as instructions
'BEGIN{FS="[ ,]"}: before reading any lines, tell awk to use both space and comma as delimiters; FS stands for Field Separator.
{print $1, $2, $NF; print $1, $3, $NF}: For each input line read, print the 1st, 2nd and last field on one line, and then print the 1st, 3rd, and last field on the next line. NF stands for Number of Fields, so $NF is the last field.
input_file: supply the name of the input file to awk as an argument.
In response to updated input format:
awk 'BEGIN{FS="[ ,]"} {print $1, $2, $4","$5; print $1, $3, $4","$5}' input_file
After Runner's modification of the original question another approach might look like this:
#!/bin/sh
# Usage $0 <file> <column>
#
FILE="${1}"
COL="${2}"
# tokens separated by linebreaks
IFS="
"
for LINE in `cat ${FILE}`; do
# get number of columns
COLS="`echo ${LINE} | awk '{print NF}'`"
# get actual field by COL, this contains the keys to be splitted into individual lines
# replace comma with newline to "reuse" newline field separator in IFS
KEYS="`echo ${LINE} | cut -d' ' -f${COL}-${COL} | tr ',' '\n'`"
COLB=$(( ${COL} - 1 ))
COLA=$(( ${COL} + 1 ))
# get text from columns before and after actual field
if [ ${COLB} -gt 0 ]; then
BEFORE="`echo ${LINE} | cut -d' ' -f1-${COLB}` "
else
BEFORE=""
fi
AFTER=" `echo ${LINE} | cut -d' ' -f${COLA}-`"
# echo "-A: $COLA ($AFTER) | B: $COLB ($BEFORE)-"
# iterate keys and re-build original line
for KEY in ${KEYS}; do
echo "${BEFORE}${KEY}${AFTER}"
done
done
With this shell file you might do what you want. This will split column 2 into multiple lines.
./script.sh input.txt 2
If you'd like to pass inputs though standard input using pipes (e.g. to split multiple columns in one go) you could change the 6. line to:
if [ "${1}" == "-" ]; then
FILE="/dev/stdin"
else
FILE="${1}"
fi
And run it this way:
./script.sh input.txt 1 | ./script.sh - 2 | ./script.sh - 3
Note that cut is very sensitiv about the field separators. Soif the line starts with a space character, column 1 would be "" (empty). If the fields were separated by amixture of spaces and tabs this script would have other issues too. In this case (as explained above) filtering the input resource (so that fields are only separated by one space character) should do it. If this is not possible or the data in each column contains space characters too, the script might get more complicated.

Resources