How to split a column which has multiple dots using Linux command line - linux

I have a file which looks like this:
chr10:100013403..100013414,- 0 0 0 0
chr10:100027943..100027958,- 0 0 0 0
chr10:100076685..100076699,+ 0 0 0 0
I want output to be like:
chr10 100013403 100013414 - 0 0 0 0
chr10 100027943 100027958 - 0 0 0 0
chr10 100076685 100076699 + 0 0 0 0
So, I want the first column to be tab separated at field delimiter = : , ..
I have used awk -F":|," '$1=$1' OFS="\t" file to separate first column. But, I am still struggling with .. characters.
I tried awk -F":|,|.." '$1=$1' OFS="\t" file but this doesn't work.

.. should be escaped.
awk -F':|,|\\.\\.' '$1=$1' OFS="\t" file
It is important to remember that when you assign a string constant as the value of FS, it undergoes normal awk string processing. For example, with Unix awk and gawk, the assignment FS = "\.." assigns the character string .. to FS (the backslash is stripped). This creates a regexp meaning “fields are separated by occurrences of any two characters.” If instead you want fields to be separated by a literal period followed by any single character, use FS = "\\..".
https://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html

If your Input_file is same as shown sample then following may help you too in same.
awk '{gsub(/:|\.+|\,/,"\t");} 1' Input_file
Here I am using gsub keyword of awk to globally substitute (:) (.+ which will take all dots) (,) with TAB and then 1 will print the edited/non-edited line of Input_file. I hope this helps.

Related

GNU Awk - don't modify whitespaces

I am using GNU Awk to replace a single character in a file. The file is a single line with varying whitespacing between "fields". After passing through gawk all the extra whitespacing is removed and I end up with single spaces. This is completely unintended and I need it to ignore these spaces and only change the one character I have targeted. I have tried several variations, but I cannot seem to get gawk to ignore these extra spaces.
Since I know this will come up, I read from the end of the line for replacement because the whitespacing is arbitrary/inconsistent in the source file.
Command:
gawk -i inplace -v new=3 'NF {$(NF-5) = new} 1' ~/scripts/tmp_beta_weather_file
Original file example:
2020-07-01 18:29:51.00 C M -11.4 28.9 29 9 23 5.5 000 0 0 00020 044013.77074 1 1 1 3 0 0
Result after command above:
2020-07-01 18:30:51.00 C M -11.8 28.8 29 5 23 5.5 000 0 0 00020 044013.77143 3 1 1 3 0 0
it might be easier with sed
sed -E 's/([^ ]+)(( [^ ]+){5})$/3\2/' file
test and add -i for in-place edit.

How can I replace a specific character in a file where it's position changes in bash command line or script?

I have the following file:
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1
The character "3" that I need to change is bolded and italicized. The value of this character is dynamic, but always a single digit. I have tried a few things using sed but I can't come up with a way to account for the character changing position due to additional characters being added before that position.
This character is always at the same position from the END of the line, but not from the beginning. Meaning, the content to the left of this character may change and it may be longer, but this is always the 11th character and 6th digit from the end. It is easy to devise a way to cut it, or find it using tail, but I can't devise a way to replace it.
To be clear, the single digit character in question will always be replaced with another single digit character.
With GNU awk
$ cat file
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1
$ gawk -i inplace -v new=9 'NF {$(NF-5) = new} 1' file
$ cat file
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 9 1 1 1 1 1
Where:
NF {$(NF-5) = new} means, when the line is not empty, replace the 6th-last field with the new value (9).
1 means print every record.
awk '{ $(NF-5) = ($(NF - 5) + 8) % 10; print }'
Given your input data, it produces;
2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 1 1 1 1 1 1
The 3 has been mapped via 11 to 1 — pick your poison on how you assign the new value, but the magic is $(NF - 5) to pick up the fifth column before the last one (or sixth from end).
Would you try the following:
replace="x" # or whatever you want to replace
sed 's/\(.\)\(.\{10\}\)$/'"$replace"'\2/' file
The left portion of the sed command \(.\)\(.\{10\}\)$ matches a character, followed by ten characters, then anchored by the end of line.
Then the 1st character is replaced with the specified character and the following ten characters are reused.
I'm gonna assume that the number that you're looking for is the same distance from the end, regardless of what comes before it:
rev ~/test.txt | awk '$6=<value to replace>' | rev
Using the bash shell which should be the last option.
rep=10
read -ra var <<< '2020-01-27 19:43:57.00 C M -8.5 0.2 0 4 81 -2.9 000 0 0 00020 043857.82219 3 1 1 1 1 1'
for i in "${!var[#]}"; do printf '%s ' "${var[$i]/${var[-6]}/$rep}"; done
If it is in a file.
rep=10
read -ra var < file.txt
for i in "${!var[#]}"; do printf '%s ' "${var[$i]/${var[-6]}/$rep}"; done
Not the shortest and fastest way but it can be done...

first column copy under empty line

Salam
Following is the required output:
RXOTG-136 VENEN6 0
VENEN6 1
VENEN7 0
VENEN7 1
RXOTG-137 TIVIK6 0
TIVIK6 1
RXOTG-138 KESTA1 0
KESTA1 1
KESTA2 0
KESTA2 1
KESTA3 0
KESTA3 1
RXOTG-139 KESTA4 0
KESTA4 1
For which i used following command
awk 'NF==1{a=$1; next}{ print val}'
but the output I am getting is
RXOTG-136 VENEN6 0
RXOTG-136 VENEN6 1
RXOTG-136 VENEN7 0
RXOTG-136 VENEN7 1
RXOTG-137 TIVIK6 0
RXOTG-137 TIVIK6 1
RXOTG-138 KESTA1 0
RXOTG-138 KESTA1 1
RXOTG-138 KESTA2 0
RXOTG-138 KESTA2 1
RXOTG-138 KESTA3 0
RXOTG-138 KESTA3 1
RXOTG-139 KESTA4 0
RXOTG-139 KESTA4 1
awk 'NF==3{a=$1} NF==2{$1=a OFS $1} 1' file
you need to store the first field somewhere
1 is for printing every line
the format will change due to reassignment of $1~$3 thus you can use column -t to format it
awk 'NF==3{a=$1} NF==2{$1=a OFS $1} 1' file | column -t
Following simple awk may help you on same.
awk '!/^ /{val=$1} /^ /{$1=val OFS $1} 1' Input_file | column -t
This might work for you (GNU sed):
sed -r '1h;1b;s/^/\n/;G;:a;/\n\s(.*\n)(.)(.*\S+\s+\S+$)/s//\2\n\1\3/;ta;s/\n//;s/\n.*//;h' file
Print the first line after making a copy in the hold space. For all subsequent lines, prepend a newline and append the previous line. Copy a character at a time from the previous line to the front of the current line until either there are no more spaces at the front of the current line or there are only two fields in the previous line. Remove the first introduced newline and remove the remains of the previous line. Copy the current line to the hold space, ready for the next time and print the current line.

How to delete the first column ( which is in fact row names) from a data file in linux?

I have data file with many thousands columns and rows. I want to delete the first column which is in fact the row counter. I used this command in linux:
cut -d " " -f 2- input.txt > output.txt
but nothing changed in my output. Does anybody knows why it does not work and what should I do?
This is what my input file looks like:
col1 col2 col3 col4 ...
1 0 0 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 0 0
5 0 1 1 1
6 1 1 1 0
7 1 0 0 0
8 0 0 0 0
9 1 0 0 0
10 1 1 1 1
11 0 0 0 1
.
.
.
I want my output look like this:
col1 col2 col3 col4 ...
0 0 0 1
0 1 0 1
0 1 0 0
0 0 0 0
0 1 1 1
1 1 1 0
1 0 0 0
0 0 0 0
1 0 0 0
1 1 1 1
0 0 0 1
.
.
.
I also tried the sed command:
sed '1d' input.file > output.file
But it deletes the first row not the first column.
Could anybody guide me?
idiomatic use of cut will be
cut -f2- input > output
if you delimiter is tab ("\t").
Or, simply with awk magic (will work for both space and tab delimiter)
awk '{$1=""}1' input | awk '{$1=$1}1' > output
first awk will delete field 1, but leaves a delimiter, second awk removes the delimiter. Default output delimiter will be space, if you want to change to tab, add -vOFS="\t" to the second awk.
UPDATED
Based on your updated input the problem is the initial spaces that cut treats as multiple columns. One way to address is to remove them first before feeding to cut
sed 's/^ *//' input | cut -d" " -f2- > output
or use the awk alternative above which will work in this case as well.
#Karafka I had CSV files so I added the "," separator (you can replace with yours
cut -d"," -f2- input.csv > output.csv
Then, I used a loop to go over all files inside the directory
# files are in the directory tmp/
for f in tmp/*
do
name=`basename $f`
echo "processing file : $name"
#kepp all column excep the first one of each csv file
cut -d"," -f2- $f > new/$name
#files using the same names are stored in directory new/
done
You can use cut command with --complement option:
cut -f1 -d" " --complement input.file > output.file
This will output all columns except the first one.
As #karakfa notes, it looks like it's the leading whitespace which is causing your issues.
Here's a sed oneliner to do the job (that will account for spaces or tabs):
sed -i.bak "s|^[ \t]\+[0-9]\+[ \t]\+||" input.txt
Explanation:
-i edit existing file in place
.bak backup original file and add .bak file extension (can use whatever you like)
s substitute
| separator (easiest character to read as sed separator IMO)
^ start match at start of the line
[ \t] match space or tab
\+ match one or more times (escape required so sed does not interpret '+' literally)
[0-9] match any number 0 - 9
As noted; the input.txt file will be edited in place. The original content of input.txt will be saved as input.txt.bak. Use just -i instead if you don't want a backup of the original file.
Also, if you know that they are definitely leading spaces (not tabs), you could shorten it to this:
sed -i.bak "s|^ \+[0-9]\+[ \t]\+||" input.txt
You can also achieve this with grep:
grep -E -o '[[:digit:]]([[:space:]][[:digit:]]){3}$' input.txt
Which assumes single character digit and space columns. To accommodate a variable number of spaces and digits you can do:
grep -E -o '[[:digit:]]+([[:space:]]+[[:digit:]]+){3}$' input.txt
If your grep supports the -P flag (--perl-regexp) you can do:
grep -P -o '\d+(\s+\d+){3}$' input.txt
And here are a few options if you are using GNU sed:
sed 's/^\s\+\w\+\s\+//' input.txt
sed 's/^\s\+\S\+\s\+//' input.txt
sed 's/^\s\+[0-9]\+\s\+//' input.txt
sed 's/^\s\+[[:digit:]]\+\s\+//' input.txt
Note that the grep regexes are matching the parts that we want to keep while the sed regexes are matching the parts we want to remove.

How to do sum from the file and move in particular way in another file in linux?

Acttualy this is my assignment.I have three-four file,related by student record.Every file have two-three student record.like this
Course Name:Opreating System
Credit: 4
123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25
243567 0 1 1 0 1 1 0 1 0 0 0 7 9 12 15 17 15
Every file have different coursename.I did every coursename and studentid move
in one file but now i don't know how to add all marks and move to another file on same place where is id? Can you please tell me how to do it?
It looks like this:
Student# Operating Systems JAVA C++ Web Programming GPA
123456 76 63 50 82 67.75
243567 80 - 34 63 59
I did like this:
#!/bin/sh
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
echo "STUDENT ID" > rsh2
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
echo "GPA" >> rsh2
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
Some comments and a few pointers :
It would help to add 'comments' for each line of code that is not self evident ; i.e. code like mv f f.bak doesn't need to be commented, but I'm not sure what the intent of your many lines of code are.
You insert a comment with the '#' char, like
# concatenate all files that contain the word CREDITS into a file called rsh1
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
Also note that you consistently use all uppercase for your search targets, i.e. CREDITS, when your sample files shows mixed case. Either used correct case for your search targets, i.e.
`grep -l 'Credits'`
OR tell grep to -i(gnore case), i.e.
`grep -il 'Credits'
Your line
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
can be reduced to 1 call to sed (and you have the same case confusion thing going on), try
sed -n '/COURSE/i{;s/COURSE NAME: //gip;}' rsh1 >> rsh2
This means (-n don't print every line by default),
`gip` = global substitute,
= ignore case in matching
print only lines where substituion was made
So you're editing out the string COURSE NAME for any line that has COURSE in it, and only printing those lines' (you probably don't need the 'g' (global) specifier given that you expect only 1 instance per line)
Your line
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
Actually looks pretty good, very advanced, you're trying to 'fold' each 2 lines together into 1 line, right?
But,
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
I'm really confused by this, is this where you're trying to total a students score? (with a sort embedded I guess not). Why do you think you need a sort,
While it is possible to perform arithmetic in sed, it is super-crazy hard, so you can either use bash variables to calculate the values OR use a unix tool that is designed to process text AND perform logical and mathematical operations of the data presented, awk or perl come to mind here
Anyway, one solution to total each score is to use awk
echo "123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25" |\
awk '{for (i=2;i<=NF;i++) { tot+=$i }; print $1 "\t" tot }'
Will give you a clue on how to proceed for that.
Awk has predefined variables that it populates for each file, and each line of text that it reads, i.e.
$0 = complete line of text (as defined by the internal variables RS (RecordSeparator)
which defaults to '\n' new-line char, the unix end-of-line char
$1 = first field in text (as defined by the internal variables FS (FieldSeparator)
which defaults to (possibly multiple) space chars OR tab char
a line with 2 connected spaces chars and 1 tab char has 3 fields)
NF = Number(of)Fields in current line of data (again fields defined by value of FS as
described above)
(there are many others, besides, $0, $n, $NF, $FS, $RS).
you can programatically increment for values like $1, $2, $3, by using a variable as in the example code, like $i (i is a variable that has a number between 2 and NF. The leading '$'
says give me the value of field i (i.e. $2, $3, $4 ...)
Incidentally, your problem could be easily solved with a single awk script, but apparently, you're supposed to learn about cat, cut, grep, etc, which is a very worthwhile goal.
I hope this helps.

Resources