How to delete the first column ( which is in fact row names) from a data file in linux? - linux

I have data file with many thousands columns and rows. I want to delete the first column which is in fact the row counter. I used this command in linux:
cut -d " " -f 2- input.txt > output.txt
but nothing changed in my output. Does anybody knows why it does not work and what should I do?
This is what my input file looks like:
col1 col2 col3 col4 ...
1 0 0 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 0 0
5 0 1 1 1
6 1 1 1 0
7 1 0 0 0
8 0 0 0 0
9 1 0 0 0
10 1 1 1 1
11 0 0 0 1
.
.
.
I want my output look like this:
col1 col2 col3 col4 ...
0 0 0 1
0 1 0 1
0 1 0 0
0 0 0 0
0 1 1 1
1 1 1 0
1 0 0 0
0 0 0 0
1 0 0 0
1 1 1 1
0 0 0 1
.
.
.
I also tried the sed command:
sed '1d' input.file > output.file
But it deletes the first row not the first column.
Could anybody guide me?

idiomatic use of cut will be
cut -f2- input > output
if you delimiter is tab ("\t").
Or, simply with awk magic (will work for both space and tab delimiter)
awk '{$1=""}1' input | awk '{$1=$1}1' > output
first awk will delete field 1, but leaves a delimiter, second awk removes the delimiter. Default output delimiter will be space, if you want to change to tab, add -vOFS="\t" to the second awk.
UPDATED
Based on your updated input the problem is the initial spaces that cut treats as multiple columns. One way to address is to remove them first before feeding to cut
sed 's/^ *//' input | cut -d" " -f2- > output
or use the awk alternative above which will work in this case as well.

#Karafka I had CSV files so I added the "," separator (you can replace with yours
cut -d"," -f2- input.csv > output.csv
Then, I used a loop to go over all files inside the directory
# files are in the directory tmp/
for f in tmp/*
do
name=`basename $f`
echo "processing file : $name"
#kepp all column excep the first one of each csv file
cut -d"," -f2- $f > new/$name
#files using the same names are stored in directory new/
done

You can use cut command with --complement option:
cut -f1 -d" " --complement input.file > output.file
This will output all columns except the first one.

As #karakfa notes, it looks like it's the leading whitespace which is causing your issues.
Here's a sed oneliner to do the job (that will account for spaces or tabs):
sed -i.bak "s|^[ \t]\+[0-9]\+[ \t]\+||" input.txt
Explanation:
-i edit existing file in place
.bak backup original file and add .bak file extension (can use whatever you like)
s substitute
| separator (easiest character to read as sed separator IMO)
^ start match at start of the line
[ \t] match space or tab
\+ match one or more times (escape required so sed does not interpret '+' literally)
[0-9] match any number 0 - 9
As noted; the input.txt file will be edited in place. The original content of input.txt will be saved as input.txt.bak. Use just -i instead if you don't want a backup of the original file.
Also, if you know that they are definitely leading spaces (not tabs), you could shorten it to this:
sed -i.bak "s|^ \+[0-9]\+[ \t]\+||" input.txt

You can also achieve this with grep:
grep -E -o '[[:digit:]]([[:space:]][[:digit:]]){3}$' input.txt
Which assumes single character digit and space columns. To accommodate a variable number of spaces and digits you can do:
grep -E -o '[[:digit:]]+([[:space:]]+[[:digit:]]+){3}$' input.txt
If your grep supports the -P flag (--perl-regexp) you can do:
grep -P -o '\d+(\s+\d+){3}$' input.txt
And here are a few options if you are using GNU sed:
sed 's/^\s\+\w\+\s\+//' input.txt
sed 's/^\s\+\S\+\s\+//' input.txt
sed 's/^\s\+[0-9]\+\s\+//' input.txt
sed 's/^\s\+[[:digit:]]\+\s\+//' input.txt
Note that the grep regexes are matching the parts that we want to keep while the sed regexes are matching the parts we want to remove.

Related

first column copy under empty line

Salam
Following is the required output:
RXOTG-136 VENEN6 0
VENEN6 1
VENEN7 0
VENEN7 1
RXOTG-137 TIVIK6 0
TIVIK6 1
RXOTG-138 KESTA1 0
KESTA1 1
KESTA2 0
KESTA2 1
KESTA3 0
KESTA3 1
RXOTG-139 KESTA4 0
KESTA4 1
For which i used following command
awk 'NF==1{a=$1; next}{ print val}'
but the output I am getting is
RXOTG-136 VENEN6 0
RXOTG-136 VENEN6 1
RXOTG-136 VENEN7 0
RXOTG-136 VENEN7 1
RXOTG-137 TIVIK6 0
RXOTG-137 TIVIK6 1
RXOTG-138 KESTA1 0
RXOTG-138 KESTA1 1
RXOTG-138 KESTA2 0
RXOTG-138 KESTA2 1
RXOTG-138 KESTA3 0
RXOTG-138 KESTA3 1
RXOTG-139 KESTA4 0
RXOTG-139 KESTA4 1
awk 'NF==3{a=$1} NF==2{$1=a OFS $1} 1' file
you need to store the first field somewhere
1 is for printing every line
the format will change due to reassignment of $1~$3 thus you can use column -t to format it
awk 'NF==3{a=$1} NF==2{$1=a OFS $1} 1' file | column -t
Following simple awk may help you on same.
awk '!/^ /{val=$1} /^ /{$1=val OFS $1} 1' Input_file | column -t
This might work for you (GNU sed):
sed -r '1h;1b;s/^/\n/;G;:a;/\n\s(.*\n)(.)(.*\S+\s+\S+$)/s//\2\n\1\3/;ta;s/\n//;s/\n.*//;h' file
Print the first line after making a copy in the hold space. For all subsequent lines, prepend a newline and append the previous line. Copy a character at a time from the previous line to the front of the current line until either there are no more spaces at the front of the current line or there are only two fields in the previous line. Remove the first introduced newline and remove the remains of the previous line. Copy the current line to the hold space, ready for the next time and print the current line.

How to split a column which has multiple dots using Linux command line

I have a file which looks like this:
chr10:100013403..100013414,- 0 0 0 0
chr10:100027943..100027958,- 0 0 0 0
chr10:100076685..100076699,+ 0 0 0 0
I want output to be like:
chr10 100013403 100013414 - 0 0 0 0
chr10 100027943 100027958 - 0 0 0 0
chr10 100076685 100076699 + 0 0 0 0
So, I want the first column to be tab separated at field delimiter = : , ..
I have used awk -F":|," '$1=$1' OFS="\t" file to separate first column. But, I am still struggling with .. characters.
I tried awk -F":|,|.." '$1=$1' OFS="\t" file but this doesn't work.
.. should be escaped.
awk -F':|,|\\.\\.' '$1=$1' OFS="\t" file
It is important to remember that when you assign a string constant as the value of FS, it undergoes normal awk string processing. For example, with Unix awk and gawk, the assignment FS = "\.." assigns the character string .. to FS (the backslash is stripped). This creates a regexp meaning “fields are separated by occurrences of any two characters.” If instead you want fields to be separated by a literal period followed by any single character, use FS = "\\..".
https://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html
If your Input_file is same as shown sample then following may help you too in same.
awk '{gsub(/:|\.+|\,/,"\t");} 1' Input_file
Here I am using gsub keyword of awk to globally substitute (:) (.+ which will take all dots) (,) with TAB and then 1 will print the edited/non-edited line of Input_file. I hope this helps.

Merging two files in bash with a twist in shell linux

The following question is somehow tricky but seemingly simple , i need to use bash
let us suppose i have 2 text files, the first on is
FirstFile.txt
0 1
0 2
1 1
1 2
2 0
SecondFile.txt
0 1
0 2
0 3
0 4
0 5
1 0
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
I want to be able to create a new Thirdfile.txt that contains that values that are not in file A , meaning if there is a common variable with file A i want it to be removed. knowing that 2 0 and 0 2 are the same ...
Can you help me out ?
Using awk, you can rearrange the columns so that the lower number is always first. When reading the first file, save them as keys in an associative array. When reading the second file, test if they're not found in the array.
awk '{if ($1 <= $2) { a = $1; b = $2; } else { a = $2; b = $1 } }
FNR==NR { arr[a, b] = 1; next; }
!arr[a, b]' FirstFile.txt SecondFile.txt > ThirdFile.txt
Results:
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
paste <(cut -f2 a.txt) <(cut -f1 a.txt) > tmp.txt
cat a.txt b.txt tmp.txt | sort | uniq -u
or
cat a.txt b.txt <(paste <(cut -f2 a.txt) <(cut -f1 a.txt)) | sort | uniq -u
Result
0 3
0 4
0 5
1 3
1 4
1 5
2 2
2 3
2 4
2 5
Explanation
uniq removes duplicate rows from a text file.
uniq requires that its input be sorted.
uniq -u prints only the rows that do not have duplicates.
So, cat a.txt b.txt | sort | uniq -u will almost get you there: Only rows in b.txt that are not in a.txt will get printed. However it doesn't handle the reversed cases, like '1 2' <-> '2 1'.
Therefore, you need a temp file that holds all the reversed removal keys. That's what paste <(cut -f2 a.txt) <(cut -f1 a.txt) does.
Note that cut assumes columns are separated by \t's. If they are not, you will need to specify a delimiter with, for example, -d ' '.

How to remove a specific string common in multiple lines in a CSV file using shell script?

I have a csv file which contains 65000 lines (Size approximately 28 MB). In each of the lines a certain path in the beginning is given e.g. "c:\abc\bcd\def\123\456". Now let's say the path "c:\abc\bcd\" is common in all the lines and rest of the content is different. I have to remove the common part (In this case "c:\abc\bcd\") from all the lines using a shell script. For example the content of the CSV file is as mentioned.
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.frag 0 0 0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.vert 0 0 0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.frag 16 24 3
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert 87 116 69
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-0 0 0
C:/Abc/Def/Test/temp\.\test\GLNext\FILE0.link-link-6 0 0 0
In the above example I need the output as below
FILE0.frag 0 0 0
FILE0.vert 0 0 0
FILE0.link-link-0.frag 17 25 2
FILE0.link-link-0.vert 85 111 68
FILE0.link-link-0.vert.bin 77 97 60
FILE0.link-link-0 0 0
FILE0.link 0 0 0
Can any of you please help me out with this?
You could use sed:
$ cat test.csv
"c:\abc\bcd\def\123\456", 1, 2
"c:\abc\bcd\def\234\456", 1, 2
"c:\abc\bcd\def\432\456", 3, 4
$ sed -i.bak -e 's/c\:\\abc\\bcd\\//1' test.csv
$ cat test.csv
"def\123\456", 1, 2
"def\234\456", 1, 2
"def\432\456", 3, 4
I am using sed here in this way:
sed -e 's/<SEARCH TERM>/<REPLACE_TERM>/<OCCURANCE>' FILE
where
<SEARCH TERM> is what we are looking for (in this case c:\abc\bcd\, but backslashes need to be escaped).
<REPLACE TERM> is what we want to replace it with, in this case nothing, and
<OCCURANCE> is which occurance of the item we want to replace, in this case the first item in each line.
(-i.bak stands for: Don't output, just edit this file. (but make a backup first))
Updated according to #david-c-rankin comment. He is right, make a backup before editing files in case you make a mistake.
# init variable
MaxPath="$( sed -n 's/,.*//p;1q' YourFile )"
GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"
# search the biggest pattern to remove
while [ ${#MaxPath} -gt 0 ] && [ $( grep -c -v -E "${GrepPath}" YourFile ) -gt 0 ]
do
MaxPath="${MaxPath%%?}"
GrepPath="^$( printf "%s" "${MaxPath}" | sed 's#\\#\\\\#g' )"
done
# Adapt your file
if [ ${#MaxPath} -gt 0 ]
then
sed "s#${GrepPath}##" YourFile
fi
Assuming for the sample that there is no special regex char nor # in MaxPath
the grep -c -v -E is not optimized in term of performance (treat whle file each time where it can stop at first miss)

How to do sum from the file and move in particular way in another file in linux?

Acttualy this is my assignment.I have three-four file,related by student record.Every file have two-three student record.like this
Course Name:Opreating System
Credit: 4
123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25
243567 0 1 1 0 1 1 0 1 0 0 0 7 9 12 15 17 15
Every file have different coursename.I did every coursename and studentid move
in one file but now i don't know how to add all marks and move to another file on same place where is id? Can you please tell me how to do it?
It looks like this:
Student# Operating Systems JAVA C++ Web Programming GPA
123456 76 63 50 82 67.75
243567 80 - 34 63 59
I did like this:
#!/bin/sh
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
echo "STUDENT ID" > rsh2
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
echo "GPA" >> rsh2
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
Some comments and a few pointers :
It would help to add 'comments' for each line of code that is not self evident ; i.e. code like mv f f.bak doesn't need to be commented, but I'm not sure what the intent of your many lines of code are.
You insert a comment with the '#' char, like
# concatenate all files that contain the word CREDITS into a file called rsh1
find ~/2011/Fall/StudentsRecord -name "*.rec" | xargs grep -l 'CREDITS' | xargs cat > rsh1
Also note that you consistently use all uppercase for your search targets, i.e. CREDITS, when your sample files shows mixed case. Either used correct case for your search targets, i.e.
`grep -l 'Credits'`
OR tell grep to -i(gnore case), i.e.
`grep -il 'Credits'
Your line
sed -n /COURSE/p rsh1 | sed 's/COURSE NAME: //g' >> rsh2
can be reduced to 1 call to sed (and you have the same case confusion thing going on), try
sed -n '/COURSE/i{;s/COURSE NAME: //gip;}' rsh1 >> rsh2
This means (-n don't print every line by default),
`gip` = global substitute,
= ignore case in matching
print only lines where substituion was made
So you're editing out the string COURSE NAME for any line that has COURSE in it, and only printing those lines' (you probably don't need the 'g' (global) specifier given that you expect only 1 instance per line)
Your line
sed -e :a -e '{N; s/\n/ /g; ta}' rsh2 > rshf
Actually looks pretty good, very advanced, you're trying to 'fold' each 2 lines together into 1 line, right?
But,
sed '/COURSE/d;/CREDIT/d' rsh1 | sort -uk 1,1 | cut -d' ' -f1 | paste -d' ' >> rshf
I'm really confused by this, is this where you're trying to total a students score? (with a sort embedded I guess not). Why do you think you need a sort,
While it is possible to perform arithmetic in sed, it is super-crazy hard, so you can either use bash variables to calculate the values OR use a unix tool that is designed to process text AND perform logical and mathematical operations of the data presented, awk or perl come to mind here
Anyway, one solution to total each score is to use awk
echo "123456 1 1 0 1 1 0 1 0 0 0 1 5 8 0 12 10 25" |\
awk '{for (i=2;i<=NF;i++) { tot+=$i }; print $1 "\t" tot }'
Will give you a clue on how to proceed for that.
Awk has predefined variables that it populates for each file, and each line of text that it reads, i.e.
$0 = complete line of text (as defined by the internal variables RS (RecordSeparator)
which defaults to '\n' new-line char, the unix end-of-line char
$1 = first field in text (as defined by the internal variables FS (FieldSeparator)
which defaults to (possibly multiple) space chars OR tab char
a line with 2 connected spaces chars and 1 tab char has 3 fields)
NF = Number(of)Fields in current line of data (again fields defined by value of FS as
described above)
(there are many others, besides, $0, $n, $NF, $FS, $RS).
you can programatically increment for values like $1, $2, $3, by using a variable as in the example code, like $i (i is a variable that has a number between 2 and NF. The leading '$'
says give me the value of field i (i.e. $2, $3, $4 ...)
Incidentally, your problem could be easily solved with a single awk script, but apparently, you're supposed to learn about cat, cut, grep, etc, which is a very worthwhile goal.
I hope this helps.

Resources