add a row with empty columns to a tab delimited file - linux

I have some tab delimited data files with different column numbers. I want to add a header line to these files, the header line only contains 'ID' in the first column, however, the total column number of the header line should be the same as the file. Can I do it with some linux commands? Thank you very much!!

for file in *
do
awk 'NR==1{hdr=$0; gsub(/[^\t]/,"",hdr); print "ID" hdr}1' "$file" > tmp &&
mv tmp "$file"
done

sed -i '1 { h; s/[^\t]//g; s/^/ID/; p; g; }' *.tsv
Copy the first line, remove all non-tabs (to clear the field contents), tack on "ID", then print this line plus the original.

Related

Replace all double quotes only in Nth Column

I have a file like this
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq"rst"|uv||xyz
abc|def||ghi|jklm|"nopq"r"st"|uv||xyz
The 6th Column could be double quoted. I want to replace all the occurances of double quotes in this field with a backslash-double quote (\")
I wish my output to look like
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
I have tried combinations of below, but ending short each time
sed -i 's/\"/\\\"/2' file.txt (this replaces only 2nd occurrence)
sed -i 's/\"/\\\"/2g' file.txt (this replaces only 2nd occurrence and all rest also)
My file will be having millions of rows; so I may need a sed or awk command only.
Please help.
You may use this awk solution in any version of awk:
awk 'BEGIN {FS=OFS="|"} {
c1 = substr($6, 1, 1)
c2 = substr($6, length($6), 1)
s = substr($6, 2, length($6)-2)
gsub(/"/, "\\\"", s)
$6 = c1 s c2
} 1' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
If this isn't all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for:
$ sed 's/"/\\"/g; s/|\\"/|"/g; s/\\"|/"|/g' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
The above will work in any sed.
This might work for you (GNU sed):
sed -E 's/[^|]*/\n&\n/6 # isolate the 6th field
h # make a copy
s/"/\\"/g # replace " by \"
s/\\(")\n|\n\\(")/\1\n\2/g # repair start and end "s
H # append amended line to copy
g # get copies to current line
s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file # swap fields
Surround the 6th field by newlines and make a copy in the hold space.
Replace all "'s by \"'s and remove the \'s at the start and end of the field if the field begins and ends in "'s
Append the amended line to the copy and replace the current line by the doubled line.
Using pattern matching replace copied line 6th field by the amended one.

Add pipe delimiter at the end of each row using unix

I am new to unix commands, please forgive if i am not using correct line of code below.
I have files (xxxx.txt.date) on winscp with header and footer. Now i want to add N number of pipe (|) at the end of the each row of all files starting from 2nd line till second last line. (i dont want | in header as well as footer).
Now i have created a scirpt in which i am using below command:
sed -e "2,\$s/$/|/" $file | column -t
2,$s/$/|/: adds | at the end of every line from line 2
Now below are the issues i am facing
First
The data doesn't change in the files i am able to see pipe added at end of each row in hive, how can i change data in files?
I don't want | in footer.
Any suggestion or help will be appreciated.
Thanks in advance !!
If you need to append just one "|" at the end of each line except header and footer
sed -i '1n; $n; s/$/|/' file_name
1n; $n; : Just print first and last line as is.
-i : make changes to the file instead of printing to STDOUT.
If you need to append n pipes at the end of each line except Header and Footer. If you use the below awk command, you will have to redirect the output to a temporary file and then rename it.
Assumptions:
I am assuming your Header and Footer are standard and start with some character(e.g., H, F, T etc) or String(Header, Footer, Trailer etc)
I am assuming your original file is delimited with "|". You can specify your actual delimiter in the below awk.
awk -F'|' -v n=7 '{if(/^Header|^Footer/) {print} else {end="";for (i=1;i<=n;i++) end=sprintf("%s%s", end, "|"); rec=sprintf("%s%s", $0, end); print rec}}' file_name
n=number of times you want to repeat | at the end of each line.
^Header|^Footer - If the line starts with "Header" or "Footer", just print the record as it is. You can specify your header and footer strings from file.
for loop - prepares a string "end" which contains "|" n times.
rec - Contains concatenated string of entire record followed by end string

insert column with same row content to csv in cli

I am having a csv to which I need to add a new column at the end and add a certain string to all rows of the csv in the newly added column.
Example csv:
os,num1,alpha1
Unix,10,A
Linux,30,B
Solaris,40,C
Fedora,20,D
Ubuntu,50,E
I tried using awk command and did not get expected result. I am not sure whether indexing or counting column number is right.
awk -F'[[:null:]]' '$2 && !$1{ $4="NA" }1'
Expected result is:
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA
You can use sed:
sed 's/$/,NA/' db1.csv > db2.csv
then edit the first line containing the column titles.
I'm not quite sure how you came up w/ that awk statement of yours, why you'd think that your file has NUL-terminated lines or that [[:null:]] has become a valid character class ...
The following, however, will do your bidding:
awk 'NR==1{print $0",code"}; NR>1{print $0",NA"}' example.csv
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA

How to Split a Delimited Text file in Linux, based on no of records, which has end-of-record separator in data fields

Problem Statement:
I have a delimited text file offloaded from Teradata which happens to have "\n" (newline characters or EOL markers) inside data fields.
The same EOL marker is at the end of each new line for one entire line or record.
I need to split this file in two or more files (based on no of records given by me) while retaining the newline chars in data fields but against the line breaks at the end of each lines.
Example:
1|Alan
Wake|15
2|Nathan
Drake|10
3|Gordon
Freeman|11
Expectation :
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11
What i have tried :
awk 'BEGIN{RS="\n"}NR%2==1{x="SplitF"++i;}{print > x}' inputfile.txt
The code can't discern between data field newlines and actual newlines. Is there a way it can be achieved?
EDIT:: i have changed the problem statement with example. Please share your thoughts on the new example.
Use the following awk approach:
awk '{ r=(r!="")?r RS $0 : $0; if(NR%4==0){ print r > "file"++i".txt"; r="" } }
END{ if(r) print r > "file"++i".txt" }' inputfile.txt
NR%4==0 - your logical single line occupies two physical records, so we expect to separate on each 4 records
Results:
> cat file1.txt
1|Alan
Wake
2|Nathan
Drake
> cat file2.txt
3|Gordon
Freeman
If you are using GNU awk you can do this by setting RS appropriately, e.g.:
parse.awk
BEGIN { RS="[0-9]\\|" }
# Skip the empty first record by checking NF (Note: this will also skip
# any empty records later in the input)
NF {
# Send record with the appropriate key to a numbered file
printf("%s", d $0) > "file" i ".txt"
}
# When we found enough records, close current file and
# prepare i for opening the next one
#
# Note: NR-1 because of the empty first record
(NR-1)%n == 0 {
close("file" i ".txt")
i++
}
# Remember the record key in d, again,
# becuase of the empty first record
{ d=RT }
Run it like this:
gawk -f parse.awk n=2 infile
Where n is the number of records to put into each file.
Output:
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11

Removing two columns from csv without removing the column heading

Been stuck on this for a while, managed to remove two columns completely from it but now I need to remove two columns (3 in total) within the 1 column heading. I've attached a snippit from my csv file.
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
2014-09-17 10-20-39 UTC;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
2014-09-17 10-20-41 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-43 UTC;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
2014-09-17 10-20-45 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-47 UTC;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
2014-09-17 10-20-49 UTC;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48
What I'm wanting to do is remove yyyy-mm-dd and also UTC, leaving just 10-20-39 underneath the timestamp column heading. I've tried removing them but I can't seem to do it without taking out the headings.
Thanks to anyone who can help me with this
A perl way:
perl -pe 's/^.+? (.+?) .+?;/$1;/ if $.>1' file
Explanation
The -pe means "print each line after applying the script to it". The script itself simply substitutes identifies the 3 first non-whitespace words and replaces them with the 2nd of the three ($1 since the pattern was captured). This is only run if the current line number ($.) is greater than 1.
An awk way
awk -F';' '(NR>1){sub(/[^ ]* /,"",$1); sub(/ [^ ]*$/,"",$1)}1;' OFS=";" file
Here, we set the input field delimiter to ; and use sub() to remove the 1st and last word from the 1st field.
This following sed command works for you:
sed '1!s/^[^ ]\+ //;1!s/ UTC//'
Explanations:
1! Do not apply to the first line.
s/^[^ ]\+ // Remove the first group of non-space characters at line beginning ("2014-09-17 " in your case).
s/ UTC// Remove the string " UTC".
Assuming the csv file is stored as a.csv, then
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
prints the results to standard output, and
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv > b.csv
saves the result to b.csv.
EDITED:
Added: sample results:
[pengyu#GLaDOS tmp]$ sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
10-20-39;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
10-20-41;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-43;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
10-20-45;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-47;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
10-20-49;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48

Resources