Split flat file and add delimiter in Linux - linux

I would like how to improve a code that I have.
My shell script reads a flat file, and split it in two files based on first char of each line, header and detail. For header the first char is 1 and for detail is 2. Splitted files does not include the firts char.
Header is delimited by "|", and detail is fixed-width, so, I add the delimiter to it alter.
What I want is to do this in one single awk, to avoid creating a tmp file.
For splitting file I use and awk command, and for adding delimiter another awk command.
This is what I have now:
Input=Input.txt
Header=Header.txt
DetailTmp=DetailTmp.txt
Detail=Detail.txt
#First I split in two files and remove first char
awk -v vFileHeader="$Header" -v vFileDetail="$DetailTmp" '/^1/ {f=vFileHeader} /^2/ {f=vFileDetail} {sub(/^./,""); print > f}' $Input
#Then, I add the delimiter to detail
awk '{OFS="|"};{print substr($1,1,10),substr($1,11,5),substr($1,16,2),substr($1,18,14),substr($1,32,4),substr($1,36,18),substr($1,54,1)}' $DetailTmp > $Detail
Any suggestion?
Input.txt file
120190301|0170117174|FRANK|DURAND|USA
2017011717400052082911070900000000000000000000091430200
120190301|0170117204|ERICK|SMITH|USA
2017011720400052082911070900000000000000000000056311910
Header.txt splitted
20190301|0170117174|FRANK|DURAND|USA
20190301|0170117204|ERICK|SMITH|USA
DetailTmp.txt splitted
017011717400052082911070900000000000000000000091430200
017011720400052082911070900000000000000000000056311910
017011727100052052911070900000000000000000000008250000
017011718200052082911070900000000000000000000008102500
017011726300052052911070900000000000000000000008250000
Detail.txt desired
0170117174|00052|08|29110709000000|0000|000000000009143020|0
0170117204|00052|08|29110709000000|0000|000000000005631191|0
0170117271|00052|05|29110709000000|0000|000000000000825000|0
0170117182|00052|08|29110709000000|0000|000000000000810250|0
0170117263|00052|05|29110709000000|0000|000000000000825000|0

just combine the scripts
$ awk -v OFS='|' '/^1/{print substr($0,2) > "header"}
/^2/{print substr($0,2,10),substr($0,11,5),... > "detail"}' file
however, you may be better off, using FIELDWIDTHS on the detail file on the second pass.

Related

Find a line and modify it in a csv file given an input

I have a csv file with a list of workers and I wanna make an script for modify their work group given their ID's. Lines in CSV files are like this:
Before:
ID TAG GROUP
niub16677500;B00;AB0
After:
ID TAG GROUP
niub16677500;B00;BC0
How I can make this?
I'm working with awk and sed commands but I couldn't get anything at the moment.
With awk:
awk -F';' -v OFS=';' -v id="niub16677500" -v new_group="BC0" '{if($1==id)$3=new_group}1' input.csv
ID;TAG;GROUP
niub16677500;B00;BC0
Redirect the output to a file and note that the csv header should use the same field separator as the body.
Explanations:
-F';' to have input field separator as ;
-v OFS=';' same for the output FS
-v id="niub16677500" -v new_group="BC0" define the variables that you are going to use in the awk commands
'{if($1==id)$3=new_group}1' when the first column is equal to the value contained in variable id the overwrite the 3rd field and print the line
With sed:
id="niub16677500"; new_group="BC0"; sed "/^$id/s/;[^;]*$/;$new_group/" input.csv
ID;TAG;GROUP
niub16677500;B00;BC0
You can either do an inline change using -i.bak option, or redirect the output to a file.
Explanations:
Store the values in 2 variables
/^$id/ when you reach a line that starts with the ID store in the variable id, run sed search and replace
s/;[^;]*$/;$new_group/ search and replace command that will replace the last field by the new value
Sed can do it,
echo 'niub16677500;B00;AB0' | sed 's/\(^niub16677500;...;\)\(...\)$/\1BC0/'
will replace the AB0 group in your example with BC0, by matching the user name, semicolon, whatever 3 characters and another semicolon, and then matching the remaining 3 characters. Then as an output it repeats the first match with \1 and adds BC0.
You can use :
sed 's/\(^niub16677500;...;\)\(...\)$/\1BC0/' <old_file >new_file
to make a new_file with this change.
https://www.grymoire.com/Unix/Sed.html is a great resource, you should take a look at it.

How to Grep the complete sequences containing a specific motif in a fasta file?

How to Grep the complete sequences containing a specific motif in a fasta file or txt file with one linux command and write them into another file? Also, I want to include the lines beginning with a ">" before these target sequences.
Example:I have a fasta file of 10000 sequences.
$cat file.fa
>name1
AEDIA
>name2
ALKME
>name3
AAIII
I want to grep sequences containing KME, so I should get:
>name2
ALKME
Attached is the current way I am using based on the answers I got. Maybe others may find it helpful. Thanks to Pierre Lindenbaum, Philipp Bayer, cpad0112 and batMan.
Preprocessing the fasta file first and get each sequence into one line (which is very important)
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa > file1.fa
Get rid of the first empty line
tail -n +2 file1.fa > file2.fa
Extract the target sequences containing the substring including their names and save it into another file
LC_ALL=C grep -B 1 KME file2.fa > result.txt
Note: Take KME as the target substring as an example
if you have multiline fasta files. First linearize with awk, and use another awk to filter the sequence containing the motif. using grep would be dangerous a sequence name contains a short motif.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{if(index($2,"KME")!=0) printf("%s\n%s\n",$1,$2);}'
grep -B1 KME file > output_file
-B1 : prints 1 line before the match as well

How to split a file into two files based on a pattern?

In a file in Linux I have the following
123_test
234_test
abc_rest
cde_rest
and so on
Now I want to get two files in Linux. One which contains only records like below
123_test
234_test
and 2nd file like below
abc_rest
cde_rest
I want to split files based on what comes after _ like _test or _rest
Edited:
123_test
234_test
abc_rest
cde_rest
456_test
fgh_rest
How can I achieve that in Linux?
Can we use split function for this?
You can use this single awk command for splitting:
awk '{ print > (/_test$/ ? "file1" : "file2") }' file
This awk command will copy all lines that start with digits to file1 and remaining lines to file2.

Generate record of files which have been removed by grep as a secondary function of primary command

I asked a question here to remove unwanted lines which contained strings which matched a particular pattern:
Remove lines containg string followed by x number of numbers
anubhava provided a good line of code which met my needs perfectly. This code removes any line which contains the string vol followed by a space and three or more consecutive numbers:
grep -Ev '\bvol([[:blank:]]+[[:digit:]]+){2}' file > newfile
The command will be used on a fairly large csv file and be initiated by crontab. For this reason, I would like to keep a record of the lines this command is removing, just so I can go back to check the correct data that is being removed- I guess it will be some sort of log containing the name sof the lines that did not make the final cut. How can I add this functionality?
Drop grep and use awk instead:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print >> "deleted"; next} 1' file
The above uses GNU awk for word delimiters (\<) and will append every deleted line to a file named "deleted". Consider adding a timestamp too:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print systime(), $0 >> "deleted"; next} 1' file

fastest way convert tab-delimited file to csv in linux

I have a tab-delimited file that has over 200 million lines. What's the fastest way in linux to convert this to a csv file? This file does have multiple lines of header information which I'll need to strip out down the road, but the number of lines of header is known. I have seen suggestions for sed and gawk, but I wonder if there is a "preferred" choice.
Just to clarify, there are no embedded tabs in this file.
If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:
import sys
import csv
tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in tabin:
commaout.writerow(row)
Run it from a shell as follows:
python script.py < input.tsv > output.csv
If all you need to do is translate all tab characters to comma characters, tr is probably the way to go.
The blank space here is a literal tab:
$ echo "hello world" | tr "\\t" ","
hello,world
Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.
perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv
Perl is generally faster at this sort of thing than the sed, awk, and Python.
If you want to convert the whole tsv file into a csv file:
$ cat data.tsv | tr "\\t" "," > data.csv
If you want to omit some fields:
$ cat data.tsv | cut -f1,2,3 | tr "\\t" "," > data.csv
The above command will convert the data.tsv file to data.csv file containing only the first three fields.
sed -e 's/"/\\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile
Damn the critics, quote everything, CSV doesn't care.
<tab> is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.
#ignacio-vazquez-abrams 's python solution is great! For people who are looking to parse delimiters other tab, the library actually allows you to set arbitrary delimiter. Here is my modified version to handle pipe-delimited files:
import sys
import csv
pipein = csv.reader(sys.stdin, delimiter='|')
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in pipein:
commaout.writerow(row)
assuming you don't want to change header and assuming you don't have embedded tabs
# cat file
header header header
one two three
$ awk 'NR>1{$1=$1}1' OFS="," file
header header header
one,two,three
NR>1 skips the first header. you mentioned you know how many lines of header, so use the correct number for your own case. with this, you also do not need to call any other external commands. just one awk command does the job.
another way if you have blank columns and you care about that.
awk 'NR>1{gsub("\t",",")}1' file
using sed
sed '2,$y/\t/,/' file #skip 1 line header and translate (same as tr)
You can also use xsv for this
xsv input -d '\t' input.tsv > output.csv
In my test on a 300MB tsv file, it was roughly 5x faster than the python solution (2.5s vs. 14s).
the following awk oneliner supports quoting + quote-escaping
printf "flop\tflap\"" | awk -F '\t' '{ gsub(/"/,"\"\"\"",$i); for(i = 1; i <= NF; i++) { printf "\"%s\"",$i; if( i < NF ) printf "," }; printf "\n" }'
gives
"flop","flap""""
right click file, click rename, delete the 't' and put a 'c'. I'm actually not joking, most csv parsers can handle tab delimiters. I had this issue now and for my purposes renaming worked just fine.
I think it is better not to cat the file because it may create problem in the case of large file. The better way may be
$ tr ',' '\t' < csvfile.csv > tabdelimitedFile.txt
The command will get input from csvfile.csv and store the result as tab seperated in tabdelimitedFile.txt

Resources