fastest way convert tab-delimited file to csv in linux - linux

I have a tab-delimited file that has over 200 million lines. What's the fastest way in linux to convert this to a csv file? This file does have multiple lines of header information which I'll need to strip out down the road, but the number of lines of header is known. I have seen suggestions for sed and gawk, but I wonder if there is a "preferred" choice.
Just to clarify, there are no embedded tabs in this file.

If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:
import sys
import csv
tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in tabin:
commaout.writerow(row)
Run it from a shell as follows:
python script.py < input.tsv > output.csv

If all you need to do is translate all tab characters to comma characters, tr is probably the way to go.
The blank space here is a literal tab:
$ echo "hello world" | tr "\\t" ","
hello,world
Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.

perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv
Perl is generally faster at this sort of thing than the sed, awk, and Python.

If you want to convert the whole tsv file into a csv file:
$ cat data.tsv | tr "\\t" "," > data.csv
If you want to omit some fields:
$ cat data.tsv | cut -f1,2,3 | tr "\\t" "," > data.csv
The above command will convert the data.tsv file to data.csv file containing only the first three fields.

sed -e 's/"/\\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile
Damn the critics, quote everything, CSV doesn't care.
<tab> is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.

#ignacio-vazquez-abrams 's python solution is great! For people who are looking to parse delimiters other tab, the library actually allows you to set arbitrary delimiter. Here is my modified version to handle pipe-delimited files:
import sys
import csv
pipein = csv.reader(sys.stdin, delimiter='|')
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in pipein:
commaout.writerow(row)

assuming you don't want to change header and assuming you don't have embedded tabs
# cat file
header header header
one two three
$ awk 'NR>1{$1=$1}1' OFS="," file
header header header
one,two,three
NR>1 skips the first header. you mentioned you know how many lines of header, so use the correct number for your own case. with this, you also do not need to call any other external commands. just one awk command does the job.
another way if you have blank columns and you care about that.
awk 'NR>1{gsub("\t",",")}1' file
using sed
sed '2,$y/\t/,/' file #skip 1 line header and translate (same as tr)

You can also use xsv for this
xsv input -d '\t' input.tsv > output.csv
In my test on a 300MB tsv file, it was roughly 5x faster than the python solution (2.5s vs. 14s).

the following awk oneliner supports quoting + quote-escaping
printf "flop\tflap\"" | awk -F '\t' '{ gsub(/"/,"\"\"\"",$i); for(i = 1; i <= NF; i++) { printf "\"%s\"",$i; if( i < NF ) printf "," }; printf "\n" }'
gives
"flop","flap""""

right click file, click rename, delete the 't' and put a 'c'. I'm actually not joking, most csv parsers can handle tab delimiters. I had this issue now and for my purposes renaming worked just fine.

I think it is better not to cat the file because it may create problem in the case of large file. The better way may be
$ tr ',' '\t' < csvfile.csv > tabdelimitedFile.txt
The command will get input from csvfile.csv and store the result as tab seperated in tabdelimitedFile.txt

Related

Split flat file and add delimiter in Linux

I would like how to improve a code that I have.
My shell script reads a flat file, and split it in two files based on first char of each line, header and detail. For header the first char is 1 and for detail is 2. Splitted files does not include the firts char.
Header is delimited by "|", and detail is fixed-width, so, I add the delimiter to it alter.
What I want is to do this in one single awk, to avoid creating a tmp file.
For splitting file I use and awk command, and for adding delimiter another awk command.
This is what I have now:
Input=Input.txt
Header=Header.txt
DetailTmp=DetailTmp.txt
Detail=Detail.txt
#First I split in two files and remove first char
awk -v vFileHeader="$Header" -v vFileDetail="$DetailTmp" '/^1/ {f=vFileHeader} /^2/ {f=vFileDetail} {sub(/^./,""); print > f}' $Input
#Then, I add the delimiter to detail
awk '{OFS="|"};{print substr($1,1,10),substr($1,11,5),substr($1,16,2),substr($1,18,14),substr($1,32,4),substr($1,36,18),substr($1,54,1)}' $DetailTmp > $Detail
Any suggestion?
Input.txt file
120190301|0170117174|FRANK|DURAND|USA
2017011717400052082911070900000000000000000000091430200
120190301|0170117204|ERICK|SMITH|USA
2017011720400052082911070900000000000000000000056311910
Header.txt splitted
20190301|0170117174|FRANK|DURAND|USA
20190301|0170117204|ERICK|SMITH|USA
DetailTmp.txt splitted
017011717400052082911070900000000000000000000091430200
017011720400052082911070900000000000000000000056311910
017011727100052052911070900000000000000000000008250000
017011718200052082911070900000000000000000000008102500
017011726300052052911070900000000000000000000008250000
Detail.txt desired
0170117174|00052|08|29110709000000|0000|000000000009143020|0
0170117204|00052|08|29110709000000|0000|000000000005631191|0
0170117271|00052|05|29110709000000|0000|000000000000825000|0
0170117182|00052|08|29110709000000|0000|000000000000810250|0
0170117263|00052|05|29110709000000|0000|000000000000825000|0
just combine the scripts
$ awk -v OFS='|' '/^1/{print substr($0,2) > "header"}
/^2/{print substr($0,2,10),substr($0,11,5),... > "detail"}' file
however, you may be better off, using FIELDWIDTHS on the detail file on the second pass.

Print duplicated numbers to another file

is there an easy way to count occurencies of specific numer inside a file? For example, I've got a file numbers.txt containing as follows:
154;459;444;154
356;2;478;154
I need to print to another file only the numbers that are duplicated, so in file duplicate.txt i should have only one occurency of 154
As suggested, "uniq -d", with suitable preprocessing will work:
tr ';' '\n' <numbers.txt | sort |uniq -d
Steps:
translate the semicolons to newlines using tr
sort the data
use "uniq -d" to show duplicates.
You should note however, that the newline escape \n is not described in POSIX tr (it works with OSX and Linux).

how to cut CSV file

I have the following CSV file
more file.csv
Number,machine_type,OS,Version,Mem,CPU,HW,Volatge
1,HG652,linux,23.12,256,III,LOP90,220
2,HG652,linux,23.12,256,III,LOP90,220
3,HG652,SCO,MK906G,526,1G,LW1005,220
4,HG652,solaris,1172,1024,2Core,netra,220
5,HG652,solaris,1172,1024,2Core,netra,220
Please advice how to cut CSV file ( by cut or sed or awk command )
in order to get a partial CSV file
Command need to get value that represent the fields that we want to cut from the CSV
According to example 1 ( value should be 6 )
Example 1
on this example we cut the 6 fields from left to right , ( in this case CSV will look like this )
Number,machine_type,OS,Version,Mem,CPU
1,HG652,linux,23.12,256,III
2,HG652,linux,23.12,256,III
3,HG652,SCO,MK906G,526,1G
4,HG652,solaris,1172,1024,2Core
5,HG652,solaris,1172,1024,2Core
cut is your friend:
$ cut -d',' -f-6 file
Number,machine_type,OS,Version,Mem,CPU
1,HG652,linux,23.12,256,III
2,HG652,linux,23.12,256,III
3,HG652,SCO,MK906G,526,1G
4,HG652,solaris,1172,1024,2Core
5,HG652,solaris,1172,1024,2Core
Explanation
-d',' set comma as field separator
-f-6 print up to the field number 6 based on that delimiter. It is equivalent to -f1-6, as 1 is default.
Also awk can make it, if necessary:
$ awk -v FS="," 'NF{for (i=1;i<=6;i++) printf "%s%s", $i, (i==6?RS:FS)}' file
Number,machine_type,OS,Version,Mem,CPU
1,HG652,linux,23.12,256,III
2,HG652,linux,23.12,256,III
3,HG652,SCO,MK906G,526,1G
4,HG652,solaris,1172,1024,2Core
5,HG652,solaris,1172,1024,2Core
the cut commandline is rather simple and well suited in your case:
cut -d, -f1-6 yourfile
So everybody agrees to say that the cut way is the best way to go in this case. But we can also talk about the awk solution, and there I may point out that in fedorqui's answer, a clever trick is used to silence empty lines (NF as a selection pattern), but it has the disadvantage of e.g. removing blank lines from the original file. I propose below another solution (en passant, using the -F option instead of the variable passing mechanism on FS that preserves any empty line and also respects lines with less than 6 fields, e.g. prints these lines without adding extra commas there:
awk -F, '{min=(NF>6?6:NF); for (i=1;i<=min-1;i++) printf "%s,", $i; printf "%s\n", $6}' yourfile
This works nicely because printf-ing $6 is never an error, even in case the line has less than 6 fields. This is true with my gawk 4.0.1, at least...

How do I remove newlines from a text file?

I have the following data, and I need to put it all into one line.
I have this:
22791
;
14336
;
22821
;
34653
;
21491
;
25522
;
33238
;
I need this:
22791;14336;22821;34653;21491;25522;33238;
EDIT
None of these commands is working perfectly.
Most of them let the data look like this:
22791
;14336
;22821
;34653
;21491
;25522
tr --delete '\n' < yourfile.txt
tr -d '\n' < yourfile.txt
Edit:
If none of the commands posted here are working, then you have something other than a newline separating your fields. Possibly you have DOS/Windows line endings in the file (although I would expect the Perl solutions to work even in that case)?
Try:
tr -d "\n\r" < yourfile.txt
If that doesn't work then you're going to have to inspect your file more closely (e.g. in a hex editor) to find out what characters are actually in there that you want to remove.
tr -d '\n' < file.txt
Or
awk '{ printf "%s", $0 }' file.txt
Or
sed ':a;N;$!ba;s/\n//g' file.txt
This page here has a bunch of other methods to remove newlines.
edited to remove feline abuse :)
perl -p -i -e 's/\R//g;' filename
Must do the job.
paste -sd "" file.txt
Expanding on a previous answer, this removes all new lines and saves the result to a new file (thanks to #tripleee):
tr -d '\n' < yourfile.txt > yourfile2.txt
Which is better than a "useless cat" (see comments):
cat file.txt | tr -d '\n' > file2.txt
Also useful for getting rid of new lines at the end of the file, e.g. created by using echo blah > file.txt.
Note that the destination filename is different, important, otherwise you'll wipe out the original content!
You can edit the file in vim:
$ vim inputfile
:%s/\n//g
use
head -n 1 filename | od -c
to figure WHAT is the offending character.
then use
tr -d '\n' <filename
for LF
tr -d '\r\n' <filename
for CRLF
Use sed with POSIX classes
This will remove all lines containing only whitespace (spaces & tabs)
sed '/^[[:space:]]*$/d'
Just take whatever you are working with and pipe it to that
Example
cat filename | sed '/^[[:space:]]*$/d'
Using man 1 ed:
# cf. http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
ed -s file <<< $'1,$j\n,p' # print to stdout
ed -s file <<< $'1,$j\nwq' # in-place edit
xargs consumes newlines as well (but adds a final trailing newline):
xargs < file.txt | tr -d ' '
Nerd fact: use ASCII instead.
tr -d '\012' < filename.extension
(Edited cause i didn't see the friggin' answer that had same solution, only difference was that mine had ASCII)
Using the gedit text editor (3.18.3)
Click Search
Click Find and Replace...
Enter \n\s into Find field
Leave Replace with blank (nothing)
Check Regular expression box
Click the Find button
Note: this doesn't exactly address the OP's original, 7 year old problem but should help some noob linux users (like me) who find their way here from the SE's with similar "how do I get my text all on one line" questions.
Was having the same case today, super easy in vim or nvim, you can use gJ to join lines. For your use case, just do
99gJ
this will join all your 99 lines. You can adjust the number 99 as need according to how many lines to join. If just join 1 line, then only gJ is good enough.
$ perl -0777 -pe 's/\n+//g' input >output
$ perl -0777 -pe 'tr/\n//d' input >output
If the data is in file.txt, then:
echo $(<file.txt) | tr -d ' '
The '$(<file.txt)' reads the file and gives the contents as a series of words which 'echo' then echoes with a space between them. The 'tr' command then deletes any spaces:
22791;14336;22821;34653;21491;25522;33238;
Assuming you only want to keep the digits and the semicolons, the following should do the trick assuming there are no major encoding issues, though it will also remove the very last "newline":
$ tr -cd ";0-9"
You can easily modify the above to include other characters, e.g. if you want to retain decimal points, commas, etc.
I usually get this usecase when I'm copying a code snippet from a file and I want to paste it into a console without adding unnecessary new lines, I ended up doing a bash alias
( i called it oneline if you are curious )
xsel -b -o | tr -d '\n' | tr -s ' ' | xsel -b -i
xsel -b -o reads my clipboard
tr -d '\n' removes new lines
tr -s ' ' removes recurring spaces
xsel -b -i pushes this back to my clipboard
after that I would paste the new contents of the clipboard into oneline in a console or whatever.
I would do it with awk, e.g.
awk '/[0-9]+/ { a = a $0 ";" } END { print a }' file.txt
(a disadvantage is that a is "accumulated" in memory).
EDIT
Forgot about printf! So also
awk '/[0-9]+/ { printf "%s;", $0 }' file.txt
or likely better, what it was already given in the other ans using awk.
You are missing the most obvious and fast answer especially when you need to do this in GUI in order to fix some weird word-wrap.
Open gedit
Then Ctrl + H, then put in the Find textbox \n and in Replace with an empty space then fill checkbox Regular expression and voila.
To also remove the trailing newline at the end of the file
python -c "s=open('filename','r').read();open('filename', 'w').write(s.replace('\n',''))"
fastest way I found:
open vim by doing this in your commandline
vim inputfile
press ":" and input the following command to remove all newlines
:%s/\n//g
Input this to also remove spaces incase some characters were spaces :%s/ //g
make sure to save by writing to the file with
:w
The same format can be used to remove any other characters, you can use a website like this
https://apps.timwhitlock.info/unicode/inspect
to figure out what character you're missing
You can also use this to figure out other characters you can't see and they have a tool as well
Tool to learn of other invisible characters

Replace whitespace with a comma in a text file in Linux

I need to edit a few text files (an output from sar) and convert them into CSV files.
I need to change every whitespace (maybe it's a tab between the numbers in the output) using sed or awk functions (an easy shell script in Linux).
Can anyone help me? Every command I used didn't change the file at all; I tried gsub.
tr ' ' ',' <input >output
Substitutes each space with a comma, if you need you can make a pass with the -s flag (squeeze repeats), that replaces each input sequence of a repeated character that is listed in SET1 (the blank space) with a single occurrence of that character.
Use of squeeze repeats used to after substitute tabs:
tr -s '\t' <input | tr '\t' ',' >output
Try something like:
sed 's/[:space:]+/,/g' orig.txt > modified.txt
The character class [:space:] will match all whitespace (spaces, tabs, etc.). If you just want to replace a single character, eg. just space, use that only.
EDIT: Actually [:space:] includes carriage return, so this may not do what you want. The following will replace tabs and spaces.
sed 's/[:blank:]+/,/g' orig.txt > modified.txt
as will
sed 's/[\t ]+/,/g' orig.txt > modified.txt
In all of this, you need to be careful that the items in your file that are separated by whitespace don't contain their own whitespace that you want to keep, eg. two words.
without looking at your input file, only a guess
awk '{$1=$1}1' OFS=","
redirect to another file and rename as needed
What about something like this :
cat texte.txt | sed -e 's/\s/,/g' > texte-new.txt
(Yes, with some useless catting and piping ; could also use < to read from the file directly, I suppose -- used cat first to output the content of the file, and only after, I added sed to my command-line)
EDIT : as #ghostdog74 pointed out in a comment, there's definitly no need for thet cat/pipe ; you can give the name of the file to sed :
sed -e 's/\s/,/g' texte.txt > texte-new.txt
If "texte.txt" is this way :
$ cat texte.txt
this is a text
in which I want to replace
spaces by commas
You'll get a "texte-new.txt" that'll look like this :
$ cat texte-new.txt
this,is,a,text
in,which,I,want,to,replace
spaces,by,commas
I wouldn't go just replacing the old file by the new one (could be done with sed -i, if I remember correctly ; and as #ghostdog74 said, this one would accept creating the backup on the fly) : keeping might be wise, as a security measure (even if it means having to rename it to something like "texte-backup.txt")
This command should work:
sed "s/\s/,/g" < infile.txt > outfile.txt
Note that you have to redirect the output to a new file. The input file is not changed in place.
sed can do this:
sed 's/[\t ]/,/g' input.file
That will send to the console,
sed -i 's/[\t ]/,/g' input.file
will edit the file in-place
Here's a Perl script which will edit the files in-place:
perl -i.bak -lpe 's/\s+/,/g' files*
Consecutive whitespace is converted to a single comma.
Each input file is moved to .bak
These command-line options are used:
-i.bak edit in-place and make .bak copies
-p loop around every line of the input file, automatically print the line
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
If you want to replace an arbitrary sequence of blank characters (tab, space) with one comma, use the following:
sed 's/[\t ]+/,/g' input_file > output_file
or
sed -r 's/[[:blank:]]+/,/g' input_file > output_file
If some of your input lines include leading space characters which are redundant and don't need to be converted to commas, then first you need to get rid of them, and then convert the remaining blank characters to commas. For such case, use the following:
sed 's/ +//' input_file | sed 's/[\t ]+/,/g' > output_file
This worked for me.
sed -e 's/\s\+/,/g' input.txt >> output.csv

Resources