Linux spilt text file by number of lines keep linebreaks in place - linux

I am new to linux (not my own server) and I want to split some windows txt files by calling a bash script from a third party application:
So far I have it working in two ways up to a point:
split -l 5000 LargeFile.txt SmallFile
for file in LargeFile.*
do
mv "$file" "$file.txt"
done
awk '{filename = "wrd." int((NR-1)/5000) ".txt"; print >> filename}' LargeFile.txt
But both give me txt files with the result:
line1line2line3line4
I found some topics about putting LargeFile.txt like this $ (LargeFile.txt) but it is not working for me. (Also I found a swich to let the split command produce txt files directly, but this is also not working)
I hope some one can help me out on this one.

Explanation: Line terminators
As explained by various answers to this question, the standard line terminators differ between OS's:
Linux uses LF (line feed, 0x0a)
Windows uses CRLF (carriage return and line feed 0x0d 0x0a)
Mac, pre OS X used CR (carriage return CR)
To solve your problem, it would be important to figure out what line terminators your LargeFile.txt uses. The simplest way would be the file command:
file LargeFile.txt
The output will indicate if line terminators are CR or CRLF and otherwise just state that it is an ASCII file.
Since LF and CRLF line terminators will be recognized properly in Linux and lines should not appear merged together (no matter which way you use to view the file) unless you configure an editor specifically so that they do, I will assume that your file has CR line terminators.
Example solution to your problem (assuming CR line terminators)
If you want to split the file in the shell and with shell commands, you will potentially face the problem that the likes of cat, split, awk, etc will not recognize line endings in the first place. If your file is very large, this may additionally lead to memory issues (?).
Therefore, the best way to handle this may be to translate the line terminators first (using the tr command) so that they are understood in Linux (i.e. to LF) and then apply your split or awk code before translating the line terminators back (if you believe you need to do this).
cat LargeFile.txt | tr "\r" "\n" > temporary_file.txt
split -l 5000 temporary_file.txt SmallFile
rm temporary_file.txt
for file in `ls SmallFile*`; do filex=$file.txt; cat $file | tr "\n" "\r" > $filex; rm $file; done
Note that the last line is actually a for loop:
for file in `ls SmallFile*`
do
filex=$file.txt
cat $file | tr "\n" "\r" > $filex
rm $file
done
This loop will again use tr to restore the CR line terminators and additionally give the resulting files a txt filename ending.
Some Remarks
Of course, if you would like to keep the LF line terminators you should not execute this line.
And finally, if you find that you have a different type of line terminators, you may need to adapt the tr command in the first line.
Both tr and split (and also cat and rm) are part of GNU coreutils and should be installed on your system unless you are in a very untypical environment (a rescue shell of an initial RAM disk perhaps). The same (should typically be available) goes for the file command, this one.

Related

What's confusing both grep and ack?

Try this: download https://www.mathworks.com/matlabcentral/fileexchange/19-delta-sigma-toolbox
In the unzipped folder, I get the following results:
ack --no-heading --no-break --matlab dsexample
Contents.m:56:% dsexample1 - Discrete-time lowpass/bandpass/quadrature modulator.
Contents.m:57:% dsexample2 - Continuous-time lowpass modulator.
dsexample1(dsm, LiveDemo);
fprintf(1,'Done.\n');
adc.sys_cs = sys_cs;
grep -nH -R --include="*.m" dsexample
Contents.m:56:% dsexample1 - Discrete-time lowpass/bandpass/quadrature modulator.
Contents.m:57:% dsexample2 - Continuous-time lowpass modulator.
dsexample1(dsm, LiveDemo); d center frequency larger Hinfation Script
fprintf(1,'Done.\n');c = c;formed.s of finite op-amp gain and capacitorased;;n for the input.
adc.sys_cs = sys_cs;snr;seed with CT simulations tora states used in the d-t model_amp); Response');
What's going on ?
[Edit for clarification]: Why is there no file name, no line number on the 3rd line result ? Why results on the 4th and 5th line do not even contain dsexample ?
NB: using ack 3.40 and grep 2.16
I do not deserve any credits for this answer - It is all about line endings.
I have known for years about Windows line endings (CR-LF) and Linux line endings (LF only), but I had never heard of Legacy MAC line endings (CR only)... The latter really upsets ack, grep, and I'm sure lots of other tools.
dos2unix and unix2dos have no effect on files with Legacy MAC format - But after using this nifty little endline tool, I could eventually bring some consistency to the source files:
endlines : 129 files converted from :
- 23 Legacy Mac (CR)
- 105 Unix (LF)
- 1 Windows (CR-LF)
Now, ack and grep are much happier.
Let's see what files contain dsexample, grep -l doesn't print the contents, just file names:
$ grep -l dsexample *
Contents.m
demoLPandBP.m
dsexample1.m
dsexample2.m
Ok, then, file shows that they have CR line terminators. (It would say "CRLF line terminators" for Windows files.)
$ file Contents.m demoLPandBP.m dsexample*
Contents.m: ASCII text
demoLPandBP.m: ASCII text, with CR line terminators
dsexample1.m: ASCII text, with CR line terminators
dsexample2.m: ASCII text, with CR line terminators
Unlike what I commented about before, Contents.m is fine. Let's look at another one, how it prints:
$ grep dsexample demoLPandBP.m
dsexample1(dsm, LiveDemo); d center frequency larger Hinf
The output from grep is actually the whole file, since grep doesn't consider the plain CR as breaking a line -- the whole file is just one line. If we change CRs to LFs, we see it better, or can just count the lines:
$ grep dsexample demoLPandBP.m | tr '\r' '\n' | wc -l
51
These are the longest lines there, in order:
%% 5th-order lowpass with optimized zeros and larger Hinf
dsm.f0 = 1/6; % Normalized center frequency
dsexample1(dsm, LiveDemo);
With a CR in the end of each, the cursor moves back to the start of the line, partially overwriting the previous output, so you get:
dsexample1(dsm, LiveDemo); d center frequency larger Hinf
(There's a space after the semicolon on that line, so the e gets overwritten too. I checked.)
Someone said dos2unix can't deal with that, and well, they're not DOS or Windows files anyway so why should it. You could do something like this, though, in Bash:
for f in *.m; do
if [[ $(file "$f") = *"ASCII text, with CR line terminators" ]]; then
tr '\r' '\n' < "$f" > tmptmptmp &&
mv tmptmptmp "$f"
fi
done
I think it was just the .m files that had the issue, hence the *.m in the loop. There was at least one PDF file there, and we don't want to break that. Though with the check on file there, it should be safe even if you just run the loop on *.
It looks like both ack and grep are getting confused by the line endings in the files. Run file *.m on your files. You'll see that some files have proper linefeeds, and some have CR line terminators.
If you clean up your line endings, things should be OK.

head file with windows line endings

On my ContOS server, I have several huge files with windows line endings. Which, I need the first n lines from in order to run some tests.
I've tried a few standard "linux" ways of doing it:
head -10 file.dat
And
sed -n 1,10p file.dat
And
awk 'NR <=10' file.dat
All of which produce don't respect the windows line endings and simply output the entire file.
Is there a way to get the n lines of a file with windows line endings?
Also, it should be noted that the output should still have the windows line endings.
This wouldn't happen with Windows line endings, which are CRLF, since Unix uses LF. So the LF would still be seen and used.
What you're describing would happen if the line endings were just CR without LF. You can translate this with:
tr '\r' '\n' < file.dat | head -10 | tr '\n' '\r'
The first tr converts to Unix format, and the second one translates back to the original format.
You could always use vim:
vim foo.txt +"%s/\r/\r/g" +wq
This will replace all carriage returns.

concatenation of strings in bash results in substitution

I need to read a file into an array and concatenate a string at the end of each line. Here is my bash script:
#!/bin/bash
IFS=$'\n' read -d '' -r -a lines < ./file.list
for i in "${lines[#]}"
do
tmp="$i"
tmp="${tmp}stuff"
echo "$tmp"
done
However, when I do this, an action of replace happens, instead of concatenation.
For example, in the file.list, we have:
http://www.example1.com
http://www.example2.com
What I need is:
http://www.example1.comstuff
http://www.example2.comstuff
But after executing the script above, I get things as below on the terminal:
stuff//www.example1.com
stuff//www.example2.com
Btw, my PC is Mac OS.
The problem also occurs while concatenating strings via awk, printf, and echo commands. For example echo $tmp"stuff" or echo "${tmp}""stuff"
The file ./file.lst is, most probably, generated on a Windows system or, at least, it was saved using the Windows convention for end of line.
Windows uses a sequence of two characters to mark the end of lines in a text file. These characters are CR (\r) followed by LF (\n). Unix-like systems (Linux and macOS starting with version 10) use LF as end of line character.
The assignment IFS=$'\n' in front of read in your code tells read to use LF as line separator. read doesn't store the LF characters in the array it produces (lines[]) but each entry from lines[] ends with a CR character.
The line tmp="${tmp}stuff" does what is it supposed to do, i.e. it appends the word stuff to the content of the variable tmp (a line read from the file).
The first line read from the input file contains the string http://www.example1.com followed by the CR character. After the string stuff is appended, the content of variable tmp is:
http://www.example1.com$'\r'stuff
The CR character is not printable. It has a special interpretation when it is printed on the terminal: it sends the cursor at the start of the line (column 1) without changing the line.
When echo prints the line above, it prints (starting on a new line) http://www.example1.com, then the CR character that sends the cursor back to the start of the line where is prints the string stuff. The stuff fragment overwrites the first 5 characters already printed on that line (http:) and the result, as it is visible on screen, is:
stuff//www.example1.com
The solution is to get rid of the CR characters from the input file. There are several ways to accomplish this goal.
A simple way to remove the CR characters from the input file is to use the command:
sed -i.bak s/$'\r'//g file.list
It removes all the CR characters from the content of file file.list, saves the updated string back into the file.list file and stores the original file.list file as file.list.bak (a backup copy in case it doesn't produce the output you expect).
Another way to get rid of the CR character is to ask the shell to remove it in the command where stuff is appended:
tmp="${tmp/$'\r'/}stuff"
When a variable is expanded in a construct like ${tmp/a/b}, all the appearances of a in $tmp are replaced with b. In this case we replace \r with nothing.
I'm guessing it's have something to do with the Carriage Return character.
Did your file.list created on windows? If so, try to use dos2unix before running the script.
Edit
You can check your files using the file command.
Example:
file file.list
If you saved the file in Windows Notepad like this:
Then it will probably come up like this:
file.list: ASCII text, with no line terminators
You can use built in tools like iconv to convert the encodings. However for a simple use like this, you can just use a command that works for multiple encodings without any conversion necessary.
You could simply buffer the file through cat, and use a regular expression that applies to either:
Carriage return followed by line terminator, or
Line terminator on it's own
Then append the string.
Example:
cat file.list | grep -E -v "^$" | sed -E -e "s/(\r?$)/stuff/g"
Will work with ASCII text, and ASCII text with no line terminators.
If you need to modify a stream to append a fixed string, you can use sed or awk, for instance:
sed 's/$/stuff/'
to append stuff to the end of each line.
using "dos2unix file.list" would also solve the problem

bash not parsing (cat + grep) correctly

I have a text file 1.grep
grep -P -e "^<job.+type.+rule" "Emake-4agents-1st-10-25-51.53.xml"
To make my grepping go faster, I do the following in bash
cat 1.grep | bash > 1.search
This works fine normally but in this case, I get the following:
$ cat 1.grep
grep -P -e "^<job.+type.+rule" "Emake-4agents-1st-10-25-51.53.xml"
$ cat 1.grep | bash > 2.search
: No such file or directory25-51.53.xml
Why does bash think that my .xml filename is a directory?
The immediate problem is that the file 1.grep is in DOS/Windows format, and has a carriage return followed by linefeed at the end of the line. Windows treats that two-character combination as the end-of-line marker, but unix tools like bash (and grep and ...) will treat just the linefeed as the end-of-line marker, so the carriage return is treated as part of the line. As a result, it's trying to read from a file named "Emake-4agents-1st-10-25-51.53.xml^M" (where ^M indicates the carriage return), which doesn't exist, so it prints an error message with a carriage return in the middle of it:
cat: Emake-4agents-1st-10-25-51.53.xml^M
: No such file or directory
...where the carriage return makes the second part overwrite the first part, giving the cryptic result you saw.
Solution: use something like dos2unix to convert the file to unix (line-feed-only) format, and use text editors that store in the unix format.
However, I also have to agree with several comments that said using cat | bash is ... just plain weird. I'm not sure exactly what you're trying to accomplish in the bigger picture, but I can't think of any situation where that'd be the "right" way to do it.

Best tool to remove dos line ends and join line back up again

I have a csv file into which has crept some ^M dos line ends, and I want to get rid of them, as well as 16 spaces and 3 tabs which follow. Like, I have to merge that line with the next one down. Heres an offending record and a good one as a sample of what I mean:
"Mary had a ^M
little lamb", "Nursery Rhyme", 1878
"Mary, Mary quite contrary", "Nursery Rhyme", 1838
I can remove the ^M using sed as you can see, but I cannot work out how to rm the nix line end to join the lines back up.
sed -e "s/^M$ //g" rhymes.csv > rhymes.csv
UPDATE
Then I read "However, the Microsoft CSV format allows embedded newlines within a double-quoted field. If embedded newlines within fields are a possibility for your data, you should consider using something other than sed to work with the data file." from:
http://sed.sourceforge.net/sedfaq4.html
So editing my question to ask Which tool I should be using?
With help from How can I replace a newline (\n) using sed?, I made this one:
sed -e ':a;N;$!ba;s/\r\n \t\t\t/=/' -i rhymes.csv
<CR> <LF> <16 spaces> <3 tabs>
If you just want to delete the CR, you could use:
<yourfile tr -d "\r" | tee yourfile
(or if the two input and output file are different: <yourfile tr -d "\r" > output)
dos2unix file_name
to convert file, or
dos2unix old_file new_file
to create new file.

Resources