I got a text file file.txt (12 MB) containing:
something1
something2
something3
something4
(...)
Is there a way to split file.txt into 12 *.txt files, let’s say file2.txt, file3.txt, file4.txt, etc.?
You can use the Linux Bash core utility split:
split -b 1M -d file.txt file
Note that M or MB both are OK but size is different. MB is 1000 * 1000, M is 1024^2
If you want to separate by lines you can use -l parameter.
UPDATE
a=(`wc -l yourfile`) ; lines=`echo $(($a/12)) | bc -l` ; split -l $lines -d file.txt file
Another solution as suggested by Kirill, you can do something like the following
split -n l/12 file.txt
Note that is l not one, split -n has a few options, like N, k/N, l/k/N, r/N, r/k/N.
$ split -l 100 input_file output_file
where -l is the number of lines in each files. This will create:
output_fileaa
output_fileab
output_fileac
output_filead
....
CS Pei's answer won't produce .txt files as the OP wants. Use:
split -b=1M -d file.txt file --additional-suffix=.txt
Using Bash:
readarray -t lines < file.txt
count=${#lines[#]}
for i in "${!lines[#]}"; do
index=$(( (i * 12 - 1) / count + 1 ))
echo "${lines[i]}" >> "file${index}.txt"
done
Using AWK:
awk '{
a[NR] = $0
}
END {
for (i = 1; i in a; ++i) {
x = (i * 12 - 1) / NR + 1
sub(/\..*$/, "", x)
print a[i] > "file" x ".txt"
}
}' file.txt
Unlike split, this one makes sure that the number of lines are most even.
Regardless to what was said in previous answers, on my Ubuntu 16.04 (Xenial Xerus) I had to do:
split -b 10M -d system.log system_split.log
Please note the space between -b and the value.
I agree with #CS Pei, however this didn't work for me:
split -b=1M -d file.txt file
...as the = after -b threw it off. Instead, I simply deleted it and left no space between it and the variable, and used lowercase "m":
split -b1m -d file.txt file
And to append ".txt", we use what #schoon said:
split -b=1m -d file.txt file --additional-suffix=.txt
I had a 188.5MB txt file and I used this command [but with -b5m for 5.2MB files], and it returned 35 split files all of which were txt files and 5.2MB except the last which was 5.0MB. Now, since I wanted my lines to stay whole, I wanted to split the main file every 1 million lines, but the split command didn't allow me to even do -100000 let alone "-1000000, so large numbers of lines to split will not work.
Try something like this:
awk -vc=1 'NR%1000000==0{++c}{print $0 > c".txt"}' Datafile.txt
for filename in *.txt; do mv "$filename" "Prefix_$filename"; done;
On my Linux system (Red Hat Enterprise 6.9), the split command does not have the command-line options for either -n or --additional-suffix.
Instead, I've used this:
split -d -l NUM_LINES really_big_file.txt split_files.txt.
where -d is to add a numeric suffix to the end of the split_files.txt. and -l specifies the number of lines per file.
For example, suppose I have a really big file like this:
$ ls -laF
total 1391952
drwxr-xr-x 2 user.name group 40 Sep 14 15:43 ./
drwxr-xr-x 3 user.name group 4096 Sep 14 15:39 ../
-rw-r--r-- 1 user.name group 1425352817 Sep 14 14:01 really_big_file.txt
This file has 100,000 lines, and I want to split it into files with at most 30,000 lines. This command will run the split and append an integer at the end of the output file pattern split_files.txt..
$ split -d -l 30000 really_big_file.txt split_files.txt.
The resulting files are split correctly with at most 30,000 lines per file.
$ ls -laF
total 2783904
drwxr-xr-x 2 user.name group 156 Sep 14 15:43 ./
drwxr-xr-x 3 user.name group 4096 Sep 14 15:39 ../
-rw-r--r-- 1 user.name group 1425352817 Sep 14 14:01 really_big_file.txt
-rw-r--r-- 1 user.name group 428604626 Sep 14 15:43 split_files.txt.00
-rw-r--r-- 1 user.name group 427152423 Sep 14 15:43 split_files.txt.01
-rw-r--r-- 1 user.name group 427141443 Sep 14 15:43 split_files.txt.02
-rw-r--r-- 1 user.name group 142454325 Sep 14 15:43 split_files.txt.03
$ wc -l *.txt*
100000 really_big_file.txt
30000 split_files.txt.00
30000 split_files.txt.01
30000 split_files.txt.02
10000 split_files.txt.03
200000 total
If each part has the same number of lines, for example 22, here is my solution:
split --numeric-suffixes=2 --additional-suffix=.txt -l 22 file.txt file
And you obtain file2.txt with the first 22 lines, file3.txt the 22 next line, etc.
Thank #hamruta-takawale, #dror-s and #stackoverflowuser2010
My search of how to do this led me here, so I'm posting this here for others too:
To get all of the contents of the file, split is the right answer! But, for those looking to just extract a piece of a file, as a sample of the file, use head or tail:
# extract just the **first** 100000 lines of /var/log/syslog into
# ~/syslog_sample.txt
head -n 100000 /var/log/syslog > ~/syslog_sample.txt
# extract just the **last** 100000 lines of /var/log/syslog into
# ~/syslog_sample.txt
tail -n 100000 /var/log/syslog > ~/syslog_sample.txt
Related
I have a file with unknown number of lines(but even number of lines). I want to print them side by side based on total number of lines in that file. For example, I have a file with 16 lines like below:
asdljsdbfajhsdbflakjsdff235
asjhbasdjbfajskdfasdbajsdx3
asjhbasdjbfajs23kdfb235ajds
asjhbasdjbfajskdfbaj456fd3v
asjhbasdjb6589fajskdfbaj235
asjhbasdjbfajs54kdfbaj2f879
asjhbasdjbfajskdfbajxdfgsdh
asjhbasdf3709ddjbfajskdfbaj
100
100
150
125
trh77rnv9vnd9dfnmdcnksosdmn
220
225
sdkjNSDfasd89asdg12asdf6asdf
So now i want to print them side by side. as they have 16 lines in total, I am trying to get the results 8:8 like below
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
paste command did not work for me exactly, (paste - - - - - - - -< file1) nor the awk command that I used awk '{printf "%s" (NR%2==0?RS:FS),$1}'
Note: The number of lines in a file dynamic. The only known thing in my scenario is, they are even number all the time.
If you have the memory to hash the whole file ("max" below):
$ awk '{
a[NR]=$0 # hash all the records
}
END { # after hashing
mid=int(NR/2) # compute the midpoint, int in case NR is uneven
for(i=1;i<=mid;i++) # iterate from start to midpoint
print a[i],a[mid+i] # output
}' file
If you have the memory to hash half of the file ("mid"):
$ awk '
NR==FNR { # on 1st pass hash second half of records
if(FNR>1) { # we dont need the 1st record ever
a[FNR]=$0 # hash record
if(FNR%2) # if odd record
delete a[int(FNR/2)+1] # remove one from the past
}
next
}
FNR==1 { # on the start of 2nd pass
if(NR%2==0) # if record count is uneven
exit # exit as there is always even count of them
offset=int((NR-1)/2) # compute offset to the beginning of hash
}
FNR<=offset { # only process the 1st half of records
print $0,a[offset+FNR] # output one from file, one from hash
next
}
{ # once 1st half of 2nd pass is finished
exit # just exit
}' file file # notice filename twice
And finally if you have awk compiled into a worms brain (ie. not so much memory, "min"):
$ awk '
NR==FNR { # just get the NR of 1st pass
next
}
FNR==1 {
mid=(NR-1)/2 # get the midpoint
file=FILENAME # filename for getline
while(++i<=mid && (getline line < file)>0); # jump getline to mid
}
{
if((getline line < file)>0) # getline read from mid+FNR
print $0,line # output
}' file file # notice filename twice
Standard disclaimer on getline and no real error control implemented.
Performance:
I seq 1 100000000 > file and tested how the above solutions performed. Output was > /dev/null but writing it to a file lasted around 2 s longer. max performance is so-so as the mem print was 88 % of my 16 GB so it might have swapped. Well, I killed all the browsers and shaved off 7 seconds for the real time of max.
+------------------+-----------+-----------+
| which | | |
| min | mid | max |
+------------------+-----------+-----------+
| time | | |
| real 1m7.027s | 1m30.146s | 0m48.405s |
| user 1m6.387s | 1m27.314 | 0m43.801s |
| sys 0m0.641s | 0m2.820s | 0m4.505s |
+------------------+-----------+-----------+
| mem | | |
| 3 MB | 6.8 GB | 13.5 GB |
+------------------+-----------+-----------+
Update:
I tested #DavidC.Rankin's and #EdMorton's solutions and they ran, respectively:
real 0m41.455s
user 0m39.086s
sys 0m2.369s
and
real 0m39.577s
user 0m37.037s
sys 0m2.541s
Mem print was about the same as my mid had. It pays to use the wc, it seems.
$ pr -2t file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
if you want just one space between columns, change to
$ pr -2ts' ' file
You can also do it with awk simply by storing the first-half of the lines in an array and then concatenating the second half to the end, e.g.
awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
Example Use/Output
With your data in file, then
$ awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
Extract the first half of the file and the last half of the file and merge the lines:
paste <(head -n $(($(wc -l <file.txt)/2)) file.txt) <(tail -n $(($(wc -l <file.txt)/2)) file.txt)
You can use columns utility from autogen:
columns -c2 --by-columns file.txt
You can use column, but the count of columns is calculated in a strange way from the count of columns of your terminal. So assuming your lines have 28 characters, you also can:
column -c $((28*2+8)) file.txt
I do not want to solve this, but if I were you:
wc -l file.txt
gives number of lines
echo $(($(wc -l < file.txt)/2))
gives a half
head -n $(($(wc -l < file.txt)/2)) file.txt > first.txt
tail -n $(($(wc -l < file.txt)/2)) file.txt > last.txt
create file with first half and last half of the original file. Now you can merge those files together side by side as it was described here .
Here is my take on it using the bash shell wc(1) and ed(1)
#!/usr/bin/env bash
array=()
file=$1
total=$(wc -l < "$file")
half=$(( total / 2 ))
plus1=$(( half + 1 ))
for ((m=1;m<=half;m++)); do
array+=("${plus1}m$m" "${m}"'s/$/ /' "${m}"',+1j')
done
After all of that if just want to print the output to stdout. Add the line below to the script.
printf '%s\n' "${array[#]}" ,p Q | ed -s "$file"
If you want to write the changes directly to the file itself, Use this code instead below the script.
printf '%s\n' "${array[#]}" w | ed -s "$file"
Here is an example.
printf '%s\n' {1..10} > file.txt
Now running the script against that file.
./myscript file.txt
Output
1 6
2 7
3 8
4 9
5 10
Or using bash4+ feature mapfile aka readarray
Save the file in an array named array.
mapfile -t array < file.txt
Separate the files.
left=("${array[#]::((${#array[#]} / 2))}") right=("${array[#]:((${#array[#]} / 2 ))}")
loop and print side-by-side
for i in "${!left[#]}"; do
printf '%s %s\n' "${left[i]}" "${right[i]}"
done
What you said The only known thing in my scenario is, they are even number all the time. That solution should work.
I want to get a part of a binary file, from byte #480161397 to #480170447 (included, 9051 bytes in total)
I use cut -b, and I expected the size of trunk1.gz to be 9051 bytes, but I get a different result.
$ wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-07/segments/1454701152097.59/warc/CC-MAIN-20160205193912-00264-ip-10-236-182-209.ec2.internal.warc.gz
$ cut -b480161397-480170447 CC-MAIN-20160205193912-00264-ip-10-236-182-209.ec2.internal.warc.gz >trunk1.gz
$ echo $((480170447-480161397+1))
9051
$ ls -l trunk1.gz
-rw-r--r-- 1 david staff 3400324 Sep 8 10:28 trunk1.gz
What is wrong?
cut -bN-M copies the range N-M bytes from every line of the input.
Example:
$ cut -b4-7 <<END
0123456789
abcdefghij
ABCDEFGHIJ
END
Output:
3456
defg
DEFG
Consider using dd for your purposes.
If you work with binary, I advise you to use dd command.
dd if=trunk1.gz bs=1 skip=480161397 count=9051 of=output.bin
bs is the block size and is set to 1 byte.
Is it possible to concatenate the headers lines in a file with the output from a filter using grep? Perhaps using the cat command or something else from GNU's coreutils?
In particular, I have a tab delimited file that roughly looks like the following:
var1 var2 var3
1 MT 500
30 CA 40000
10 NV 1240
40 TX 500
30 UT 35000
10 AZ 1405
35 CO 500
15 UT 9000
1 NV 1505
30 CA 40000
10 NV 1240
I would like to select from lines 2 - N all lines that contain "CA" using grep and also to place the first row, the variable names, in the first line of the output file using GNU/Linux commands.
The desired output for the example would be:
var1 var2 var3
30 CA 40000
35 CA 65000
15 CA 2500
I can select the two sets of desired output with the following lines of code.
head -1 filename
grep -E CA filename
My initial idea is to combine the output of these commands using cat, but I have not been successful so far.
If you're running the commands from a shell (including shell scripts), you can run each command separately and redirect the output:
head -1 filename > outputfile
grep -E CA filename >> outputfile
The first line will overwrite outputfile, because a single > was used. The second line will append to outputfile, because >> was used.
If you want to do this in a single command, the following worked in bash:
(head -1 filename && grep -E CA filename) > outputfile
If you want the output to go to standard output, leave off the parenthesis and redirection:
head -1 filename && grep -E CA filename
It's not clear what you're looking for, but perhaps just:
{ head -1 filename; grep -E CA filename; } > output
or
awk 'NR==1 || /CA/' filename > output
But another interpretation of your question is best addressed using sed or awk.
For example, to print lines 5-9 and line 14, you can do:
sed -n -e 5,9p -e 14p
or
awk '(NR >=5 && NR <=9) || NR==14'
I just came across a method that uses the cat command.
cat <(head -1 filename) <(grep -E CA filename) > outputfile
This site, tldp.org, calls the <(command) syntax "process substitution."
It is unclear to me what method would be more efficient in terms of memory / speed, but this is testable.
Would it be possible to give the input to a very long command in the end of the command ? Below example would explain my query more clearly.
currently while grepping I have to do like below:
zgrep -i "A VERY LONG TEXT" file |awk '{print $1}'
Every time I have to move the cursor back to the "A VERY LONG TEXT" to change the pattern.I wanted to alter the command in such a way that "A VERY LONG TEXT" comes in the end of the command so I can quickly change it.
command1 |command2 |some_magic "A VERY LONG TEXT"
I know I can achieve this result by doing CAT and then grepping ,wondering if there is any alternate way to do this. May be like assigning it to a temp variable?
EXAMPLE 2:
I need to get real time time stamp of all the commands and their output in my session files.So I have to use below command. But before executing any command I have to move my cursor till unbuffer and change the commands. Is there is any way I can alter the below command such that I can enter my commands in the end of the line ?
/yumm 194>unbuffer ls -lrt | awk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; }'
2014-10-01 10:38:19 total 0
2014-10-01 10:38:19 -rw-rw-r-- 1 user bcam 0 Oct 1 10:37 1
2014-10-01 10:38:19 -rw-rw-r-- 1 user bcam 0 Oct 1 10:38 test1
2014-10-01 10:38:19 -rw-rw-r-- 1 user bcam 0 Oct 1 10:38 test2
2014-10-01 10:38:19 -rw-rw-r-- 1 user bcam 0 Oct 1 10:38 test3
2014-10-01 10:38:19 -rw-rw-r-- 1 user bcam 0 Oct 1 10:38 test4
yumm 195>
In short, I need some command to get time stamp of all the commands and their output I execute.
What if you just set this text to a variable?
mystring="A VERY LONG TEXT"
zgrep -i "$mystring" file | awk '{print $1}'
^ ^
note you need double quotes to make it work
Based on your edit, you can also do:
awk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; }' <<< "$(unbuffer ls -ltr)"
^^^^^^^^^^^^^^^^^^^^^^^^^
When editing and re-submitting a command, use:
Ctrl-A to move the cursor back to the start of the line quickly
Ctrl-E to move to the end of the line quickly
Alt-F to move forwards one word
Alt-B to move backwards one word
Or use fc command to open the last command in the editor and allow you to edit - say with vi commands, and then when you save it, it gets re-submitted for execution.
These shortcut key maybe help you.
Ctrl-a: move to beginning of line
Ctrl-e: move to end of line
Ctrl-b: move to previous character
Ctrl-f: move to next character
Ctrl-p: previous command (same as "UP" key in bash history)
Ctrl-n: next command (same as "DOWN" key in bash history)
Ctrl-h: Delete backward character
Ctrl-d: Delete current cursored character
Ctrl-k: Delete characters after cursor
You can easily image if you know Emacs editor.
These shortcut is same key bind as same as emacs.
You can define a function, e.g., in your ~/.bash_profile:
some_magic() {
zgrep "$1" file | awk '{print $1}'
}
And use it the following way:
some_magic "A VERY LONG TEXT"
As for your second example, as soon as a command output is piped, it gets buffered by the pipe. Therefore the timestamp acquired at the other side of the pipe is wrong. Anyway, if you don't mind having a wrong timestamp, you can use this function:
some_other_magic() {
$1 | awk '{print strftime("%Y-%m-%d %H:%M:%S"), $0}'
}
And use it the following way:
some_other_magic "ls -lrt"
I'm working on a home work assignment. The question is:
Write an awk script to select all regular files (not directories or
links) in /etc ending with .conf, sort the result by size from
smallest to largest, count the number of files, and print out the
number of files followed by the filenames and sizes in two columns.
Include a header row for the filenames and sizes. Paste both your
script and its output in the answer area.
I'm really struggling trying to get this to work through using awk. Here's what I came up with.
ls -lrS /etc/*.conf |wc –l
will return the number 33 which is the number of files .conf files in the directory.
ls -lrS /etc/*.conf |awk '{print "File_Size"": " $5 " ""File_Name and Size"": " $9}'
this will make 2 columns with the name and size of the .conf file in the directory.
It works, but I don't think it is what he's looking for. I'm having an AWKful time.
Let's see here...
select all regular files (not directories or links)
So far you haven't addressed this, but if you are piping in the output of ls -l..., this is easy, select on
/^-/
because directories start with d, symbolic links with l and so on. Only plain old files start with -. Now
print out the number of files followed
Well, counting matches is easy enough...
BEGIN{count=0} # This is not *necessary*, but I tend to put it in for clarity
/^-/ {count++;}
To get the filename and size, look at the output of ls -l and count up columns
BEGIN{count=0}
/^-/ {
count++;
SIZE=$5;
FNAME=$9;
}
The big difficulty here is that awk doesn't provide much by way of sorting primitives, so that's the hard part. That can be beaten if you want to be clever but it is not particularly efficient (see the awful thing I did in a [code-golf] solution). The easy (and unixy) thing to do would be to pipe part of the output to sort, so...we collect a line for each file into a big string
BEGIN{count=0}
/^-/ {
count++
SIZE=$5;
FNAME=$9;
OUTPUT=sprintf("%10d\t%s\n%s",SIZE,FNAME,OUTPUT);
}
END{
printf("%d files\n",count);
printf(" SIZE \tFILENAME"); # No newline here because OUTPUT has it
print OUTPUT|"sort -n --key=1";
}
Gives output like
11 files
SIZE FILENAME
673 makefile
2192 houghdata.cc
2749 houghdata.hh
6236 testhough.cc
8751 fasthough.hh
11886 fasthough.cc
19270 HoughData.png
60036 houghdata.o
104680 testhough
150292 testhough.o
168588 fasthough.o
(BTW--There is a test subdirectory here, and you'll note that it does not appear in the output.)
May be something like this should get you on your way -
ls -lrS /etc/*.conf |
awk '
BEGIN{print "Size:\tFilename:"} # Prints Headers
/^-/{print $5"\t"$9} # Prints two desired columns, /^-/ captures only files
END{print "Total Files = "(NR-1)}' # Uses in-built variable to print count
Test: Text after # are comments for your reference.
[jaypal:~/Temp] ls -lrS /etc/*.conf |
awk '
BEGIN{print "Size:\tFilename:"}
/^-/{print $5"\t"$9}
END{print "Total Files = "(NR-1)}'
Size: Filename:
0 /etc/kern_loader.conf
22 /etc/ntp.conf
54 /etc/ftpd.conf
105 /etc/launchd.conf
168 /etc/memberd.conf
242 /etc/notify.conf
366 /etc/ntp-restrict.conf
526 /etc/gdb.conf
723 /etc/pf.conf
753 /etc/6to4.conf
772 /etc/syslog.conf
983 /etc/rtadvd.conf
1185 /etc/asl.conf
1238 /etc/named.conf
1590 /etc/newsyslog.conf
1759 /etc/autofs.conf
2378 /etc/dnsextd.conf
4589 /etc/man.conf
Total Files = 18
I would first find the files with something like find /etc -type f -name '*.conf' ; so you get the right list of files. Then you do ls -l on them (perhaps using xargs). And then using awk should be simple.
But I don't think that if I did more your homework that would help you. You need to think by yourself and find out.
Disclaimer: I'm not a shell expert.
Thought I'd give this a go, been beaten on speed of reply though :-) :
clear
FILE_COUNT=`find /etc/ -name '*.conf' -type f -maxdepth 1 | wc -l`
echo "Number of files: $FILE_COUNT"
ls -lrS /etc/[^-]*.conf | awk '
BEGIN {print "NAME | SIZE"}\
{print $9," | ",$5}\
END {print "- DONE -"}\
'
My output is ugly :-( :
Number of files: 21
NAME | SIZE
/etc/kern_loader.conf | 0
/etc/resolv.conf | 20
/etc/AFP.conf | 24
/etc/ntp.conf | 42
/etc/ftpd.conf | 54
/etc/notify.conf | 132
/etc/memberd.conf | 168
/etc/Symantec.conf | 246
/etc/ntp-restrict.conf | 366
/etc/gdb.conf | 526
/etc/6to4.conf | 753
/etc/syslog.conf | 772
/etc/asl.conf | 860
/etc/liveupdate.conf | 861
/etc/rtadvd.conf | 983
/etc/named.conf | 1238
/etc/newsyslog.conf | 1590
/etc/autofs.conf | 1759
/etc/dnsextd.conf | 2378
/etc/smb.conf | 2975
/etc/man.conf | 4589
/etc/amavisd.conf | 31925
- DONE -