Split texts into smaller texts of n number of words - text

I have a large number of texts (several thousand) in a txt format and would like to split them into 500-word long chunks and to save these chunks into separate folders.
< *.txt tr -c A-Za-z0-9 \\n | grep -v '^$' | split -l 500
can do the job but it splits texts to one word per line, whereas I would like to retain the original format.
I was wondering if there is a bash command or Python script to do this.

You should also be able to do that with csplit, but I had better luck with the perl solution found here; https://unix.stackexchange.com/questions/66513/how-can-i-split-a-large-text-file-into-chunks-of-500-words-or-so
Thanks to Joseph R.
$ cat generatewordchunks.pl
perl -e '
undef $/;
$file=<>;
while($file=~ /\G((\S+\s+){500})/gc)
{
$i++;
open A,">","chunk-$i.txt";
print A $1;
close A;
}
$i++;
if($file=~ /\G(.+)\Z/sg)
{
open A,">","chunk-$i.txt";
print A $1;
}
' $1
$ ./generatewordchunks.pl woord.list
$ ls -ltr
total 13
-rwxrwx--- 1 root vboxsf 5934 Jul 31 16:03 woord.list
-rwxrwx--- 1 root vboxsf 362 Jul 31 16:08 generatewordchunks.pl
-rwxrwx--- 1 root vboxsf 4203 Jul 31 16:11 chunk-1.txt
-rwxrwx--- 1 root vboxsf 1731 Jul 31 16:11 chunk-2.txt

Related

Find regular expression matching condition

I have a set of files including a date in their name:
MERRA2_400.tavg1_2d_slv_Nx.20151229.SUB.nc
MERRA2_400.tavg1_2d_slv_Nx.20151230.SUB.nc
MERRA2_400.tavg1_2d_slv_Nx.20151231.SUB.nc
I want to select the files matching a condition on this date. In this example: date > 20151230
I tried things like:
find . -regex ".*.SUB.nc" | cut -d "." -f 4 | while read a; do if [ $a -ge 20151201 ]; then echo $a; fi; done
BUT:
1) This is returning only a part of the filename, whereas I would like to return the entire filename.
2) There may be a more elegant way than using while read/do
thanks in advance!
Rearranging your code becomes:
#!/usr/bin/env bash
find . -regex ".*.SUB.nc" \
| rev | cut -d '.' -f 3 | rev \
| while read a; do
if [ $a -ge 20151201 ]; then
echo $a
fi
done
rev | cut -d '.' -f 3 | rev is used because
if you give absolute path or
the subdirectories have . in them
then it won't be the 4th field, but it will always be the 3rd last field.
This will give the output:
20151231
20151229
20151230
To show the complete file names change echo $a with ls *$a*. Output:
MERRA2_400.tavg1_2d_slv_Nx.20151231.SUB.nc
MERRA2_400.tavg1_2d_slv_Nx.20151229.SUB.nc
MERRA2_400.tavg1_2d_slv_Nx.20151230.SUB.nc
I tested this script with file names whose dates are less than 20151201. For example MERRA2_400.tavg1_2d_slv_Nx.20151200.SUB.nc. The results are consistent.
Perhaps a more efficient way to accomplish your task is using a grep regex like:
find . -regex ".*.SUB.nc" | grep -E "201512(0[1-9]|[1-9][0-9])|201[6-9][0-9][0-9][0-9]"
This will work just fine.
find . -regex ".*.SUB.nc" | rev | cut -d '.' -f 3 | rev | while read a; do if [ $a -ge 20151201 ]; then echo `ls -R | grep $a` ;fi ;done
rev | cut -d '.' -f 3 | rev is used because
if you give absolute path or
the subdirectories have . in them
then it won't be the 4th field now, but it will always be the 3rd last field always.
ls -R | grep $a so that you can recursively find out the name of the file.
Assume is the files and file structure is :
[root#localhost temp]# ls -lrt -R
.:
total 8
-rw-r--r--. 1 root root 0 Apr 25 16:15 MERRA2_400.tavg1_2d_slv_Nx.20151231.SUB.nc
-rw-r--r--. 1 root root 0 Apr 25 16:15 MERRA2_400.tavg1_2d_slv_Nx.20151230.SUB.nc
-rw-r--r--. 1 root root 0 Apr 25 16:15 MERRA2_400.tavg1_2d_slv_Nx.20151229.SUB.nc
drwxr-xr-x. 2 root root 4096 Apr 25 16:32 temp.3
drwxr-xr-x. 3 root root 4096 Apr 25 17:13 temp2
./temp.3:
total 0
./temp2:
total 4
-rw-r--r--. 1 root root 0 Apr 25 16:27 MERRA2_400.tavg1_2d_slv_Nx.20151111.SUB.nc
-rw-r--r--. 1 root root 0 Apr 25 16:27 MERRA2_400.tavg1_2d_slv_Nx.20151222.SUB.nc
drwxr-xr-x. 2 root root 4096 Apr 25 17:13 temp21
./temp2/temp21:
total 0
-rw-r--r--. 1 root root 0 Apr 25 17:13 MERRA2_400.tavg1_2d_slv_Nx.20151333.SUB.nc
Running above command gives :
MERRA2_400.tavg1_2d_slv_Nx.20151229.SUB.nc
MERRA2_400.tavg1_2d_slv_Nx.20151231.SUB.nc
MERRA2_400.tavg1_2d_slv_Nx.20151230.SUB.nc
MERRA2_400.tavg1_2d_slv_Nx.20151333.SUB.nc
MERRA2_400.tavg1_2d_slv_Nx.20151222.SUB.nc

Print permissions from file arguments in Bash script

I'm having trouble reading the permissions of file arguments. I looks like it has something to do with hidden files but I'm not sure why.
Current Code:
#!/bin/bash
if [ $# = 0 ]
then
echo "Usage ./checkPerm filename [filename2 ... filenameN]"
exit 0
fi
for file in $#
do
ls -l | grep $file | cut -f1 -d' '
# Do Something
done
I can get the permissions for each input, but when a hidden file is run through through the loop it re-prints the permissions of all files.
-bash-4.1$ ll test*
-rw-r--r-- 1 user joe 0 Nov 11 19:07 test1
-r-xr-xr-x 1 user joe 0 Nov 11 19:07 test2*
-r--r----- 1 user joe 0 Nov 11 19:07 test3
-rwxr-x--- 1 user joe 0 Nov 11 19:07 test4*
-bash-4.1$ ./checkPerm test*
-rw-r--r--
-rw-r--r--
-r-xr-xr-x
-r--r-----
-rwxr-x---
-r--r-----
-rw-r--r--
-r-xr-xr-x
-r--r-----
-rwxr-x---
-bash-4.1$
What is going on in the loop?
It's your grep:
ls -l | grep 'test2*'
This will grep out anything starting with test since you're basically asking for anything starting with test that might end with 0 or more 2s in it, as specified by the 2*.
To get your intended result, simply remove your loop and replace it with this:
ls -l "$#" | cut -d' ' -f1
Or keep your loop, but remove the grep:
ls -l $file | cut -d' ' -f1
Also, technically, none of those files are hidden. Hidden files in bash start with ., like .bashrc.
When you do the ls -l inside the loop and then grep the results, if there are files that contain test1 in the name, but not at the start, they are selected by the grep, giving you extra results. You could see that by doing:
ls -l | grep test
and seeing that there are many more entries than the 4 you get with ls -l test*.
Inside your loop, you should probably use just:
ls -ld "$file" | cut -d' ' -f1

Can I change the order of the output fields from the Linux cut command? [duplicate]

This question already has answers here:
Rearrange columns using cut
(9 answers)
Closed 5 years ago.
I am using cut command in command line and seems I can't get the output I like.
Do you have any idea why I am getting this? Is it something that I do wrong?
This is the normal output and I would like to output in different order:
[root#upbvm500 root]# ls -al IDS_DIR/a | tr -s " "
-rw-r--r-- 1 root root 0 Jan 1 17:18 IDS_DIR/a
[root#upbvm500 root]#
[root#upbvm500 root]# ls -al IDS_DIR/a | tr -s " " | cut -d" " -f5,6,7,8,3,4,1
-rw-r--r-- root root 0 Jan 1 17:18
But as you can see, this is not working like expected.
Any idea why they are switching places?
From man cut:
Selected input is written in the same order that it is read, and is written exactly once.
Use awk '{print $5,$6,$7,$8,$3,$4,$1}' instead of cut.
cut does not reorder its output. It simply collects a list of which columns to print, then prints them out as they arrive.
Use a different tool such as Awk to reorder output columns.
However, in this patricular case, try with stat or find instead of ls. It is generally not recommended to try to parse the output from ls. See http://mywiki.wooledge.org/ParsingLs
As others have mentioned, don't parse ls. If you want file information, use stat
stat -c "%s %y %U %G %A %n" filename
You may need to do some extra work to get the timestamp formatted as you want.
$ ls -l data
-rw-r--r-- 1 glennj glennj 13 2013-01-01 11:19 data
$ LC_TIME=POSIX ls -l data
-rw-r--r-- 1 glennj glennj 13 Jan 1 11:19 data
$ stat -c "%s %y %U %G %A %n" data
13 2013-01-01 11:19:53.670015242 -0500 glennj glennj -rw-r--r-- data
$ stat -c "%s %Y %U %G %A %n" data | awk '{$2 = strftime("%b %e %H:%M", $2)} 1'
13 Jan 1 11:19 glennj glennj -rw-r--r-- data

Get first and last files per month

Based on this question Group files and pipe to awk command
I have a set of files like this:-
-rw-r--r-- 1 root root 497186 Apr 21 13:17 2012_03_25
-rw-r--r-- 1 root root 490558 Apr 21 13:17 2012_03_26
-rw-r--r-- 1 root root 488797 Apr 21 13:17 2012_03_27
-rw-r--r-- 1 root root 316290 Apr 21 13:17 2012_03_28
-rw-r--r-- 1 root root 490081 Apr 21 13:17 2012_03_29
-rw-r--r-- 1 root root 486621 Apr 21 13:17 2012_03_30
-rw-r--r-- 1 root root 490904 Apr 21 13:17 2012_03_31
-rw-r--r-- 1 root root 491788 Apr 21 13:17 2012_04_01
-rw-r--r-- 1 root root 488630 Apr 21 13:17 2012_04_02
Based on the answer in the linked question I have a script with the following code, which works fine:-
DIR="/tmp/tmp"
for month in $(find "$DIR" -maxdepth 1 -type f | sed 's/.*\/\([0-9]\{4\}_[0-9]\{2\}\).*/\1/' | sort -u); do
echo "Start awk command for files $month"
power=$(awk -F, '{ x += $1 } END { print x/NR }' "$DIR/${month}"_[0-3][0-9])
echo $power
done
The below command on it's own returns a list like this:-
find /tmp/tmp -maxdepth 1 -type f | sed 's/.*\/\([0-9]\{4\}_[0-9]\{2\}\).*/\1/' | sort -u
2011_05
2011_06
2011_07
2011_08
2011_09
2011_10
2011_11
2011_12
2012_01
2012_02
2012_03
2012_04
The find command is passing a set of files using a GLOB to AWK to be processed as a batch.
Based on this, i want to be able to run the following cut commands
head -1 FirstFile | date -d "`cut -d, -f7`" +%s
tail -1 LastFile | date -d "`cut -d, -f7`" +%s
These need to be run for the FIRST and LAST file PER SET
So for 2012_03 above, the head would need to be run for the 2012_03_25 file and the tail would need to be run for the 2012_03_31 as these are the first and last files in the set for March.
So basically I need to be able to get the FIRST and LAST file PER BATCH.
I hope I have made this clear enough, if not please comment.
DIR="/tmp/tmp"
for month in $(find "$DIR" -maxdepth 1 -type f | sed 's/.*\/\([0-9]\{4\}_[0-9]\{2\}\).*/\1/' | sort -u); do
echo "Start awk command for files $month"
IFS=, read start end power < <(awk -F, 'BEGIN{OFS = ","} NR == 1 {printf "%s,", $7} { x += $1; d = $7 } END { print d, x/NR }' "$DIR/${month}"_[0-3][0-9])
echo $power
date -d "$start" +%s
date -d "$end" +%s
done
Here is how you would use a here-doc, which should work in most shells:
read start end power <<EOF
$(awk -F, 'NR == 1 {printf "%s ", $7} { x += $1; d = $7 } END { print d, x/NR }' "$DIR/${month}"_[0-3][0-9]))
EOF

How to recursively list all files and directories

Using the tcsh shell on Free BSD, is there a way to recursively list all files and directories including the owner, group and relative path to the file?
ls -alR comes close, but it does not show the relative path in front of every file, it shows the path at the top of a grouping i.e.
owner% ls -alR
total 0
drwxr-xr-x 3 owner group 102 Feb 1 10:50 .
drwx------+ 27 owner group 918 Feb 1 10:49 ..
drwxr-xr-x 5 owner group 170 Feb 1 10:50 subfolder
./subfolder:
total 16
drwxr-xr-x 5 owner group 170 Feb 1 10:50 .
drwxr-xr-x 3 owner group 102 Feb 1 10:50 ..
-rw-r--r-- 1 owner group 0 Feb 1 10:50 file1
-rw-r--r-- 1 owner group 0 Feb 1 10:50 file2
What I would like is output like:
owner group ./relative/path/to/file
The accepted answer to this question shows the relative path to a file, but does not show the owner and group.
How about this:
find . -exec ls -dl \{\} \; | awk '{print $3, $4, $9}'
Use tree. Few linux distributions install it by default (in these dark days of only GUIs :-), but it's always available in the standard repositories. It should be available for *BSD also, see http://mama.indstate.edu/users/ice/tree/
Use:
tree -p -u -g -f -i
or
tree -p -u -g -f
or check the man page for many other useful arguments.
Works in Linux Debian:
find $PWD -type f
find comes close:
find . -printf "%u %g %p\n"
There is also "%P", which removes the prefix from the filename, if you want the paths to be relative to the specified directory.
Note that this is GNU find, I don't know if the BSD find also supports -printf.
You've already got an answer that works, but for reference you should be able to do this on the BSDs (I've tested it on a mac) :
find . -ls
If you fancy using Perl don't use it as a wrapper around shell commands. Doing it in native Perl is faster, more portable, and more resilient. Plus it avoids ad-hoc regexes.
use File::Find;
use File::stat;
find (\&myList, ".");
sub myList {
my $st = lstat($_) or die "No $file: $!";
print getgrnam($st->gid), " ",
getpwuid($st->uid), " ",
$File::Find::name, "\n";
}
Simple way I found was this:
ls -lthr /path_to_directory/*
" * " - represents levels.
Ajiths-MBP:test ajith$ ls -lthr *
test2:
total 0
-rw-r--r-- 1 ajith staff 0B Oct 17 18:22 test2.txt
test3:
total 0
-rw-r--r-- 1 ajith staff 0B Oct 17 18:22 test3.txt
test1:
total 0
-rw-r--r-- 1 ajith staff 0B Oct 17 18:21 test1.txt
drwxr-xr-x 3 ajith staff 96B Oct 17 18:22 test1_sub_dir
Ajiths-MBP:test ajith$ ls -lthr */*
-rw-r--r-- 1 ajith staff 0B Oct 17 18:21 test1/test1.txt
-rw-r--r-- 1 ajith staff 0B Oct 17 18:22 test2/test2.txt
-rw-r--r-- 1 ajith staff 0B Oct 17 18:22 test3/test3.txt
test1/test1_sub_dir:
total 0
-rw-r--r-- 1 ajith staff 0B Oct 17 18:22 test1_sub_file.txt
Use a shell script. Or a Perl script. Example Perl script (because it's easier for me to do):
#!/usr/bin/perl
use strict;
use warnings;
foreach(`find . -name \*`) {
chomp;
my $ls = `ls -l $_`;
# an incomprehensible string of characters because it's Perl
my($owner, $group) = /\S+\s+\S+\s+(\S+)\s+(\S)+/;
printf("%-10s %-10s %s\n", $owner, $group, $_);
}
Perhaps a bit more verbose than the other answers, but should do the trick, and should save you having to remember what to type. (Code untested.)

Resources