Creating multiple files with names matching a pattern from one file - linux

I have a text file such as:
>Tolten.WP_096573835.1
MSSPKSLLIRRARIILPNGELMVGDVLTRDRQIVEVAPEIYTPTPTTEIDAAGLTLLPGVIDPQVHFREPGLEHKEDLFTASCACAKGGVTSFLEMPNTRPLTTN
--
>Trivar.WP_011317016.1
MSSPQSLLIRGARILLPNGEFLLGDVLIRDRHIIEVGTEIVNTTPATEIDAKGLTLLPGVIDPQVHFREPGLEHKEDLFTASCACAKGGVTSFLEMPNTRPLTTS
--
>uniSU2.WP_085434158.1
MTQLLIRHAQILLPNGQFLLGDVLTQDGKILEVASEIAATDLSNIIDATGLTLLPGVIDPQVHFREPGLEHKEDLFTATCACAKGGVTSFLEMPNTRPLTTTQAA
--
>Chlfri.WP_016876644.1
MSETPLLDKVIKNVRVVRPNQHTIEKVDIGIKNGKFAQIAPQISPDQTKEVFDAKNLLGFPGVVDAHMHIGIYQPLAQDAVSESKAAAMGGVTTSLNYIRTGQYY
--
>Noslin.WP_099070767.1
MSEASLLDRVIKNVRVVRPHNDAIELLDLGIKDGKFARIASHISPDTAKEVFDAKNLLGFPGVVDAHMHIGIYQPLDKDAVTESKAAAMGGVTTSLNYIRTGQYY
I want to create multiple text files with the content between each and every "--". The text files would be named after the the string starting with ">".
In the end I would have these text files:
Tolten.WP_096573835.1
Trivar.WP_011317016.1
uniSU2.WP_085434158.1
Chlfri.WP_016876644.1
Noslin.WP_099070767.1
With the following content:
Tolten.WP_096573835.1 text file:
>Tolten.WP_096573835.1
MSSPKSLLIRRARIILPNGELMVGDVLTRDRQIVEVAPEIYTPTPTTEIDAAGLTLLPGVIDPQVHFREPGLEHKEDLFTASCACAKGGVTSFLEMPNTRPLTTN
Trivar.WP_011317016.1 text file:
>Trivar.WP_011317016.1
MSSPQSLLIRGARILLPNGEFLLGDVLIRDRHIIEVGTEIVNTTPATEIDAKGLTLLPGVIDPQVHFREPGLEHKEDLFTASCACAKGGVTSFLEMPNTRPLTTS
uniSU2.WP_085434158.1
>uniSU2.WP_085434158.1
MTQLLIRHAQILLPNGQFLLGDVLTQDGKILEVASEIAATDLSNIIDATGLTLLPGVIDPQVHFREPGLEHKEDLFTATCACAKGGVTSFLEMPNTRPLTTTQAA
Chlfri.WP_016876644.1 text file:
>Chlfri.WP_016876644.1
MSETPLLDKVIKNVRVVRPNQHTIEKVDIGIKNGKFAQIAPQISPDQTKEVFDAKNLLGFPGVVDAHMHIGIYQPLAQDAVSESKAAAMGGVTTSLNYIRTGQYY
>Noslin.WP_099070767.1 text file:
>Noslin.WP_099070767.1
MSEASLLDRVIKNVRVVRPHNDAIELLDLGIKDGKFARIASHISPDTAKEVFDAKNLLGFPGVVDAHMHIGIYQPLDKDAVTESKAAAMGGVTTSLNYIRTGQYY
I know csplit works for this sort of thing:
csplit --suppress-matched original_text_file.txt '/^--/' '{*}'
But I can't get it to name the files appropriately.
Does anyone know how to help?
Thanks in advance :)

I'm afraid csplit can't do that directly, you can only change prefix and suffix of filenames using a "fixed" format. nothing stops you of doing the renaming afterwards using a simple loop, e.g.:
$ csplit --suppress-matched original_text_file.txt '/^--$/' '{*}'
129
129
129
129
129
$ for f in xx*; do mv "$f" "`head -n 1 "$f" | cut -c 2-`"; done
$ ls -1
Chlfri.WP_016876644.1
Noslin.WP_099070767.1
original_text_file.txt
Tolten.WP_096573835.1
Trivar.WP_011317016.1
uniSU2.WP_085434158.1
$
you can easily combine csplit and the loop to get a one-liner:
csplit --suppress-matched original_text_file.txt '/^--$/' '{*}' && for f in xx*; do mv "$f" "`head -n 1 "$f" | cut -c 2-`"; done

Related

Grep in a for loop in Linux

I have a file containing a certain number of occurrences of a determined string called "Thermochemistry". Im trying to write a script to grep 500 lines above each occurrence of this string and create a new file numbered accordingly. What I've tried this
occurrences="$(grep -c 'Thermochemistry' $FILE)"
for (( i=1; i<=$occurrences; i++)); do
grep -B500 -m "$i" 'Thermochemistry' $FILE > newfile_"$i".tmp
done
If there are 20 occurrences of 'Thermochemistry' in the file, I wanted it to create 20 new files, called newfile_1.tmp to newfile_20.tmp, but it doesn't work.
Can anyone help?
Next to the magic command from oguz ismail, you could use the following awk line:
awk '/Thermochemistry/{close(f);f="newfile_"++i".tmp"
for(i=FNR;i<=FNR+(FNR>500?500:FNR);++i) print b[i%500] > f
}{b[FNR%500]=$0}' file

Create CSV file using file name and file contents in Linux

I have a folder with over 400K txt files.
With names like
deID.RESUL_12433287659.txt_234323456.txt
deID.RESUL_34534563649.txt_345353567.txt
deID.RESUL_44235345636.txt_537967875.txt
deID.RESUL_35234663456.txt_423452545.txt
Each file has different content
I want to grab file name and file content and put in CSV.
Something like:
file_name,file_content
deID.RESUL_12433287659.txt_234323456.txt,Content 1
deID.RESUL_34534563649.txt_345353567.txt,Content 2
deID.RESUL_44235345636.txt_537967875.txt,Content 3
deID.RESUL_35234663456.txt_423452545.txt,Content 4
I know how to grab all the files in a directory in CSV using:
find * > files.csv
How can I also grab the contents of the file?
find * is somewhat strange, find already scans recursively. find . is enough to include all find * (well, unless there is somewhat strange shell glob rules you take into account).
We would need to iterate over the files. Also it would be nice to remove newlines.
# create file for a MCVE
while IFS=' ' read -r file content; do echo "$content" > "$file"; done <<EOF
deID.RESUL_12433287659.txt_234323456.txt Content 1
deID.RESUL_34534563649.txt_345353567.txt Content 2
deID.RESUL_44235345636.txt_537967875.txt Content 3
deID.RESUL_35234663456.txt_423452545.txt Content 4
EOF
{
# I'm using `|` as the separator for columns
# output header names
echo 'file_name|file_content';
# this is the hearth of the script
# find the files
# for each file execute `sh -c 'printf "%s|%s\n" "$1" "$(cat "$1")"' -- <filename>`
# printf - nice printing
# "$(cat "$1")" - gets file content and also removes trailing empty newlines. Neat.
find . -type f -name 'deID.*' -exec sh -c 'printf "%s|%s\n" "$1" "$(cat "$1")"' -- {} \;
} |
# nice formatting:
column -t -s'|' -o ' '
will output:
file_name file_content
./deID.RESUL_44235345636.txt_537967875.txt Content 3
./deID.RESUL_35234663456.txt_423452545.txt Content 4
./deID.RESUL_34534563649.txt_345353567.txt Content 2
./deID.RESUL_12433287659.txt_234323456.txt Content 1

Better way to rename files based on multiple patterns

a lot of files I download have crap/spam in their filenames, e.g.
[ www.crap.com ] file.name.ext
www.crap.com - file.name.ext
I've come up with two ways for dealing with them but they both seem pretty clunky:
with parameter expansion:
if [[ ${base_name} != ${base_name//\[+([^\]])\]} ]]
then
mv -v "${dir_name}/${base_name}" "${dir_name}/${base_name//\[+([^\]])\]}" &&
base_name="${base_name//\[+([^\]])\]}"
fi
if [[ ${base_name} != ${base_name//www.*.com - /} ]]
then
mv -v "${dir_name}/${base_name}" "${dir_name}/${base_name//www.*.com - /}" &&
base_name="${base_name//www.*.com - /}"
fi
# more of these type of statements; one for each type of frequently-encountered pattern
and then with echo/sed:
tmp=`echo "${base_name}" | sed -e 's/\[[^][]*\]//g' | sed -e 's/\s-\s//g'`
mv "${base_name}" "{tmp}"
I feel like the parameter expansion is the worse of the two but I like it because I'm able to keep the same variable assigned to the file for further processing after the rename (the above code is used in a script that's called for each file after the file download is complete).
So anyway I was hoping there's a better/cleaner way to do the above that someone more knowledgeable than myself could show me, preferably in a way that would allow me to easily reassign the old/original variable to the new/renamed file.
Thanks
Two answer: using perl rename or using pure bash
As there are some people who dislike perl, I wrote my bash only version
Renaming files by using the rename command.
Introduction
Yes, this is a typical job for rename command which was precisely designed for:
man rename | sed -ne '/example/,/^[^ ]/p'
For example, to rename all files matching "*.bak" to strip the
extension, you might say
rename 's/\.bak$//' *.bak
To translate uppercase names to lower, you'd use
rename 'y/A-Z/a-z/' *
More oriented samples
Simply drop all spaces and square brackets:
rename 's/[ \[\]]*//g;' *.ext
Rename all .jpg by numbering from 1:
rename 's/^.*$/sprintf "IMG_%05d.JPG",++$./e' *.jpg
Demo:
touch {a..e}.jpg
ls -ltr
total 0
-rw-r--r-- 1 user user 0 sep 6 16:35 e.jpg
-rw-r--r-- 1 user user 0 sep 6 16:35 d.jpg
-rw-r--r-- 1 user user 0 sep 6 16:35 c.jpg
-rw-r--r-- 1 user user 0 sep 6 16:35 b.jpg
-rw-r--r-- 1 user user 0 sep 6 16:35 a.jpg
rename 's/^.*$/sprintf "IMG_%05d.JPG",++$./e' *.jpg
ls -ltr
total 0
-rw-r--r-- 1 user user 0 sep 6 16:35 IMG_00005.JPG
-rw-r--r-- 1 user user 0 sep 6 16:35 IMG_00004.JPG
-rw-r--r-- 1 user user 0 sep 6 16:35 IMG_00003.JPG
-rw-r--r-- 1 user user 0 sep 6 16:35 IMG_00002.JPG
-rw-r--r-- 1 user user 0 sep 6 16:35 IMG_00001.JPG
Full syntax for matching SO question, in safe way
There is a strong and safe way using rename utility:
As this is perl common tool, we have to use perl syntax:
rename 'my $o=$_;
s/[ \[\]]+/-/g;
s/-+/-/g;
s/^-//g;
s/-\(\..*\|\)$/$1/g;
s/(.*[^\d])(|-(\d+))(\.[a-z0-9]{2,6})$/
my $i=$3;
$i=0 unless $i;
sprintf("%s-%d%s", $1, $i+1, $4)
/eg while
$o ne $_ &&
-f $_;
' *
Testing rule:
touch '[ www.crap.com ] file.name.ext' 'www.crap.com - file.name.ext'
ls -1
[ www.crap.com ] file.name.ext
www.crap.com - file.name.ext
rename 'my $o=$_; ...
...
...' *
ls -1
www.crap.com-file.name-1.ext
www.crap.com-file.name.ext
touch '[ www.crap.com ] file.name.ext' 'www.crap.com - file.name.ext'
ls -1
www.crap.com-file.name-1.ext
[ www.crap.com ] file.name.ext
www.crap.com - file.name.ext
www.crap.com-file.name.ext
rename 'my $o=$_; ...
...
...' *
ls -1
www.crap.com-file.name-1.ext
www.crap.com-file.name-2.ext
www.crap.com-file.name-3.ext
www.crap.com-file.name.ext
... and so on...
... and it's safe while you don't use -f flag to rename command: file won't be overwrited and you will get an error message if something goes wrong.
Renaming files by using bash and so called bashisms:
I prefer doing this by using dedicated utility, but this could even be done by using pure bash (aka without any fork)
There is no use of any other binary than bash (no sed, awk, tr or other):
#!/bin/bash
for file;do
newname=${file//[ \]\[]/.}
while [ "$newname" != "${newname#.}" ] ;do
newname=${newname#.}
done
while [ "$newname" != "${newname//[.-][.-]/.}" ] ;do
newname=${newname//[.-][.-]/-};done
if [ "$file" != "$newname" ] ;then
if [ -f $newname ] ;then
ext=${newname##*.}
basename=${newname%.$ext}
partname=${basename%%-[0-9]}
count=${basename#${partname}-}
[ "$partname" = "$count" ] && count=0
while printf -v newname "%s-%d.%s" $partname $[++count] $ext &&
[ -f "$newname" ] ;do
:;done
fi
mv "$file" $newname
fi
done
To be run with files as argument, for sample:
/path/to/my/script.sh \[*
Replacing spaces and square bracket by dot
Replacing sequences of .-, -., -- or .. by only one -.
Test if filename don't differ, there is nothing to do.
Test if a file exist with newname...
split filename, counter and extension, for making indexed newname
loop if a file exist with newname
Finaly rename the file.
Take advantage of the following classical pattern:
job_select /path/to/directory| job_strategy | job_process
where job_select is responsible for selecting the objects of your job, job_strategy prepares a processing plan for these objects and job_process eventually executes the plan.
This assumes that filenames do not contain a vertical bar | nor a newline character.
The job_select function
# job_select PATH
# Produce the list of files to process
job_select()
{
find "$1" -name 'www.*.com - *' -o -name '[*] - *'
}
The find command can examine all properties of the file maintained by the file system, like creation time, access time, modification time. It is also possible to control how the filesystem is explored by telling find not to descend into mounted filesystems, how much recursions levels are allowed. It is common to append pipes to the find command to perform more complicated selections based on the filename.
Avoid the common pitfall of including the contents of hidden directories in the output of the job_select function. For instance, the directories CVS, .svn, .svk and .git are used by the corresponding source control management tools and it is almost always wrong to include their contents in the output of the job_select function. By inadvertently batch processing these files, one can easily make the affected working copy unusable.
The job_strategy function
# job_strategy
# Prepare a plan for renaming files
job_strategy()
{
sed -e '
h
s#/www\..*\.com - *#/#
s#/\[^]]* - *#/#
x
G
s/\n/|/
'
}
This commands reads the output of job_select and makes a plan for our renaming job. The plan is represented by text lines having two fields separated by the character |, the first field being the old name of the file and the second being the new computed file of the file, it looks like
[ www.crap.com ] file.name.1.ext|file.name.1.ext
www.crap.com - file.name.2.ext|file.name.2.ext
The particular program used to produce the plan is essentially irrelevant, but it is common to use sed as in the example; awk or perl for this. Let us walk through the sed-script used here:
h Replace the contents of the hold space with the contents of the pattern space.
… Edit the contents of the pattern space.
x Swap the contents of the pattern and hold spaces.
G Append a newline character followed by the contents of the hold space to the pattern space.
s/\n/|/ Replace the newline character in the pattern space by a vertical bar.
It can be easier to use several filters to prepare the plan. Another common case is the use of the stat command to add creation times to file names.
The job_process function
# job_process
# Rename files according to a plan
job_process()
{
local oldname
local newname
while IFS='|' read oldname newname; do
mv "$oldname" "$newname"
done
}
The input field separator IFS is adjusted to let the function read the output of job_strategy. Declaring oldname and newname as local is useful in large programs but can be omitted in very simple scripts. The job_process function can be adjusted to avoid overwriting existing files and report the problematic items.
About data structures in shell programs
Note the use of pipes to transfer data from one stage to the other: apprentices often rely on variables to represent such information but it turns out to be a clumsy choice. Instead, it is preferable to represent data as tabular files or as tabular data streams moving from one process to the other, in this form, data can be easily processed by powerful tools like sed, awk, join, paste and sort — only to cite the most common ones.
You can use rnm
rnm -rs '/\[crap\]|\[spam\]//g' *.ext
The above will remove [crap] or [spam] from filename.
You can pass multiple regex pattern by terminating them with ; or overloading the -rs option.
rnm -rs '/[\[\]]//g;/\s*\[crap\]//g' -rs '/crap2//' *.ext
The general format of this replace string is /search_part/replace_part/modifier
search_part : regex to search for.
replace_part : string to replace with
modifier : i (case insensitive), g (global replace)
uppercase/lowercase:
A replace string of the form /search_part/\c/modifier will make the selected part of the filename (by the regex search_part) lowercase while \C (capital \C) in replace part will make it uppercase.
rnm -rs '/[abcd]/\C/g' *.ext
## this will capitalize all a,b,c,d in the filenames
If you have many regex patterns that need to be dealt with, then put those patterns in a file and pass the file with -rs/f option.
rnm -rs/f /path/to/regex/pattern/file *.ext
You can find some other examples here.
Note:
rnm uses PCRE2 (revised PCRE) regex.
You can undo an unwanted rename operation by running rnm -u
P.S: I am the author of this tool.
If you are using Ubunntu/Debian os use rename command to rename multiple files at time.
If you want to use something not depending on perl, you can use the following code (let's call it sanitizeNames.sh). It is only showing a few cases, but it's easily extensible using string substitution, tr (and sed too).
#!/bin/bash
ls $1 |while read f; do
newfname=$(echo "$f" \
|tr -d '\[ ' \ # Removing opened square bracket
|tr ' \]' '-' \ # Translating closing square bracket to dash
|tr -s '-' \ # Squeezing multiple dashes
|tr -s '.' \ # Squeezing multiple dots
)
newfname=${newfname//-./.}
if [ -f "$newfname" ]; then
# Some string magic...
extension=${newfname##*\.}
basename=${newfname%\.*}
basename=${basename%\-[1-9]*}
lastNum=$[ $(ls $basename*|wc -l) ]
mv "$f" "$basename-$lastNum.$extension"
else
mv "$f" "$newfname"
fi
done
And use it:
$ touch '[ www.crap.com ] file.name.ext' 'www.crap.com - file.name.ext' '[ www.crap.com ] - file.name.ext' '[www.crap.com ].file.anothername.ext2' '[www.crap.com ].file.name.ext'
$ ls -1 *crap*
[ www.crap.com ] - file.name.ext
[ www.crap.com ] file.name.ext
[www.crap.com ].file.anothername.ext2
[www.crap.com ].file.name.ext
www.crap.com - file.name.ext
$ ./sanitizeNames.sh *crap*
$ ls -1 *crap*
www.crap.com-file.anothername.ext2
www.crap.com-file.name-1.ext
www.crap.com-file.name-2.ext
www.crap.com-file.name-3.ext
www.crap.com-file.name.ext

Count occurence of character in files

I want to count all $ characters in each file in a directory with several subdirectories.
My goal is to count all variables in a PHP project. The files have the suffix .php.
I tried
grep -r '$' . | wc -c
grep -r '$' . | wc -l
and a lot of other stuff but all returned a number that can not match. In my example file are only four $.
So I hope someone can help me.
EDIT
My example file
<?php
class MyClass extends Controller {
$a;$a;
$a;$a;
$a;
$a;
To recursively count the number of $ characters in a set of files in a directory you could do:
fgrep -Rho '$' some_dir | wc -l
To include only files of extension .php in the recursion you could instead use:
fgrep -Rho --include='*.php' '$' some_dir | wc -l
The -R is for recursively traversing the files in some_dir and the -o is for matching part of the each line searched. The set of files are restricted to the pattern *.php and file names are not included in the output with -h, which may otherwise have caused false positives.
For counting variables in a PHP project you can use the variable regex defined here.
So, the next will grep all variables for each file:
cd ~/my/php/project
grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
-P - use perlish regex
-r - recursive
-o - each match on separate line
will produce something like:
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeMySQL.class.php:$driverId
./elFinderVolumeMySQL.class.php:$db
./elFinderVolumeMySQL.class.php:$tbf
You want count them, so you can use:
$ grep -Proc '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
and will get the count of variables in each file, like:
./connector.minimal.php:9
./connector.php:9
./elFinder.class.php:437
./elFinderConnector.class.php:46
./elFinderVolumeDriver.class.php:1343
./elFinderVolumeFTP.class.php:577
./elFinderVolumeFTPIIS.class.php:63
./elFinderVolumeLocalFileSystem.class.php:279
./elFinderVolumeMySQL.class.php:335
./mime.types:0
./MySQLStorage.sql:0
When want count by file and by variable, you can use:
$ grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
for getting result like:
17 ./elFinderVolumeLocalFileSystem.class.php:$target
8 ./elFinderVolumeLocalFileSystem.class.php:$targetDir
3 ./elFinderVolumeLocalFileSystem.class.php:$test
97 ./elFinderVolumeLocalFileSystem.class.php:$this
1 ./elFinderVolumeLocalFileSystem.class.php:$write
6 ./elFinderVolumeMySQL.class.php:$arc
3 ./elFinderVolumeMySQL.class.php:$bg
10 ./elFinderVolumeMySQL.class.php:$content
1 ./elFinderVolumeMySQL.class.php:$crop
where you can see, than the variable $write is used only once, so (maybe) it is useless.
You can also count per variable per whole project
$ grep -Proh '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
and will get something like:
13 $tree
1 $treeDeep
3 $trg
3 $trgfp
10 $ts
6 $tstat
35 $type
where you can see, than the $treeDeep is used only once in a whole project, so it is sure useless.
You can achieve many other combinations with different grep, sort and uniq commands..

grouping lines from a txt file using filters in Linux to create multiple txt files

I have a txt file, where each line starts with participant No, followed by the date and other variables (numbers only), so has format:
S001_2 20090926 14756 93
S002_2 20090803 15876 13
I want to write a script that creates smaller txt files containing only 20 participants per file (so first one will contain lines from S001_2 to S020_2;second from S021_2 to S040_2; total number of subjects approximately 200). However, subjects are not organized, therefore I can`t set a range with sed.
What would be the best command to filter ppts into chunks depending on what number (SOO1_2) the line starts with?
Thanks in advance.
Use the split command to split a file (or a filtered result) without ranges and sed. According to the documentation, this should work:
cat file.txt | split -l 20 - PREFIX
This will produce the files PREFIXaa, PREFIXab, ... (Note that it does not add the .txt extension to the file name!)
If you want to filter the files first, in the way #Sergey described:
cat file.txt | sort | split -l 20 - PREFIX
Sort without any parameters should be suitable, because there are leading zeros in your numbers like S001_2. So, first sort the file:
sort file.txt > sorted.txt
Then you will be able to set ranges with sed for file_sort.txt
This looks like a whole script for splitting sorted file into 20-line files:
num=1;
i=1;
lines=`wc -l sorted.txt | cut -d' ' -f 1`;#get number of lines
while [ $i -lt $lines ];do
sed -n $i,`echo $i+19 | bc`p sorted.txt > file$num;
num=`echo $num+1 | bc`;
i=`echo $i+20 | bc`;
done;
$ split -d -l 20 file.txt -a3 db_
produces: db_000, db_001, db_002, ..., db_N

Resources