I have a folder structure like this ...
data/
---B1/
name_x_1.gz
name_y_1.gz
name_z_2.gz
name_p_2.gz
---C1
name_s_1.gz
name_t_1.gz
name_u_2.gz
name_v_2.gz
I need to go in to each subdirectory (e.g. B1) and perform the following:
cat *_1.gz > B1_1.gz
cat *_2.gz > B1_2.gz
I'm having problems with the file naming part. I can get in directories using the following:
for d in */; do
cat *_1.gz > $d_1.gz
cat *_2.gz > $d_2.gz
done
However I get an error that $d is a directory -- how do I strip the name to create the concatenated filename?
Thanks
Taking your question verbatim: If you have a variable d, where you know that it ends in / (as is the case in your example), you can get the value with this last character stripped by writing ${d:0:-1} (i.e. the substring starting at the beginning, up to (excluding) the last character.
Of course in your case, I would rather write the loop as
for d in *; do
which already creates the names without a trailing slash. But this is still probably not what you want, because d would assume the name of the entries in the directory you have cd'ed to, but you want the name of the directory itself. You can optain this for instance by $(basename "$PWD"), which turns your loop into (i.e.)
cd B1
prefix=$(basename "$PWD") # This set prefix to B1
for f in *
do
# Since your original code indicates that you want to create a *copy* of the file
# with a new name, I do the same here.
cp -v "$f" "${prefix}_$f" #
done
You can also use cat, as in your original solution, if you prefer.
If you're calling bash, you can use parameter expansion and do everything natively in the shell without creating a sub-shell to another process. This is POSIX compliant
#!/bin/bash
for dir in data/*; do
cat "$dir/"*_1.gz > "$dir/${dir##*/}_1.gz"
cat "$dir/"*_2.gz > "$dir/${dir##*/}_2.gz"
done
Sure, just descend into the directory.
# assuming PWD = data/
for d in */; do
(
cd "$d"
cat *_1.gz > "$(basename "$d")"_1.gz
cat *_2.gz > "$(basename "$d")"_2.gz
)
done
how do I strip the name to create the concatenated filename?
The simplest and most portable is with basename.
This requires Ed, which should hopefully be present on your machine. If not, I trust your distribution will have a package for it.
#!/bin/sh
cat >> edprint+.txt << EOF
1p
q
EOF
cat >> edpop+.txt << EOF
1d
wq
EOF
b1="${PWD}/data/B1"
c1="${PWD}/$data/C1"
find "${b1}" -maxdepth 1 -type f > b1stack
find "${c1}" -maxdepth 1 -type f > c1stack
while [ $(wc -l b1stack | cut -d' ' -f1) -gt 0 ]
do
b1line=$(ed -s b1stack < edprint+.txt)
b1name=$(basename "${b1line}")
b1suffix=$(echo "${b1name}" | cut -d'_' -f3)
b1fixed=$(echo "B1_${b1suffix}"
mv -v "${b1}/${b1line}" "${b1}/${b1fixed}"
ed -s b1stack < edpop+.txt
done
while [ $(wc -l c1stack | cut -d' ' -f1) -gt 0 ]
do
c1line=$(ed -s c1stack < edprint+.txt)
c1name=$(basename "${c1line}")
c1suffix=$(echo "${c1name}" | cut -d'_' -f3)
c1fixed=$(echo "B1_${c1suffix}"
mv -v "${c1}/${c1line}" "${c1}/${c1fixed}"
ed -s c1stack < edpop+.txt
done
rm -v ./edprint+.txt
rm -v ./edpop+.txt
rm -v ./b1stack
rm -v ./c1stack
I am trying to segregate filenames matching a particular into a separate file and its contents into different files matching particular patterns.I have the filenames with special characters included like '|'
I tried using grep command. Grep Ril and Grep -H to print the filenames,but it is not working.
#!bin/bash
cd home/test
let "x = 1"
for file in $(find home/test/* -type f -name "*.txt") ;
do
var=$(echo "${x}|fill|${file##*/}")
echo "${var}" | grep -n "*|fill|*.txt" >header.txt
myvar=$(sed 's/^/'${x}'|/g' ${file})
echo "${myvar}" |grep -n "*|Ball|*" >Ball.txt
echo "${myvar}" |grep -n "*|Fire|*" >Fire.txt
let x=x+1
done
unset 'x'
let x=x+1
done
unset 'x
I have the filenames in this format:
1|fill|abc.txt
2|fill|def.txt
The 'fill' remains the same in all files. The final file for this should have values like this
1|fill|abc.txt
2|fill|def.txt
3...
4...
5...
etc...
Then, each file contains different contents.
File1 contains data similar to this pattern:
1|Ball|202029|
1|Cat|202029|
1|fire|202898
...
File 2 contains data similar to this pattern:
2|Bat|202029|
2|Ball|202029|
2|cat|202898
Now the final output should be in such a way that all the data containing 'ball' should be in a separate file, 'cat' in separate file, 'fire' in separate file and so on.
I not sure the below code will do the thing you want, but it will be close to it I beleve, let me know and I update is accordingly.
the files below will be in the same directory as the other files you use in the script and as they end .txt as well next script run will read them as well.
header.txt
B.txt
C.txt
F.txt
#!/bin/bash
# i put the directory in variable, so it can be changed at a single place.
dir='/home/test'
#if cd failed , print erron on standard error output and terminate script.
if ! cd "${dir}" ;then
echo "cd failed into ${dir}" >&2
exit 1
fi
# set counter to 1
let "x = 1"
# Null file contents or create new file
# without this file content will be preserved from earlier script runs.
> header.txt
> B.txt
> C.txt
> F.txt
# go trhought every file in ${dir} path that name end with .txt and it is a regular file
for file in $(find ${dir} -type f -name "*.txt") ;
do
# store basefilename in variable with aditional counter number and text |Fill| front of it.
filename=$(echo "${x}|fill|${file##*/}")
echo "${filename}" >> header.txt
# this can be used as well:
##echo "${x}|fill|${file##*/}" >> header.txt
# only difference is you stored the output into variable.
# find matching line in files
grep -i '|Ball|' ${file} | sed 's/^/'${x}'|/g' >> B.txt
grep -i '|Cat|' ${file} | sed 's/^/'${x}'|/g' >> C.txt
grep -i '|Fire|' ${file} | sed 's/^/'${x}'|/g' >> F.txt
# add 1 to counter
let "x=x+1"
done
# unset counter
unset 'x'
Input files:
File1.txt
1|Ball|202029|
1|Cat|202029|
1|fire|202898
File2.txt
2|Bat|202029|
2|Ball|202029|
2|cat|202898
Output files:
header.txt
1|fill|header.txt
2|fill|B.txt
3|fill|C.txt
4|fill|F.txt
5|fill|File1.txt
6|fill|File2.txt
B.txt
5|1|Ball|202029|
6|2|Ball|202029|
C.txt
5|1|Cat|202029|
6|2|cat|202898
F.txt
5|1|fire|202898
I'm working on a task for uni work where the aim is to count all files and directories within a given directory and then all subdirectories as well. We are forbidden from using find, locate, du or any recursive commands (e.g. ls -R).
To solve this I've tried making my own recursive command and have run into the error above, more specificly it is line 37: testdir/.hidd1/: syntax error: operand expected (error token is ".hidd1/")
The Hierarchy I'm using
The code for this is as follows:
tgtdir=$1
visfiles=0
hidfiles=0
visdir=0
hiddir=0
function searchDirectory {
curdir=$1
echo "curdir = $curdir"
# Rather than change directory ensure that each recursive call uses the $curdir/NameOfWantedDirectory
noDir=$(ls -l -A $curdir| grep ^d | wc -l) # Work out the number of directories in the current directory
echo "noDir = $noDir"
shopt -s nullglob # Enable nullglob to prevent a null term being added to the array
directories=(*/ .*/) # Store all directories and hidden directories into the array 'directories'
shopt -u nullglob #Turn off nullglob to ensure it doesn't later interfere
echo "${directories[#]}" # Print out the array directories
y=0 # Declares a variable to act as a index value
for i in $( ls -d ${curdir}*/ ${curdir}.*/ ); do # loops through all directories both visible and hidden
if [[ "${i:(-3)}" = "../" ]]; then
echo "Found ./"
continue;
elif [[ "${i:(-2)}" = "./" ]]; then
echo "Found ../"
continue;
else # When position i is ./ or ../ the loop advances otherwise the value is added to directories and y is incremented before the loop advances
echo "Adding $i to directories"
directories[y]="$i"
let "y++"
fi
done # Adds all directories except ./ and ../ to the array directories
echo "${directories[#]}"
if [[ "${noDir}" -gt "0" ]]; then
for i in ${directories[#]}; do
echo "at position i ${directories[$i]}"
searchDirectory ${directories[$i]} #### <--- line 37 - the error line
done # Loops through subdirectories to reach the bottom of the hierarchy using recursion
fi
visfiles=$(ls -l $tgtdir | grep -v ^total | grep -v ^d | wc -l)
# Calls the ls -l command which puts each file on a new line, then removes the line which states the total and any lines starting with a 'd' which would be a directory with grep -v,
#finally counts all lines using wc -l
hiddenfiles=$(expr $(ls -l -a $tgtdir | grep -v ^total | grep -v ^d | wc -l) - $visfiles)
# Finds the total number of files including hidden and puts them on a line each (using -l and -a (all)) removes the line stating the total as well as any directoriesand then counts them.
#Then stores the number of hidden files by expressing the complete number of files minus the visible files.
visdir=$(ls -l $tgtdir | grep ^d | wc -l)
# Counts visible directories by using ls -l then filtering it with grep to find all lines starting with a d indicating a directory. Then counts the lines with wc -l.
hiddir=$(expr $(ls -l -a $tgtdir | grep ^d | wc -l) - $visdir)
# Finds hidden directories by expressing total number of directories including hidden - total number of visible directories
#At minimum this will be 2 as it includes the directories . and ..
total=$(expr $visfiles + $hiddenfiles + $visdir + $hiddir) # Calculates total number of files and directories including hidden.
}
searchDirectory $tgtdir
echo "Total Files: $visfiles (+$hiddenfiles hidden)"
echo "Directories Found: $visdir (+$hiddir hidden)"
echo "Total files and directories: $total"
exit 0
Thanks for any help you can give
Line 37 is searchDirectory ${directories[$i]}, as I count. Yes?
Replace the for loop with for i in "${directories[#]}"; do - add double quotes. This will keep each element as its own word.
Replace line 37 with searchDirectory "$i". The for loop gives you each element of the array in i, not each index. Therefore, you don't need to go into directories again - i already has the word you need.
Also, I note that the echos on lines 22 and 25 are swapped :) .
Hey I'm star struck on how to count the different amounts of file types / extensions recursively in a folder. I also need to print them to a .txt file.
For example I have 10 txt's 20 .docx files mixed up in multiple folders.
Help me !
find ./ -type f |awk -F . '{print $NF}' | sort | awk '{count[$1]++}END{for(j in count) print j,"("count[j]" occurences)"}'
Gets all filenames with find, then uses awk to get the extension, then uses awk again to count the occurences
Just with bash: version 4 required for this code
#!/bin/bash
shopt -s globstar nullglob
declare -A exts
for f in * **/*; do
[[ -f $f ]] || continue # only count files
filename=${f##*/} # remove directories from pathname
ext=${filename##*.}
[[ $filename == $ext ]] && ext="no_extension"
: ${exts[$ext]=0} # initialize array element if unset
(( exts[$ext]++ ))
done
for ext in "${!exts[#]}"; do
echo "$ext ${exts[$ext]}"
done | sort -k2nr | column -t
this one seems unsolved so far, so here is how far I got counting files and ordering them:
find . -type f | sed -n 's/..*\.//p' | sort -f | uniq -ic
Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).
Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.
I am guessing some concoction of split and head will do the trick?
This is robhruska's script cleaned up a bit:
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "$file"
done
I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.
If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.
Edit
Using GNU split it's possible to do this:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
Broken out for readability:
split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_
When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.
A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.
This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)
cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'
Based on Ole Tange's answer.
See comments for some tips on installing parallel
You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):
tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
You can use [mg]awk:
awk 'NR==1{
header=$0;
count=1;
print header > "x_" count;
next
}
!( (NR-1) % 100){
count++;
print header > "x_" count;
}
{
print $0 > "x_" count
}' file
100 is the number of lines of each slice.
It doesn't require temp files and can be put on a single line.
I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.
$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done
This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.
Use GNU Parallel:
parallel -a bigfile.csv --header : --pipepart 'cat > {#}'
If you need to run a command on each of the parts, then GNU Parallel can help do that, too:
parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}
If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):
parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
If you want to split into 10 MB blocks:
parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.
csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00
Line by line explanation:
Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.
trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
head -n 1 file.txt > tmp_file
cat $file >> tmp_file
mv -f tmp_file $file
done
Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.
I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:
awk 'NR==1{print $0 > FILENAME ".split1"; print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
I really liked Rob and Dennis' versions, so much so that I wanted to improve them.
Here's my version:
in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done
Differences:
in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines
Inspired by #Arkady's comment on a one-liner.
MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)
MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done
Evidence:
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xaafoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xabfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040108 Jun 1 23:18 mycsv.csv.xacfoo
-rw-rw-r-- 1 ec2-user ec2-user 32040110 Jun 1 23:18 mycsv.csv.xadfoo
and of course head -2 *foo to see the header is added.
A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in.
So something like:
head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt
I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.
export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[#]:1}"
do
head -n 1 "${F}" > tmp_file
cat "$file" >> tmp_file
mv -f tmp_file "${file}"
done
output
$ wc -l input*
22 input.csv
3 input_aa.csv
4 input_ab.csv
4 input_ac.csv
4 input_ad.csv
4 input_ae.csv
4 input_af.csv
4 input_ag.csv
2 input_ah.csv
51 total