Split files according to a field and save in subdirectory created using the root name

Split files according to a field and save in subdirectory created using the root name - linux

I am having trouble with several bits of code, I am no expert in Linux Bash programming unfortunately so I have tried unsuccessfully to find something that works for my task all day and was hoping you could help guide me in the right direction.
I have many large files that I would like to split according to the third field within each of them, I would like to keep the header in each of the sub-files, and save the created sub-files in new directories created from the root names of the files.
The initial files stored in the original directory are:
Downloads/directory1/Levels_CHG_Lab_S_sample1.txt
Downloads/directory1/Levels_CHG_Lab_S_sample2.txt
Downloads/directory1/Levels_CHG_Lab_S_sample3.txt
and so on..
Each of these files have 200 columns, and column 3 contains values from 1 through 10.
I would like to split each of the files above based on the value of this column, and store the subfiles in subfolders, so for example sub-folder "Downloads/directory1/sample1" will contain 10 files (with the header line) derived by splitting the file Downloads/directory1/Levels_CHG_Lab_S_sample1.txt.
I have tried now many different steps for these steps, with no success.. I must be making this more complicated than it is since the code I have tried looks aweful…
Here is the code I am trying to work from:
FILES=Downloads/directory1/
for f in $FILES
do
# Create folder with root name by stripping file names
fname=${echo $f | sed 's/.txt//;s/Levels_CHG_Lab_S_//'}
echo "Creating sub-directory [$fname]"
mkdir "$fname"
# Save the header
awk 'NR==1{print $0}' $f > header
# Split each file by third column
echo "Splitting file $f"
awk 'NR>1 {print $0 > $3".txt" }' $f
# Move newly created files in sub directory
mv {1..10}.txt $fname # I have no idea how to do specify the files just created
# Loop through the sub-files to attach header row:
for subfile in $fname
do
cat header $subfile >> tmp_file
mv -f tmp_file $subfile
done
done
All these steps seem very complicated to me, I would very much appreciate if you could help me solve this in the right way. Thank you very much for your help.
-fra

You have a few problems with your code right now. First of all, at no point do you list the contents of your downloads directory. You are simply setting the FILES variable to a string that is the path to that directory. You would need something like:
FILES=$(ls Downloads/directory1/*.txt)
You also never cd to the Downloads/directory1 folder, so your mkdir would create directories in cwd; probably not what you want.
If you know that the numbers in column 3 always range from 1 to 10, I would just pre-populate those files with the header line before you split the file.
Try this code to do what you want (untested):
BASEDIR=Downloads/directory1/
FILES=$(ls ${BASEDIR}/*.txt)
for f in $FILES; do
# Create folder with root name by stripping file names
dirname=$(echo $f | sed 's/.txt//;s/Levels_CHG_Lab_S_//')
dirname="${BASENAME}/${dirname}/"
echo "Creating sub-directory [$dirname]"
mkdir "$dirname"
# Save the header to each file
HEADER_LINE=$(head -n1 $f)
for i in {1..10}; do
echo ${HEADER_LINE} > ${dirname}/${i}.txt
done
# Split each file by third column
echo "Splitting file $f"
awk -v dirname=${dirname} 'NR>1 {filename=dirname$3".txt"; print $0 >> filename }' $f
done

Related

How do i extract the date from multiple files with dates in it?

Lets say i have multiple filesnames e.g. R014-20171109-1159.log.20171109_1159.
I want to create a shell script which creates for every given date a folder and moves the files matching the date to it.
Is this possible?
For the example a folder "20171109" should be created and has the file "R014-20171109-1159.log.20171109_1159" on it.
Thanks

This is a typical application of a for-loop in bash to iterate thru files.
At the same time, this solution utilizes GNU [ shell param substitution ].
for file in /path/to/files/*\.log\.*
do
foldername=${file#*-}
foldername=${foldername%%-*}
mkdir -p "${foldername}" # -p suppress errors if folder already exists
[ $? -eq 0 ] && mv "${file}" "${foldername}" # check last cmd status and move
done

Since you want to write a shell script, use commands. To get date, use cut cmd like ex:
cat 1.txt
R014-20171109-1159.log.20171109_1159
cat 1.txt | cut -d "-" -f2
Output
20171109
is your date and create folder. This way you can loop and create as many folders as you want

Its actually quite easy(my Bash syntax might be a bit off) -
for f in /path/to/your/files*; do
## Check if the glob gets expanded to existing files.
## If not, f here will be exactly the pattern above
## and the exists test will evaluate to false.
[ -e "$f" ] && echo $f > #grep the file name for "*.log."
#and extract 8 charecters after "*.log." .
#Next check if a folder exists already with the name of 8 charecters.
#If not { create}
#else just move the file to that folder path
break
done
Main idea is from this post link. Sorry for not providing the actual code as i havent worked anytime recently on Bash

Below commands can be put in script to achieve this,
Assign a variable with current date as below ( use --date='n day ago' option if need to have an older date).
if need to get it from File name itself, get files in a loop then use cut command to get the date string,
dirVar=$(date +%Y%m%d) --> for current day,
dirVar=$(date +%Y%m%d --date='1 day ago') --> for yesterday,
dirVar=$(echo $fileName | cut -c6-13) or
dirVar=$(echo $fileName | cut -d- -f2) --> to get from $fileName
Create directory with the variable value as below, (-p : create directory if doesn't exist.)
mkdir -p ${dirVar}
Move files to directory to the directory with below line,
mv *log.${dirVar}* ${dirVar}/

How to add a column at the end of multiple csv files using shell script

I have a couple of thousands CSV file. All of them have same structure and header. I would like to add a column at the end of the file. I found several solutions that add a column and value to that column but I didn't find anything that adds the header for that new column. For example, I have files like 1001.csv, 1002.csv, 1003.csv and so on.
Contents of 1001.csv
ID,URL
1,one.com
2,two.com
I want to modify it like this
ID,URL,FILE
1,one.com,1001
2,two.com,1001
Since I have tons of files like this, I don't want to mess up the data while adding a column. Also, I don't want to produce extra files if it's possible to do in place update.

I tested this on a huge number of files and it worked really fast. This code removes the header first then add a column plus value to the column and finally brings the header back.
#!/bin/bash
# How to run $ ./this-script.sh inputdir/
# here inputdir contains all csv files
# input argument is dir name
DIRNAME=`basename $1`
# go to target directory
cd $DIRNAME
# get list of all csv files
csvfiles=`ls *.csv`
for FILENAME in $csvfiles
do
echo $FILENAME
# filename without extension
CODE="${FILENAME%.*}"
echo $CODE
## remove header
tail -n +2 "$FILENAME" > "$FILENAME.tmp" && mv "$FILENAME.tmp" "$FILENAME"
## add new field at the end
sed "s/$/,$CODE/" "$FILENAME" > "$FILENAME.tmp2"
## add header with new column name
# keep filename.bak as a backup for safety
sed -i.bak 1i"id,url,file" "$FILENAME.tmp2"
# if all good then remove temp files
rm "$FILENAME"
rm "$FILENAME.tmp2.bak"
# rename output file to original name
mv "$FILENAME.tmp2" "$FILENAME"
done
# go back to parent directory
cd ..

Copy text from multiple files, same names to different path in bash (linux)

I need help copying content from various files to others (same name and format, different path).
For example, $HOME/initial/baby.desktop has text which I need to write into $HOME/scripts/baby.desktop. This is very simple for a single file, but I have 2500 files in $HOME/initial/ and the same number in $HOME/scripts/ with corresponding names (same names and format). I want append (copy) the content of file in path A to path B (which have the same name and format), to the end of file in path B without erase the content of file in path B.
Example content of $HOME/initial/*.desktop to final $HOME/scripts/*.desktop. I tried the following, but it don't work:
cd $HOME/initial/
for i in $( ls *.desktop ); do egrep "Icon" $i >> $HOME/scripts/$i; done

Firstly, I would backup $HOME/initial and $HOME/scripts, because there is lots of scope for people misunderstanding your question. Like this:
cd $HOME
tar -cvf initial.tar initial
tar -cvf scripts.tar scripts
That will put all the files in $HOME/initial into a single tarfile called initial.tar and all the files in $HOME/scripts into a single tarfile called scripts.tar.
Now for your question... in general, if you want to put the contents of FileB onto the end of FileA, the command is
cat FileB >> FileA
Note the DOUBLE ">>" which means "append" rather than single ">" which means overwrite.
So, I think you want to do this:
cd $HOME/initial/baby.desktop
cat SomeFile >> $HOME/scripts/baby.desktop/SomeFile
where SomeFile is the name of any file you choose to test with. I would test that has worked and then, if you are happy with that, go ahead and run the same command inside a loop:
cd $HOME/initial/baby.desktop
for SOURCE in *
do
DESTINATION="$HOME/scripts/baby.desktop/$SOURCE"
echo Appending "$SOURCE" to "$DESTINATION"
#cat "$SOURCE" >> "$DESTINATION"
done
When the output looks correct, remove the "#" at the start of the penultimate line and run it again.

I solved it, if some people want learn how to resolve is very simple:
using Sed
I need only the match (or pattern) line "Icon=/usr/share/some_picture.png into $HOME/initial/example.desktop to other with same name and format $HOME/scripts/example.desktop, but I had a lot of .desktop files (2500 files)
cd $HOME/initial
STRING_LINE=`grep -l -R "Icon=" *.desktop`
for i in $STRING_LINE; do sed -ne '/Icon=/ p' $i >> $HOME/scripts/$i ; done
_________
If you need only copy all to other file with same name and format
using cat
cd $HOME/initial
STRING_LINE=`grep -l -R "Icon=" *.desktop`
for i in $STRING_LINE; do cat $i >> $HOME/scripts/$i ; done

copy all unique files in a directory based on hashes

file=$3
#Using $3 as I am using 1 & 2 in the rest of the script[that works]
file_hash=md5sum "$file" | cut -d ' ' -f l
#generates hashes for file
for a in /path/to/source/* #loop for all files in directory
do
if [ "$file_hash" == $(md5sum "$a" | cut -d ' ' -f l) ]:
#if the file hash is equal to the hash generated then file is copied to path/to/source
then cp "file" /path/to/source/*
else cp "$file" "file.JPG" mv "file.JPG" /path/to/source/$file #otherwise the file renamed as file.JPG so it is not overwritten
fi
done
Can anyone help me with this code?
I'm trying to write a script in Bash which will generate hashes for all my files within a directory, if there is two duplicate hashes, then only one of the images is copied to the destination directory, can anyone see where I am going wrong here?
I have to use md5sum, so no other sha1s, fdupes or anything like that unfortunately.

Assuming it doesn't matter which of the unique files is copied, a simple way would be to use bash's support for associative arrays:
declare -A files
while read hash name
do
files[$hash]=$name
done < <(md5sum /path/to/source/*)
cp "${files[#]}" /path/to/dest
Any file with an identical hash will simply overwrite the record of the previous one, leaving you with only unique files in the array.

How to read the complete path till the end of the directory structure using loop in scripting

I have a following directory structure as
/home/ABCD/apple/ball/car/divider.txt, /home/ABCD this is like a root directory for my apps, I can get that easily, and from there all the sub folders may vary for every case, so I am looking for a generic program where I can extract the path through some loops
I want to extract the directory structure to a separate variable as "/home/ABCD/apple/ball/car/"
Can any one help me
2nd Example : /home/ABCD/adam/nest/mary/user.txt
variable should get the following value - "/home/ABCD/adam/nest/mary/"

Use dirname
$ dirname /home/ABCD/apple/ball/car/divider.txt
/home/ABCD/apple/ball/car
To assign to variable do
var=$(dirname /home/ABCD/apple/ball/car/divider.txt)
echo "$var"
No spaces before and after the =

if the ending slash / is required, you could pick one:
kent$ echo "/home/ABCD/adam/nest/mary/user.txt"|grep -Po '.*/'
/home/ABCD/adam/nest/mary/
or
kent$ echo "/home/ABCD/adam/nest/mary/user.txt"|sed -r 's#(.*/).*#\1#'
/home/ABCD/adam/nest/mary/
or
kent$ echo $(dirname /home/ABCD/adam/nest/mary/user.txt)"/"
/home/ABCD/adam/nest/mary/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split files according to a field and save in subdirectory created using the root name - linux

Related

How do i extract the date from multiple files with dates in it?

How to add a column at the end of multiple csv files using shell script

Copy text from multiple files, same names to different path in bash (linux)

copy all unique files in a directory based on hashes

How to read the complete path till the end of the directory structure using loop in scripting

Categories

Resources