Linux replace column in filename with parent directory name - linux

I have a file structure that looks like this:
Surge/Track_000000/000_extracted.csv
where the zeroes can be replaced by any numberical value
The files 000_extracted.csv look like:
Timestep ElementID SE
1 100 .5
2 100 1.3
3 100 .7
4 100 .2
Ideally what I would like to have is a resulting file that looks like this:
Track Timestep ElementID SE
0000000 1 100 .5
0000000 2 100 1.3
0000000 3 100 .7
0000000 4 100 .2
Where the 0000000 is the 7 digit track code from the parent directory name.
As I first step I want to append the directory name (Track_000000) to the filename. So it would go from 212_extracted.csv to Track_000000_212_extracted.csv.
I tried this:
for i in 'ls ./Surge/'
do
for j in 'ls ./Surge/$i/'
do
mv -v './Surge/$i/*.csv' './Surge/$i/$i-*.csv'
done
done
This is not working. While $i should be Track_0000000, instead it is telling me it is /Surge/.
Any help would be appreciated.
Thanks,
K

Something along these lines:
for dir in /Surge/*; do
prefix=$(basename $dir)
trk=${prefix#Track_}
for file in $dir/*.csv; do
awk '{ print TRK, " ", $0 }' TRK=$trk < $file > ${prefix}_${file}
done
done

Use back apostrophe: ` instead of ':
for i in `ls ./Surge/`; do
for j in `ls ./Surge/$i/`; do
mv -v './Surge/$i/*.csv' './Surge/$i/$i-*.csv'
done
done

Related

Print a row of 16 lines evenly side by side (column)

I have a file with unknown number of lines(but even number of lines). I want to print them side by side based on total number of lines in that file. For example, I have a file with 16 lines like below:
asdljsdbfajhsdbflakjsdff235
asjhbasdjbfajskdfasdbajsdx3
asjhbasdjbfajs23kdfb235ajds
asjhbasdjbfajskdfbaj456fd3v
asjhbasdjb6589fajskdfbaj235
asjhbasdjbfajs54kdfbaj2f879
asjhbasdjbfajskdfbajxdfgsdh
asjhbasdf3709ddjbfajskdfbaj
100
100
150
125
trh77rnv9vnd9dfnmdcnksosdmn
220
225
sdkjNSDfasd89asdg12asdf6asdf
So now i want to print them side by side. as they have 16 lines in total, I am trying to get the results 8:8 like below
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
paste command did not work for me exactly, (paste - - - - - - - -< file1) nor the awk command that I used awk '{printf "%s" (NR%2==0?RS:FS),$1}'
Note: The number of lines in a file dynamic. The only known thing in my scenario is, they are even number all the time.
If you have the memory to hash the whole file ("max" below):
$ awk '{
a[NR]=$0 # hash all the records
}
END { # after hashing
mid=int(NR/2) # compute the midpoint, int in case NR is uneven
for(i=1;i<=mid;i++) # iterate from start to midpoint
print a[i],a[mid+i] # output
}' file
If you have the memory to hash half of the file ("mid"):
$ awk '
NR==FNR { # on 1st pass hash second half of records
if(FNR>1) { # we dont need the 1st record ever
a[FNR]=$0 # hash record
if(FNR%2) # if odd record
delete a[int(FNR/2)+1] # remove one from the past
}
next
}
FNR==1 { # on the start of 2nd pass
if(NR%2==0) # if record count is uneven
exit # exit as there is always even count of them
offset=int((NR-1)/2) # compute offset to the beginning of hash
}
FNR<=offset { # only process the 1st half of records
print $0,a[offset+FNR] # output one from file, one from hash
next
}
{ # once 1st half of 2nd pass is finished
exit # just exit
}' file file # notice filename twice
And finally if you have awk compiled into a worms brain (ie. not so much memory, "min"):
$ awk '
NR==FNR { # just get the NR of 1st pass
next
}
FNR==1 {
mid=(NR-1)/2 # get the midpoint
file=FILENAME # filename for getline
while(++i<=mid && (getline line < file)>0); # jump getline to mid
}
{
if((getline line < file)>0) # getline read from mid+FNR
print $0,line # output
}' file file # notice filename twice
Standard disclaimer on getline and no real error control implemented.
Performance:
I seq 1 100000000 > file and tested how the above solutions performed. Output was > /dev/null but writing it to a file lasted around 2 s longer. max performance is so-so as the mem print was 88 % of my 16 GB so it might have swapped. Well, I killed all the browsers and shaved off 7 seconds for the real time of max.
+------------------+-----------+-----------+
| which | | |
| min | mid | max |
+------------------+-----------+-----------+
| time | | |
| real 1m7.027s | 1m30.146s | 0m48.405s |
| user 1m6.387s | 1m27.314 | 0m43.801s |
| sys 0m0.641s | 0m2.820s | 0m4.505s |
+------------------+-----------+-----------+
| mem | | |
| 3 MB | 6.8 GB | 13.5 GB |
+------------------+-----------+-----------+
Update:
I tested #DavidC.Rankin's and #EdMorton's solutions and they ran, respectively:
real 0m41.455s
user 0m39.086s
sys 0m2.369s
and
real 0m39.577s
user 0m37.037s
sys 0m2.541s
Mem print was about the same as my mid had. It pays to use the wc, it seems.
$ pr -2t file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
if you want just one space between columns, change to
$ pr -2ts' ' file
You can also do it with awk simply by storing the first-half of the lines in an array and then concatenating the second half to the end, e.g.
awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
Example Use/Output
With your data in file, then
$ awk -v nlines=$(wc -l < file) -v j=0 'FNR<=nlines/2{a[++i]=$0; next} j<i{print a[++j],$1}' file
asdljsdbfajhsdbflakjsdff235 100
asjhbasdjbfajskdfasdbajsdx3 100
asjhbasdjbfajs23kdfb235ajds 150
asjhbasdjbfajskdfbaj456fd3v 125
asjhbasdjb6589fajskdfbaj235 trh77rnv9vnd9dfnmdcnksosdmn
asjhbasdjbfajs54kdfbaj2f879 220
asjhbasdjbfajskdfbajxdfgsdh 225
asjhbasdf3709ddjbfajskdfbaj sdkjNSDfasd89asdg12asdf6asdf
Extract the first half of the file and the last half of the file and merge the lines:
paste <(head -n $(($(wc -l <file.txt)/2)) file.txt) <(tail -n $(($(wc -l <file.txt)/2)) file.txt)
You can use columns utility from autogen:
columns -c2 --by-columns file.txt
You can use column, but the count of columns is calculated in a strange way from the count of columns of your terminal. So assuming your lines have 28 characters, you also can:
column -c $((28*2+8)) file.txt
I do not want to solve this, but if I were you:
wc -l file.txt
gives number of lines
echo $(($(wc -l < file.txt)/2))
gives a half
head -n $(($(wc -l < file.txt)/2)) file.txt > first.txt
tail -n $(($(wc -l < file.txt)/2)) file.txt > last.txt
create file with first half and last half of the original file. Now you can merge those files together side by side as it was described here .
Here is my take on it using the bash shell wc(1) and ed(1)
#!/usr/bin/env bash
array=()
file=$1
total=$(wc -l < "$file")
half=$(( total / 2 ))
plus1=$(( half + 1 ))
for ((m=1;m<=half;m++)); do
array+=("${plus1}m$m" "${m}"'s/$/ /' "${m}"',+1j')
done
After all of that if just want to print the output to stdout. Add the line below to the script.
printf '%s\n' "${array[#]}" ,p Q | ed -s "$file"
If you want to write the changes directly to the file itself, Use this code instead below the script.
printf '%s\n' "${array[#]}" w | ed -s "$file"
Here is an example.
printf '%s\n' {1..10} > file.txt
Now running the script against that file.
./myscript file.txt
Output
1 6
2 7
3 8
4 9
5 10
Or using bash4+ feature mapfile aka readarray
Save the file in an array named array.
mapfile -t array < file.txt
Separate the files.
left=("${array[#]::((${#array[#]} / 2))}") right=("${array[#]:((${#array[#]} / 2 ))}")
loop and print side-by-side
for i in "${!left[#]}"; do
printf '%s %s\n' "${left[i]}" "${right[i]}"
done
What you said The only known thing in my scenario is, they are even number all the time. That solution should work.

How to grep file names which containg range of value(-6 to -7) in column 2 in it sub files?

i have 5000 directories(ligand_0001 to ligand_5000 ). Each contains sub file name as log.txt which contains scores in column 2. I want to extract all those directory names (ligand_*) which have a log file that contains -6 to -7 scores in second column.
1 -6.1 0.000 0.000
2 -6.1 2.657 3.713
3 -5.9 26.479 28.383
4 -5.9 27.924 30.549
5 -5.8 4.579 8.657
6 -5.8 26.841 28.725
7 -5.8 25.192 27.089
8 -5.6 3.119 4.640
This is the sub file (log.txt) in ligand_0005 folder. i want only the name of the folder because it contain -6 to -7 value in column 2 (i.e ligand_0005)
Here is a small awk script that scan all the files together in one sweep.
script.awk
BEGINFILE{ # on every file
pathPartsLen = split(FILENAME,pathParts, "/"); # split path to its parts into arry pathParts
currentDir = pathParts[pathPartsLen - 1]; # find the current parent dir
}
$2 ~ "^-[67]" { # match 2nd field to start with -6 or -7
print currentDir;
nextfile; # skip the rest of the file, goto next file
}
running:
awk -f script.awk $(find ligand_* -name log.txt)
explaining:
find ligand_* -name log.txt : list all log.txt files in directories ligand_*
use awk to figure out if numbers exists in the second column, iterate over folders and check each folder's log.txt
ARRAY=()
for i in ligand_*
do
if [[ ! -z $(awk '$2>=-7 && $2<=-6' ${i}/log.txt) ]]
then
ARRAY+=("${i}")
fi
done
printf '%s\n' "${ARRAY[#]}"

NTILE a column in csv - Linux

I have a csv file that reads like this:
a,b,c,2
d,e,f,3
g,h,i,3
j,k,l,4
m,n,o,5
p,q,r,6
s,t,u,7
v,w,x,8
y,z,zz,9
I want to assign quintiles to this data (like we do it in sql), using preferably bash command in linux. The quintiles, if assigned as a new column, will make the final output look like:
a,b,c,2, 1
d,e,f,3, 1
g,h,i,3, 2
j,k,l,4, 2
m,n,o,5, 3
p,q,r,6, 3
s,t,u,7, 4
v,w,x,8, 4
y,z,z,9, 5
The only thing i am able to achieve is to add a new incremental column to the csv file:
`awk '{$3=","a[$3]++}1' f1.csv > f2.csv`
But not sure how do the quintiles. Please help. thanks.
awk '{a[NR]=$0}
END{
for(i=1;i<=NR;i++) {
p=100/NR*i
q=1
if(p>20){q=2}
if(p>40){q=3}
if(p>60){q=4}
if(p>80){q=5}
print a[i] ", " q
}
}' file
Output:
a,b,c,2, 1
d,e,f,3, 2
g,h,i,3, 2
j,k,l,4, 3
m,n,o,5, 3
p,q,r,6, 4
s,t,u,7, 4
v,w,x,8, 5
y,z,zz,9, 5
Short wc + awk approach:
awk -v n=$(cat file | wc -l) \
'BEGIN{ OFS=","; n=sprintf("%.f\n", n*0.2); c=1 }
{ $(NF+1)=" "c }!(NR % n){ ++c }1' file
n=$(cat file | wc -l) - get the total number of lines of the input file file
n*0.2 - 1/5th (20 percent) of the range
$(NF+1)=" "c - set next last field with current rank value c
The output:
a,b,c,2, 1
d,e,f,3, 1
g,h,i,3, 2
j,k,l,4, 2
m,n,o,5, 3
p,q,r,6, 3
s,t,u,7, 4
v,w,x,8, 4
y,z,zz,9, 5

Process large amount of data using bash

I've got to process a large amount of txt files in a folder using bash scripting.
Each file contains million of row and they are formatted like this:
File #1:
en ample_1 200
it example_3 24
ar example_5 500
fr.b example_4 570
fr.c example_2 39
en.n bample_6 10
File #2:
de example_3 4
uk.n example_5 50
de.n example_4 70
uk example_2 9
en ample_1 79
en.n bample_6 1
...
I've got to filter by "en" or "en.n", finding duplicate occurrences in the second column, sum third colum and get a sorted file like this:
en ample_1 279
en.n bample_6 11
Here my script:
#! /bin/bash
clear
BASEPATH=<base_path>
FILES=<folder_with_files>
TEMP_UNZIPPED="tmp"
FINAL_RES="pg-1"
#iterate each file in folder and apply grep
INDEX=0
DATE=$(date "+DATE: %d/%m/%y - TIME: %H:%M:%S")
echo "$DATE" > log
for i in ${BASEPATH}${FILES}
do
FILENAME="${i%.*}"
if [ $INDEX = 0 ]; then
VAR=$(gunzip $i)
#-e -> multiple condition; -w exact word; -r grep recursively; -h remove file path
FILTER_EN=$(grep -e '^en.n\|^en ' $FILENAME > $FINAL_RES)
INDEX=1
#remove file to free space
rm $FILENAME
else
VAR=$(gunzip $i)
FILTER_EN=$(grep -e '^en.n\|^en ' $FILENAME > $TEMP_UNZIPPED)
cat $TEMP_UNZIPPED >> $FINAL_RES
#AWK BLOCK
#create array a indexed with page title and adding frequency parameter as value.
#eg. a['ciao']=2 -> the second time I find "ciao", I sum previous value 2 with the new. This is why i use "+=" operator
#for each element in array I print i=page_title and array content such as frequency
PARSING=$(awk '{ page_title=$1" "$2;
frequency=$3;
array[page_title]+=frequency
}END{
for (i in array){
print i,array[i] | "sort -k2,2"
}
}' $FINAL_RES)
echo "$PARSING" > $FINAL_RES
#END AWK BLOCK
rm $FILENAME
rm $TEMP_UNZIPPED
fi
done
mv $FINAL_RES $BASEPATH/06/01/
DATE=$(date "+DATE: %d/%m/%y - TIME: %H:%M:%S")
echo "$DATE" >> log
Everything works, but it take a long long time to execute. Does anyone know how to get same result, with less time and less lines of code?
The UNIX shell is an environment from which to manipulate files and processes and sequence calls to tools. The UNIX tool which shell calls to manipulate text is awk so just use it:
$ awk '$1~/^en(\.n)?$/{tot[$1" "$2]+=$3} END{for (key in tot) print key, tot[key]}' file | sort
en ample_1 279
en.n bample_6 11
Your script has too many issues to comment on which indicates you are a beginner at shell programming - get the books Bash Shell Scripting Recipes by Chris Johnson and Effective Awk Programming, 4th Edition, by Arnold Robins.

Using the command line to combine non-adjacent sections of a file

Is it possible to concatenate the headers lines in a file with the output from a filter using grep? Perhaps using the cat command or something else from GNU's coreutils?
In particular, I have a tab delimited file that roughly looks like the following:
var1 var2 var3
1 MT 500
30 CA 40000
10 NV 1240
40 TX 500
30 UT 35000
10 AZ 1405
35 CO 500
15 UT 9000
1 NV 1505
30 CA 40000
10 NV 1240
I would like to select from lines 2 - N all lines that contain "CA" using grep and also to place the first row, the variable names, in the first line of the output file using GNU/Linux commands.
The desired output for the example would be:
var1 var2 var3
30 CA 40000
35 CA 65000
15 CA 2500
I can select the two sets of desired output with the following lines of code.
head -1 filename
grep -E CA filename
My initial idea is to combine the output of these commands using cat, but I have not been successful so far.
If you're running the commands from a shell (including shell scripts), you can run each command separately and redirect the output:
head -1 filename > outputfile
grep -E CA filename >> outputfile
The first line will overwrite outputfile, because a single > was used. The second line will append to outputfile, because >> was used.
If you want to do this in a single command, the following worked in bash:
(head -1 filename && grep -E CA filename) > outputfile
If you want the output to go to standard output, leave off the parenthesis and redirection:
head -1 filename && grep -E CA filename
It's not clear what you're looking for, but perhaps just:
{ head -1 filename; grep -E CA filename; } > output
or
awk 'NR==1 || /CA/' filename > output
But another interpretation of your question is best addressed using sed or awk.
For example, to print lines 5-9 and line 14, you can do:
sed -n -e 5,9p -e 14p
or
awk '(NR >=5 && NR <=9) || NR==14'
I just came across a method that uses the cat command.
cat <(head -1 filename) <(grep -E CA filename) > outputfile
This site, tldp.org, calls the <(command) syntax "process substitution."
It is unclear to me what method would be more efficient in terms of memory / speed, but this is testable.

Resources