I have a list file with many different columns, and want to know whenever the next value in the third column is equal to the current value on third column. My problem is, those values are directories, looking like this:
/media/user/DATA/Folder1/File1.extension
/media/user/DATA/Folder1/File1.extension
/media/user/DATA/Folder1/File2.extension
/media/user/DATA/Folder1/File2.extension
/media/user/DATA/Folder2/File3.extension
/media/user/DATA/Folder2/File4.extension
/media/user/DATA/Folder2/File5.extension
/media/user/DATA/Folder3/File6.extension
/media/user/DATA/Folder4/File6.extension
I have files with the same name in the same folder (which should be considered as an equal value), same name in different folders (must be recognized as different values) and same folder but different names (different values).
What I had been trying to do was to sort them all (with sort -k 3,3), then make a fourth column saying whether the next value is the same as the one in the column or not, looking something like this:
/media/user/DATA/Folder1/File1.extension Y
/media/user/DATA/Folder1/File1.extension N
/media/user/DATA/Folder1/File2.extension Y
/media/user/DATA/Folder1/File2.extension N
/media/user/DATA/Folder2/File3.extension N
/media/user/DATA/Folder2/File4.extension N
/media/user/DATA/Folder2/File5.extension N
/media/user/DATA/Folder3/File6.extension N
/media/user/DATA/Folder4/File6.extension N
I came up with this code to do that, but it keeps showing errors when comparing R1 with LastR1, which I think may be due to the "/" in the strings...
echo "N" > LastR1.kp
sort -k3,3 list > list.tmp
mv list list.backup
mv list.tmp list
while read T1 T2 R1; do
LastR1=`cat LastR1.kp`
if [ $R1 == $LastR1 ]
then
KeepR1="Y"
else
KeepR1="N"
fi
echo "$KeepR1" > CKeep.kp
cat KeepList.kp CKeep.kp > KeepList.kp
done
sed -i -e 1,2d KeepList
join list KeepList > list.tmp
mv list.tmp list
So, what I would have at the end, would be my original file, with a fourth column with either Y (for 'next row has the same value in 3rd column) or N (for 'different value on the next row').
I can't seem to find the reason why my if statement doesn't work - and, even though I think this approach could do what I need, I'm definitely open to different approaches
Not sure if you are using bash or sh. Add the shebang line on top of your script:
#!/usr/bin/env bash
and change the if condition to use [[ ]]:
if [[ $R1 == $LastR1 ]]
I hope this would solve your problem.
Related
I have created a script that identifies a unique set of files (from a large list) and saves the file paths as unique variables (e.g. $vol1, $vol2, $vol3... see below).
for volume in `seq 1 $num_ap_files`
do
bindex=$(cat ../b0_volumes_tmp.txt | head -$volume | tail -1)
eval 'vol'${volume}=$(cat ../all_volumes.txt | head -$bindex | tail -1)
done
I'd like to concatenate the variables into one string variable to use as input for a separate command line program. As the number of volumes may differ - it seems most appropriate to create a loop of the following nature (??? define where I am having trouble):
??? files=$vol1
for i in `seq 2 $num_ap_files
do
??? eval files=${files}" $vol"${i)
done
I've tried a few different options here using files+=, eval function, etc. and continue to get an error message: -bash: vol0001.nii.gz: command not found. Is there a way around this? The filenames need to be in string format to be fed into subsequent processing steps.
Thanks,
David
Use an array for this:
vol=()
for ((volume = 1; volume <= num_ap_files; volume++)); do
bindex=$(sed -n "$volume p" ../b0_volumes_tmp.txt)
vol[$volume]=$(sed -n "$bindex p" ../all_volumes.txt)
done
files="${vol[*]}"
The variable files contains the content of the array vol as one single string.
If you now want to call a command with each element of vol as an argument, no need to use an intermediate variable like files:
my_next_command "${vol[#]}"
Note: the command sed -n "$N p" file prints the Nth line of the file (the same as your cat file | head -$N | tail -1)
I'm trying to output lines of a CSV file which is quite large. In the past I have tried different things and ultimately come to find that Linux's command line interface (sed, awk, grep, etc) is the fastest way to handle these types of files.
I have a CSV file like this:
1,rand1,rand2
4,randx,randy,
6,randz,randq,
...
1001,randy,randi,
1030,rando,randn,
1030,randz,randc,
1036,randp,randu
...
1230994,randm,randn,
1230995,randz,randl,
1231869,rande,randf
Although the first column is numerically increasing, the space between each number varies randomly. I need to be able to output all lines that have a value between X and Y in their first column.
Something like:
sed ./csv -min --col1 1000 -max --col1 1400
which would output all the lines that have a first column value between 1000 and 1400.
The lines are different enough that in a >5 GB file there might only be ~5 duplicates, so it wouldn't be a big deal if it counted the duplicates only once -- but it would be a big deal if it threw an error due to a duplicate line.
I may not know whether particular line values exist (e.g. 1000 is a rough estimate and should not be assumed to exist as a first column value).
Optimizations matter when it comes to large files; the following awk command:
is parameterized (uses variables to define the range boundaries)
performs only a single comparison for records that come before the range.
exits as soon as the last record of interest has been found.
awk -F, -v from=1000 -v to=1400 '$1 < from { next } $1 > to { exit } 1' ./csv
Because awk performs numerical comparison (with input fields that look like numbers), the range boundaries needn't match field values precisely.
You can easily do this with awk, though it won't take full advantage of the file being sorted:
awk -F , '$1 > 1400 { exit(0); } $1 >= 1000 { print }' file.csv
If you know that the numbers are increasing and unique, you can use addresses like this:
sed '/^1000,/,/^1400,/!d' infile.csv
which does not print any line that is outside of the lines between the one that matches /^1000,/ and the one that matches /^1400,/.
Notice that this doesn't work if 1000 or 1400 don't actually exist as values, i.e., it wouldn't print anything at all in that case.
In any case, as demonstrated by the answers by mklement0 and that other guy, awk is a the better choice here.
Here's a bash-version of the script:
#! /bin/bash
fname="$1"
start_nr="$2"
end_nr="$3"
while IFS=, read -r nr rest || [[ -n $nr && -n $rest ]]; do
if (( $nr < $start_nr )); then continue;
elif (( $nr > $end_nr )); then break; fi
printf "%s,%s\n" "$nr" "$rest"
done < "$fname"
Which you would then call script.sh foo.csv 1000 2000
The script will start printing when the number is large enough and then immediately stops when the number gets above the limit.
I have two questions .
I have found following code line in script : IFS=${IFS#??}
I would like to understand what it is exactly doing ?
When I am trying to perform something in every place from directory like eg.:
$1 = home/user/bin/etc/something...
so I need to change IFS to "/" and then proceed this in for loop like
while [ -e "$1" ]; do
for F in `$1`
#do something
done
shift
done
Is that the correct way ?
${var#??} is a shell parameter expansion. It tries to match the beginning of $var with the pattern written after #. If it does, it returns the variable $var with that part removed. Since ? matches any character, this means that ${var#??} removes the first two chars from the var $var.
$ var="hello"
$ echo ${var#??}
llo
So with IFS=${IFS#??} you are resetting IFS to its value after removing its two first chars.
To loop through the words in a /-delimited string, you can store the splitted string into an array and then loop through it:
$ IFS="/" read -r -a myarray <<< "home/user/bin/etc/something"
$ for w in "${array[#]}"; do echo "-- $w"; done
-- home
-- user
-- bin
-- etc
-- something
I'm conducting a reiterative analysis and having to submit more than 5000 jobs to a batch system on a large computer cluster.
I'm wanting to run a bash for loop but call both the current list item and the next item in my script. I'm not sure the best way to do this using this format:
#! /bin/bash
for i in `cat list.txt`;
do
# run a bunch of code on $i (ex. Data file for 'Apples')
# compare input of $i with next in list {$i+1} (ex. Compare 'Apples' to 'Oranges', save output)
# take output of this comparison and use it as an input for the next analysis of $i (ex. analyze 'Apples' some more, save output for the next step, analyze data on 'Oranges')
# save this output as the input for next script which analyses the next item in the list {$i+1} (Analysis of 'Oranges' with input data from 'Apples', and comparing to 'Grapes' in the middle of the loop, etc., etc.)
done
Would it be easiest for me to provide a tabular input list in a while loop? I would really prefer not to do this as I would have to do some code editing, albeit minor.
Thanks for helping a novice -- I've looked all over the interwebs and ran through a bunch of books and haven't found a good way to do this.
EDIT: For some reason I was thinking there might have been a for loop trick to do this but I guess not; it's probably easier for me to do a while loop with a tabular input. I was prepared to do this, but I didn't want to re-write the code I had already.
UPDATE: Thank you all so much for your time and input! Greatly appreciated.
Another solution is to use bash arrays. For example, given a file list.txt with content:
1
2
3
4
4
5
You can create an array variable with the lines of the file as elements as:
$ myarray=(1 2 3 4 4 5)
While you could also do myarray=( $(echo list.txt) ) this may split on whitespace and handle other output inappropriately, the better method is:
$ IFS=$'\n' read -r -d '' -a myarray < list.txt
Then you can access elements as:
$ echo "${myarray[2]}"
3
To length of the array is given by ${#myarray[#]}. A list of all indices is given by ${!myarray[#]}, and you can loop over this list of indices:
for i in "${!myarray[#]}"; do
echo "${myarray[$i]} ${myarray[$(( $i + 1))]}"
done
Output:
1 2
2 3
3 4
4 4
4 5
5
While there are likely simpler solutions to your particular use case, this would allow you to access arbitrary combinations of array elements in the loop.
This answer assumes that you want your values to overlap -- meaning that a value given as next then becomes curr on the following iteration.
Assuming you encapsulate your code in a function that takes two arguments (current and next) when a next item exists, or one argument when on the last item:
# a "higher-order function"; it takes another function as its argument
# and calls that argument with current/next input pairs.
invokeWithNext() {
local funcName=$1
local curr next
read -r curr
while read -r next; do
"$funcName" "$curr" "$next"
curr=$next
done
"$funcName" "$curr"
}
# replace this with your own logic
yourProcess() {
local curr=$1 next=$2
if (( $# > 1 )); then
printf 'Current value is %q, and next item is %q\n' "$curr" "$next"
else
printf 'Current value is %q; no next item exists\n' "$curr"
fi
}
These definitions done, you can run:
invokeWithNext yourProcess <list.txt
...yield output such as:
Current value is 1, and next item is 2
Current value is 2, and next item is 3
Current value is 3, and next item is 4
Current value is 4, and next item is 5
Current value is 5; no next item exists
$ printf '%d\n' {0..10} | paste - -
0 1
2 3
4 5
6 7
8 9
10
So if you just want to interpolate lines so that you can read two variables per line...
while read -r odd even; do
…
done < <(paste - - < inputfile)
You will need to do additional work if your lines contain whitespace.
I would replace the for loop with a while read xx loop.
Something along the lines of
cat list.txt | while read line; do
if read nextline; then
# You have $line and $nextline
else
# You have garbage in $nextline and the last line of list.txt in $line
fi
done
I am trying to grep all the lines between 2 date ranges , where the dates are formatted like this :
date_time.strftime("%Y%m%d%H%M")
so say between [201211150821 - 201211150824]
I am trying to write a script which involves looking for lines between these dates:
cat <somepattern>*.log | **grep [201211150821 - 201211150824]**
I am trying to find out if something exists in unix where I can look for a range in date.
I can convert dates in logs to (since epoch) and then use regular grep with [time1 - time2] , but that means reading each line , extracting the time value and then converting it etc .
May be something simple already exist , so that I can specify date/timestamp ranges the way I can provide a numeric range to grep ?
Thanks!
P.S:
Also I can pass in the pattern something like 2012111511(27|28|29|[3-5][0-9]) , but thats specific to ranges I want and its tedious to try out for different dates each time and gets trickier doing it at runtime.
Use awk. Assuming the first token in the line is the timestamp:
awk '
BEGIN { first=ARGV[1]; last=ARGV[2]; }
$1 > first && $1 < last { print; }
' 201211150821 201211150824
A Perl solution:
perl -wne 'print if m/(?<!\d)(20\d{8})(?!\d)/
&& $1 >= 201211150821 && $1 <= 201211150824'
(It finds the first ten-digit integer that starts with 20, and prints the line if that integer is within your range of interest. If it doesn't find any such integer, it skips the line. You can tweak the regex to be more restrictive about valid months and hours and so on.)
You are looking for the somewhat obscure 'csplit' (context split) command:
csplit '%201211150821%' '/201211150824/' file
will split out all the lines between the first and second regexps from file. It is likely to be the fastest and shortest if your files are sorted on the dates (you said you were grepping logs).
Bash + coreutils' expr only:
export cmp=201211150823 ; cat file.txt|while read line; do range=$(expr match "$line" '.*\[\(.*\)\].*'); [ "x$range" = "x" ] && continue; start=${range:0:12}; end=${range:15:12}; [ $start -le $cmp -a $end -ge $cmp ] && echo "match: $line"; done
cmp is your comparison value,
I wrote a specific tool for similar searches - http://code.google.com/p/bsearch/
In your example, the usage will be:
$ bsearch -p '$[YYYYMMDDhhmm]' -t 201211150821 -t 201211150824 logfile.