Linux (command) | rename | trim leading chracters after x vs. keep only numbers - linux

I want to delete some parts from file names such that
101 - title [1994].mp4
102 - title [1994].mp4
103 - title [1994].mp4
104 - title [1994].mp4
105 - title [1994].mp4
becomes
101.mp4
102.mp4
103.mp4
104.mp4
There are two or more ways to handle this, either by:
keeping numbers and remove non-numbered characters
trim leading characters after (3)-characters
How would I use the linux command rename to only keep the first (3) characters and trim the rest, while keeping the extension ofcourse.
I would like to avoid the mv command, what are the ways to do this with rename?

This is the expression you want s/(\d{3}).*$/$1.mp4/. Take a look at the output:
rename -n 's/(\d{3}).*$/$1.mp4/' *mp4
101 - title [1994].mp4 renamed as 101.mp4
102 - title [1994].mp4 renamed as 102.mp4
103 - title [1994].mp4 renamed as 103.mp4
104 - title [1994].mp4 renamed as 104.mp4
105 - title [1994].mp4 renamed as 105.mp4

Related

awk split adds whole string to array position 1 (reason unknown)

So I have a .txt file that looks like this:
mona 70 77 85 77
john 85 92 78 80
andreja 89 90 85 94
jasper 84 64 81 66
george 54 77 82 73
ellis 90 93 89 88
I have created a grades.awk script that contains the following code:
{
FS=" "
names=$1
vi1=$2
vi2=$3
vi3=$4
rv=$5
#printf("%s ",names);
split(names,nameArray," ");
printf("%s\t",nameArray[1]); //prints the whole array of names for some reason, instead of just the name at position 1 in array ("john")
}
So my question is, how do I split this correctly? Am I doing something wrong?
How do you read line by line, word by word correctly. I need to add each column into its own array. I've been searching for the answer for quite some time now and can't fix my problem.
here is a template to calculate average grades per student
$ awk '{sum=0; for(i=2;i<=NF;i++) sum+=$i;
printf "%s\t%5.2f\n", $1, sum/(NF-1)}' file
mona 77.25
john 83.75
andreja 89.50
jasper 73.75
george 71.50
ellis 90.00
printf("%s\t",nameArray[1])
is doing exactly what you want it to do but you aren't printing any newline between invocations so it's getting called once per input line and outputting one word at a time but since you aren't outputting any newlines between words you just get 1 line of output. Change it to:
printf("%s\n",nameArray[1])
There are a few other issues with your code of course (e.g. you're setting FS in the wrong place and unnecessarily, names only every contains 1 word so splitting it into an array doesn't make sense, etc.) but I think that's what you were asking about specifically.
If that's not all you want then edit your question to clarify what you're trying to do and add concise, testable sample input and expected output.

change in the text file in linux command line

I have a big file like this example:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3
uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1
uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1
uc009vis.3 chr1 - 14361 16765 14361 14361 4 14361,14969,15795,16606, 14829,15038,15942,16765, uc009vis.3
I want to change the 4th column. each element in each row in column 4 should be replaced by the element in the same row but from column 5. I want to change this element from column5 and put it in the same row but in column 4. the change would be "(element of column5) - 1".
I am not so familiar with command line in linux(shell). do you know how I can do that in a single line?
here is the expected output:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
uc001aaa.3 chr1 + 14408 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3
uc010nxr.1 chr1 + 14408 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1
uc010nxq.1 chr1 + 14408 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1
uc009vis.3 chr1 - 16764 16765 14361 14361 4 14361,14969,15795,16606, 14829,15038,15942,16765, uc009vis.3
awk is a great tool for manipulating files like this. It allows processing a file that consists of records of fields; by default records are defined by lines in the file and fields are separated by spaces. The awk command line to do what you want is:
awk '!/^#/ { $4 = $5 - 1 } { print }' <filename>
An awk program is a sequence of pattern-action pairs. If a pattern is omitted the action is performed for all input records, if an action is omitted (not used in this program) the default action is to print the record. Fields are referenced in an awk program as $n where n is the field number. There are several forms of pattern but the one used here is the negation a regular expression that is matched against the whole record. So this program updates the 4th field to be the value of the 5th field minus 1 but only for lines that do not start with a # to avoid messing up the header. Then for all records (because the pattern is omitted) the record is printed. The pattern-action pairs are evaluated in order so the records is printed after updating the 4th field.
save you content in file name a
awk '{if(NR>1){$4=$5-1;print $0}else{print $0}}' a

How do I convert spaced columns to tabs? [duplicate]

This question already has an answer here:
Fixed width to CSV
(1 answer)
Closed 5 years ago.
This question is not a duplicate as someone had suggested. Mods, pay attention
I'm running a for loop on multiple files that contain information like below
1 Leer Normal [status] — 100
1 Wrap Normal [physical] 15 90
4 Poison Sting Poison [physical] 15 100
9 Bite Dark [physical] 60 100
12 Glare Normal [status] — 100
17 Screech Normal [status] — 85
20 Acid Poison [special] 40 100
25 Spit Up Normal [special] — 100
25 Stockpile Normal [status] — —
25 Swallow Normal [status] — —
28 Acid Spray Poison [special] 40 100
33 Mud Bomb Ground [special] 65 85
36 Gastro Acid Poison [status] — 100
38 Belch Poison [special] 120 90
41 Haze Ice [status] — —
44 Coil Poison [status] — —
49 Gunk Shot Poison [physical] 120 80
I need to be able to extract data from it.
The problem is, each file has different column lengths.
Column 2 sometimes has spaces in it so squeezing all spaces and using space as a delimiter for cut is not an option. I need the columns separated by tabs without using specific information because the loop goes over about 800 files.
sed 's/ \+/ /g' | cut -f 2 -d " "
^ Not what I need since column 2 has spaces in it
cut -b "5-20"
^ Can't use this either because the columns lengths are different for each file.
With sed, to replace multiples consecutive spaces or tabs with one tab:
sed 's/[[:space:]]\{1,\}/\t/g' file
Explanations:
s: substitute
[[:space:]]: space or tab characters
\{1,\}: when at least one occurrence is found
g: apply substitution to all occurrences in line
Edit:
To preserve single spaces in second column, you can replace only when 2 spaces/tabs are found:
sed 's/[[:space:]]\{2,\}/\t/g' file

R: How to filter through a string of characters in the header of a table

I have a table, here's the start:
TargetID SM_H1462 SM_H1463 SM_K1566 SM_X1567 SM_V1568 SM_K1534 SM_K1570 SM_K1571
ENSG00000000419.8 290 270 314 364 240 386 430 329
ENSG00000000457.8 252 230 242 220 106 234 343 321
ENSG00000000460.11 154 158 162 136 64 152 206 432
ENSG00000000938.7 20106 18664 19764 15640 19024 18508 45590 32113
I want to write a code that will filter through the names of each column (the SM_... ones) and only look at the fourth character in each name. There are 4 different options that can appear at the 4th character: they can be letters H, K, X or V. This can be seen from the table above, e.g. SM_H1462, SM_K1571 etc. Names that have the letter H and K as the 4th character is the Control and names that have letters X or V as the 4th character is the Case.
I want the code to separate the column names based on the 4th letter and group them into two groups: either Case and Control.
Essentially, we can ignore the data for now, I just want to work with the col names first.
You could try checking for the fourth character and ger case and control aas two separate data frames,if that helps you
my.df <- data.frame(matrix(rep(seq(1,8),3), ncol = 8))
colnames(my.df) <- c('SM_H1462','SM_H1463','SM_K1566','SM_X1567', 'SM_V1568', 'SM_K1534', 'SM_K1570','SM_K1571')
my.df
control = my.df[,(substr(colnames(my.df),4,4) == 'H' | substr(colnames(my.df),4,4) == 'K')]
case = my.df[,(substr(colnames(my.df),4,4) == 'X' | substr(colnames(my.df),4,4) == 'V')]

How to truncate text in string after/before separator in PowerShell

I want to make a PowerShell script that takes each file of my music library and then does a hash sum of it and writes that into a file like so:
test.txt ; 131 136 80 89 119 17 60 123 210 121 188 42 136 200 131 198
When I start the script, I need it to first compare my music library with the already existing values, but for this I just want to cut off everything after the ; so that it can compare filename against filename (or filepath)... but I'm stumped at how to do that.
I tried replacing the value via $name = $name -replace ";*","", but that didn't work. I also tried to filter... but I don't know how.
$pos = $name.IndexOf(";")
$leftPart = $name.Substring(0, $pos)
$rightPart = $name.Substring($pos+1)
Internally, PowerShell uses the String class.
$text = "test.txt ; 131 136 80 89 119 17 60 123 210 121 188 42 136 200 131 198"
$text.split(';')[1].split(' ')
You can use a Split :
$text = "test.txt ; 131 136 80 89 119 17 60 123 210 121 188 42 136 200 131 198"
$separator = ";" # you can put many separator like this "; : ,"
$parts = $text.split($separator)
echo $parts[0] # return test.txt
echo $parts[1] # return the part after the separator
$name -replace ";*",""
You were close, but you used the syntax of a wildcard expresson rather than a regular expression, which is what the -replace operator expects.
Therefore (hash sequence shortened):
PS> 'test.txt ; 131 136 80 89 119 17 60 123 210 121 188' -replace '\s*;.*'
test.txt
Note:
Omitting the substitution-text operand (the 2nd RHS operand) implicitly uses "" (the empty string), i.e. it effectively removes what the regex matched.
.* is what represents a potentially empty run (*) of characters (.) in a regex - it is the regex equivalent of * by itself in a wildcard expression.
Adding \s* before the ; in the regex also removes trailing whitespace (\s) after the filename.
I've used '...' rather than "..." to enclose the regex, so as to prevent confusion between what PowerShell expands up front (see expandable strings in PowerShell and what the .NET regex engine sees.
This does work for a specific delimiter for a specific amount of characters between the delimiter. I had many issues attempting to use this in a for each loop where the position changed but the delimiter was the same. For example I was using the backslash as the delimiter and wanted to only use everything to the right of the backslash. The issue was that once the position was defined (71 characters from the beginning) it would use $pos as 71 every time regardless of where the delimiter actually was in the script. I found another method of using a delimiter and .split to break things up then used the split variable to call the sections For instance the first section was $variable[0] and the second section was $variable[1].
I had a dir full of files including some that were named invoice no-product no.pdf and wanted to sort these by product no, so...
get-childitem *.pdf | sort-object -property #{expression={$\_.name.substring($\_.name.indexof("-")+1)}}
Note that in the absence of a - this sorts by $_.name
Using regex, the result is in $matches[1]:
$str = "test.txt ; 131 136 80 89 119 17 60 123 210 121 188 42 136 200 131 198"
$str -match "^(.*?)\s\;"
$matches[1]
test.txt

Resources