Pharo beginner here - I've got some (ugly?) code that works for a specific case, but wonder if there's a better (more readable) way. I want to get the name of a repo without the .git at the end (if it's there).
Trimming off n characters would be great. I'm looking for a general solution.
I found trimRight: but it's not a number of characters; it's either white space or specific characters.
(projectName endsWith: '.git')
ifTrue: [ projectName := projectName
copyFrom: 1 to: projectName size - '.git' size - 1 ]
For these cases where you want to remove a specific suffix if it is present, you have exactly #withoutSuffix:.
In your case, it should be
projectName withoutSuffix: '.git'
’Hello World’ allButLast: 2 => ‘Hello Wor’
Related
Guys I don't know if this is a bind problem, character set problem or something I haven't considered. Here's the challenge - the file system was written by some cd ripping software but the problem is more generic, I need to be able to insert any legal filename from a Linux OS into a database, this set of titles is just a test case.
I tied to make this as clear as I could:
find(\&process, $_);< populates #fnames with File::Find:name
my $count = scalar #fnames;
print "File count $count\n";
So I use File:Find to fill the array with the file name strings, some are really problematic..
$stm = $dbh_sms->prepare(qq{
INSERT INTO $table (path) VALUES (?)
});
foreach $fname(#fnames){
print "File:$fname\n";
$stm->execute("$fname");
}
Now here's what I see printed, compared to what comes back out of MariaDb, just a few examples showing the problem:
File:/test1/music/THE DOOBIE BROTHERS - BEST OF THE DOOBIES/02 - THE DOOBIE BROTHERS - LONG TRAIN RUNNIN´.mp3
A bunch of these titles have the back ticks, they are problem #1 - here's how they come back out of a select against the table after I populate it:
| /test1/music/THE DOOBIE BROTHERS - BEST OF THE DOOBIES/02 - THE DOOBIE BROTHERS - LONG TRAIN RUNNIN´.mp3
This character is also a problem:
File:/test1/music/Blue Öyster Cult - Workshop Of The Telescopes/01 - Blue Öyster Cult - Don't Fear The Reaper.mp3
From the database:
/test1/music/Blue Ãyster Cult - Workshop Of The Telescopes/01 - Blue Ãyster Cult - Don't Fear The Reaper.mp3
And one more, but far from the last of the problematic strings:
File:/test1/music/Better Than Ezra - Deluxe/03 - Better Than Ezra - Southern Gürl.mp3
Comes out as:
/test1/music/Better Than Ezra - Deluxe/03 - Better Than Ezra - Southern Gürl.mp3
I though this was a character set problem, so I added this to the connect string:
{ RaiseError => 1, mysql_enable_utf8mb4 => 1 }
I've also tried:
$dbh_sms->do('SET NAMES utf8mb4');
Is that a hex C2 being added to the string? (Edit: it's an octal 303, hex c3) I've also tried changing the column to varbinary and still see the same results. Anyone have a clue why this is happening? I'm at a loss....
TIA
Edit - I dumped the table with OD to find out what is actually getting inserted since I was of the belief that with a placeholder, a bind variable would be written without interpolation, basically a binary transfer. To save my eyes, I just did a handful of records concentrating on what I thought was a 'back tick' from the above example, which is an octal 264, AKA "Acute accent - spacing acute".
It is in the table, but preceded by 303 202 and 302, which are a couple of those 'A' characters with the icing on top and a "low quote" character in between. Which contradicts my prior understanding about the utility of place holders and bind variables.
So I am more confused now than before.
I found the problem, it was in the perl character encoding:
$fname = decode("UTF-8", $fname);
was all I needed.
I've a list of phrases, actually it's an Excel file, but I can extract each single line if needed.
I need to find the line that is quite similar, for example one line can be:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°)
and some line after I can have the same line or this one:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA200 (temp.max60°)
Like you can see these two lines are pretty the same, not equal in this case but at 98%
The main problem is that I've to process about 45k lines, for this reason I'm searching a way to do that in a quick and maybe visual way.
The first thing that came in my mind was to compare the very 1st line to the 2nd then the 3rd till the end, and so on with the 2nd one and the 3rd one till latest-1 and make a kind of score, for example the 1st line is 100% with line 42, 99% with line 522 ... 21% with line 22142 etc etc...
But is only one idea, maybe not the best.
Maybe out there's already a good program/script/online services/program, I searched but I can't find it, so at the end I asked here.
Anyone knows a good way (if this is possible) or script or one online services to achieve this?
One thing you can do is write a script, which does as follows:
Extract data from csv file
Define a regex which can conclude a similarity, a python example can be:
[\w\s]+\([\w]+\)[\w\s]+\([\w°]+\)
Or such, refer the documentation.
The problem you have is that you are not looking for an exact match, but a like.
This is a problem even databases have never solved and results in a full table scan.
So we're unlikely to solve it.
However, I'd like to propose that you consider alternatives:
You could decide to limit the differences to specific character sets.
In the above example, you were ignoring numbers, but respected letters.
If we can assume that this rule will always hold true, then we can perform a text replace on the string.
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°) ==> ANTIBRATING SSPIRING JOINT (type _) mod. GA_ (temp.max_°)
Now, we can deal with this problem by performing an exact string comparison. This can be done by hashing. The easiest way is to feed a hashmap/hashset or a database with a hash index on the column where you will store this adjusted text.
You could decide to trade time for space.
For example, you can feed the strings to a service which will build lots of different variations of indexes on your string. For example, feed elasticsearch with your data, and then perform analytic queries on it.
Fuzzy searches is the key.
I found several projects and ideas, but the one I used is tree-agrep, I know that is quite old but in this case works for me, I created this little script to help me to create a list of differences, so I can manually check it with my file
#!/bin/bash
########## CONFIGURATIONS ##########
original_file=/path/jjj.txt
t_agrep_bin="$(command -v tre-agrep)"
destination_file=/path/destination_file.txt
distance=1
########## CONFIGURATIONS ##########
lines=$(grep "" -c "$original_file")
if [[ -s "$destination_file" ]]; then
rm -rf "$destination_file"
fi
start=1
while IFS= read -r line; do
echo "Checking line $start/$lines"
lista=$($t_agrep_bin -$distance -B --colour -s -n -i "$line" $original_file)
echo "$lista" | awk -F ':' '{print $1}' ORS=' ' >> "$destination_file"
echo >> "$destination_file"
start=$((start+1))
done < "$original_file"
So, basically I have to name several directories with " / " at the end of the name ( ex: nameOfDir/ ) and also to give some others these kind of names :
example 1: The Sun
example 2 : Sth & Sth
example 3 : Sth: 'Sth'
How can I can do this? Whenever I write ' mkdir The Sun' it creates two directories... I am really stuck at this point.
The "/" is a reserved character so you cannot use it as a part of the directory name.
Multiple words and special character like "*", " ", "&" and the like are possible, but I personally recommend against using them as they make names hard to read and easily lead to errors in scripts...
I am running cygwin on Windows 7. I am using a signal processing tool and basically performing alignments. I had about 1200 input files. Each file is of the format given below.
input_file_ format = "AC_XXXXXX.abc"
The first step required building some kind of indexes for all the input files, this was done with the tool's build-index command and now each file had 6 indexes associated with it. Therefore now I have about 1200*6 = 7200 index files. The indexes are of the form given below.
indexes_format = "AC_XXXXXX.abc.1",
"AC_XXXXXX.abc.2",
"AC_XXXXXX.abc.3",
"AC_XXXXXX.abc.4",
"AC_XXXXXX.abc.rev.1",
"AC_XXXXXX.abc.rev.1"
Now, I need to use these indexes to perform the alignment. All the 6 indexes of each file are called together and the final operation is done as follows.
signal-processing-tool ..\path-to-indexes\AC_XXXXXX.abc ..\Query file
Where AC_XXXXXX.abc is the index associated with that particular index file. All 6 index files are called with **AC_XXXXXX.abc*.
My problem is that I need to use only the first 14 characters of the index file names for the final operation.
When I use the code below, the alignment is not executed.
for file in indexes/*; do ./tool $file|cut -b1-14 Project/query_file; done
I'd appreciate help with this!
First of all, keep in mind that $file will always start with "indexes/", so trimming first 14 characters would always include that folder name in the beginning.
To use first 14 characters in a variable, use ${file:0:14}, where 0 is the starting string index, and 14 is the length of the desired substring.
Alternatively, if you want to use cut, you need to run it in a subshell: for file in indexes/*; do ./tool $(echo $file|cut -c 1-14) Project/query_file; done I changed the arg for cut to -c for characters instead of bytes
I have code like this:
define(`DOSPACE',`...................................
.........................
................`'ifelse(eval(len(`Space: $2: $3')>60),1,`...'substr($3,eval((60-len(`Space: $2: $3')-7)*-1),$3)
...................
...................')
That is, if the total string is too long (over 60 chars), I want to strip off enough characters (+3) off the left of $3 to make it fit, and add a "..." to show the missing characters.
I'm at a loss. I'm convinced its either a quoting issue or a "time of execution" issue (whereby $2 and $3 are not what I think they are...)
To complicate matters: this macro is being called in another (I'd forgotten...)
define(`DOGROWTH',dnl
DOSPACE(`$1',`$2',`/path/to/some/$3/where')
DOSPACE(`$1',`$2',`/path/to/some/$3/else')
)
What about patsubst?