How to remove accents / diacritic marks from a string in Qt? - string

How to remove diacritic marks from a string in Qt. For example, this:
QString test = QString::fromUtf8("éçàÖœ");
qDebug() << StringUtil::removeAccents(test);
should output:
ecaOoe

There is not straighforward, built-in solution in Qt. A simple solution, which should work in most cases, is to loop through the string and replace each character by their equivalent:
QString StringUtil::diacriticLetters_;
QStringList StringUtil::noDiacriticLetters_;
QString StringUtil::removeAccents(QString s) {
if (diacriticLetters_.isEmpty()) {
diacriticLetters_ = QString::fromUtf8("ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ");
noDiacriticLetters_ << "S"<<"OE"<<"Z"<<"s"<<"oe"<<"z"<<"Y"<<"Y"<<"u"<<"A"<<"A"<<"A"<<"A"<<"A"<<"A"<<"AE"<<"C"<<"E"<<"E"<<"E"<<"E"<<"I"<<"I"<<"I"<<"I"<<"D"<<"N"<<"O"<<"O"<<"O"<<"O"<<"O"<<"O"<<"U"<<"U"<<"U"<<"U"<<"Y"<<"s"<<"a"<<"a"<<"a"<<"a"<<"a"<<"a"<<"ae"<<"c"<<"e"<<"e"<<"e"<<"e"<<"i"<<"i"<<"i"<<"i"<<"o"<<"n"<<"o"<<"o"<<"o"<<"o"<<"o"<<"o"<<"u"<<"u"<<"u"<<"u"<<"y"<<"y";
}
QString output = "";
for (int i = 0; i < s.length(); i++) {
QChar c = s[i];
int dIndex = diacriticLetters_.indexOf(c);
if (dIndex < 0) {
output.append(c);
} else {
QString replacement = noDiacriticLetters_[dIndex];
output.append(replacement);
}
}
return output;
}
Note that noDiacriticLetters_ needs to be a QStringList since some characters with diacritic marks can match to two single characters. For example œ => oe

Your question is a bit misleading. You seem to want to do more than merely remove diacritical marks (œ is a ligature letter without diacritics). I guess you want to turn any Unicode string into a roughly corresponding ASCII string?
For diacritics, you could perform a decomposing Unicode normalization (NFD or NFKD, depending on your specific needs) and then remove all characters of the "Mark" categories (QChar::Mark_NonSpacing, QChar::Mark_SpacingCombining and QChar::Mark_Enclosing).
For everything else (e.g. œ), I don't know of a generic solution. Create a look-up table with all your desired replacements and then search and replace (see Laurent's answer).

A partial solution is to use QString::normalized, than remove the special characters.
QString test = QString::fromUtf8("éçàÖœ");
QString stringNormalized = test.normalized (QString::NormalizationForm_KD);
stringNormalized.remove(QRegExp("[^a-zA-Z\\s]"));
This is however a partial solution because it will not convert "œ" into "oe".

There is crude way to partial solve your problem (just accents, but not ligatures such as "oe").
QString title=QString::fromUtf8("éçàÖ");
qDebug("%s\n", title.toLocal8Bit().data());

Related

How to set maximum length in String value DART

i am trying to set maximum length in string value and put '..' instead of removed chrs like following
String myValue = 'Welcome'
now i need the maximum length is 4 so output like following
'welc..'
how can i handle this ? thanks
The short and incorrect version is:
String abbrevBad(String input, int maxlength) {
if (input.length <= maxLength) return input;
return input.substring(0, maxLength - 2) + "..";
}
(Using .. is not the typographical way to mark an elision. That takes ..., the "ellipsis" symbol.)
A more internationally aware version would count grapheme clusters instead of code units, so it handles complex characters and emojis as a single character, and doesn't break in the middle of one. Might also use the proper ellipsis character.
String abbreviate(String input, int maxLength) {
var it = input.characters.iterator;
for (var i = 0; i <= maxLength; i++) {
if (!it.expandNext()) return input;
}
it.dropLast(2);
return "${it.current}\u2026";
}
That also works for characters which are not single code units:
void main() {
print(abbreviate("argelbargle", 7)); // argelb…
print(abbreviate("🇩🇰🇩🇰🇩🇰🇩🇰🇩🇰", 4)); // 🇩🇰🇩🇰🇩🇰…
}
(If you want to use ... instead of …, just change .dropLast(2) to .dropLast(4) and "…" to "...".)
You need to use RichText and you need to specify the overflow type, just like this:
Flexible(
child: RichText("Very, very, very looong text",
overflow: TextOverflow.ellipsis,
),
);
If the Text widget overflows, some points (...) will appears.

Replacing the number in a string

if my string is lets say "Alfa1234Beta"
how can I convert all the number in to "_"
for example "Alfa1234Beta"
will be "Alfa____Beta"
Going with the Regex approach pointed out by others is possibly OK for your scenario. Mind you however, that Regex sometimes tend to be overused. A hand rolled approach could be like this:
static string ReplaceDigits(string str)
{
StringBuilder sb = null;
for (int i = 0; i < str.Length; i++)
{
if (Char.IsDigit(str[i]))
{
if (sb == null)
{
// Seen a digit, allocate StringBuilder, copy non-digits we might have skipped over so far.
sb = new StringBuilder();
if (i > 0)
{
sb.Append(str, 0, i);
}
}
// Replace current character (a digit)
sb.Append('_');
}
else
{
if (sb != null)
{
// Seen some digits (being replaced) already. Collect non-digits as well.
sb.Append(str[i]);
}
}
}
if (sb != null)
{
return sb.ToString();
}
return str;
}
It is more light weight than Regex and only allocates when there is actually something to do (replace). So, go ahead use the Regex version if you like. If you figure out during profiling that is too heavy weight, you can use something like the above. YMMV
You can run for loop on the string and then use the following method to replace numbers with _
if (!System.Text.RegularExpressions.Regex.IsMatch(i, "^[0-9]*$"))
Here variable i is the character in the for loop .
You can use this:
var s = "Alfa1234Beta";
var s2 = System.Text.RegularExpressions.Regex.Replace(s, "[0-9]", "_");
s2 now contains "Alfa____Beta".
Explanation: the regex [0-9] matches any digit from 0 to 9 (inclusive). The Regex.Replace then replaces all matched characters with an "_".
EDIT
And if you want it a bit shorter AND also match non-latin digits, use \d as a regex:
var s = "Alfa1234Beta๓"; // ๓ is "Thai digit three"
var s2 = System.Text.RegularExpressions.Regex.Replace(s, #"\d", "_");
s2 now contains "Alfa____Beta_".

D: how to remove last char in string?

I need to remove last char in string in my case it's comma (","):
foreach(line; fcontent.splitLines)
{
string row = line.split.map!(a=>format("'%s', ", a)).join;
writeln(row.chop.chop);
}
I have found only one way - to call chop two times. First remove \r\n and second remove last char.
Is there any better ways?
import std.array;
if (!row.empty)
row.popBack();
As it usually happens with string processing, it depends on how much Unicode do you care about.
If you only work with ASCII it is very simple:
import std.encoding;
// no "nice" ASCII literals, D really encourages Unicode
auto str1 = cast(AsciiString) "abcde";
str1 = str1[0 .. $-1]; // get slice of everything but last byte
auto str2 = cast(AsciiString) "abcde\n\r";
str2 = str2[0 .. $-3]; // same principle
In "last char" actually means unicode code point (http://unicode.org/glossary/#code_point) it gets a bit more complicated. Easy way is to just rely on D automatic decoding and algorithms:
import std.range, std.stdio;
auto range = "кириллица".retro.drop(1).retro();
writeln(range);
Here retro (http://dlang.org/phobos/std_range.html#.retro) is a lazy reverse iteration function. It takes any range (unicode string is a valid range) and returns wrapper that is capable of iterating it backwards.
drop (http://dlang.org/phobos/std_range.html#.drop) simply pops a single range element and ignores it. Calling retro again will reverse the iteration order back to normal, but now with the last element dropped.
Reason why it is different from ASCII version is because of nature of Unicode (specifically UTF-8 which D defaults to) - it does not allow random access to any code point. You actually need to decode them all one by one to get to any desired index. Fortunately, D takes care of all decoding for you hiding it behind convenient range interface.
For those who want even more Unicode correctness, it should be possible to operate on graphemes (http://unicode.org/glossary/#grapheme):
import std.range, std.uni, std.stdio;
auto range = "abcde".byGrapheme.retro.drop(1).retro();
writeln(range);
Sadly, looks like this specific pattern is not curently supported because of bug in Phobos. I have created an issue about it : https://issues.dlang.org/show_bug.cgi?id=14394
NOTE: Updated my answer to be a bit cleaner and removed the lambda function in 'map!' as it was a little ugly.
import std.algorithm, std.stdio;
import std.string;
void main(){
string fcontent = "I am a test\nFile\nwith some,\nCommas here and\nthere,\n";
auto data = fcontent
.splitLines
.map!(a => a.replaceLast(","))
.join("\n");
writefln("%s", data);
}
auto replaceLast(string line, string toReplace){
auto o = line.lastIndexOf(toReplace);
return o >= 0 ? line[0..o] : line;
}
module main;
import std.stdio : writeln;
import std.string : lineSplitter, join;
import std.algorithm : map, splitter, each;
enum fcontent = "some text\r\nnext line\r\n";
void main()
{
fcontent.lineSplitter.map!(a=>a.splitter(' ')
.map!(b=>"'" ~ b ~ "'")
.join(", "))
.each!writeln;
}
Take a look, I use this extension method to replace any last character or sub-string, for example:
string testStr = "Happy holiday!";<br>
Console.Write(testStr.ReplaceVeryLast("holiday!", "Easter!"));
public static class StringExtensions
{
public static string ReplaceVeryLast(this string sStr, string sSearch, string sReplace = "")
{
int pos = 0;
sStr = sStr.Trim();
do
{
pos = sStr.LastIndexOf(sSearch, StringComparison.CurrentCultureIgnoreCase);
if (pos >= 0 && pos + sSearch.Length == sStr.Length)
sStr = sStr.Substring(0, pos) + sReplace;
} while (pos == (sStr.Length - sSearch.Length + 1));
return sStr;
}
}

Need a program to reverse the words in a string

I asked this question in a few interviews. I want to know from the Stackoverflow readers as to what should be the answer to this question.
Such a seemingly simple question, but has been interpreted quite a few different ways.
if your definition of a "word" is a series of non-whitespace characters surrounded by a whitespace character, then in 5 second pseudocode you do:
var words = split(inputString, " ")
var reverse = new array
var count = words.count -1
var i = 0
while count != 0
reverse[i] = words[count]
count--
i++
return reverse
If you want to take into consideration also spaces, you can do it like that:
string word = "hello my name is";
string result="";
int k=word.size();
for (int j=word.size()-1; j>=0; j--)
{
while(word[j]!= ' ' && j>=0)
j--;
int end=k;
k=j+1;
int count=0;
if (j>=0)
{
int temp=j;
while (word[temp]==' '){
count++;
temp--;
}
j-=count;
}
else j=j+1;
result+=word.substr(k,end-k);
k-=count;
while(count!=0)
{
result+=' ';
count--;
}
}
It will print out for you "is name my hello"
Taken from something called "Hacking a Google Interview" that was somewhere on my computer ... don't know from where I got it but I remember I saw this exact question inside ... here is the answer:
Reverse the string by swapping the
first character with the last
character, the second with the
second-to-last character, and so on.
Then, go through the string looking
for spaces, so that you find where
each of the words is. Reverse each of
the words you encounter by again
swapping the first character with the
last character, the second character
with the second-to-last character, and
so on.
This came up in LessThanDot Programmer Puzzles
#include<stdio.h>
void reverse_word(char *,int,int);
int main()
{
char s[80],temp;
int l,i,k;
int lower,upper;
printf("Enter the ssentence\n");
gets(s);
l=strlen(s);
printf("%d\n",l);
k=l;
for(i=0;i<l;i++)
{
if(k<=i)
{temp=s[i];
s[i]=s[l-1-i];
s[l-1-i]=temp;}
k--;
}
printf("%s\n",s);
lower=0;
upper=0;
for(i=0;;i++)
{
if(s[i]==' '||s[i]=='\0')
{upper=i-1;
reverse_word(s,lower,upper);
lower=i+1;
}
if(s[i]=='\0')
break;
}
printf("%s",s);
return 0;
}
void reverse_word(char *s,int lower,int upper)
{
char temp;
//int i;
while(upper>lower)
{
temp=s[lower];
s[lower]=s[upper];
s[upper]=temp;
upper=upper-1;
lower=lower+1;
}
}
The following code (C++) will convert a string this is a test to test a is this:
string reverseWords(string str)
{
string result = "";
vector<string> strs;
stringstream S(str);
string s;
while (S>>s)
strs.push_back(s);
reverse(strs.begin(), strs.end());
if (strs.size() > 0)
result = strs[0];
for(int i=1; i<strs.size(); i++)
result += " " + strs[i];
return result;
}
PS: it's actually a google code jam question, more info can be found here.

Format string to title case

How do I format a string to title case?
Here is a simple static method to do this in C#:
public static string ToTitleCaseInvariant(string targetString)
{
return System.Threading.Thread.CurrentThread.CurrentCulture.TextInfo.ToTitleCase(targetString);
}
I would be wary of automatically upcasing all whitespace-preceded-words in scenarios where I would run the risk of attracting the fury of nitpickers.
I would at least consider implementing a dictionary for exception cases like articles and conjunctions. Behold:
"Beauty and the Beast"
And when it comes to proper nouns, the thing gets much uglier.
Here's a Perl solution http://daringfireball.net/2008/05/title_case
Here's a Ruby solution http://frankschmitt.org/projects/title-case
Here's a Ruby one-liner solution: http://snippets.dzone.com/posts/show/4702
'some string here'.gsub(/\b\w/){$&.upcase}
What the one-liner is doing is using a regular expression substitution of the first character of each word with the uppercase version of it.
To capatilise it in, say, C - use the ascii codes (http://www.asciitable.com/) to find the integer value of the char and subtract 32 from it.
This is a poor solution if you ever plan to accept characters beyond a-z and A-Z.
For instance: ASCII 134: å, ASCII 143: Å.
Using arithmetic gets you: ASCII 102: f
Use library calls, don't assume you can use integer arithmetic on your characters to get back something useful. Unicode is tricky.
In Silverlight there is no ToTitleCase in the TextInfo class.
Here's a simple regex based way.
Note: Silverlight doesn't have precompiled regexes, but for me this performance loss is not an issue.
public string TitleCase(string str)
{
return Regex.Replace(str, #"\w+", (m) =>
{
string tmp = m.Value;
return char.ToUpper(tmp[0]) + tmp.Substring(1, tmp.Length - 1).ToLower();
});
}
In Perl:
$string =~ s/(\w+)/\u\L$1/g;
That's even in the FAQ.
If the language you are using has a supported method/function then just use that (as in the C# ToTitleCase method)
If it does not, then you will want to do something like the following:
Read in the string
Take the first word
Capitalize the first letter of that word 1
Go forward and find the next word
Go to 3 if not at the end of the string, otherwise exit
1 To capitalize it in, say, C - use the ascii codes to find the integer value of the char and subtract 32 from it.
There would need to be much more error checking in the code (ensuring valid letters etc.), and the "Capitalize" function will need to impose some sort of "title-case scheme" on the letters to check for words that do not need to be capatilised ('and', 'but' etc. Here is a good scheme)
In what language?
In PHP it is:
ucwords()
example:
$HelloWorld = ucwords('hello world');
In Java, you can use the following code.
public String titleCase(String str) {
char[] chars = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (i == 0) {
chars[i] = Character.toUpperCase(chars[i]);
} else if ((i + 1) < chars.length && chars[i] == ' ') {
chars[i + 1] = Character.toUpperCase(chars[i + 1]);
}
}
return new String(chars);
}
Excel-like PROPER:
public static string ExcelProper(string s) {
bool upper_needed = true;
string result = "";
foreach (char c in s) {
bool is_letter = Char.IsLetter(c);
if (is_letter)
if (upper_needed)
result += Char.ToUpper(c);
else
result += Char.ToLower(c);
else
result += c;
upper_needed = !is_letter;
}
return result;
}
http://titlecase.com/ has an API
There is a built-in formula PROPER(n) in Excel.
Was quite pleased to see I didn't have to write it myself!
Here's an implementation in Python: https://launchpad.net/titlecase.py
And a port of this implementation that I've just done in C++: http://codepad.org/RrfcsZzO
Here is a simple example of how to do it :
public static string ToTitleCaseInvariant(string str)
{
return System.Threading.Thread.CurrentThread.CurrentCulture.TextInfo.ToTitleCase(str);
}
I think using the CultureInfo is not always reliable, this the simple and handy way to manipulate string manually:
string sourceName = txtTextBox.Text.ToLower();
string destinationName = sourceName[0].ToUpper();
for (int i = 0; i < (sourceName.Length - 1); i++) {
if (sourceName[i + 1] == "") {
destinationName += sourceName[i + 1];
}
else {
destinationName += sourceName[i + 1];
}
}
txtTextBox.Text = desinationName;
In C#
using System.Globalization;
using System.Threading;
protected void Page_Load(object sender, EventArgs e)
{
CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
TextInfo textInfo = cultureInfo.TextInfo;
Response.Write(textInfo.ToTitleCase("WelcometoHome<br />"));
Response.Write(textInfo.ToTitleCase("Welcome to Home"));
Response.Write(textInfo.ToTitleCase("Welcome#to$home<br/>").Replace("#","").Replace("$", ""));
}
In C# you can simply use
CultureInfo.InvariantCulture.TextInfo.ToTitleCase(str.ToLowerInvariant())
Invariant
Works with uppercase strings
Without using a ready-made function, a super-simple low-level algorithm to convert a string to title case:
convert first character to uppercase.
for each character in string,
if the previous character is whitespace,
convert character to uppercase.
This asssumes the "convert character to uppercase" will do that correctly regardless of whether or not the character is case-sensitive (e.g., '+').
Here you have a C++ version. It's got a set of non uppercaseable words like prononuns and prepositions. However, I would not recommend automating this process if you are to deal with important texts.
#include <iostream>
#include <string>
#include <vector>
#include <cctype>
#include <set>
using namespace std;
typedef vector<pair<string, int> > subDivision;
set<string> nonUpperCaseAble;
subDivision split(string & cadena, string delim = " "){
subDivision retorno;
int pos, inic = 0;
while((pos = cadena.find_first_of(delim, inic)) != cadena.npos){
if(pos-inic > 0){
retorno.push_back(make_pair(cadena.substr(inic, pos-inic), inic));
}
inic = pos+1;
}
if(inic != cadena.length()){
retorno.push_back(make_pair(cadena.substr(inic, cadena.length() - inic), inic));
}
return retorno;
}
string firstUpper (string & pal){
pal[0] = toupper(pal[0]);
return pal;
}
int main()
{
nonUpperCaseAble.insert("the");
nonUpperCaseAble.insert("of");
nonUpperCaseAble.insert("in");
// ...
string linea, resultado;
cout << "Type the line you want to convert: " << endl;
getline(cin, linea);
subDivision trozos = split(linea);
for(int i = 0; i < trozos.size(); i++){
if(trozos[i].second == 0)
{
resultado += firstUpper(trozos[i].first);
}
else if (linea[trozos[i].second-1] == ' ')
{
if(nonUpperCaseAble.find(trozos[i].first) == nonUpperCaseAble.end())
{
resultado += " " + firstUpper(trozos[i].first);
}else{
resultado += " " + trozos[i].first;
}
}
else
{
resultado += trozos[i].first;
}
}
cout << resultado << endl;
getchar();
return 0;
}
With perl you could do this:
my $tc_string = join ' ', map { ucfirst($\_) } split /\s+/, $string;

Resources