I'm trying to look for a function/class (in Qt or the standard library, preferably) that could help me find a part of a string that matches a search string, but ignores all diacritics.
For instance, when I have a string that looks like: "This is éval" and I look for "eval" or "Eval", it should return true.
bool contains(QString fullString, QString searchString)
{
QCollator collator;
collator.setCaseSensitivity(Qt::CaseInsensitive);
collator.setIgnorePunctuation(true);
for (int i = 0; i <= fullString.length() - searchString.length(); ++i)
{
if (collatorpare(fullString.mid(i, searchString.length()), searchString) == 0)
{
return true;
}
}
return false;
}
At the moment, this doesn't yield the correct result. The only functions I found that do these comparisons only do this kind of comparing on full strings, not on parts of it.
Edit
So thanks to the additional input on this page, I already came up with this example:
QString stripDiacritics(const QString& source, QString::NormalizationForm form)
{
QString stripped;
QString normalizedSource = source.normalized(form);
for (auto chr : normalizedSource) {
// strip diacritic marks
if (chr.category() != QChar::Mark_NonSpacing && chr.category() != QChar::Mark_SpacingCombining) {
stripped.append(chr);
}
}
return stripped;
}
QString origin = QString::fromUtf8("üó ÐÈMØ");
QString form_D = stripDiacritics(origin, QString::NormalizationForm::NormalizationForm_D);
spdlog::info("origin: {}, form_d: {}", origin.toStdString(), form_D.toStdString());
Which than prints:
origin: üó ÐÈMØ, form_d: uo ÐEMØ
This is already fairly close to what I want to achieve, but additional help to remove, for instance, the stroke from the 'D' would still be really helpful.
I'm trying to look for a function/class (in Qt or the standard library, preferably) that could help me find a part of a string that matches a search string, but ignores all diacritics.
For instance, when I have a string that looks like: "This is éval" and I look for "eval" or "Eval", it should return true.
bool contains(QString fullString, QString searchString)
{
QCollator collator;
collator.setCaseSensitivity(Qt::CaseInsensitive);
collator.setIgnorePunctuation(true);
for (int i = 0; i <= fullString.length() - searchString.length(); ++i)
{
if (collator.compare(fullString.mid(i, searchString.length()), searchString) == 0)
{
return true;
}
}
return false;
}
At the moment, this doesn't yield the correct result. The only functions I found that do these comparisons only do this kind of comparing on full strings, not on parts of it.
Edit
So thanks to the additional input on this page, I already came up with this example:
QString stripDiacritics(const QString& source, QString::NormalizationForm form)
{
QString stripped;
QString normalizedSource = source.normalized(form);
for (auto chr : normalizedSource) {
// strip diacritic marks
if (chr.category() != QChar::Mark_NonSpacing && chr.category() != QChar::Mark_SpacingCombining) {
stripped.append(chr);
}
}
return stripped;
}
QString origin = QString::fromUtf8("üó ÐÈMØ");
QString form_D = stripDiacritics(origin, QString::NormalizationForm::NormalizationForm_D);
spdlog::info("origin: {}, form_d: {}", origin.toStdString(), form_D.toStdString());
Which than prints:
origin: üó ÐÈMØ, form_d: uo ÐEMØ
This is already fairly close to what I want to achieve, but additional help to remove, for instance, the stroke from the 'D' would still be really helpful.
If you need to ignore all diacritics unconditionally, you may try to normalize the string to its canonical decomposition form, and then filter out characters that belong to the non-spacing mark category or possibly to other two mark categories as well. And, while you are at it, possibly to other categories such as punctuation; maybe also squash whitespace sequences or do other transformations.
If you need to remove some diacritics and keep others according to some locale-specific rules (for example, keep the acute accent but remove the dieresis in some languages), I'm afraid you need to compile your own table of rules.
Addendum. The Unicode standard does not consider the character 00D0 LATIN CAPITAL LETTER ETH a variant of 0044 LATIN CAPITAL LETTER D, not does it define any relation between the two. They are just characters that look somewhat similar. Likewise, the character 00D8 LATIN CAPITAL LETTER O WITH STROKE is not a 004F LATIN CAPITAL LETTER O with a diacritical mark, it's its own thing. If you need special treatment for these and similar cases, you need to look outside of the Unicode standard and libraries that implement it.
The shortest way but doesn't optimize is
I have some quick and dirty example
#include <inttypes.h>
#include <iostream>
#include <string>
bool is_ignore_code(uint32_t code) {
struct Range {
uint32_t min;
uint32_t max;
};
constexpr size_t range_size = 1;
constexpr Range ranges[range_size] = {{0x303, 0x36F}}; // ranges for ignore code usually combine code
constexpr size_t ignore_size = 1;
constexpr uint32_t ignore_code[ignore_size] = {0x301}; // individual ignore code, 0x301 = Combining Acute Accent ́ , it is hard to type i use hex-literal instead.
for (int i = 0; i < range_size; i++) { // search in range first
if (code >= ranges[i].min && code <= ranges[i].max) {
return true;
}
}
for (int i = 0; i < ignore_size; i++) {
if (code == ignore_code[i]) {
return true;
}
}
return false;
}
char code_to_char(uint32_t code) {
constexpr size_t lookup_size = 3;
constexpr uint32_t lookup[lookup_size] = {L'T', L'é', L'E'}; // store unicode for look up
constexpr char normal[lookup_size] = {'t', 'e', 'e'}; // store normalized char for look up index
for (int i = 0; i < lookup_size; ++i) {
if (lookup[i] == code) {
return normal[i];
}
}
return 0; // not found
}
int utf8_code_at(const char* data, size_t i, uint32_t& code) { // simple decode utf8 to unicode return 0 when something wrong
int len = 0;
if ((data[i] & 0x80) == 0) {
len = 1;
code = data[i];
} else if ((data[i] & 0xE0) == 0xC0) {
len = 2;
code = ((data[i] & 0x1F) << 6) | (data[i + 1] & 0x3F);
} else if ((data[i] & 0xF0) == 0xE0) {
len = 3;
code = ((data[i] & 0x0F) << 12) | ((data[i + 1] & 0x3F) << 6) |
(data[i + 2] & 0x3F);
} else if ((data[i] & 0xF8) == 0xF0) {
len = 4;
code = ((data[i] & 0x07) << 18) | ((data[i + 1] & 0x3F) << 12) |
((data[i + 2] & 0x3F) << 6) | (data[i + 3] & 0x3F);
}
return len;
}
void normalize_string(std::string_view from, std::string& to) { // must be utf8
to.reserve(
from.size()); // normalized string size should never more than original
uint32_t code;
for (int it = 0; it < from.length();) {
auto len = utf8_code_at(from.data(), it, code);
if (len == 0) {
it++; // can't decode go next char
continue;
}
if (is_ignore_code(code)) { // code is in ignore list go to next code
it += len;
continue;
}
auto ch = code_to_char(code);
if (ch == 0) { // not found = copy original to destination
to += from.substr(it, len);
} else {
to += ch;
}
it += len; //go to next code
}
}
int main() {
std::string str{"This a is évalsß"};
std::string str_n;
normalize_string(str, str_n);
std::string search{"Evál"};
std::string search_n;
normalize_string(search, search_n);
std::cout << str << " >> " << str_n << "\n";
std::cout << search << " >> " << search_n << "\n";
std::cout << str << " =contain= " << search << " >> "
<< (str_n.find(search_n) != std::string::npos) << "\n";
return 0;
}
godbolt
This code is work only when your string is wellform utf8 only. It has no error handing for utf8. utf8 error handing