qt - Find a substring in C++ but ignore diacritics - Stack Overflow

admin2025-04-26  4

I'm trying to look for a function/class (in Qt or the standard library, preferably) that could help me find a part of a string that matches a search string, but ignores all diacritics.

For instance, when I have a string that looks like: "This is éval" and I look for "eval" or "Eval", it should return true.

bool contains(QString fullString, QString searchString)
{
    QCollator collator;
    collator.setCaseSensitivity(Qt::CaseInsensitive);
    collator.setIgnorePunctuation(true);
    for (int i = 0; i <= fullString.length() - searchString.length(); ++i) 
    {
        if (collatorpare(fullString.mid(i, searchString.length()), searchString) == 0) 
        {
            return true;
        }
    }
    return false;
}

At the moment, this doesn't yield the correct result. The only functions I found that do these comparisons only do this kind of comparing on full strings, not on parts of it.

Edit

So thanks to the additional input on this page, I already came up with this example:

QString stripDiacritics(const QString& source, QString::NormalizationForm form)
{
    QString stripped;

    QString normalizedSource = source.normalized(form);
    for (auto chr : normalizedSource) {
        // strip diacritic marks
        if (chr.category() != QChar::Mark_NonSpacing && chr.category() != QChar::Mark_SpacingCombining) {
            stripped.append(chr);
        }
    }
    return stripped;
}

QString origin = QString::fromUtf8("üó ÐÈMØ");
QString form_D = stripDiacritics(origin, QString::NormalizationForm::NormalizationForm_D);

spdlog::info("origin: {}, form_d: {}", origin.toStdString(), form_D.toStdString());

Which than prints:

origin: üó ÐÈMØ, form_d: uo ÐEMØ

This is already fairly close to what I want to achieve, but additional help to remove, for instance, the stroke from the 'D' would still be really helpful.

I'm trying to look for a function/class (in Qt or the standard library, preferably) that could help me find a part of a string that matches a search string, but ignores all diacritics.

For instance, when I have a string that looks like: "This is éval" and I look for "eval" or "Eval", it should return true.

bool contains(QString fullString, QString searchString)
{
    QCollator collator;
    collator.setCaseSensitivity(Qt::CaseInsensitive);
    collator.setIgnorePunctuation(true);
    for (int i = 0; i <= fullString.length() - searchString.length(); ++i) 
    {
        if (collator.compare(fullString.mid(i, searchString.length()), searchString) == 0) 
        {
            return true;
        }
    }
    return false;
}

At the moment, this doesn't yield the correct result. The only functions I found that do these comparisons only do this kind of comparing on full strings, not on parts of it.

Edit

So thanks to the additional input on this page, I already came up with this example:

QString stripDiacritics(const QString& source, QString::NormalizationForm form)
{
    QString stripped;

    QString normalizedSource = source.normalized(form);
    for (auto chr : normalizedSource) {
        // strip diacritic marks
        if (chr.category() != QChar::Mark_NonSpacing && chr.category() != QChar::Mark_SpacingCombining) {
            stripped.append(chr);
        }
    }
    return stripped;
}

QString origin = QString::fromUtf8("üó ÐÈMØ");
QString form_D = stripDiacritics(origin, QString::NormalizationForm::NormalizationForm_D);

spdlog::info("origin: {}, form_d: {}", origin.toStdString(), form_D.toStdString());

Which than prints:

origin: üó ÐÈMØ, form_d: uo ÐEMØ

This is already fairly close to what I want to achieve, but additional help to remove, for instance, the stroke from the 'D' would still be really helpful.

Share Improve this question edited Jan 16 at 10:47 Laurens Brock asked Jan 15 at 4:41 Laurens BrockLaurens Brock 111 silver badge4 bronze badges 5
  • Not quite an answer, but maybe you use NKD (decomposition) to decompose accented letters into the base letter and a separate accent, and then remove all accents, then do the finding? – Eugene Ryabtsev Commented Jan 15 at 4:57
  • Unfortunately, the user that approved your post did not consider the whole thread in the Staging Ground, as the post is missing an important aspect: you should edit your post and clarify that you also want to filter out diacritics that, depending on the language, may not match the filter (eg: "Å" should not be considered as "A" in some languages, since it's a different letter). That said, as already noted in other comments to the SG post, there is no common library/function that can do that, since you want to consider language aware aspects. – musicamante Commented Jan 15 at 5:21
  • Also note that even if you manage to create your own "table" of rules and exceptions, that may not be consistent in case the general context (of the search or of the program) is inconsistent with a "special content", such as a word originally intended in another language. For instance, if the context is intended for an English usage, but it contains a word or person name that, in its original language, has a diacritic that should not be matched. As suggested by others in the SG, the only reliable way is to add a flag/option that eventually tells the filter to consider or ignores diacritics. – musicamante Commented Jan 15 at 5:21
  • It's unclear what you're trying to achieve with this. Still, just guessing the intention, I'd suggest you look at so-called "Unicode lookalikes". In short: Reduce the input strings to their lookalike base (e.g. strip accents) and then do the substring search. – Ulrich Eckhardt Commented Jan 15 at 11:14
  • @UlrichEckhardt The intention is to allow the user to search for text, but for usability sake, he shouldn't be bother with the diacritics. I'm looking for similar behavior as your browser has. If you look on this page for the letter 'd', it will highlight all variants. I'm looking for similar matching behavior. – Laurens Brock Commented Jan 16 at 10:42
Add a comment  | 

2 Answers 2

Reset to default 2

If you need to ignore all diacritics unconditionally, you may try to normalize the string to its canonical decomposition form, and then filter out characters that belong to the non-spacing mark category or possibly to other two mark categories as well. And, while you are at it, possibly to other categories such as punctuation; maybe also squash whitespace sequences or do other transformations.

If you need to remove some diacritics and keep others according to some locale-specific rules (for example, keep the acute accent but remove the dieresis in some languages), I'm afraid you need to compile your own table of rules.

Addendum. The Unicode standard does not consider the character 00D0 LATIN CAPITAL LETTER ETH a variant of 0044 LATIN CAPITAL LETTER D, not does it define any relation between the two. They are just characters that look somewhat similar. Likewise, the character 00D8 LATIN CAPITAL LETTER O WITH STROKE is not a 004F LATIN CAPITAL LETTER O with a diacritical mark, it's its own thing. If you need special treatment for these and similar cases, you need to look outside of the Unicode standard and libraries that implement it.

The shortest way but doesn't optimize is

  • normalize both string and search string then compare them
  • easiest way to compare character is use unicode to compare them because all unicode has same size

I have some quick and dirty example

#include <inttypes.h>

#include <iostream>
#include <string>

bool is_ignore_code(uint32_t code) {
    struct Range {
        uint32_t min;
        uint32_t max;
    };
    constexpr size_t range_size = 1;
    constexpr Range ranges[range_size] = {{0x303, 0x36F}};  // ranges for ignore code usually combine code
    constexpr size_t ignore_size = 1;
    constexpr uint32_t ignore_code[ignore_size] = {0x301};  // individual ignore code, 0x301 = Combining Acute Accent ́  , it is hard to type i use hex-literal instead.
    for (int i = 0; i < range_size; i++) {  // search in range first
        if (code >= ranges[i].min && code <= ranges[i].max) {
            return true;
        }
    }
    for (int i = 0; i < ignore_size; i++) {  
        if (code == ignore_code[i]) {
            return true;
        }
    }
    return false;
}

char code_to_char(uint32_t code) {
    constexpr size_t lookup_size = 3;
    constexpr uint32_t lookup[lookup_size] = {L'T', L'é', L'E'};  // store unicode for look up
    constexpr char normal[lookup_size] = {'t', 'e', 'e'};  // store normalized char for look up index

    for (int i = 0; i < lookup_size; ++i) {
        if (lookup[i] == code) {
            return normal[i];
        }
    }

    return 0;  // not found
}

int utf8_code_at(const char* data, size_t i, uint32_t& code) {  // simple decode utf8 to unicode return 0  when something wrong
    int len = 0;

    if ((data[i] & 0x80) == 0) {
        len = 1;
        code = data[i];
    } else if ((data[i] & 0xE0) == 0xC0) {
        len = 2;
        code = ((data[i] & 0x1F) << 6) | (data[i + 1] & 0x3F);
    } else if ((data[i] & 0xF0) == 0xE0) {
        len = 3;
        code = ((data[i] & 0x0F) << 12) | ((data[i + 1] & 0x3F) << 6) |
               (data[i + 2] & 0x3F);
    } else if ((data[i] & 0xF8) == 0xF0) {
        len = 4;
        code = ((data[i] & 0x07) << 18) | ((data[i + 1] & 0x3F) << 12) |
               ((data[i + 2] & 0x3F) << 6) | (data[i + 3] & 0x3F);
    }

    return len;
}

void normalize_string(std::string_view from, std::string& to) {  // must be utf8
    to.reserve(
        from.size());  // normalized string size should never more than original
    uint32_t code;
    for (int it = 0; it < from.length();) {
        auto len = utf8_code_at(from.data(), it, code);
        if (len == 0) {
            it++;  // can't decode go next char
            continue;
        }
        if (is_ignore_code(code)) { // code is in ignore list go to next code
            it += len;
            continue;
        }
        auto ch = code_to_char(code);
        if (ch == 0) {  // not found = copy original to destination
            to += from.substr(it, len);
        } else {
            to += ch;
        }
        it += len;  //go to next code
    }
}

int main() {
    std::string str{"This a is évalsß"};

    std::string str_n;

    normalize_string(str, str_n);

    std::string search{"Evál"};

    std::string search_n;

    normalize_string(search, search_n);

    std::cout << str << " >> " << str_n << "\n";
    std::cout << search << " >> " << search_n << "\n";

    std::cout << str << " =contain= " << search << " >> "
              << (str_n.find(search_n) != std::string::npos) << "\n";

    return 0;
}

godbolt

This code is work only when your string is wellform utf8 only. It has no error handing for utf8. utf8 error handing

转载请注明原文地址:http://anycun.com/QandA/1745599445a91001.html