Search within e-mails and other applications

ATSrl 2025-06-15 17:01:17 UTC #1

Hi all,
Sorry for the banal question, but searching within emails, the address book and generally in any EGW application allows you to use a string, but it doesn’t seem to allow sophisticated searches, such as searching for item1 and item2, for example, in the subject line of the email regardless of the order and whether there are additional strings between them.
Filters, on the other hand, allow sophisticated searches using regex.
Is there any way to perform more advanced searches?
Suggestions are welcome.

Translated with DeepL.com (free version)
Gabriele

RalfBecker 2025-06-16 06:06:59 UTC #2

All EGroupware apps, with the exception of mail (!), use the following search-syntax:
A B: contains A or contains B
A or B: contains A or contains B
A and B: contains A and contains B
A +B: contains A and contains B
A -B: contains A, but NOT B
(Works also with more then 2 arguments.)

Mail is the exception, as what ever you type into the search field is send directly to the IMAP server. I’m not aware IMAP allows for more sophisticated combinations of search patterns.

Ralf

ATSrl 2026-01-10 17:29:47 UTC #3

Hi Ralf,
Thank you very much for your prompt and helpful response. As always, your support and the transparency of EGroupware’s architecture are greatly appreciated. We took some time to analyze the code more deeply, and while we are certainly not experts we’d like to share some observations.

Looking at the createIMAPFilter() function in api/src/Mail.php, it appears that the search string is not passed directly to the IMAP server as a raw query, but rather gets wrapped into a specific search criterion based on the selected search type.
For example, when we search for “hospital project” with the Subject filter, the actual IMAP query becomes:
SUBJECT "hospital project"
This means the IMAP server searches for emails containing the exact substring “hospital project” in the subject line.

The practical limitation:
This creates some challenges in daily operations, especially with large mailboxes:

Exact substring matching: If I’m looking for emails about a hospital project but I type “Hospita” (missing the final “l”), I won’t find “Hospital” - the search requires the exact substring.
Word order matters: Searching for “New hospital” won’t find an email with subject “Hospital new project” because IMAP substring matching is sequential.
No logical operators: We cannot search for emails that contain “hospital” AND “budget” in the subject, or emails from “mario” OR “luigi”. The search is limited to a single substring.

Technical observation:
We verified that our IMAP server (Dovecot) fully supports RFC 3501 search capabilities, including:

OR SUBJECT “term1” SUBJECT "term2"
SUBJECT “term1” FROM “term2” (implicit AND)
NOT SEEN
Combined criteria

We also noticed that Horde_Imap_Client_Search_Query (which EGroupware uses) does support these advanced queries through methods like orSearch(), andSearch(), and flag(). In fact, the quicksearch already uses orSearch() to combine SUBJECT and FROM criteria.

Would it be very useful if email interprets the search syntax already used in other EGroupware applications
For example:

hospital budget → OR search (contains “hospital” OR “budget”)
hospital +budget → AND search (contains both)
hospital -draft → NOT search (contains “hospital” but NOT “draft”)

This would align mail search behavior with other EGW apps (Addressbook, Calendar, InfoLog) and leverage the IMAP capabilities that are already available through Horde.

Thank you again for your time and for maintaining such a transparent and well-documented project. EGroupware’s adherence to open standards (CardDAV, CalDAV, IMAP) is one of its greatest strengths, and we believe this enhancement would further improve the user experience.
Best regards,
Gabriele

ATSrl 2026-01-23 07:27:42 UTC #4

Good morning everyone,
Does anyone else have this problem or is interested?
Any advice is welcome.
Gabriele

ATSrl 2026-02-16 07:13:55 UTC #5

Good morning, Everyone.
Does anyone else have this kind of problem?
If anyone has had this problem and solved it, any suggestions are welcome.
Thank you.

RalfBecker 2026-02-16 08:18:52 UTC #6

@bbegw can you reproduce this?

ATSrl 2026-02-16 08:39:11 UTC #7

Hi Ralf,
First of all, thank you for your interest. However, I don’t understand your request.
In our case, searching emails presents the problems I described in my previous posts.
If you give me some more information, I can certainly I will take care of it.
thank you
Gabriele

RalfBecker 2026-02-16 09:01:51 UTC #8

Hi Gabriel,

fixing something like you describe it - specially in our mail sources - takes considerable effort and therefore I need to make sure it also happens on our systems, before commiting time.

Ralf

bbegw 2026-02-16 11:34:10 UTC #9

Hi Ralf,

i can reproduce that search for “ralf becker” is giving no matches, so “or/and/negativ search” is not working at all.

in genereal for me the search always works as “contain XYZ” so searching for “becker” or “beck” is finding “becker”.

If changing the search does not get any slower, this changes would be helpful.

From my point of view, Gabriel can for sure also look into it and sent some “good” patches or put some money based on support budget into the development …

Or for sure a combination of both is also welcome!

Birgit

ATSrl 2026-05-12 10:25:44 UTC #10

Hi all,

Following up on this thread — I’ve implemented and tested a patch that brings the unified EGroupware search syntax (+, -, and, or, quoted phrases) to the Mail module, in line with what already works in Addressbook, Calendar and InfoLog. Sharing here for review and to ask if the maintainers would be open to a PR.

Recap of the problem

As Ralf noted earlier in this thread, the Mail module is the only one that forwards the search string directly to the IMAP server without any tokenisation. So a query like invoice overdue becomes SEARCH TEXT "invoice overdue", and Dovecot looks for the literal contiguous substring. An email whose body says “the invoice from 02/2026 is overdue” is not matched, even though both words are clearly present.

This holds even with fts_flatcurve enabled on Dovecot: Flatcurve indexes content by token, but a multi-word phrase from the client is still matched as a bigram of consecutive tokens — non-contiguous words remain a hard miss.

What the patch does

It modifies api/src/Mail.php, function createIMAPFilter(). Specifically the case 'BODY'/'TEXT' and case 'FROM'/'TO'/'CC'/'BCC'/'SUBJECT' branches — so the input string is tokenised and each token becomes a separate Horde_Imap_Client_Search_Query sub-query. The sub-queries are then combined with andSearch(), orSearch() or negated according to the operator on each token.

The resulting IMAP query for invoice +overdue with TEXT search type becomes:

(BODY "invoice") (BODY "overdue")

Dovecot intersects the two token lookups. With Flatcurve, that’s two index hits + a posting-list intersection — sub-millisecond. Without Flatcurve, it’s two linear scans — slower but still correct.

User-facing syntax (matches the conventions Ralf already documented for the other apps):

Input	Meaning
`invoice`	substring “invoice” anywhere
`invoice overdue`	“invoice” OR “overdue” (default)
`invoice and overdue`	“invoice” AND “overdue”
`invoice +overdue`	“invoice” AND “overdue”
`invoice -spam`	“invoice” AND NOT “spam”
`"invoice overdue"`	literal phrase as a single token (original behaviour preserved)

Code and unified diff

Full code (the two new protected static helpers + the two modified case branches) and the patch -p0 unified diff are in this gist:

https://gist.github.com/CActor/092c75d6394a0a37fd2b76fa3c2caf34

In summary: ~130 lines of delta, one file touched (api/src/Mail.php), two new methods (buildTokenizedSearch() and parseSearchTokens()) added right after createIMAPFilter(). The two case branches are rewritten to delegate to the tokeniser via a factory closure (headerText() for header searches, text() for body/full-text searches).

Notes from real-world testing

Tested on EGroupware 26.1 (26.5.20260507) running from the official Docker image, with fts_flatcurve enabled on the Dovecot side
Cross-folder full-text search now matches messages with words in any order and with arbitrary text between them, as expected
Quoted phrases ("foo bar") still produce the historical contiguous-substring behaviour, preserving backward compatibility for users who relied on it
Zero changes to UI, configuration, DB schema, sessions or authentication — single file changed, two new helpers
The case 'QUICK'/'QUICKWITHCC'/'BYDATE' branches are intentionally left alone for this first cut: they already use orSearch() to combine SUBJECT+FROM(+CC), and applying tokenisation there is a follow-up that needs slightly more thought to avoid combinatorial explosion. Happy to address it as a separate change if there’s interest.

Question for the maintainers

Ralf, Birgit — would you be open to a Pull Request that brings this in? I’m happy to:

Open a PR against master with the diff and unit tests for buildTokenizedSearch() / parseSearchTokens()
Adjust the parser if you’d prefer stricter or different semantics (e.g. proper operator precedence with parentheses, default-to-AND instead of default-to-OR specifically for Mail)
Add functional tests at the IMAP-builder level

Thanks for the architecture work that already had andSearch() / orSearch() / negation built into Horde_Imap_Client_Search_Query — the change really is straightforward thanks to those primitives.

I hope someone will be interested in trying out the modification.

Best regards,
Gabriele

RalfBecker 2026-05-12 11:19:03 UTC #11

Hi Gabrielle,

thanks for your effort

Please do so and open a pull request.

Ralf

ATSrl 2026-05-12 19:51:34 UTC #12

Hi Ralf, thanks for the green light on the PR.

While testing the IMAP search patch in production we noticed the same user-facing limitation also applies to the Mail filter rules. A user who learns to write invoice +overdue in the search box naturally tries the same in the “Subject contains” field of a filter rule, and the filter doesn’t fire on messages where the words are not adjacent.

From the end-user perspective, having the search box and the filter rules of the same module follow two different syntaxes is hard to justify: it forces them to keep two mental models for what looks like the same operation. Aligning both — the search-side patch you’ve already greenlit, plus this filter-side one — gives a single, consistent search/filter syntax across the Mail module, in line with what Addressbook, Calendar and InfoLog already do.

The fix is structurally identical to the search-side one, just applied to a different code path. I have implemented and tested it. Posting it here so we can discuss it before I open two PRs — one for each.

The problem on filter rules

EGroupware Mail filters end up as Sieve scripts on Dovecot, generated by api/src/Mail/Sieve/Script.php and uploaded via managesieve. For a filter “Subject contains invoice overdue”, the script emits:

if allof (
    header :contains "subject" "invoice overdue"
) { ... }

:contains is a literal contiguous-substring test, so a message with subject “the invoice from 02/2026 is overdue” is not matched — the same gotcha as on the search side. Users currently work around this by typing +nuovo in the special “header” row and another in the standard “subject” row, then setting match all of — clunky.

The fix: same tokenisation, applied to Sieve generation

Two new static helpers on the Script class — buildTokenizedSieveTest() and parseSieveTokens() — and a tokenised branch in five case of the rule generator (FROM / TO / SUBJECT / custom header / body), gated on plain :contains mode (so wildcards and the regex checkbox are untouched).

A filter value invoice +overdue on Subject now emits:

if allof (
    allof (
        header :contains "subject" "invoice",
        header :contains "subject" "overdue"
    )
) { ... }

A single-token value (e.g. invoice only) still produces byte-identical output to the historical generator — no regression on existing filters.

User-facing syntax matches the search patch exactly, so there’s a single mental model across the whole Mail module:

Input in any filter value	Meaning
`invoice`	substring “invoice” anywhere
`invoice overdue`	“invoice” OR “overdue” (default)
`invoice and overdue`	“invoice” AND “overdue”
`invoice +overdue`	“invoice” AND “overdue”
`invoice -spam`	“invoice” AND NOT “spam”
`"invoice overdue"`	literal phrase (preserves legacy behaviour)

Code and unified diff

Second gist (the Sieve one): https://gist.github.com/CActor/82621b8df78662e6117cb3beeedb94e0

It contains the full Script.php patched, the unified diff (patch -p0 applicable, ~286 lines), and a worked-out example of the generated Sieve script.

Tested

EGroupware 26.1 (26.5.20260507), official Docker image, Dovecot with fts_flatcurve enabled. The patched file is bind-mounted source-side to survive Watchtower image updates. Eight functional test cases (prova +filtro +nuovo on Subject, sent via SMTP from a separate account, observed routing to INBOX/TEST vs INBOX):

five positives matched as expected, including the case where words are far apart in the subject and the case where one term is a substring of a larger word (“prova” inside “approvazione”)
three negatives stayed in INBOX as expected (missing token / declension differences / split word)

️ Breaking change for existing filter rules with multi-word values

Since the tokeniser kicks in by default on whitespace (consistent with the documented EGroupware search syntax — “A B: contains A or contains B”), filter rules that today rely on whitespace being part of a contiguous-substring match will change behaviour after this patch is deployed.

Three example patterns we observed on a real-world deployment:

Filter value (before)	Old generated Sieve	New generated Sieve
Subject contains `Project 70`	`header :contains "subject" "Project 70"` (only matches contiguous)	`anyof (... "Project", ... "70")` (matches “Project” OR “70” — much broader)
Subject contains `urgent meeting`	only contiguous match	`anyof(...)` — broader
body `:text :contains "J. Smith"`	only contiguous match	`anyof(...)` — much broader, “J.” is a substring of many words

Filters that use wildcards (*/?) or the per-rule regex checkbox are unaffected (the tokeniser branch is gated off in those modes), so any rule that already disambiguates via *phrase* or regex stays as-is.

Migration for end-users: review existing filter rule values; for any with whitespace that you intend as a literal phrase, wrap the value in double quotes (e.g. change Project 70 to "Project 70"). The quoted-phrase token preserves the historical contiguous-substring behaviour.

Plan for the PR

For now I’ll open only the search PR (api/src/Mail.php, as discussed in the previous post). The filter patch (api/src/Mail/Sieve/Script.php) is implemented, tested in our staging environment, and ready as a second PR — but I’d like to wait for your feedback on the search PR first, and confirmation that you want the filter side merged too, before opening a second one. The two share the same tokeniser shape; if you’d prefer it factored into a small trait (e.g. Mail/SearchTokeniserTrait) used by both, happy to refactor either at PR-review time.

Just let me know on this thread (or directly on the search PR) whether you want me to open the filter PR as a follow-up, and I’ll do so within the day — with the BREAKING CHANGE note in the description and a changelog entry suggesting end-users audit their multi-word filter values.

Best regards,
Gabriele

RalfBecker 2026-05-13 05:57:28 UTC #13

Hmm, I’m a bit concerned about the Sieve part.

Saying users should review their Sieve rules is not going to work. Best case szenario they found the rule no longer working and opening it as bug (as it was working before). Worst case it just stays broken

I don’t mind if the syntax for the search and the Sieve rules are slightly different.

One way to implement it in a safe way would be to add a 3rd search-mode beside wildcard and regexp…

Ralf

ATSrl 2026-05-13 07:02:22 UTC #14

Hi Ralf,
In our experience, filters are little used precisely because of a complexity in their definition. I do not think that would be a major impact. However, I agree with you that it could be interpreted as a bag.
Anyway not all filters would be affected and i think that rIf the change had a clear and obvious disclaimer, the problem would be very small.
I denti know if the change you propose is ovef my possibilities. In any case I’ll analyze the problem.
In the meantime, you think he should proceed. With PR for mail search?
Thank you

RalfBecker 2026-05-13 07:13:05 UTC #15

Yes please

I also like the Sieve part, but there must be something which switches to the new syntax explicitly, so it does not affect existing rules.

Ralf

ATSrl 2026-05-13 07:32:49 UTC #16

Hi Ral,
I’ll prepare the PR this eveninig and try to analize the esplicitamente switch for siete.
Thanks
Gabriele

ATSrl 2026-05-13 19:41:19 UTC #17

Hi Rslt,
The PR has been created. I’m now trying to work out how to implement your suggestion. I’ve added an option in the new filter creation process that allows users to follow the different search filter specification method, though it’s not mandatory, and I’m trying to see if it’s possible to maintain both approaches.
I have 48 ‘OLD’ filters in my email; I’ve added some new filters and am sending test emails to check everything.
Gabriele

ATSrl 2026-05-14 07:36:05 UTC #18

Hi Ralf, follow-up on the filter side.

[quote=“RalfBecker, post:13, topic:79137”]

One way to implement it in a safe way would be to add a 3rd search-mode beside wildcard and regexp…

Per your feedback (“filter rules need an explicit switch so existing rules are not changed silently”), I redesigned the Sieve patch around an opt-in checkbox in the rule editor. Default = off, so existing rules continue to produce byte-identical Sieve, no surprises on upgrade. The original always-on draft from my previous post is replaced.

New per-rule checkbox in mail/templates/default/sieve.edit.xet:

☐ Use tokenised search syntax (+word required, -word forbidden, “…” literal phrase; same as the EGroupware search box)
Persistence: stored as the next free bit 256 in the existing flg integer column of the rule — no schema migration.
Generator gating in api/src/Mail/Sieve/Script.php: the tokenised branch only fires when ($rule['flg'] & 256) AND !regexp AND no */? in the value. All other modes (wildcards, regex, plain unchecked) are left untouched.
Small UX touch in mail/inc/class.mail_sieve.inc.php: the last state of the checkbox is remembered as a per-user preference (mail/sieve_last_tokenized) and used as the default for new rules. Existing rules are not affected — only the default for new ones.

What a user sees in the rule editor

The new checkbox sits in the dialog footer, right below “Use regular expressions”, so it is visible alongside the other rule-wide mode toggles:

Three behaviours, one screen:

Existing rule, checkbox NOT ticked → Save → emitted Sieve is byte-identical to pre-patch. No behaviour change, no surprises.
Existing rule, checkbox ticked → Save → the value fields become tokenised on emission, generating allof / anyof / not composite tests.
New rule → the checkbox default reflects what the user last chose. This is the small UX touch in (4) below: power users who always want tokenisation get it without re-ticking on every new rule, and users who never tick it never see the new behaviour propagate to their new rules.

The “last choice is the new default” logic, end-to-end

This is the subtle part — worth spelling out because it is what makes the opt-in feel natural rather than annoying:

First time the user opens the rule editor, the checkbox is unticked (the absolute default for a brand-new installation, stored as mail/sieve_last_tokenized = 0 in user preferences).
The user creates a new rule, ticks the checkbox, clicks Save → the rule is saved with flg & 256, AND the user preference mail/sieve_last_tokenized flips to 1.
Next new rule the same user creates → checkbox renders already ticked as the default.
If the user later unticks it and saves a rule, the preference flips back to 0 → next new rule renders unticked again.
Existing rules are never affected by the preference value — when an existing rule is loaded, its checkbox reflects that rule’s own flg & 256, not the user preference.

So: the preference influences only the default for new rules. Each saved rule remembers its own choice. Two users with different preferences see different defaults but never affect each other.

This keeps the experience consistent with how the EGroupware search box is used: a user who has internalised the +word/-word/"phrase" syntax wants it everywhere by default; a user who hasn’t will not have it forced on them.

Tokenised syntax (same as the search box)

Input	Meaning
`invoice`	substring “invoice” anywhere
`invoice overdue`	“invoice” OR “overdue” (default for whitespace)
`invoice and overdue`	“invoice” AND “overdue”
`invoice +overdue`	“invoice” AND “overdue”
`invoice -spam`	“invoice” AND NOT “spam”
`"invoice overdue"`	literal phrase (= legacy contiguous match)

Fields covered (only when the checkbox is checked and the value is plain :contains): From, To, Subject, custom header (If mail header), body (If mail body message type in both :text and :raw).

Migration story for upgrade — nothing required

Because the patch is gated on flg & 256, no existing rule changes behaviour at deploy time and no migration step is required. Rules saved before the patch have flg & 256 == 0, so the generator emits the historical contiguous :contains Sieve, byte-identical to pre-patch. The editor renders them with the new checkbox unchecked. The first time a user opens an existing rule, ticks the checkbox and saves, that single rule is regenerated under the patched generator. End-users do not need to take any action.

For instances that prefer to bulk-normalise the stored Sieve scripts at deploy time anyway (e.g. to validate the patched parser end-to-end across all users), I include an optional CLI helper resave-sieve-rules.php that iterates all active users and round-trips their rules through retrieveRules() → setRules(). Pure bookkeeping, no semantic effect.

docker cp resave-sieve-rules.php egroupware:/tmp/
docker exec -u www-data -e EGW_USER=admin -e EGW_PASS='...' egroupware \
    php /tmp/resave-sieve-rules.php --dry-run --verbose
docker exec -u www-data -e EGW_USER=admin -e EGW_PASS='...' egroupware \
    php /tmp/resave-sieve-rules.php

--user=<lid> is supported for staged rollout. We used it on our own test instance before considering the full fleet.

Files in the patch (3)

api/src/Mail/Sieve/Script.php — two new static helpers + tokenised branch in 5 case blocks.
mail/templates/default/sieve.edit.xet — one new <et2-checkbox id="tokenized">.
mail/inc/class.mail_sieve.inc.php — load/save of flg & 256, plus the last-choice user preference.

~340 lines effective delta across the three files, no new dependencies, no schema migration. Unified diff and the worked-out generated Sieve are in the gist linked above.

Tested

EGroupware 26.1 (26.5.20260507), official Docker image, Dovecot with fts_flatcurve enabled. The three patched files are bind-mounted source-side to survive Watchtower updates. 34 functional test cases via SMTP from a separate account:

8 tokenised positives (checkbox checked: AND / OR / +token / -token / quoted phrase / case-insensitive / cross-position / substring-of-larger-word) all matched as expected.
4 tokenised negatives (missing required token, forbidden token present, spaced-out non-substring) correctly stayed in INBOX.
22 legacy rules (checkbox unchecked) routed identically to pre-patch behaviour for every message — confirmed zero regression.

Plan for the PR

#240 — the search-side patch on api/src/Mail.php, ready for review.
#241 — the filter-side patch (api/src/Mail/Sieve/Script.php + 2 companion files), opened as Draft because #240 is its functional prerequisite. Same v2 design as described above. The migration helper (resave-sieve-rules.php) is referenced from the PR description and lives in the companion gist linked below. The Draft state is just to invite review/discussion while #240 is merged first — happy to convert to ready-for-review at any moment.

If you’d prefer the tokeniser factored out of Mail.php and Script.php into a small shared trait (e.g. EGroupware\Api\Mail\SearchTokeniserTrait) before the second PR lands, happy to do that at PR-review time on either patch.

Best regards,
Gabriele