Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE2: optimize memory allocations #15395

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jan 31, 2025

We noticed in #15088 that we don't need Crystal::ThreadLocalValue in the Regex PCRE2 engine.

We can reuse the JIT stack but also the match data for every Regex (no need for a specific match data per Regex). We can allocate one and make sure it's not used by other threads... hence the thread locals: no more spinlock (thread contention) nor hash.

It's simpler and faster. Here's the benchmark from #13144 for example:

$ crystal run --release bench/regex.cr
starts_with?  29.46M ( 33.94ns) (± 0.27%)
    matches?  40.34M ( 24.79ns) (± 0.74%)
$ bin/crystal run --release bench/regex.cr
starts_with?  30.70M ( 32.58ns) (± 0.26%)
    matches?  46.28M ( 21.61ns) (± 0.72%)

Enabling MT also no longer has any impact on performance:

$ crystal run --release -Dpreview_mt bench/regex.cr
starts_with?  26.50M ( 37.73ns) (± 0.09%)
    matches?  41.75M ( 23.95ns) (± 0.56%)
$ bin/crystal run --release -Dpreview_mt bench/regex.cr
starts_with?  30.61M ( 32.67ns) (± 0.36%)
    matches?  48.45M ( 20.64ns) (± 1.51%)

The drawback is that we must allocate each matchdata with a maximum number of ovectors (65535). That might increase memory usage, though I failed to notice it in practice. Maybe not allocating memory for every regular expression is helping?

Note: this PR will be separated into a couple PRs to introduce Crystal::System::ThreadLocal(T). The point of this new type is for this patch, so I want approval on the overall approach before the split.

I could have used @[ThreadLocal] but some targets don't support it (namely: Android, MinGW and OpenBSD) and we can't register destructors either (on thread shutdown). But using pthread_key_create or FlsAlloc we can 😍

Wraps the system API (pthread on unix, FLS on windows) for thread local
storage (TLS) or thread specific storage (TSS). Allows to register a
destructor to automatically cleanup the thread local values when a
thread terminates.
Both are merely scratchpad for the current thread to execute the
current, blocking, PCRE2 match. There are no concurrency issues. We only
need to make sure that only one thread can access both allocations at
any one time.

Keeping a thread local is simpler and faster than using
Crystal::ThreadLocalValue and we don't need to keep live references to
GC allocated objects (both are allocated using malloc).
@ysbaddaden ysbaddaden self-assigned this Jan 31, 2025
@ysbaddaden ysbaddaden changed the title Refactor: PCRE2 memory allocations PCRE2: optimize memory allocations Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant