What should it mean to run a regexp match on invalid UTF-8 bytes?
The coherent behavior options are:
1. Invalid UTF-8 does not match any character classes,
nor a U+FFFD literal (nor \x{fffd}).
2. Each byte of invalid UTF-8 is treated identically to a U+FFFD in the input,
as a utf8.DecodeRune loop might.
RE2 uses Rule 1.
Because it works byte at a time, it can also provide \C to match any
single byte of input, which matches invalid UTF-8 as well.
This provides the nice property that a match for a regexp without \C
is guaranteed to be valid UTF-8.
Unfortunately, today Go has an incoherent mix of these two, although
mostly Rule 2. This is a deviation from RE2, and it gives up the nice
property, but we probably can't correct that at this point.
In particular .* already matches entire inputs today, valid UTF-8 or
not, and I doubt we can break that.
This CL adopts Rule 2 officially, fixing the few places that deviate from it.
Fixes#48749.
Change-Id: I96402527c5dfb1146212f568ffa09dde91d71244
Reviewed-on: https://go-review.googlesource.com/c/go/+/354569
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Rob Pike <r@golang.org>
Many uses of Index/IndexByte/IndexRune/Split/SplitN
can be written more clearly using the new Cut functions.
Do that. Also rewrite to other functions if that's clearer.
For #46336.
Change-Id: I68d024716ace41a57a8bf74455c62279bde0f448
Reviewed-on: https://go-review.googlesource.com/c/go/+/351711
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
SubexpIndex returns the index of the first subexpression with the given name,
or -1 if there is no subexpression with that name.
Fixes#32420
Change-Id: Ie1f9d22d50fb84e18added80a9d9a9f6dca8ffc4
Reviewed-on: https://go-review.googlesource.com/c/go/+/187919
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Currently we say that a negative index means no match,
but we don't say how "no match" is expressed when 'Index'
is not present. Say how it is expressed.
Change-Id: I82b6c9038557ac49852ac03642afc0bc545bb4a2
Reviewed-on: https://go-review.googlesource.com/c/go/+/175677
Reviewed-by: Ian Lance Taylor <iant@golang.org>
This change limits the capacity of the slices of bytes returned by:
- Find
- FindAll
- FindAllSubmatch
to be the same as their length.
Fixes#30169
Change-Id: I07b632757d2bfeab42fce0d42364e2a16c597360
Reviewed-on: https://go-review.googlesource.com/c/161877
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Change-Id: I628aad9a3abe9cc0c3233f476960e53bd291eca9
Reviewed-on: https://go-review.googlesource.com/135235
Reviewed-by: Ralph Corderoy <ralph@inputplus.co.uk>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Before:
// Find returns a slice holding the text of the leftmost match in b of the regular expression.
// Match checks whether a textual regular expression matches a byte slice.
After:
// Match reports whether the byte slice b contains any match of the regular expression re.
The use of different wording for Find and Match always makes me think
that Match required the entire string to match while Find clearly allows
a substring to match.
This CL makes the Match wording correspond more closely to Find,
to try to avoid that confusion.
Change-Id: I97fb82d5080d3246ee5cf52abf28d2a2296a5039
Reviewed-on: https://go-review.googlesource.com/123736
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Revert CL 101715.
The size of a sync.Pool scales linearly with GOMAXPROCS,
making it inappropriate to put a sync.Pool in any individually
allocated object, as the sync.Pool documentation explains.
The change also broke DeepEqual on regexps.
I have a cleaner way to do this with global sync.Pools but it's
too late in the cycle. Will revisit in Go 1.12. For now, revert.
Fixes#26219.
Change-Id: Ie632e709eb3caf489d85efceac0e4b130ec2019f
Reviewed-on: https://go-review.googlesource.com/122596
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Each URL was manually verified to ensure it did not serve up incorrect
content.
Change-Id: I4dc846227af95a73ee9a3074d0c379ff0fa955df
Reviewed-on: https://go-review.googlesource.com/115798
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Performance optimization for the internals of the Regexp type. This adds
no features and has no user-visible impact beyond performance. Copy now
shares the cache, so memory usage for programs that use Copy a lot
should go down; Copy has effectively become a no-op.
The before v. after benchmark results show a lot of noise from run to
run, but there's a clear improvement to the Shared case and no detriment
to the Copied case.
BenchmarkMatchParallelShared-4 361 77.9 -78.42%
BenchmarkMatchParallelCopied-4 70.3 72.2 +2.70%
Macro benchmarks show that the lock contention in Regexp is gone, and my
server is now able to scale linearly 2.5x times more than before (and I
only stopped there because I ran out of CPU in my test machine).
Fixes#24411
Change-Id: Ib33abff2802f27599f5d09084775e95b54e3e1d7
Reviewed-on: https://go-review.googlesource.com/101715
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
If you try to find something in a slice of bytes using a Regexp object,
the byte array will not be released by GC until you use the Regexp object
on another slice of bytes. It happens because the Regexp object keep
references to the input data in its cache.
Change-Id: I873107f15c1900aa53ccae5d29dbc885b9562808
Reviewed-on: https://go-review.googlesource.com/96715
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
The existing implementation declares a variable error which collides
with builting type error.
This change simply renames error variable to err.
Change-Id: Ib56c2530f37f53ec70fdebb825a432d4c550cd04
Reviewed-on: https://go-review.googlesource.com/87775
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
strings.IndexByte was introduced in go1.2 and it can be used
effectively wherever the second argument to strings.Index is
exactly one byte long.
This avoids generating unnecessary string symbols and saves
a few calls to strings.Index.
Change-Id: I1ab5edb7c4ee9058084cfa57cbcc267c2597e793
Reviewed-on: https://go-review.googlesource.com/65930
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
This is the same technique used in CL 24466. By adding a little bit of
size to the binary, we can remove a function call and gain a lot of
performance.
A raw array ([128]bool) would be faster, but is also be 128 bytes
instead of 16.
Running tip on a Mac:
name old time/op new time/op delta
QuoteMetaAll-4 192ns ±12% 120ns ±11% -37.27% (p=0.000 n=10+10)
QuoteMetaNone-4 186ns ± 6% 64ns ± 6% -65.52% (p=0.000 n=10+10)
name old speed new speed delta
QuoteMetaAll-4 73.2MB/s ±11% 116.6MB/s ±10% +59.21% (p=0.000 n=10+10)
QuoteMetaNone-4 139MB/s ± 6% 405MB/s ± 6% +190.74% (p=0.000 n=10+10)
Change-Id: I68ce9fe2ef1c28e2274157789b35b0dd6ae3efb5
Reviewed-on: https://go-review.googlesource.com/41495
Run-TryBot: Kevin Burke <kev@inburke.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
The step method implementations check directly if the next rune
only needs one byte to be decoded and avoid calling utf8.DecodeRune
for such ASCII characters.
Introduce the same fast path optimization for rune decoding
for the context methods.
Results for regexp benchmarks that use the context methods:
name old time/op new time/op delta
AnchoredLiteralShortNonMatch-4 97.5ns ± 1% 94.8ns ± 2% -2.80% (p=0.000 n=45+43)
AnchoredShortMatch-4 163ns ± 1% 160ns ± 1% -1.84% (p=0.000 n=46+47)
NotOnePassShortA-4 742ns ± 2% 742ns ± 2% ~ (p=0.440 n=49+50)
NotOnePassShortB-4 535ns ± 1% 533ns ± 2% -0.37% (p=0.005 n=46+48)
OnePassLongPrefix-4 169ns ± 2% 166ns ± 2% -2.06% (p=0.000 n=50+49)
Change-Id: Ib302d9e8c63333f02695369fcf9963974362e335
Reviewed-on: https://go-review.googlesource.com/38256
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
This change removes a lot of dead code. Some of the code has never been
used, not even when it was first commited. The rest shouldn't have
survived refactors.
This change doesn't remove unused routines helpful for debugging, nor
does it remove code that's used in commented out blocks of code that are
only unused temporarily. Furthermore, unused constants weren't removed
when they were part of a set of constants from specifications.
One noteworthy omission from this CL are about 1000 lines of unused code
in cmd/fix, 700 lines of which are the typechecker, which hasn't been
used ever since the pre-Go 1 fixes have been removed. I wasn't sure if
this code should stick around for future uses of cmd/fix or be culled as
well.
Change-Id: Ib714bc7e487edc11ad23ba1c3222d1fd02e4a549
Reviewed-on: https://go-review.googlesource.com/20926
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
There's nothing guaranteeing that the *Regexp isn't in active use,
and so copying the sync.Mutex value is invalid.
Updates #14839.
Change-Id: Iddf52bf69df1b563377922399f64a571f76b95dd
Reviewed-on: https://go-review.googlesource.com/20841
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Andrew Gerrand <adg@golang.org>
The tree's pretty inconsistent about single space vs double space
after a period in documentation. Make it consistently a single space,
per earlier decisions. This means contributors won't be confused by
misleading precedence.
This CL doesn't use go/doc to parse. It only addresses // comments.
It was generated with:
$ perl -i -npe 's,^(\s*// .+[a-z]\.) +([A-Z]),$1 $2,' $(git grep -l -E '^\s*//(.+\.) +([A-Z])')
$ go test go/doc -update
Change-Id: Iccdb99c37c797ef1f804a94b22ba5ee4b500c4f7
Reviewed-on: https://go-review.googlesource.com/20022
Reviewed-by: Rob Pike <r@golang.org>
Reviewed-by: Dave Day <djd@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
This helps users who wish to use separate Regexps in each goroutine to
avoid lock contention. Previously they had to parse the expression
multiple times to achieve this.
I used variants of the included benchmark to evaluate this change. I
used the arguments -benchtime 20s -cpu 1,2,4,8,16 on a machine with 16
hardware cores.
Comparing a single shared Regexp vs. copied Regexps, we can see that
lock contention causes huge slowdowns at higher levels of parallelism.
The copied version shows the expected linear speedup.
name old time/op new time/op delta
MatchParallel 366ns ± 0% 370ns ± 0% +1.09% (p=0.000 n=10+8)
MatchParallel-2 324ns ±28% 184ns ± 1% -43.37% (p=0.000 n=10+10)
MatchParallel-4 352ns ± 5% 93ns ± 1% -73.70% (p=0.000 n=9+10)
MatchParallel-8 480ns ± 3% 46ns ± 0% -90.33% (p=0.000 n=9+8)
MatchParallel-16 510ns ± 8% 24ns ± 6% -95.36% (p=0.000 n=10+8)
I also compared a modified version of Regexp that has no mutex and a
single machine (the "RegexpForSingleGoroutine" rsc mentioned in
https://github.com/golang/go/issues/8232#issuecomment-66096128).
In this next test, I compared using N copied Regexps vs. N separate
RegexpForSingleGoroutines. This shows that, even for this relatively
simple regex, avoiding the lock entirely would only buy about 10-12%
further improvement.
name old time/op new time/op delta
MatchParallel 370ns ± 0% 322ns ± 0% -12.97% (p=0.000 n=8+8)
MatchParallel-2 184ns ± 1% 162ns ± 1% -11.60% (p=0.000 n=10+10)
MatchParallel-4 92.7ns ± 1% 81.1ns ± 2% -12.43% (p=0.000 n=10+10)
MatchParallel-8 46.4ns ± 0% 41.8ns ±10% -9.78% (p=0.000 n=8+10)
MatchParallel-16 23.7ns ± 6% 20.6ns ± 1% -13.14% (p=0.000 n=8+8)
Updates #8232.
Change-Id: I15201a080c363d1b44104eafed46d8df5e311902
Reviewed-on: https://go-review.googlesource.com/16110
Reviewed-by: Russ Cox <rsc@golang.org>
Regular expressions involving a (x){0} term are
simplified by removing this term from the
expression, just before the expression is compiled.
The number of subexpressions is evaluated before
the simplification. The number of capture instructions
in the compiled expressions is not necessarily in line
with the number of subexpressions.
When the ReplaceAll(String) methods are used, a number
of capture slots (nmatch) is evaluated as 2*(s+1)
(s being the number of subexpressions).
In some case, it can be higher than the number of capture
instructions evaluated at compile time, resulting in a
panic when the internal slices of regexp.machine
are resized to this value.
Fixed by capping the number of capture slots to the number
of capture instructions.
I must say I do not really see the benefits of setting
nmatch lower than re.prog.NumCap using this 2*(s+1) formula,
so perhaps this can be further simplified.
Fixes#11178Fixes#11176
Change-Id: I21415e8ef2dd5f2721218e9a679f7f6bfb76ae9b
Reviewed-on: https://go-review.googlesource.com/14013
Reviewed-by: Russ Cox <rsc@golang.org>
In 1.6, go doc is more likely to be available.
Change-Id: I970ad1d3317b35273f5c8d830f75713d3570c473
Reviewed-on: https://go-review.googlesource.com/10518
Reviewed-by: Andrew Gerrand <adg@golang.org>