99 Commits

Author SHA1 Message Date
David Chase
fc5073bc15 runtime,internal: move runtime/internal/sys to internal/runtime/sys
Cleanup and friction reduction

For #65355.

Change-Id: Ia14c9dc584a529a35b97801dd3e95b9acc99a511
Reviewed-on: https://go-review.googlesource.com/c/go/+/600436
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
2024-07-23 19:05:35 +00:00
Nick Ripley
87abb4afb6 runtime: avoid multiple records with identical stacks from MutexProfile
When using frame pointer unwinding, we defer frame skipping and inline
expansion for call stacks until profile reporting time. We can end up
with records which have different stacks if no frames are skipped, but
identical stacks once skipping is taken into account. Returning multiple
records with the same stack (but different values) has broken programs
which rely on the records already being fully aggregated by call stack
when returned from runtime.MutexProfile. This CL addresses the problem
by handling skipping at recording time. We do full inline expansion to
correctly skip the desired number of frames when recording the call
stack, and then handle the rest of inline expansion when reporting the
profile.

The regression test in this CL is adapted from the reproducer in
https://github.com/grafana/pyroscope-go/issues/103, authored by Tolya
Korniltsev.

Fixes #67548

This reapplies CL 595966.
The original version of this CL introduced a bounds error in
MutexProfile and failed to correctly expand inlined frames from that
call. This CL applies the original CL, fixing the bounds error and
adding a test for the output of MutexProfile to ensure the frames are
expanded properly.

Change-Id: I5ef8aafb9f88152a704034065c0742ba767c4dbb
Reviewed-on: https://go-review.googlesource.com/c/go/+/598515
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-07-17 19:17:19 +00:00
Cherry Mui
6948b4df8c Revert "runtime: avoid multiple records with identical stacks from MutexProfile"
This reverts CL 595966.

Reason for revert: This CL contains a bug. See the comment in https://go-review.googlesource.com/c/go/+/595966/8#message-57f4c1f9570b5fe912e06f4ae3b52817962533c0

Change-Id: I48030907ded173ae20a8965bf1b41a713dd06059
Reviewed-on: https://go-review.googlesource.com/c/go/+/598219
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2024-07-15 20:09:02 +00:00
Nick Ripley
183a40db6d runtime: avoid multiple records with identical stacks from MutexProfile
When using frame pointer unwinding, we defer frame skipping and inline
expansion for call stacks until profile reporting time. We can end up
with records which have different stacks if no frames are skipped, but
identical stacks once skipping is taken into account. Returning multiple
records with the same stack (but different values) has broken programs
which rely on the records already being fully aggregated by call stack
when returned from runtime.MutexProfile. This CL addresses the problem
by handling skipping at recording time. We do full inline expansion to
correctly skip the desired number of frames when recording the call
stack, and then handle the rest of inline expansion when reporting the
profile.

The regression test in this CL is adapted from the reproducer in
https://github.com/grafana/pyroscope-go/issues/103, authored by Tolya
Korniltsev.

Fixes #67548

Co-Authored-By: Tolya Korniltsev <korniltsev.anatoly@gmail.com>
Change-Id: I6a42ce612377f235b2c8c0cec9ba8e9331224b84
Reviewed-on: https://go-review.googlesource.com/c/go/+/595966
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Carlos Amedee <carlos@golang.org>
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
2024-07-09 21:41:42 +00:00
Cherry Mui
82c14346d8 cmd/link: don't disable memory profiling when pprof.WriteHeapProfile is used
We have an optimization that if the memory profile is not consumed
anywhere, we set the memory profiling rate to 0 to disable the
"background" low-rate profiling. We detect whether the memory
profile is used by checking whether the runtime.MemProfile function
is reachable at link time. Previously, all APIs that access the
memory profile go through runtime.MemProfile. But the code was
refactored in CL 572396, and now the legacy entry point
WriteHeapProfile uses pprof_memProfileInternal without going
through runtime.MemProfile. In fact, even with the recommended
runtime/pprof.Profile API (pprof.Lookup or pprof.Profiles),
runtime.MemProfile is only (happen to be) reachable through
countHeap.

Change the linker to check runtime.memProfileInternal instead,
which is on all code paths that retrieve the memory profile. Add
a test case for WriteHeapProfile, so we cover all entry points.

Fixes #68136.

Change-Id: I075c8d45c95c81825a1822f032e23107aea4303c
Reviewed-on: https://go-review.googlesource.com/c/go/+/596538
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-07-03 22:43:11 +00:00
Rhys Hiltner
9114c51521 Revert "runtime: prepare for extensions to waiting M list"
This reverts commit be0b569caa0eab1a7f30edf64e550bbf5f6ff235 (CL 585635).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I7843ccaecbd273b7ceacfa0f420dd993b4b15a0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/589117
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-05-30 17:57:37 +00:00
Rhys Hiltner
afbbc2894b Revert "runtime: double-link list of waiting Ms"
This reverts commit d881ed6384ae58154d99682f1e20160c64e7c3c2 (CL 585637).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I70d8d0b74f73be95c43d664f584e8d98519aba26
Reviewed-on: https://go-review.googlesource.com/c/go/+/589116
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-05-30 17:56:46 +00:00
Rhys Hiltner
5dead59add Revert "runtime: profile mutex contention during unlock"
This reverts commit ba1c5b2c4573e10f3b5f0e0f25a27f17fba67eb0 (CL 585638).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: Ibeec5d8deb17e87966cf352fefc7efe2267839d6
Reviewed-on: https://go-review.googlesource.com/c/go/+/589115
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-05-30 17:56:43 +00:00
Rhys Hiltner
ca7d300509 Revert "runtime: remove GODEBUG=runtimecontentionstacks"
This reverts commit 87e930f7289136fad1310d4b63dd4127e409bac5 (CL 585639)

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I1e286d2a16d16e4af202cd5dc04b2d9c4ee71b32
Reviewed-on: https://go-review.googlesource.com/c/go/+/589097
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
2024-05-30 17:52:17 +00:00
Rhys Hiltner
1be701a344 Revert "runtime: split mutex profile clocks"
This reverts commit 8ab131fb1256a4a795c610e145c022e22e2d1567 (CL 586796)

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I54711691e86e072081482102019d168292b5150a
Reviewed-on: https://go-review.googlesource.com/c/go/+/589095
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
2024-05-30 17:49:07 +00:00
Rhys Hiltner
8ab131fb12 runtime: split mutex profile clocks
Mutex contention measurements work with two clocks: nanotime for use in
runtime/metrics, and cputicks for the runtime/pprof profile. They're
subject to different sampling rates: the runtime/metrics view is always
enabled, but the profile is adjustable and is turned off by default.
They have different levels of overhead: it can take as little as one
instruction to read cputicks while nanotime calls are more elaborate
(although some platforms implement cputicks as a nanotime call). The use
of the timestamps is also different: the profile's view needs to attach
the delay in some Ms' lock2 calls to another M's unlock2 call stack, but
the metric's view is only an int64.

Treat them differently. Don't bother threading the nanotime clock
through to the unlock2 call, measure and report it directly within
lock2. Sample nanotime at a constant gTrackingPeriod.

Don't consult any clocks unless the mutex is actually contended.

Continue liberal use of cputicks for now.

For #66999

Change-Id: I1c2085ea0e695bfa90c30fadedc99ced9eb1f69e
Reviewed-on: https://go-review.googlesource.com/c/go/+/586796
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Run-TryBot: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
2024-05-22 14:34:20 +00:00
Rhys Hiltner
87e930f728 runtime: remove GODEBUG=runtimecontentionstacks
Go 1.22 promised to remove the setting in a future release once the
semantics of runtime-internal lock contention matched that of
sync.Mutex. That work is done, remove the setting.

For #66999

Change-Id: I3c4894148385adf2756d8754e44d7317305ad758
Reviewed-on: https://go-review.googlesource.com/c/go/+/585639
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-05-21 17:17:36 +00:00
Rhys Hiltner
ba1c5b2c45 runtime: profile mutex contention during unlock
When an M's use of a lock causes delays in other Ms, capture the stack
of the unlock call that caused the delay. This makes the meaning of the
mutex profile for runtime-internal mutexes match the behavior for
sync.Mutex: the profile points to the end of the critical section that
is responsible for delaying other work.

Fixes #66999

Change-Id: I4abc4a1df00a48765d29c07776481a1cbd539ff8
Reviewed-on: https://go-review.googlesource.com/c/go/+/585638
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-05-21 17:17:34 +00:00
Rhys Hiltner
d881ed6384 runtime: double-link list of waiting Ms
When an M unlocks a contended mutex, it needs to consult a list of the
Ms that had to wait during its critical section. This allows the M to
attribute the appropriate amount of blame to the unlocking call stack.

Mirroring the implementation for users' sync.Mutex contention (via
sudog), we can (in a future commit) use the time that the head and tail
of the wait list started waiting, and the number of waiters, to estimate
the sum of the Ms' delays.

When an M acquires the mutex, it needs to remove itself from the list of
waiters. Since the futex-based lock implementation leaves the OS in
control of the order of M wakeups, we need to be prepared for quickly
(constant time) removing any M from the list.

First, have each M add itself to a singly-linked wait list when it finds
that its lock call will need to sleep. This case is safe against
live-lock, since any delay to one M adding itself to the list would be
due to another M making durable progress.

Second, have the M that holds the lock (either right before releasing,
or right after acquiring) update metadata on the list of waiting Ms to
double-link the list and maintain a tail pointer and waiter count. That
work is amortized-constant: we'll avoid contended locks becoming
proportionally more contended and undergoing performance collapse.

For #66999

Change-Id: If75cdea915afb59ccec47294e0b52c466aac8736
Reviewed-on: https://go-review.googlesource.com/c/go/+/585637
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
2024-05-21 17:17:31 +00:00
Rhys Hiltner
be0b569caa runtime: prepare for extensions to waiting M list
Move the nextwaitm field into a small struct, in preparation for
additional metadata to track how long Ms need to wait for locks.

For #66999

Change-Id: Ib40e43c15cde22f7e35922641107973d99439ecd
Reviewed-on: https://go-review.googlesource.com/c/go/+/585635
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2024-05-21 17:17:24 +00:00
Felix Geisendörfer
66cc2b7ca7 runtime: make profstackdepth a GODEBUG option
Allow users to decrease the profiling stack depth back to 32 in case
they experience any problems with the new default of 128.

Users may also use this option to increase the depth up to 1024.

Change-Id: Ieaab2513024915a223239278dd97a6e161dde1cf
Reviewed-on: https://go-review.googlesource.com/c/go/+/581917
Reviewed-by: Austin Clements <austin@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2024-05-21 14:38:56 +00:00
Felix Geisendörfer
1b9dc3e178 runtime: increase profiling stack depth to 128
The current stack depth limit for alloc, mutex, block, threadcreate and
goroutine profiles of 32 frequently leads to truncated stack traces in
production applications. Increase the limit to 128 which is the same
size used by the execution tracer.

Create internal/profilerecord to define variants of the runtime's
StackRecord, MemProfileRecord and BlockProfileRecord types that can hold
arbitrarily big stack traces. Implement internal profiling APIs based on
these new types and use them for creating protobuf profiles and to act
as shims for the public profiling APIs using the old types.

This will lead to an increase in memory usage for applications that
use the impacted profile types and have stack traces exceeding the
current limit of 32. Those applications will also experience a slight
increase in CPU usage, but this will hopefully soon be mitigated via CL
540476 and 533258 which introduce frame pointer unwinding for the
relevant profile types.

For #43669.

Change-Id: Ie53762e65d0f6295f5d4c7d3c87172d5a052164e
Reviewed-on: https://go-review.googlesource.com/c/go/+/572396
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-05-21 14:38:45 +00:00
Felix Geisendörfer
47187a4f4f runtime: fix profile stack trace depth regression
Previously it was possible for mutex and block profile stack traces to
contain up to 32 frames in Stack0 or the resulting pprof profiles.
CL 533258 changed this behavior by using some of the space to
record skipped frames that are discarded when performing delayed inline
expansion. This has lowered the effective maximum stack size from 32 to
27 (the max skip value is 5), which can be seen as a small regression.

Add TestProfilerStackDepth to demonstrate the issue and protect all
profile types from similar regressions in the future. Fix the issue by
increasing the internal maxStack limit to take the maxSkip value into
account. Assert that the maxSkip value is never exceeded when recording
mutex and block profile stack traces.

Three alternative solutions to the problem were considered and
discarded:

1) Revert CL 533258 and give up on frame pointer unwinding. This seems
   unappealing as we would lose the performance benefits of frame
   pointer unwinding.
2) Discard skipped frames when recording the initial stack trace. This
   would require eager inline expansion for up to maxSkip frames and
   partially negate the performance benefits of frame pointer
   unwinding.
3) Accept and document the new behavior. This would simplify the
   implementation, but seems more confusing from a user perspective. It
   also complicates the creation of test cases that make assertions
   about the maximum profiling stack depth.

The execution tracer still has the same issue due to CL 463835. This
should be addressed in a follow-up CL.

Co-authored-by: Nick Ripley <nick.ripley@datadoghq.com>
Change-Id: Ibf4dbf08a5166c9cb32470068c69f58bc5f98d2c
Reviewed-on: https://go-review.googlesource.com/c/go/+/586657
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-05-21 14:38:39 +00:00
Nick Ripley
f4494522dc runtime: use frame pointer unwinding for block and mutex profilers
Use frame pointer unwinding, where supported, to collect call stacks for
the block, and mutex profilers. This method of collecting call stacks is
typically an order of magnitude faster than callers/tracebackPCs. The
marginal benefit for these profile types is likely small compared to
using frame pointer unwinding for the execution tracer. However, the
block profiler can have noticeable overhead unless the sampling rate is
very high. Additionally, using frame pointer unwinding in more places
helps ensure more testing/support, which benefits systems like the
execution tracer which rely on frame pointer unwinding to be practical
to use.

Change-Id: I4b36c90cd2df844645fd275a41b247352d635727
Reviewed-on: https://go-review.googlesource.com/c/go/+/533258
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-05-13 16:42:35 +00:00
Felix Geisendörfer
2141315251 runtime: move profiling pc buffers to m
Move profiling pc buffers from being stack allocated to an m field.

This is motivated by the next patch, which will increase the default
stack depth to 128, which might lead to undesirable stack growth for
goroutines that produce profiling events.

Additionally, this change paves the way to make the stack depth
configurable via GODEBUG.

Change-Id: Ifa407f899188e2c7c0a81de92194fdb627cb4b36
Reviewed-on: https://go-review.googlesource.com/c/go/+/574699
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2024-05-08 17:48:38 +00:00
Michael Anthony Knyszek
11047345f5 runtime: remove allocfreetrace
allocfreetrace prints all allocations and frees to stderr. It's not
terribly useful because it has a really huge overhead, making it not
feasible to use except for the most trivial programs. A follow-up CL
will replace it with something that is both more thorough and also lower
overhead.

Change-Id: I1d668fee8b6aaef5251a5aea3054ec2444d75eb6
Reviewed-on: https://go-review.googlesource.com/c/go/+/583376
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
2024-05-08 17:44:56 +00:00
Andy Pan
4c2b1e0feb runtime: migrate internal/atomic to internal/runtime
For #65355

Change-Id: I65dd090fb99de9b231af2112c5ccb0eb635db2be
Reviewed-on: https://go-review.googlesource.com/c/go/+/560155
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ibrahim Bazoka <ibrahimbazoka729@gmail.com>
Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com>
2024-03-25 19:53:03 +00:00
apocelipes
c8c46e746b runtime: use built-in clear to simplify code
Change-Id: Icb6d9ca996b4119d8636d9f7f6a56e510d74d059
GitHub-Last-Rev: 08178e8ff798f4a51860573788c9347a0fb6bc40
GitHub-Pull-Request: golang/go#66188
Reviewed-on: https://go-review.googlesource.com/c/go/+/569979
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: qiulaidongfeng <2645477756@qq.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2024-03-08 16:28:44 +00:00
Michael Pratt
65f056d07a runtime: avoid new linkname for goroutine profiles
CL 464349 added a new linkname to provide gcount to runtime/pprof to
avoid a STW when estimating the goroutine profile allocation size.

However, adding a linkname here isn't necessary for a few reasons:

1. We already export gcount via NumGoroutines. I completely forgot about
   this during review.
2. aktau suggested that goroutineProfileWithLabelsConcurrent return
   gcount as a fast path estimate when the input is empty.

The second point keeps the code cleaner overall, so I've done that.

For #54014.

Change-Id: I6cb0811a769c805e269b55774cdd43509854078e
Reviewed-on: https://go-review.googlesource.com/c/go/+/559515
Auto-Submit: Michael Pratt <mpratt@google.com>
Auto-Submit: Nicolas Hillegeer <aktau@google.com>
Reviewed-by: Nicolas Hillegeer <aktau@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-01-30 16:34:07 +00:00
Jun10ng
f96d9a6432 runtime: reduce one STW when obtaining goroutine configuration file
Fixes #54014

Change-Id: If4ee2752008729e1ed4b767cfda52effdcec4959
GitHub-Last-Rev: 5ce300bf5128f842604d85d5f8749027c8e091c2
GitHub-Pull-Request: golang/go#58239
Reviewed-on: https://go-review.googlesource.com/c/go/+/464349
Reviewed-by: qiulaidongfeng <2645477756@qq.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: qiulaidongfeng <2645477756@qq.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2024-01-30 00:04:58 +00:00
Olivier Mengué
202b435969 runtime: more godoc links
Change-Id: I8fe66326994894b17ce0eda991bba942844d26b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/541475
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-01-15 21:21:08 +00:00
Michael Pratt
98fd8f5768 runtime: rename GODEBUG=profileruntimelocks to runtimecontentionstacks
profileruntimelocks is new in CL 544195, but the name is deceptive. Even
with profileruntimelocks=0, runtime-internal locks are still profiled.
The actual difference is that call stacks are not collected. Instead all
contention is reported at runtime._LostContendedLock.

Rename this setting to runtimecontentionstacks to make its name more
aligned with its behavior.

In addition, for this release the default is profileruntimelocks=0,
meaning that users are fairly likely to encounter
runtime._LostContendedLock. Rename it to
runtime._LostContendedRuntimeLock in an attempt to make it more
intuitive that these are runtime locks, not locks in application code.

For #57071.

Change-Id: I38aac28b2c0852db643d53b1eab3f3bc42a43393
Reviewed-on: https://go-review.googlesource.com/c/go/+/547055
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Rhys Hiltner <rhys@justin.tv>
2023-12-06 17:57:59 +00:00
Russ Cox
c29444ef39 math/rand, math/rand/v2: use ChaCha8 for global rand
Move ChaCha8 code into internal/chacha8rand and use it to implement
runtime.rand, which is used for the unseeded global source for
both math/rand and math/rand/v2. This also affects the calculation of
the start point for iteration over very very large maps (when the
32-bit fastrand is not big enough).

The benefit is that misuse of the global random number generators
in math/rand and math/rand/v2 in contexts where non-predictable
randomness is important for security reasons is no longer a
security problem, removing a common mistake among programmers
who are unaware of the different kinds of randomness.

The cost is an extra 304 bytes per thread stored in the m struct
plus 2-3ns more per random uint64 due to the more sophisticated
algorithm. Using PCG looks like it would cost about the same,
although I haven't benchmarked that.

Before this, the math/rand and math/rand/v2 global generator
was wyrand (https://github.com/wangyi-fudan/wyhash).
For math/rand, using wyrand instead of the Mitchell/Reeds/Thompson
ALFG was justifiable, since the latter was not any better.
But for math/rand/v2, the global generator really should be
at least as good as one of the well-studied, specific algorithms
provided directly by the package, and it's not.

(Wyrand is still reasonable for scheduling and cache decisions.)

Good randomness does have a cost: about twice wyrand.

Also rationalize the various runtime rand references.

goos: linux
goarch: amd64
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
                        │ bbb48afeb7.amd64 │           5cf807d1ea.amd64           │
                        │      sec/op      │    sec/op     vs base                │
ChaCha8-32                     1.862n ± 2%    1.861n ± 2%        ~ (p=0.825 n=20)
PCG_DXSM-32                    1.471n ± 1%    1.460n ± 2%        ~ (p=0.153 n=20)
SourceUint64-32                1.636n ± 2%    1.582n ± 1%   -3.30% (p=0.000 n=20)
GlobalInt64-32                 2.087n ± 1%    3.663n ± 1%  +75.54% (p=0.000 n=20)
GlobalInt64Parallel-32        0.1042n ± 1%   0.2026n ± 1%  +94.48% (p=0.000 n=20)
GlobalUint64-32                2.263n ± 2%    3.724n ± 1%  +64.57% (p=0.000 n=20)
GlobalUint64Parallel-32       0.1019n ± 1%   0.1973n ± 1%  +93.67% (p=0.000 n=20)
Int64-32                       1.771n ± 1%    1.774n ± 1%        ~ (p=0.449 n=20)
Uint64-32                      1.863n ± 2%    1.866n ± 1%        ~ (p=0.364 n=20)
GlobalIntN1000-32              3.134n ± 3%    4.730n ± 2%  +50.95% (p=0.000 n=20)
IntN1000-32                    2.489n ± 1%    2.489n ± 1%        ~ (p=0.683 n=20)
Int64N1000-32                  2.521n ± 1%    2.516n ± 1%        ~ (p=0.394 n=20)
Int64N1e8-32                   2.479n ± 1%    2.478n ± 2%        ~ (p=0.743 n=20)
Int64N1e9-32                   2.530n ± 2%    2.514n ± 2%        ~ (p=0.193 n=20)
Int64N2e9-32                   2.501n ± 1%    2.494n ± 1%        ~ (p=0.616 n=20)
Int64N1e18-32                  3.227n ± 1%    3.205n ± 1%        ~ (p=0.101 n=20)
Int64N2e18-32                  3.647n ± 1%    3.599n ± 1%        ~ (p=0.019 n=20)
Int64N4e18-32                  5.135n ± 1%    5.069n ± 2%        ~ (p=0.034 n=20)
Int32N1000-32                  2.657n ± 1%    2.637n ± 1%        ~ (p=0.180 n=20)
Int32N1e8-32                   2.636n ± 1%    2.636n ± 1%        ~ (p=0.763 n=20)
Int32N1e9-32                   2.660n ± 2%    2.638n ± 1%        ~ (p=0.358 n=20)
Int32N2e9-32                   2.662n ± 2%    2.618n ± 2%        ~ (p=0.064 n=20)
Float32-32                     2.272n ± 2%    2.239n ± 2%        ~ (p=0.194 n=20)
Float64-32                     2.272n ± 1%    2.286n ± 2%        ~ (p=0.763 n=20)
ExpFloat64-32                  3.762n ± 1%    3.744n ± 1%        ~ (p=0.171 n=20)
NormFloat64-32                 3.706n ± 1%    3.655n ± 2%        ~ (p=0.066 n=20)
Perm3-32                       32.93n ± 3%    34.62n ± 1%   +5.13% (p=0.000 n=20)
Perm30-32                      202.9n ± 1%    204.0n ± 1%        ~ (p=0.482 n=20)
Perm30ViaShuffle-32            115.0n ± 1%    114.9n ± 1%        ~ (p=0.358 n=20)
ShuffleOverhead-32             112.8n ± 1%    112.7n ± 1%        ~ (p=0.692 n=20)
Concurrent-32                  2.107n ± 0%    3.725n ± 1%  +76.75% (p=0.000 n=20)

goos: darwin
goarch: arm64
pkg: math/rand/v2
                       │ bbb48afeb7.arm64 │           5cf807d1ea.arm64            │
                       │      sec/op      │    sec/op     vs base                 │
ChaCha8-8                     2.480n ± 0%    2.429n ± 0%    -2.04% (p=0.000 n=20)
PCG_DXSM-8                    2.531n ± 0%    2.530n ± 0%         ~ (p=0.877 n=20)
SourceUint64-8                2.534n ± 0%    2.533n ± 0%         ~ (p=0.732 n=20)
GlobalInt64-8                 2.172n ± 1%    4.794n ± 0%  +120.67% (p=0.000 n=20)
GlobalInt64Parallel-8        0.4320n ± 0%   0.9605n ± 0%  +122.32% (p=0.000 n=20)
GlobalUint64-8                2.182n ± 0%    4.770n ± 0%  +118.58% (p=0.000 n=20)
GlobalUint64Parallel-8       0.4307n ± 0%   0.9583n ± 0%  +122.51% (p=0.000 n=20)
Int64-8                       4.107n ± 0%    4.104n ± 0%         ~ (p=0.416 n=20)
Uint64-8                      4.080n ± 0%    4.080n ± 0%         ~ (p=0.052 n=20)
GlobalIntN1000-8              2.814n ± 2%    5.643n ± 0%  +100.50% (p=0.000 n=20)
IntN1000-8                    4.141n ± 0%    4.139n ± 0%         ~ (p=0.140 n=20)
Int64N1000-8                  4.140n ± 0%    4.140n ± 0%         ~ (p=0.313 n=20)
Int64N1e8-8                   4.140n ± 0%    4.139n ± 0%         ~ (p=0.103 n=20)
Int64N1e9-8                   4.139n ± 0%    4.140n ± 0%         ~ (p=0.761 n=20)
Int64N2e9-8                   4.140n ± 0%    4.140n ± 0%         ~ (p=0.636 n=20)
Int64N1e18-8                  5.266n ± 0%    5.326n ± 1%    +1.14% (p=0.001 n=20)
Int64N2e18-8                  6.052n ± 0%    6.167n ± 0%    +1.90% (p=0.000 n=20)
Int64N4e18-8                  8.826n ± 0%    9.051n ± 0%    +2.55% (p=0.000 n=20)
Int32N1000-8                  4.127n ± 0%    4.132n ± 0%    +0.12% (p=0.000 n=20)
Int32N1e8-8                   4.126n ± 0%    4.131n ± 0%    +0.12% (p=0.000 n=20)
Int32N1e9-8                   4.127n ± 0%    4.132n ± 0%    +0.12% (p=0.000 n=20)
Int32N2e9-8                   4.132n ± 0%    4.131n ± 0%         ~ (p=0.017 n=20)
Float32-8                     4.109n ± 0%    4.105n ± 0%         ~ (p=0.379 n=20)
Float64-8                     4.107n ± 0%    4.106n ± 0%         ~ (p=0.867 n=20)
ExpFloat64-8                  5.339n ± 0%    5.383n ± 0%    +0.82% (p=0.000 n=20)
NormFloat64-8                 5.735n ± 0%    5.737n ± 1%         ~ (p=0.856 n=20)
Perm3-8                       26.65n ± 0%    26.80n ± 1%    +0.58% (p=0.000 n=20)
Perm30-8                      194.8n ± 1%    197.0n ± 0%    +1.18% (p=0.000 n=20)
Perm30ViaShuffle-8            156.6n ± 0%    157.6n ± 1%    +0.61% (p=0.000 n=20)
ShuffleOverhead-8             124.9n ± 0%    125.5n ± 0%    +0.52% (p=0.000 n=20)
Concurrent-8                  2.434n ± 3%    5.066n ± 0%  +108.09% (p=0.000 n=20)

goos: linux
goarch: 386
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
                        │ bbb48afeb7.386 │            5cf807d1ea.386             │
                        │     sec/op     │    sec/op     vs base                 │
ChaCha8-32                  11.295n ± 1%    4.748n ± 2%   -57.96% (p=0.000 n=20)
PCG_DXSM-32                  7.693n ± 1%    7.738n ± 2%         ~ (p=0.542 n=20)
SourceUint64-32              7.658n ± 2%    7.622n ± 2%         ~ (p=0.344 n=20)
GlobalInt64-32               3.473n ± 2%    7.526n ± 2%  +116.73% (p=0.000 n=20)
GlobalInt64Parallel-32      0.3198n ± 0%   0.5444n ± 0%   +70.22% (p=0.000 n=20)
GlobalUint64-32              3.612n ± 0%    7.575n ± 1%  +109.69% (p=0.000 n=20)
GlobalUint64Parallel-32     0.3168n ± 0%   0.5403n ± 0%   +70.51% (p=0.000 n=20)
Int64-32                     7.673n ± 2%    7.789n ± 1%         ~ (p=0.122 n=20)
Uint64-32                    7.773n ± 1%    7.827n ± 2%         ~ (p=0.920 n=20)
GlobalIntN1000-32            6.268n ± 1%    9.581n ± 1%   +52.87% (p=0.000 n=20)
IntN1000-32                  10.33n ± 2%    10.45n ± 1%         ~ (p=0.233 n=20)
Int64N1000-32                10.98n ± 2%    11.01n ± 1%         ~ (p=0.401 n=20)
Int64N1e8-32                 11.19n ± 2%    10.97n ± 1%         ~ (p=0.033 n=20)
Int64N1e9-32                 11.06n ± 1%    11.08n ± 1%         ~ (p=0.498 n=20)
Int64N2e9-32                 11.10n ± 1%    11.01n ± 2%         ~ (p=0.995 n=20)
Int64N1e18-32                15.23n ± 2%    15.04n ± 1%         ~ (p=0.973 n=20)
Int64N2e18-32                15.89n ± 1%    15.85n ± 1%         ~ (p=0.409 n=20)
Int64N4e18-32                18.96n ± 2%    19.34n ± 2%         ~ (p=0.048 n=20)
Int32N1000-32                10.46n ± 2%    10.44n ± 2%         ~ (p=0.480 n=20)
Int32N1e8-32                 10.46n ± 2%    10.49n ± 2%         ~ (p=0.951 n=20)
Int32N1e9-32                 10.28n ± 2%    10.26n ± 1%         ~ (p=0.431 n=20)
Int32N2e9-32                 10.50n ± 2%    10.44n ± 2%         ~ (p=0.249 n=20)
Float32-32                   13.80n ± 2%    13.80n ± 2%         ~ (p=0.751 n=20)
Float64-32                   23.55n ± 2%    23.87n ± 0%         ~ (p=0.408 n=20)
ExpFloat64-32                15.36n ± 1%    15.29n ± 2%         ~ (p=0.316 n=20)
NormFloat64-32               13.57n ± 1%    13.79n ± 1%    +1.66% (p=0.005 n=20)
Perm3-32                     45.70n ± 2%    46.99n ± 2%    +2.81% (p=0.001 n=20)
Perm30-32                    399.0n ± 1%    403.8n ± 1%    +1.19% (p=0.006 n=20)
Perm30ViaShuffle-32          349.0n ± 1%    350.4n ± 1%         ~ (p=0.909 n=20)
ShuffleOverhead-32           322.3n ± 1%    323.8n ± 1%         ~ (p=0.410 n=20)
Concurrent-32                3.331n ± 1%    7.312n ± 1%  +119.50% (p=0.000 n=20)

For #61716.

Change-Id: Ibdddeed85c34d9ae397289dc899e04d4845f9ed2
Reviewed-on: https://go-review.googlesource.com/c/go/+/516860
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Filippo Valsorda <filippo@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-12-05 20:34:30 +00:00
Rhys Hiltner
450ecbe905 runtime: profile contended lock calls
Add runtime-internal locks to the mutex contention profile.

Store up to one call stack responsible for lock contention on the M,
until it's safe to contribute its value to the mprof table. Try to use
that limited local storage space for a relatively large source of
contention, and attribute any contention in stacks we're not able to
store to a sentinel _LostContendedLock function.

Avoid ballooning lock contention while manipulating the mprof table by
attributing to that sentinel function any lock contention experienced
while reporting lock contention.

Guard collecting real call stacks with GODEBUG=profileruntimelocks=1,
since the available data has mixed semantics; we can easily capture an
M's own wait time, but we'd prefer for the profile entry of each
critical section to describe how long it made the other Ms wait. It's
too late in the Go 1.22 cycle to make the required changes to
futex-based locks. When not enabled, attribute the time to the sentinel
function instead.

Fixes #57071

This is a roll-forward of https://go.dev/cl/528657, which was reverted
in https://go.dev/cl/543660

Reason for revert: de-flakes tests (reduces dependence on fine-grained
timers, correctly identifies contention on big-endian futex locks,
attempts to measure contention in the semaphore implementation but only
uses that secondary measurement to finish the test early, skips tests on
single-processor systems)

Change-Id: I31389f24283d85e46ad9ba8d4f514cb9add8dfb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/544195
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
2023-11-21 21:02:20 +00:00
Matthew Dempsky
468bc94188 Revert "runtime: profile contended lock calls"
This reverts commit go.dev/cl/528657.

Reason for revert: broke a lot of builders.

Change-Id: I70c33062020e997c4df67b3eaa2e886cf0da961e
Reviewed-on: https://go-review.googlesource.com/c/go/+/543660
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-20 13:20:29 +00:00
Rhys Hiltner
0b31a46f1f runtime: profile contended lock calls
Add runtime-internal locks to the mutex contention profile.

Store up to one call stack responsible for lock contention on the M,
until it's safe to contribute its value to the mprof table. Try to use
that limited local storage space for a relatively large source of
contention, and attribute any contention in stacks we're not able to
store to a sentinel _LostContendedLock function.

Avoid ballooning lock contention while manipulating the mprof table by
attributing to that sentinel function any lock contention experienced
while reporting lock contention.

Guard collecting real call stacks with GODEBUG=profileruntimelocks=1,
since the available data has mixed semantics; we can easily capture an
M's own wait time, but we'd prefer for the profile entry of each
critical section to describe how long it made the other Ms wait. It's
too late in the Go 1.22 cycle to make the required changes to
futex-based locks. When not enabled, attribute the time to the sentinel
function instead.

Fixes #57071

Change-Id: I3eee0ccbfc20f333b56f20d8725dfd7f3a526b41
Reviewed-on: https://go-review.googlesource.com/c/go/+/528657
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Than McIntosh <thanm@google.com>
2023-11-17 23:04:58 +00:00
Michael Pratt
6ef98ac87c runtime/metrics: add STW stopping and total time metrics
This CL adds four new time histogram metrics:

/sched/pauses/stopping/gc:seconds
/sched/pauses/stopping/other:seconds
/sched/pauses/total/gc:seconds
/sched/pauses/total/other:seconds

The "stopping" metrics measure the time taken to start a stop-the-world
pause. i.e., how long it takes stopTheWorldWithSema to stop all Ps.
This can be used to detect STW struggling to preempt Ps.

The "total" metrics measure the total duration of a stop-the-world
pause, from starting to stop-the-world until the world is started again.
This includes the time spent in the "start" phase.

The "gc" metrics are used for GC-related STW pauses. The "other" metrics
are used for all other STW pauses.

All of these metrics start timing in stopTheWorldWithSema only after
successfully acquiring sched.lock, thus excluding lock contention on
sched.lock. The reasoning behind this is that while waiting on
sched.lock the world is not stopped at all (all other Ps can run), so
the impact of this contention is primarily limited to the goroutine
attempting to stop-the-world. Additionally, we already have some
visibility into sched.lock contention via contention profiles (#57071).

/sched/pauses/total/gc:seconds is conceptually equivalent to
/gc/pauses:seconds, so the latter is marked as deprecated and returns
the same histogram as the former.

In the implementation, there are a few minor differences:

* For both mark and sweep termination stops, /gc/pauses:seconds started
  timing prior to calling startTheWorldWithSema, thus including lock
  contention.

These details are minor enough, that I do not believe the slight change
in reporting will matter. For mark termination stops, moving timing stop
into startTheWorldWithSema does have the side effect of requiring moving
other GC metric calculations outside of the STW, as they depend on the
same end time.

Fixes #63340

Change-Id: Iacd0bab11bedab85d3dcfb982361413a7d9c0d05
Reviewed-on: https://go-review.googlesource.com/c/go/+/534161
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-15 16:49:45 +00:00
Michael Anthony Knyszek
130baf3d42 runtime: improve tickspersecond
Currently tickspersecond forces a 100 millisecond sleep the first time
it's called. This isn't great for profiling short-lived programs, since
both CPU profiling and block profiling might call into it.

100 milliseconds is a long time, but it's chosen to try and capture a
decent estimate of the conversion on platform with course-granularity
clocks. If the granularity is 15 ms, it'll only be 15% off at worst.

Let's try a different strategy. First, let's require 5 milliseconds of
time to have elapsed at a minimum. This should be plenty on platforms
with nanosecond time granularity from the system clock, provided the
caller of tickspersecond intends to use it for calculating durations,
not timestamps. Next, grab a timestamp as close to process start as
possible, so that we can cover some of that 5 millisecond just during
runtime start.

Finally, this function is only ever called from normal goroutine
contexts. Let's do a regular goroutine sleep instead of a thread-level
sleep under a runtime lock, which has all sorts of nasty effects on
preemption.

While we're here, let's also rename tickspersecond to ticksPerSecond.
Also, let's write down some explicit rules of thumb on when to use this
function. Clocks are hard, and using this for timestamp conversion is
likely to make lining up those timestamps with other clocks on the
system difficult if not impossible.

Note that while this improves ticksPerSecond on platforms with good
clocks, we still end up with a pretty coarse sleep on platforms with
coarse clocks, and a pretty coarse result. On these platforms, keep the
minimum required elapsed time at 100 ms. There's not much we can do
about these platforms except spin and try to catch the clock boundary,
but at 10+ ms of granularity, that might be a lot of spinning.

Fixes #63103.
Fixes #63078.

Change-Id: Ic32a4ba70a03bdf5c13cb80c2669c4064aa4cca2
Reviewed-on: https://go-review.googlesource.com/c/go/+/538898
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2023-11-09 22:38:24 +00:00
cui fliter
6a1bbca2b3 runtime: add available godoc link
Change-Id: Ifb4844efddcb0369b0302eeab72394eeaf5c8072
Reviewed-on: https://go-review.googlesource.com/c/go/+/540022
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: shuang cui <imcusg@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-08 16:59:11 +00:00
Rhys Hiltner
9cdcb01320 runtime/pprof: include labels for caller of goroutine profile
The goroutine profile has close to three code paths for adding a
goroutine record to the goroutine profile: one for the goroutine that
requested the profile, one for every other goroutine, plus some special
handling for the finalizer goroutine. The first of those captured the
goroutine stack, but neglected to include that goroutine's labels.

Update the tests to check for the inclusion of labels for all three
types of goroutines, and include labels for the creator of the goroutine
profile.

For #63712

Change-Id: Id5387a5f536d3c37268c240e0b6db3d329a3d632
Reviewed-on: https://go-review.googlesource.com/c/go/+/537515
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: David Chase <drchase@google.com>
2023-10-25 17:37:34 +00:00
Andy Pan
a7c3de7052 runtime: document maxStack and m.createstack in more details
Change-Id: If93b6cfa5a598a5f4101c879a0cd88a194e4a6aa
Reviewed-on: https://go-review.googlesource.com/c/go/+/518116
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Andy Pan <panjf2000@gmail.com>
2023-08-21 17:10:00 +00:00
Russ Cox
1c00354013 runtime: change mutex profile to count every blocked goroutine
The pprof mutex profile was meant to match the Google C++ (now Abseil)
mutex profiler, originally designed and implemented by Mike Burrows.
When we worked on the Go version, pjw and I missed that C++ counts the
time each thread is blocked, even if multiple threads are blocked on a
mutex. That is, if 100 threads are blocked on the same mutex for the
same 10ms, that still counts as 1000ms of contention in C++. In Go, to
date, /debug/pprof/mutex has counted that as only 10ms of contention.
If 100 goroutines are blocked on one mutex and only 1 goroutine is
blocked on another mutex, we probably do want to see the first mutex
as being more contended, so the Abseil approach is the more useful one.

This CL adopts "contention scales with number of goroutines blocked",
to better match Abseil [1]. However, it still makes sure to attribute the
time to the unlock that caused the backup, not subsequent innocent
unlocks that were affected by the congestion. In this way it still gives
more accurate profiles than Abseil does.

[1] https://github.com/abseil/abseil-cpp/blob/lts_2023_01_25/absl/synchronization/mutex.cc#L2390

Fixes #61015.

Change-Id: I7eb9e706867ffa8c0abb5b26a1b448f6eba49331
Reviewed-on: https://go-review.googlesource.com/c/go/+/506415
Run-TryBot: Russ Cox <rsc@golang.org>
Auto-Submit: Russ Cox <rsc@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2023-08-17 01:30:31 +00:00
Michael Anthony Knyszek
b1aadd034c runtime: emit STW events for all pauses, not just those for the GC
Currently STW events are only emitted for GC STWs. There's little reason
why the trace can't contain events for every STW: they're rare so don't
take up much space in the trace, yet being able to see when the world
was stopped is often critical to debugging certain latency issues,
especially when they stem from user-level APIs.

This change adds new "kinds" to the EvGCSTWStart event, renames the
GCSTW events to just "STW," and lets the parser deal with unknown STW
kinds for future backwards compatibility.

But, this change must break trace compatibility, so it bumps the trace
version to Go 1.21.

This change also includes a small cleanup in the trace command, which
previously checked for STW events when deciding whether user tasks
overlapped with a GC. Looking at the source, I don't see a way for STW
events to ever enter the stream that that code looks at, so that
condition has been deleted.

Change-Id: I9a5dc144092c53e92eb6950e9a5504a790ac00cf
Reviewed-on: https://go-review.googlesource.com/c/go/+/494495
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
2023-05-19 17:06:45 +00:00
David Chase
a2838ec5f2 runtime: redefine _type to abi.Type; add rtype for methods.
Change-Id: I1c478b704d84811caa209006c657dda82d9c4cf9
Reviewed-on: https://go-review.googlesource.com/c/go/+/488435
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
2023-05-11 04:50:30 +00:00
Austin Clements
3e360b035f runtime: new API for filling PC traceback buffers
Currently, filling PC traceback buffers is one of the jobs of
gentraceback. This moves it into a new function, tracebackPCs, with a
simple API built around unwinder, and changes all callers to use this
new API.

Updates #54466.

Change-Id: Id2038bded81bf533a5a4e71178a7c014904d938c
Reviewed-on: https://go-review.googlesource.com/c/go/+/468300
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
2023-03-10 17:59:37 +00:00
Keith Randall
469afbcc46 runtime: remove go119ConcurrentGoroutineProfile flag
Change-Id: If7a248374dcb2c276d2d85a4863eb2ed1bc246a4
Reviewed-on: https://go-review.googlesource.com/c/go/+/463226
Run-TryBot: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Rhys Hiltner <rhys@justin.tv>
2023-01-28 19:56:15 +00:00
Nick Ripley
30b1af00ff runtime/pprof: scale mutex profile samples when they are recorded
Samples in the mutex profile have their count and duration scaled
according to the probability they were sampled. This is done when the
profile is actually requested. The adjustment is done using to the
current configured sampling rate. However, if the sample rate is changed
after a specific sample is recorded, then the sample will be scaled
incorrectly. In particular, if the sampling rate is changed to 0, all of
the samples in the encoded profile will have 0 count and duration. This
means the profile will be "empty", even if it should have had samples.

This CL scales the samples in the profile when they are recorded, rather
than when the profile is requested. This matches what is currently done
for the block profile.

With this change, neither the block profile nor mutex profile are scaled
when they are encoded, so the logic for scaling the samples can be
removed.

Change-Id: If228cf39284385aa8fb9a2d62492d839e02f027f
Reviewed-on: https://go-review.googlesource.com/c/go/+/443056
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Joedian Reid <joedian@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
2022-11-15 17:33:07 +00:00
Wang Deyu
f63b268b9a runtime: avoid initializing MemProfileRate in init function
Fixes #55100

Change-Id: Ibbff921e74c3a416fd8bb019d20410273961c015
Reviewed-on: https://go-review.googlesource.com/c/go/+/431315
Auto-Submit: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
2022-10-05 17:31:56 +00:00
Leonard Wang
bd5595d7fa runtime: refactor finalizer goroutine status
Use an atomic.Uint32 to represent the state of finalizer goroutine.
fingStatus will only be changed to fingWake in non fingWait state,
so it is safe to set fingRunningFinalizer status in runfinq.

name            old time/op  new time/op  delta
Finalizer-8      592µs ± 4%   561µs ± 1%  -5.22%  (p=0.000 n=10+10)
FinalizerRun-8   694ns ± 6%   675ns ± 7%    ~     (p=0.059 n=9+8)

Change-Id: I7e4da30cec98ce99f7d8cf4c97f933a8a2d1cae1
Reviewed-on: https://go-review.googlesource.com/c/go/+/400134
Reviewed-by: Joedian Reid <joedian@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Michael Pratt <mpratt@google.com>
2022-09-05 08:28:34 +00:00
Cuong Manh Le
a719a78c1b runtime: add and use runtime/internal/sys.NotInHeap
Updates #46731

Change-Id: Ic2208c8bb639aa1e390be0d62e2bd799ecf20654
Reviewed-on: https://go-review.googlesource.com/c/go/+/421878
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
2022-08-19 00:29:18 +00:00
Jun10ng
de8101d21b runtime: fix typos
"finializer" =>  "finalizer"

Change-Id: Ia3c12bff8556b6a8d33b700c59357f47502757b1
GitHub-Last-Rev: c64cf47974020c8480039ba61d0890bdc07a3b0f
GitHub-Pull-Request: golang/go#53917
Reviewed-on: https://go-review.googlesource.com/c/go/+/417915
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Keith Randall <khr@google.com>
2022-07-18 18:04:23 +00:00
John Bampton
b2116f748a all: fix spelling
Change-Id: Iee18987c495d1d4bde9da888d454eea8079d3ebc
GitHub-Last-Rev: ff5e01599ddf7deb3ab6ce190ba92eb02ae2cb15
GitHub-Pull-Request: golang/go#52949
Reviewed-on: https://go-review.googlesource.com/c/go/+/406915
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
2022-05-17 21:46:33 +00:00
Michael Anthony Knyszek
8fdd277fe6 runtime: profile finalizer G more carefully in goroutine profile
If SetFinalizer is never called, we might readgstatus on a nil fing
variable, resulting in a crash. This change guards code that accesses
fing by a nil check.

Fixes #52821.

Change-Id: I3e8e7004f97f073dc622e801a1d37003ea153a29
Reviewed-on: https://go-review.googlesource.com/c/go/+/405475
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Rhys Hiltner <rhys@justin.tv>
2022-05-10 15:43:40 +00:00
Rhys Hiltner
89c0dd829f runtime: split mprof locks
The profiles for memory allocations, sync.Mutex contention, and general
blocking store their data in a shared hash table. The bookkeeping work
at the end of a garbage collection cycle involves maintenance on each
memory allocation record. Previously, a single lock guarded access to
the hash table and the contents of all records. When a program has
allocated memory at a large number of unique call stacks, the
maintenance following every garbage collection can hold that lock for
several milliseconds. That can prevent progress on all other goroutines
by delaying acquirep's call to mcache.prepareForSweep, which needs the
lock in mProf_Free to report when a profiled allocation is no longer in
use. With no user goroutines making progress, it is in effect a
multi-millisecond GC-related stop-the-world pause.

Split the lock so the call to mProf_Flush no longer delays each P's call
to mProf_Free: mProf_Free uses a lock on the memory records' N+1 cycle,
and mProf_Flush uses locks on the memory records' accumulator and their
N cycle. mProf_Malloc also no longer competes with mProf_Flush, as it
uses a lock on the memory records' N+2 cycle. The profiles for
sync.Mutex contention and general blocking now share a separate lock,
and another lock guards insertions to the shared hash table (uncommon in
the steady-state). Consumers of each type of profile take the matching
accumulator lock, so will observe consistent count and magnitude values
for each record.

For #45894

Change-Id: I615ff80618d10e71025423daa64b0b7f9dc57daa
Reviewed-on: https://go-review.googlesource.com/c/go/+/399956
Reviewed-by: Carlos Amedee <carlos@golang.org>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-05-03 20:49:59 +00:00
Rhys Hiltner
52bd1c4d6c runtime: decrease STW pause for goroutine profile
The goroutine profile needs to stop the world to get a consistent
snapshot of all goroutines in the app. Leaving the world stopped while
iterating over allgs leads to a pause proportional to the number of
goroutines in the app (or its high-water mark).

Instead, do only a fixed amount of bookkeeping while the world is
stopped. Install a barrier so the scheduler confirms that a goroutine
appears in the profile, with its stack recorded exactly as it was during
the stop-the-world pause, before it allows that goroutine to execute.
Iterate over allgs while the app resumes normal operations, adding each
to the profile unless they've been scheduled in the meantime (and so
have profiled themselves). Stop the world a second time to remove the
barrier and do a fixed amount of cleanup work.

This increases both the fixed overhead and per-goroutine CPU-time cost
of GoroutineProfile. It also increases the wall-clock latency of the
call to GoroutineProfile, since the scheduler may interrupt it to
execute other goroutines.

    name                                  old time/op    new time/op    delta
    GoroutineProfile/small/loaded-8         1.05ms ± 5%    4.99ms ±31%   +376.85%  (p=0.000 n=10+9)
    GoroutineProfile/sparse/loaded-8        1.04ms ± 4%    3.61ms ±27%   +246.61%  (p=0.000 n=10+10)
    GoroutineProfile/large/loaded-8         7.69ms ±17%   20.35ms ± 4%   +164.50%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle              958µs ± 0%    1820µs ±23%    +89.91%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle-8          1.00ms ± 3%    1.52ms ±17%    +51.18%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle-8           1.01ms ± 4%    1.47ms ± 7%    +45.28%  (p=0.000 n=9+9)
    GoroutineProfile/sparse/idle             980µs ± 1%    1403µs ± 2%    +43.22%  (p=0.000 n=9+10)
    GoroutineProfile/large/idle-8           7.19ms ± 8%    8.43ms ±21%    +17.22%  (p=0.011 n=10+10)
    PingPongHog                              511ns ± 8%     585ns ± 9%    +14.39%  (p=0.000 n=10+10)
    GoroutineProfile/large/idle             6.71ms ± 0%    7.58ms ± 3%    +13.08%  (p=0.000 n=8+10)
    PingPongHog-8                            469ns ± 8%     509ns ±12%     +8.62%  (p=0.010 n=9+10)
    WakeupParallelSyscall/5µs                216µs ± 4%     229µs ± 3%     +6.06%  (p=0.000 n=10+9)
    WakeupParallelSyscall/5µs-8              147µs ± 1%     149µs ± 2%     +1.12%  (p=0.009 n=10+10)
    WakeupParallelSyscall/2µs-8              140µs ± 0%     142µs ± 1%     +1.11%  (p=0.001 n=10+9)
    WakeupParallelSyscall/50µs-8             236µs ± 0%     238µs ± 1%     +1.08%  (p=0.000 n=9+10)
    WakeupParallelSyscall/1µs-8              138µs ± 0%     140µs ± 1%     +1.05%  (p=0.013 n=10+9)
    Matmult                                 8.52ns ± 1%    8.61ns ± 0%     +0.98%  (p=0.002 n=10+8)
    WakeupParallelSyscall/10µs-8             157µs ± 1%     158µs ± 1%     +0.58%  (p=0.003 n=10+8)
    CreateGoroutinesSingle-8                 328ns ± 0%     330ns ± 1%     +0.57%  (p=0.000 n=9+9)
    WakeupParallelSpinning/100µs-8           343µs ± 0%     344µs ± 1%     +0.30%  (p=0.015 n=8+8)
    WakeupParallelSyscall/20µs-8             178µs ± 0%     178µs ± 0%     +0.18%  (p=0.043 n=10+9)
    StackGrowthDeep-8                       22.8µs ± 0%    22.9µs ± 0%     +0.12%  (p=0.006 n=10+10)
    StackGrowth                             1.06µs ± 0%    1.06µs ± 0%     +0.09%  (p=0.000 n=8+9)
    WakeupParallelSpinning/0s               10.7µs ± 0%    10.7µs ± 0%     +0.08%  (p=0.000 n=9+9)
    WakeupParallelSpinning/5µs              30.7µs ± 0%    30.7µs ± 0%     +0.04%  (p=0.000 n=10+10)
    WakeupParallelSpinning/100µs             411µs ± 0%     411µs ± 0%     +0.03%  (p=0.000 n=10+9)
    WakeupParallelSpinning/2µs              18.7µs ± 0%    18.7µs ± 0%     +0.02%  (p=0.026 n=10+10)
    WakeupParallelSpinning/20µs-8           93.0µs ± 0%    93.0µs ± 0%     +0.01%  (p=0.021 n=9+10)
    StackGrowth-8                            216ns ± 0%     216ns ± 0%       ~     (p=0.209 n=10+10)
    CreateGoroutinesParallel-8              49.5ns ± 2%    49.3ns ± 1%       ~     (p=0.591 n=10+10)
    CreateGoroutinesSingle                   699ns ±20%     748ns ±19%       ~     (p=0.353 n=10+10)
    WakeupParallelSpinning/0s-8             15.9µs ± 2%    16.0µs ± 3%       ~     (p=0.315 n=10+10)
    WakeupParallelSpinning/1µs              14.6µs ± 0%    14.6µs ± 0%       ~     (p=0.513 n=10+10)
    WakeupParallelSpinning/2µs-8            24.2µs ± 3%    24.1µs ± 2%       ~     (p=0.971 n=10+10)
    WakeupParallelSpinning/10µs             50.7µs ± 0%    50.7µs ± 0%       ~     (p=0.101 n=10+10)
    WakeupParallelSpinning/20µs             90.7µs ± 0%    90.7µs ± 0%       ~     (p=0.898 n=10+10)
    WakeupParallelSpinning/50µs              211µs ± 0%     211µs ± 0%       ~     (p=0.382 n=10+10)
    WakeupParallelSyscall/0s-8               137µs ± 1%     138µs ± 0%       ~     (p=0.075 n=10+10)
    WakeupParallelSyscall/1µs                216µs ± 1%     219µs ± 3%       ~     (p=0.065 n=10+9)
    WakeupParallelSyscall/2µs                214µs ± 7%     219µs ± 1%       ~     (p=0.101 n=10+8)
    WakeupParallelSyscall/50µs               317µs ± 5%     326µs ± 4%       ~     (p=0.123 n=10+10)
    WakeupParallelSyscall/100µs              450µs ± 9%     459µs ± 8%       ~     (p=0.247 n=10+10)
    WakeupParallelSyscall/100µs-8            337µs ± 0%     338µs ± 1%       ~     (p=0.089 n=10+10)
    WakeupParallelSpinning/5µs-8            32.2µs ± 0%    32.2µs ± 0%     -0.05%  (p=0.026 n=9+10)
    WakeupParallelSpinning/50µs-8            216µs ± 0%     216µs ± 0%     -0.12%  (p=0.004 n=10+10)
    WakeupParallelSpinning/1µs-8            20.6µs ± 0%    20.5µs ± 0%     -0.22%  (p=0.014 n=10+10)
    WakeupParallelSpinning/10µs-8           54.5µs ± 0%    54.2µs ± 1%     -0.57%  (p=0.000 n=10+10)
    CreateGoroutines-8                       213ns ± 1%     211ns ± 1%     -0.86%  (p=0.002 n=10+10)
    CreateGoroutinesCapture                 1.03µs ± 0%    1.02µs ± 0%     -0.91%  (p=0.000 n=10+10)
    CreateGoroutinesCapture-8               1.32µs ± 1%    1.31µs ± 1%     -1.06%  (p=0.001 n=10+9)
    CreateGoroutines                         188ns ± 0%     186ns ± 0%     -1.06%  (p=0.000 n=9+10)
    CreateGoroutinesParallel                 188ns ± 0%     186ns ± 0%     -1.27%  (p=0.000 n=8+10)
    WakeupParallelSyscall/0s                 210µs ± 3%     207µs ± 3%     -1.60%  (p=0.043 n=10+10)
    StackGrowthDeep                          121µs ± 1%     119µs ± 1%     -1.70%  (p=0.000 n=9+10)
    Matmult-8                               1.82ns ± 3%    1.78ns ± 3%     -2.16%  (p=0.020 n=10+10)
    WakeupParallelSyscall/20µs               281µs ± 3%     269µs ± 4%     -4.44%  (p=0.000 n=10+10)
    WakeupParallelSyscall/10µs               239µs ± 3%     228µs ± 9%     -4.70%  (p=0.001 n=10+10)
    GoroutineProfile/sparse-nil/idle-8       485µs ± 2%      12µs ± 4%    -97.56%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/idle-8        484µs ± 2%      12µs ± 1%    -97.60%  (p=0.000 n=10+7)
    GoroutineProfile/small-nil/loaded-8      487µs ± 2%      11µs ± 3%    -97.68%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/loaded-8     507µs ± 4%      11µs ± 6%    -97.78%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/idle-8        709µs ± 2%      11µs ± 4%    -98.38%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/loaded-8      717µs ± 2%      11µs ± 3%    -98.43%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/idle         465µs ± 3%       1µs ± 1%    -99.84%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/idle          493µs ± 3%       1µs ± 0%    -99.85%  (p=0.000 n=10+9)
    GoroutineProfile/large-nil/idle          716µs ± 1%       1µs ± 2%    -99.89%  (p=0.000 n=7+10)

    name                                  old alloc/op   new alloc/op   delta
    CreateGoroutinesCapture                   144B ± 0%      144B ± 0%       ~     (all equal)
    CreateGoroutinesCapture-8                 144B ± 0%      144B ± 0%       ~     (all equal)

    name                                  old allocs/op  new allocs/op  delta
    CreateGoroutinesCapture                   5.00 ± 0%      5.00 ± 0%       ~     (all equal)
    CreateGoroutinesCapture-8                 5.00 ± 0%      5.00 ± 0%       ~     (all equal)

    name                                  old p50-ns     new p50-ns     delta
    GoroutineProfile/small/loaded-8          1.01M ± 3%     3.87M ±45%   +282.15%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/loaded-8         1.02M ± 3%     2.43M ±41%   +138.42%  (p=0.000 n=10+10)
    GoroutineProfile/large/loaded-8          7.43M ±16%    17.28M ± 2%   +132.43%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle               956k ± 0%     1559k ±16%    +63.03%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle-8            1.01M ± 3%     1.45M ± 7%    +44.31%  (p=0.000 n=10+9)
    GoroutineProfile/sparse/idle              977k ± 1%     1399k ± 2%    +43.20%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle-8           1.00M ± 3%     1.41M ± 3%    +40.47%  (p=0.000 n=10+10)
    GoroutineProfile/large/idle-8            6.97M ± 1%     8.41M ±25%    +20.54%  (p=0.003 n=8+10)
    GoroutineProfile/large/idle              6.71M ± 1%     7.46M ± 4%    +11.15%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/idle-8        483k ± 3%       13k ± 3%    -97.41%  (p=0.000 n=10+9)
    GoroutineProfile/small-nil/idle-8         483k ± 2%       12k ± 1%    -97.43%  (p=0.000 n=10+8)
    GoroutineProfile/small-nil/loaded-8       484k ± 3%       10k ± 2%    -97.93%  (p=0.000 n=10+8)
    GoroutineProfile/sparse-nil/loaded-8      492k ± 2%       10k ± 4%    -97.97%  (p=0.000 n=10+8)
    GoroutineProfile/large-nil/idle-8         708k ± 2%       12k ±15%    -98.36%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/loaded-8       714k ± 2%       10k ± 2%    -98.60%  (p=0.000 n=10+8)
    GoroutineProfile/sparse-nil/idle          459k ± 1%        1k ± 1%    -99.85%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/idle           477k ± 1%        1k ± 0%    -99.85%  (p=0.000 n=10+9)
    GoroutineProfile/large-nil/idle           712k ± 1%        1k ± 1%    -99.90%  (p=0.000 n=7+10)

    name                                  old p90-ns     new p90-ns     delta
    GoroutineProfile/small/loaded-8          1.13M ±10%     7.49M ±35%   +562.07%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/loaded-8         1.10M ±12%     4.58M ±31%   +318.02%  (p=0.000 n=10+9)
    GoroutineProfile/large/loaded-8          8.78M ±24%    27.83M ± 2%   +217.00%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle               967k ± 0%     2909k ±50%   +200.91%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle-8           1.02M ± 3%     1.96M ±76%    +92.99%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle-8            1.07M ±17%     1.55M ±12%    +45.23%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle              992k ± 1%     1417k ± 3%    +42.79%  (p=0.000 n=9+10)
    GoroutineProfile/large/idle              6.73M ± 0%     7.99M ± 8%    +18.80%  (p=0.000 n=8+10)
    GoroutineProfile/large/idle-8            8.20M ±25%     9.18M ±25%       ~     (p=0.315 n=10+10)
    GoroutineProfile/sparse-nil/idle-8        495k ± 3%       13k ± 1%    -97.36%  (p=0.000 n=10+9)
    GoroutineProfile/small-nil/idle-8         494k ± 2%       13k ± 3%    -97.36%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/loaded-8       496k ± 2%       13k ± 1%    -97.41%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/loaded-8      544k ±11%       13k ± 1%    -97.62%  (p=0.000 n=10+9)
    GoroutineProfile/large-nil/idle-8         724k ± 1%       13k ± 3%    -98.20%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/loaded-8       729k ± 3%       13k ± 2%    -98.23%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/idle          476k ± 4%        1k ± 1%    -99.85%  (p=0.000 n=9+10)
    GoroutineProfile/small-nil/idle           537k ±10%        1k ± 0%    -99.87%  (p=0.000 n=10+9)
    GoroutineProfile/large-nil/idle           729k ± 0%        1k ± 1%    -99.90%  (p=0.000 n=7+10)

    name                                  old p99-ns     new p99-ns     delta
    GoroutineProfile/sparse/loaded-8         1.27M ±33%    20.49M ±17%  +1514.61%  (p=0.000 n=10+10)
    GoroutineProfile/small/loaded-8          1.37M ±29%    20.48M ±23%  +1399.35%  (p=0.000 n=10+10)
    GoroutineProfile/large/loaded-8          9.76M ±23%    39.98M ±22%   +309.52%  (p=0.000 n=10+8)
    GoroutineProfile/small/idle               976k ± 1%     3367k ±55%   +244.94%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle-8           1.03M ± 3%     2.50M ±65%   +142.30%  (p=0.000 n=10+10)
    GoroutineProfile/small/idle-8            1.17M ±34%     1.70M ±14%    +45.15%  (p=0.000 n=10+10)
    GoroutineProfile/sparse/idle             1.02M ± 3%     1.45M ± 4%    +42.64%  (p=0.000 n=9+10)
    GoroutineProfile/large/idle              6.92M ± 2%     9.00M ± 7%    +29.98%  (p=0.000 n=8+9)
    GoroutineProfile/large/idle-8            8.74M ±23%     9.90M ±24%       ~     (p=0.190 n=10+10)
    GoroutineProfile/sparse-nil/idle-8        508k ± 4%       16k ± 2%    -96.90%  (p=0.000 n=10+9)
    GoroutineProfile/small-nil/idle-8         508k ± 4%       16k ± 3%    -96.91%  (p=0.000 n=10+9)
    GoroutineProfile/small-nil/loaded-8       542k ± 5%       15k ±15%    -97.15%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/loaded-8      649k ±16%       15k ±18%    -97.67%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/idle-8         738k ± 2%       16k ± 2%    -97.86%  (p=0.000 n=10+10)
    GoroutineProfile/large-nil/loaded-8       765k ± 4%       15k ±17%    -98.03%  (p=0.000 n=10+10)
    GoroutineProfile/sparse-nil/idle          539k ±26%        1k ±17%    -99.84%  (p=0.000 n=10+10)
    GoroutineProfile/small-nil/idle           659k ±25%        1k ± 0%    -99.84%  (p=0.000 n=10+8)
    GoroutineProfile/large-nil/idle           760k ± 2%        1k ±22%    -99.88%  (p=0.000 n=9+10)

Fixes #33250
For #50794

Change-Id: I862a2bc4e991cec485f21a6fce4fca84f2c6435b
Reviewed-on: https://go-review.googlesource.com/c/go/+/387415
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
TryBot-Result: Gopher Robot <gobot@golang.org>
2022-05-03 20:49:34 +00:00