Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-128002: fix asyncio.all_tasks against concurrent deallocations of tasks #128541

Merged
merged 4 commits into from
Jan 9, 2025

Conversation

kumaraditya303
Copy link
Contributor

@kumaraditya303 kumaraditya303 commented Jan 6, 2025

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right to me. The object can still be part way through a dealloc call when a stop the world call happens.

Can you provide more details about what bug or crash you are trying to fix?

@kumaraditya303
Copy link
Contributor Author

kumaraditya303 commented Jan 6, 2025

This doesn't look right to me. The object can still be part way through a dealloc call when a stop the world call happens.

Please see #128002 (comment)

The bug is that the linked list holds borrowed references to tasks so it is possible that task is concurrently deallocated while it gets added to tasks list.

@kumaraditya303
Copy link
Contributor Author

Crash backtrace:

Objects/object.c:578: PyObject_CallFinalizerFromDealloc: Assertion failed: PyObject_CallFinalizerFromDealloc called on object with a non-zero refcount
Enable tracemalloc to get the memory block allocation traceback

object address : 0x7fb3781e1820
object refcount : 1
object type : 0x7fb3629e7d10
object type name: _asyncio.Task
object repr : <Task finished name='None' coro=<TestFreeThreading.test_all_tasks_race..main..coro() done, defined at /home/realkumaraditya/cpython/Lib/test/test_asyncio/test_free_threading.py:24> result=None>

Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized

Thread 0x00007fb34edfd6c0 (most recent call first):
File "/home/realkumaraditya/cpython/Lib/test/test_asyncio/test_free_threading.py", line 29 in main
File "/home/realkumaraditya/cpython/Lib/asyncio/events.py", line 98 in _run
File "/home/realkumaraditya/cpython/Lib/asyncio/base_events.py", line 2033 in _run_once
File "/home/realkumaraditya/cpython/Lib/asyncio/base_events.py", line 678 in run_forever
File "/home/realkumaraditya/cpython/Lib/asyncio/base_events.py", line 707 in run_until_complete
File "/home/realkumaraditya/cpython/Lib/asyncio/runners.py", line 127 in run
File "/home/realkumaraditya/cpython/Lib/test/test_asyncio/test_free_threading.py", line 49 in runner
File "/home/realkumaraditya/cpython/Lib/threading.py", line 996 in run
File "/home/realkumaraditya/cpython/Lib/threading.py", line 1054 in _bootstrap_inner
File "/home/realkumaraditya/cpython/Lib/threading.py", line 1016 in _bootstrap

Current thread 0x00007fb34fdff6c0 (most recent call first):
File "/home/realkumaraditya/cpython/Lib/asyncio/runners.py", line 208 in _cancel_all_tasks
File "/home/realkumaraditya/cpython/Lib/asyncio/runners.py", line 71 in close
File "/home/realkumaraditya/cpython/Lib/asyncio/runners.py", line 63 in exit
File "/home/realkumaraditya/cpython/Lib/test/test_asyncio/test_free_threading.py", line 46 in runner
File "/home/realkumaraditya/cpython/Lib/threading.py", line 996 in run
File "/home/realkumaraditya/cpython/Lib/threading.py", line 1054 in _bootstrap_inner
File "/home/realkumaraditya/cpython/Lib/threading.py", line 1016 in _bootstrap

Thread 0x00007fb3542ff6c0 (most recent call first):
File "/home/realkumaraditya/cpython/Lib/asyncio/runners.py", line 208 in _cancel_all_tasks
File "/home/realkumaraditya/cpython/Lib/asyncio/runners.py", line 71 in close
File "/home/realkumaraditya/cpython/Lib/asyncio/runners.py", line 63 in exit
File "/home/realkumaraditya/cpython/Lib/test/test_asyncio/test_free_threading.py", line 46 in runner
File "/home/realkumaraditya/cpython/Lib/threading.py", line 996 in run
File "/home/realkumaraditya/cpython/Lib/threading.py", line 1054 in _bootstrap_inner
File "/home/realkumaraditya/cpython/Lib/threading.py", line 1016 in _bootstrap

Thread 0x00007fb3523fe6c0 (most recent call first):

Thread 0x00007fb3a66ebf40 (most recent call first):
File "/home/realkumaraditya/cpython/Lib/threading.py", line 1105 in join
File "/home/realkumaraditya/cpython/Lib/test/support/threading_helper.py", line 147 in start_threads
File "/home/realkumaraditya/cpython/Lib/contextlib.py", line 148 in exit
File "/home/realkumaraditya/cpython/Lib/test/test_asyncio/test_free_threading.py", line 57 in test_all_tasks_race
File "/home/realkumaraditya/cpython/Lib/unittest/case.py", line 606 in _callTestMethod
File "/home/realkumaraditya/cpython/Lib/unittest/case.py", line 660 in run
File "/home/realkumaraditya/cpython/Lib/unittest/case.py", line 716 in call
File "/home/realkumaraditya/cpython/Lib/unittest/suite.py", line 122 in run
File "/home/realkumaraditya/cpython/Lib/unittest/suite.py", line 84 in call
File "/home/realkumaraditya/cpython/Lib/unittest/suite.py", line 122 in run
File "/home/realkumaraditya/cpython/Lib/unittest/suite.py", line 84 in call
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/testresult.py", line 148 in run
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/single.py", line 58 in _run_suite
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/single.py", line 38 in run_unittest
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/single.py", line 136 in test_func
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/single.py", line 92 in regrtest_runner
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/single.py", line 139 in _load_run_test
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/single.py", line 184 in _runtest_env_changed_exc
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/single.py", line 284 in _runtest
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/single.py", line 313 in run_single_test
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/worker.py", line 83 in worker_process
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/worker.py", line 118 in main
File "/home/realkumaraditya/cpython/Lib/test/libregrtest/worker.py", line 122 in
File "/home/realkumaraditya/cpython/Lib/runpy.py", line 88 in _run_code
File "/home/realkumaraditya/cpython/Lib/runpy.py", line 198 in _run_module_as_main
Kill <WorkerThread #1 running test=test_asyncio.test_free_threading pid=35002 time=4.4 sec> process group
Kill <WorkerThread #2 running test=test_asyncio.test_free_threading pid=35001 time=4.6 sec> process group
Kill <WorkerThread #3 running test=test_asyncio.test_free_threading pid=35045 time=1.8 sec> process group

== Tests result: FAILURE ==

1 test failed:
test_asyncio.test_free_threading

50 tests OK.

Total duration: 5 min 30 sec
Total tests: run=800
Total test files: run=51 failed=1
Result: FAILURE

@kumaraditya303 kumaraditya303 marked this pull request as draft January 6, 2025 15:40
@colesbury
Copy link
Contributor

colesbury commented Jan 6, 2025

Do you have a command to reproduce the crash?

You wrote that:

[A task object] can be concurrently deallocated while list is being read in another thread as deallocation doesn't hold the state lock

There are two things that confuse me:

  • The unregister_task() function holds the state lock, which prevents concurrent traversal of the list for that important function.
  • The state lock isn't held for the entire deallocation sequence, but it's also not atomic with respect to a stop the world pause. That makes me think that it's likely to have similar problems. For example, the unregister_task function may block on acquiring the state lock allowing a stop-the-world pause to happen concurrently (while the task is blocked on the lock). There may also be other points where a stop the world call can happen, such as when calling other objects' destructors.

Additionally, the way this linked list works worries me:

  • It's essentially a list of weak references, just not implemented using PyWeakRef object for performance reasons
  • Weakrefs are cleared later on during dealloc -- not in tp_finalize -- when we know that the object will definitely be destroyed.
  • Weakrefs have special handling for dealing with similar concurrent deallocations.
  • Weakrefs also have special hooks in the GC, but I'm not sure that's relevant here.

Anyways, I think it may be worth trying to match the weakref implementation since we know that it works.

@colesbury
Copy link
Contributor

From looking at the crash backtrace, I think the important thing here would be to use _Py_TryIncref in all_tasks like we do in _PyWeakref_GET_REF. This will avoid incref'ing objects that have zero refcount (and are either about to be deallocated or partway through deallocation).

Something like:

    llist_for_each_safe(node, &state->asyncio_tasks_head) {
        TaskObj *task = llist_data(node, TaskObj, task_node);
        if (_Py_TryIncref(task)) {
            if (_PyList_AppendTakeRef(tasks, (PyObject *)task) < 0) {
                ...
            }
        }
    }

@kumaraditya303
Copy link
Contributor Author

Do you have a command to reproduce the crash?

Running test_asyncio.test_free_threading in a loop on main reproduces the crash.
env TSAN_OPTIONS=suppressions={$PWD}/Tools/tsan/suppressions_free_threading.txt ./python -m test test_asyncio.test_free_threading -j 4 -F

@kumaraditya303
Copy link
Contributor Author

Anyways, I think it may be worth trying to match the weakref implementation since we know that it works.

So I think the following changes need to be made:

  • Move unregister_task to be in tp_dealloc so that it is called when task is definitely getting deallocated.
  • In all_tasks use _Py_TryIncref so that if task is concurrently deallocated, it doesn't gets added to the tasks list.
  • The state lock is held while iteration of linked list as it was done before this PR
  • No need for stop the world pause in all_tasks

Did I get it correctly? Aside, thanks for your explanation! I didn't know that it was possible that stop the world could happen while a deallocator is running concurrently.

@colesbury
Copy link
Contributor

Yes, but on second thought let's keep the unregister_task in the finalizer for now. I'm not entirely sure where it should be. Normal dealloc calls clear weakrefs relatively late (after finalizers are invoked), but the cyclic GC clears them earlier (before finalizers). I can't think of any concrete problems immediately, but I'm a bit worried about the ordering.

@kumaraditya303 kumaraditya303 changed the title gh-128002: use a stop the world event in asyncio.all_tasks gh-128002: fix asyncio.all_tasks against concurrent deallocations of tasks Jan 7, 2025
@kumaraditya303 kumaraditya303 marked this pull request as ready for review January 8, 2025 11:11
@kumaraditya303 kumaraditya303 merged commit 7dc41ad into python:main Jan 9, 2025
45 checks passed
@kumaraditya303 kumaraditya303 deleted the async-finalize branch January 9, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
  NODES
COMMUNITY 2
Project 5
USERS 1