Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Use Python pickle protocol version 4 for np.save #26388

Merged
merged 4 commits into from
May 16, 2024
Merged

Conversation

vxst
Copy link
Contributor

@vxst vxst commented May 6, 2024

Currently, when NumPy saves data using pickle, it forces the protocol version to 3, which was the default value from Python 3.0 to Python 3.7. However, since Python 3.7 has reached its end-of-life (EOL), there are no actively maintained Python versions that default to using pickle protocol 3.

By allowing NumPy to use the default pickle protocol version, objects larger than 4GB can be pickled, resolving the issue described in #26224. Although this new protocol version is incompatible with Python 3.3 and earlier versions, Python 3.3 has long reached its EOL. Therefore, there is no need to force pickle protocol 3 when saving data using pickle.

It is important to note that the pickle module still supports reading pickled objects from protocol 3. As a result, this change will not introduce incompatibilities when reading files saved with an earlier pickle protocol(which is used in current numpy release).

@vxst
Copy link
Contributor Author

vxst commented May 6, 2024

Should this change be backported to NumPy 1.23/1.24? Python changed its default pickle protocol version starting from Python 3.8, and NumPy 1.23 and later versions only support Python 3.8 and newer.

@vxst
Copy link
Contributor Author

vxst commented May 6, 2024

The failures of the two Windows jobs are not related to this PR. The issue lies with the new spin version 0.9, which added gcov support but called the build with the gcov keyword argument even when the test does not require gcov. I have created a PR (scientific-python/spin#183) at spin to address this problem.

@ngoldbaum ngoldbaum added the 09 - Backport-Candidate PRs tagged should be backported label May 6, 2024
@ngoldbaum
Copy link
Member

Should this change be backported to NumPy 1.23/1.24?

No, we're currently only backporting to the 2.0 and 1.26 branches.

@rkern
Copy link
Member

rkern commented May 6, 2024

There is a reason that we hardcode the protocol, despite the drawbacks when we forget about it as we drop support for versions of Python. We don't want Python 3.14 to decide that pickle.DEFAULT_PROTOCOL is 6, or something, while we're still supporting a version of Python that doesn't have it.

In an ideal world, the policy that we probably want is something like:

  • we specify the protocol to be the "best" (probably highest, but only if the features warrant it; I'd probably go with 5 right now) one supported by the versions of Python that we support
  • we add "check if we need to change the np.save() pickle protocol" to the checklist when we drop EOL Python versions
    • creating such a checklist if we don't have one already
  • allow np.save() users to override it

@charris
Copy link
Member

charris commented May 6, 2024

we're currently only backporting to the 2.0 and 1.26

Scratch 1.26, we cannot build wheels for macosx arm64 and it is not worth the work to fix that.

@vxst
Copy link
Contributor Author

vxst commented May 6, 2024

There is a reason that we hardcode the protocol, despite the drawbacks when we forget about it as we drop support for versions of Python. We don't want Python 3.14 to decide that pickle.DEFAULT_PROTOCOL is 6, or something, while we're still supporting a version of Python that doesn't have it.

In an ideal world, the policy that we probably want is something like:

  • we specify the protocol to be the "best" (probably highest, but only if the features warrant it; I'd probably go with 5 right now) one supported by the versions of Python that we support

  • we add "check if we need to change the np.save() pickle protocol" to the checklist when we drop EOL Python versions

    • creating such a checklist if we don't have one already
  • allow np.save() users to override it

The pickle.DEFAULT_PROTOCOL exists so that the oldest still supported version of Python already support the latest Python's pickle.DEFAULT_PROTOCOL version, according to CPython:

https://github.com/python/cpython/blob/5a1618a2c8c108b8c73aa9459b63f0dbd66b60f6/Lib/pickle.py#L68-L70

Since NumPy only supports the officially supported Python versions, using the pickle.DEFAULT_PROTOCOL value will not cause NumPy portability issues between two supported NumPy versions. I believe it's logical to follow the upstream Python policy for pickle portability, rather than having a NUMPY_DEFAULT_PICKLE_PROTOCOL, which will generate a lot of unnecessary work while solving a problem already addressed by upstream.

It's also possible to allow np.save() users to override the pickle protocol in order to benefit from the performance gains provided by the newer protocols. However, I believe this requirement can be addressed by letting users tweak the pickle.DEFAULT_PROTOCOL value themselves, which saves adding a parameter on the NumPy side. New features provided by pickle actually need to be adapted by new code (e.g., Out-of-band Buffers for pickle protocol 5). If we only use pickle to pickle things, keeping the pickle.DEFAULT_PROTOCOL value may be better.

@rkern @mattip

@ngoldbaum
Copy link
Member

The pickle.DEFAULT_PROTOCOL exists so that the oldest still supported version of Python already support the latest Python's pickle.DEFAULT_PROTOCOL version, according to CPython:

This doesn't say anything about changes in Python 3.14 or any future version though...

However, I believe this requirement can be addressed by letting users tweak the pickle.DEFAULT_PROTOCOL value themselves,

Today I learned this is a mutable constant:

>>> import pickle
>>> pickle.DEFAULT_PROTOCOL = 7
>>> pickle.DEFAULT_PROTOCOL
7

Still, I don't think we should be telling users to mutate that, and adding a new keyword to handle this in numpy makes sense to me, along with updating the default in numpy.

@vxst
Copy link
Contributor Author

vxst commented May 6, 2024

@ambv is this a CPython policy that DEFAULT_PROTOCOL for pickle is supported across all currently supported Python versions?

@charris
Copy link
Member

charris commented May 6, 2024

This could use a release note, see doc/release/upcoming_changes for examples.

@vxst
Copy link
Contributor Author

vxst commented May 6, 2024

Still, I don't think we should be telling users to mutate that, and adding a new keyword to handle this in numpy makes sense to me, along with updating the default in numpy.

I believe users who really request a specific pickle protocol version know what they're doing, since it's pretty low level details.

This could use a release note, see doc/release/upcoming_changes for examples.

If we decide to add the pickle_protocol keyword to numpy.save, I can edit the code & documentation and add the doc/release/upcoming_changes release notes. However, it's your call on whether to add this keyword or just follow the CPython policy.

@rkern
Copy link
Member

rkern commented May 6, 2024

The pickle.DEFAULT_PROTOCOL exists so that the oldest still supported version of Python already support the latest Python's pickle.DEFAULT_PROTOCOL version, according to CPython:

The policy in that comment is that you are only allowed to bump it if the last version that didn't have it is EOLed. There's nothing in there that states that it will be bumped when that version goes EOL. 5 was introduced in 3.8, and 3.7 EOLed before 3.12, but 3.12 and 3.13 still have 4 as the default. In any case, Python's EOL schedule and numpy's Python version support are not always in lock-step.

New features provided by pickle actually need to be adapted by new code (e.g., Out-of-band Buffers for pickle protocol 5).

The 5 format will handle them in-band just fine without needing to modify the pickle.dump() call, and AFAICT, more efficiently than if one was stuck with 4. So if people have objects that support 5 by returning PickleBuffers, it would be good to support it even if we don't do anything out-of-band. People are increasingly sticking weirder things into arrays and passing them about (e.g. variably sized images, text fragments, etc.) for ML purposes, so this isn't entirely theoretical.

@vxst
Copy link
Contributor Author

vxst commented May 6, 2024

Since the support of Python version is 3.8+, I believe 5 will be the best default? Does this change need steering council or something?

@ambv
Copy link

ambv commented May 7, 2024

@ambv is this a CPython policy that DEFAULT_PROTOCOL for pickle is supported across all currently supported Python versions?

Yes.

@charris
Copy link
Member

charris commented May 10, 2024

Lets just hardwire a later version. '4' would be conservative and would solve the size problem.

@vxst
Copy link
Contributor Author

vxst commented May 10, 2024

  • we specify the protocol to be the "best" (probably highest, but only if the features warrant it; I'd probably go with 5 right now) one supported by the versions of Python that we support

  • we add "check if we need to change the np.save() pickle protocol" to the checklist when we drop EOL Python versions

    • creating such a checklist if we don't have one already
  • allow np.save() users to override it

After some thought, I believe it's the best way if we change the interface to provide users with flexibility. I will implement the changes with a default protocol of 5 (or 4?), and allow users to use the keyword pickle_protocol to control the version they want for np.save. However, I can't find the checklist in the repo. Should I add one? Or perhaps we add them to doc/source/dev/releasing.rst?

@mattip
Copy link
Member

mattip commented May 15, 2024

We discussed this at the triage meeting. We suggest sticking with 4. This does need a release note.

@vxst
Copy link
Contributor Author

vxst commented May 15, 2024

We discussed this at the triage meeting. We suggest sticking with 4. This does need a release note.

Shall we add the keyword to np.save? I’m currently doing some benchmarks to see whether it’s possible to increase the performance with protocol 5

@vxst
Copy link
Contributor Author

vxst commented May 15, 2024

I'll follow your decision. If you believe we shall give user the flexibility, I'll add the keyword, add the performance character of version 4 and 5 to the document, and write the release note following the existing example.

If you believe we should hardwire version 4, I'll only write the release note to explain the performance improvement of protocol 4 and the possible incompatibility if someone's using ancient python version, everyone using Python 3.4+ shall be fine, since pickle.load in read_array does not have any version hard-coded:-)

@mattip
Copy link
Member

mattip commented May 15, 2024

Let's break this into pieces:

  • Changing the protocol to 4 and document it seems reasonable
  • Exposing protocol as a kwarg that can be used in the pickle_kwargs dict is a harder sell, is there a clear need for it? I don't see any real advantage to move to 5, since the main change, out-of-band buffer storage, would require a new npy file format. The original PR to add pickling BUG: enable working around pickle compatibility issues on Py3 in npy files #5640 did not say why they rejected adding the protocol. It would need to be popped off the dict and tested for validity (minimum 4, maximum pickle.HIGHEST_PROTOCOL).

@vxst vxst changed the title BUG: Use default Python pickle protocol version for np.save BUG: Use Python pickle protocol version 4 for np.save May 16, 2024
@vxst
Copy link
Contributor Author

vxst commented May 16, 2024

Currently I only see performance gains for protocol 5 in IPC situation, which is not actually the use case for numpy.save.

Since we hardcode protocol version 4 in numpy.save, fix_imports will not be viable(actually it's not viable since we upgraded the protocol version to 3 in 1.17). Shall I also also remove the fix_imports keywords in this PR, or shall I open a new PR to remove it?

@vxst
Copy link
Contributor Author

vxst commented May 16, 2024

Many people actually use numpy.save(..., allow_pickle=True, fix_imports=True) as a default behavior (e.g., see this example: intel/intel-extension-for-transformers). Removing fix_imports keyword may break compatibility with existing code. However, since this keyword does nothing when the protocol is set to 3 and above, perhaps we could deprecate them first? Or maybe NumPy 2.0 would be a good opportunity to remove them?

@mattip
Copy link
Member

mattip commented May 16, 2024

It is best to treat separate problems in separate PRs/issues. Let's go with what we have here now. If you want to continue, you can open a deprecation PR to deprecate the out-dated fix_imports.

@mattip mattip merged commit 98e8913 into numpy:main May 16, 2024
63 of 65 checks passed
@mattip
Copy link
Member

mattip commented May 16, 2024

Thanks @vxst

charris pushed a commit to charris/numpy that referenced this pull request May 16, 2024
* BUG: Use default Python pickle protocol version rather than outdated protocol 3

* Update default pickle of np.save to 4

* Adds document for default pickle protocol 4
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label May 16, 2024
charris added a commit that referenced this pull request May 16, 2024
BUG: Use Python pickle protocol version 4 for np.save (#26388)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
  NODES
COMMUNITY 2
Idea 2
idea 2
Note 8
Project 5
USERS 12