With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.
This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.
Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.
Closes#1776.
When building an extension with rust-numpy and ndarray on the MSRV of
1.41 with complex numbers. The num-complex crate version needs to be
0.2 which was the pinned version as of ndarray 0.13.1 which was the last
release of ndarray that supported building with rust 1.41. However, the
pyo3 pinned version of 0.4 is incompatible with this and will cause an
error when building because of the version mismatch. To fix this This
commit expands the supported versions for num-complex to match what
rust-numpy uses [1] so that we can build pyo3, numpy, ndarray, and
num-complex in an extension with rust 1.41.
Fixes#1798
[1] https://github.com/PyO3/rust-numpy/blob/v0.14.1/Cargo.toml#L19
All symbols which are canonically defined in cpython/unicodeobject.h
and had been defined in our unicodeobject.rs have been moved to our
corresponding cpython/unicodeobject.rs file.
This module is only present in non-limited build configurations, so
we were able to drop the cfg annotations as part of moving the
definitions.
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.
This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.
I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate `#define`s as inline functions.
Missing symbols have been annotated with `skipped` and symbols currently
defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.
The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't
support. So we've provided accessor functions for retrieving these
fields' values. No accessor functions are present because you shouldn't
be touching these values from Rust code.
Tests of the bitfield APIs and macro implementations have been added.
The setter function will receive a NULL value on deletion requests.
This wasn't properly handled before, leading to a panic.
The new code raises AttributeError in this scenario instead.
A test for the behavior has been added. Documentation has also
been updated to reflect the behavior.
The field wasn't defined previously. And the enum wasn't defined as
`[repr(C)]`.
This missing field could result in memory corruption if a Rust-allocated
`PyStatus` was passed to a Python API, which could perform an
out-of-bounds write. In my code, the out-of-bounds write corrupted a
variable on the stack, leading to a segfault due to illegal memory
access. However, this crash only occurred on Rust 1.54! So I initially
mis-attribted it as a compiler bug / regression. It appears that a
low-level Rust change in 1.54.0 changed the LLVM IR in such a way to
cause LLVM optimization passes to produce sufficiently different
assembly code, tickling the crash. See
https://github.com/rust-lang/rust/issues/87947 if you want to see
the wild goose chase I went on in Rust / LLVM land to potentially
pin this on a compiler bug.
Lessen learned: Rust crashes are almost certainly due to use of
`unsafe`.