[Bug 2358292] Review Request: rust-icu_locid_transform_data - Data for the icu_locid_transform crate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



https://bugzilla.redhat.com/show_bug.cgi?id=2358292



--- Comment #5 from Ben Beasley <code@xxxxxxxxxxxxxxxxxx> ---
(In reply to Fabio Valentini from comment #4)
> Hm, the data/* files exclusively contain generated code, not "data" per se.

True, it *is* generated data (byte buffers), but wrapped up in generated
boilerplate code.

> Not sure *what* they're generated from, but it doesn't look like the code
> for generating these files is part of the published crate,
> which would be a violation of the rules here:
> https://docs.fedoraproject.org/en-US/packaging-guidelines/what-can-be-
> packaged/#pregenerated-code

As far as I can tell, this is

https://github.com/unicode-org/icu4x/blob/58e7b89140dd95dfc778b1bc34d88abefe598208/Makefile.toml#L115

[tasks.ci-job-full-datagen]
description = "Run full data generation on latest CLDR and ICU"
category = "CI"
dependencies = [
    "bakeddata-check",
]

which refers to

https://github.com/unicode-org/icu4x/blob/58e7b89140dd95dfc778b1bc34d88abefe598208/tools/make/data.toml#L119

[tasks.bakeddata-check]
description = "Rebuild baked data and ensure that the working copy is clean"
category = "ICU4X Data"
dependencies = ["bakeddata"]
script_runner = "@duckscript"
script = '''
exit_on_error true

output = exec git status --porcelain=v1
output_length = length ${output.stdout}
if greater_than ${output_length} 0
    msg = array "" ""
    array_push ${msg} "Baked data needs to be updated. Please run `cargo make
bakeddata`"
    array_push ${msg} ""
    array_push ${msg} "${output.stdout}"
    msg = array_join ${msg} "\n"
    trigger_error ${msg}
end

which refers to

https://github.com/unicode-org/icu4x/blob/58e7b89140dd95dfc778b1bc34d88abefe598208/tools/make/data.toml#L102

[tasks.bakeddata]
description = "Builds full baked data"
category = "ICU4X Data"
script_runner = "@duckscript"
script = '''
exit_on_error true

if array_is_empty ${@}
    exec --fail-on-error cargo run -p bakeddata-scripts --release
else
    exec --fail-on-error cargo build -p bakeddata-scripts
    for component in ${@}
        exec --fail-on-error target/debug/bakeddata-scripts "${component}"
    end
end
'''

which relies on

https://github.com/unicode-org/icu4x/tree/release/1.5/tools/bakeddata-scripts

which uses

https://github.com/unicode-org/icu4x/tree/release/1.5/provider/datagen

https://crates.io/crates/icu_datagen

The bits of generated code appear to be scattered across

https://github.com/unicode-org/icu4x/blob/release/1.5/provider/datagen/src/baked_exporter.rs

They aren’t immediately recognizable there because most of each generated block
of code is templated-in names, but the comments and constructs I spot-checked
seemed to be present.

The data is ultimately encoded into buffers with databake,
https://src.fedoraproject.org/rpms/rust-databake.

I think it’s reasonable to argue that the generated code is actually all
boilerplate from the code generator (icu_datagen+databake), and doesn’t have
its own sources specific to this crate. In
https://docs.fedoraproject.org/en-US/packaging-guidelines/what-can-be-packaged/#pregenerated-code,
this is similar to the bison example: bison consumes a rules file (which must
be in the source RPM) and produces C sources, which contain a lot of extra
boilerplate code from bison itself, but the version of bison that did the
generating doesn’t have to be in the source RPM, or even in Fedora, although
it’s better if it can be.

The origin of the data in the buffers is another matter. According to

https://github.com/unicode-org/icu4x/blob/release/1.5/tutorials/data-management.md

“Data generation is done using the icu_datagen crate, which pulls in data from
Unicode's Common Locale Data Repository (CLDR) and from ICU4C releases to
generate ICU4X data. The crate has a command line interface as well as a Rust
API, which can be used in Rust scripts.”

So the ultimate sources for the data are somewhere in

https://cldr.unicode.org/index/downloads

Looking at

https://github.com/unicode-org/icu4x/blob/release/1.5/provider/datagen/src/provider.rs

it appears that there could also be data from an ICU(4C) release, e.g.

https://github.com/unicode-org/icu/releases/download/release-77-1/icuexportdata_release-77-1.zip

I’m not sure how practical it is to follow the breadcrumb trails to associate
particular Unicode data files with particular baked data buffers. Given
README.md attributes the CLDR and ICU versions,

  This data was generated with CLDR version 46.0.0-BETA2, ICU version
icu4x/2024-05-16/75.x, and
  LSTM segmenter version v0.1.0.

I suppose it should suffice to add the following as additional Sources to cover
the requirement to include original sources for the data:

https://github.com/unicode-org/icu/releases/download/icu4x%2F2024-05-16%2F75.x/icuexportdata_icu4x-2024-05-16-75.x.zip
https://github.com/unicode-org/cldr-json/releases/download/46.0.0-BETA2/cldr-46.0.0-BETA2-json-full.zip

This is a bit bulky, but I don’t know what else to do.

Whatever we decide to do with this crate, we’ll need to repeat the exercise
several times. There are 13 "icu_*_data" crates in ICU4X.

> 
> And I'm a little bit unsure whether we've already encountered issues with
> endianness in API calls like this one?
> 
> > zerovec::ZeroVec::from_bytes_unchecked(b"am\0ar\0as\0balbe\0bg\0bgcbhobn\0brxchrcswcv...")

There was https://github.com/unicode-org/icu4x/issues/6292, but that was with
rkyv, not zerovec/databake. To be honest, I’m not sure how to check for
endianness issues here other than running the tests, which do pass on s390x
(https://copr.fedorainfracloud.org/coprs/music/idna1/build/8874479/).


-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are always notified about changes to this product and component
https://bugzilla.redhat.com/show_bug.cgi?id=2358292

Report this comment as SPAM: https://bugzilla.redhat.com/enter_bug.cgi?product=Bugzilla&format=report-spam&short_desc=Report%20of%20Bug%202358292%23c5

-- 
_______________________________________________
package-review mailing list -- package-review@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to package-review-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/package-review@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue




[Index of Archives]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite Conditions]     [KDE Users]

  Powered by Linux