https://bugzilla.redhat.com/show_bug.cgi?id=2358292 --- Comment #5 from Ben Beasley <code@xxxxxxxxxxxxxxxxxx> --- (In reply to Fabio Valentini from comment #4) > Hm, the data/* files exclusively contain generated code, not "data" per se. True, it *is* generated data (byte buffers), but wrapped up in generated boilerplate code. > Not sure *what* they're generated from, but it doesn't look like the code > for generating these files is part of the published crate, > which would be a violation of the rules here: > https://docs.fedoraproject.org/en-US/packaging-guidelines/what-can-be- > packaged/#pregenerated-code As far as I can tell, this is https://github.com/unicode-org/icu4x/blob/58e7b89140dd95dfc778b1bc34d88abefe598208/Makefile.toml#L115 [tasks.ci-job-full-datagen] description = "Run full data generation on latest CLDR and ICU" category = "CI" dependencies = [ "bakeddata-check", ] which refers to https://github.com/unicode-org/icu4x/blob/58e7b89140dd95dfc778b1bc34d88abefe598208/tools/make/data.toml#L119 [tasks.bakeddata-check] description = "Rebuild baked data and ensure that the working copy is clean" category = "ICU4X Data" dependencies = ["bakeddata"] script_runner = "@duckscript" script = ''' exit_on_error true output = exec git status --porcelain=v1 output_length = length ${output.stdout} if greater_than ${output_length} 0 msg = array "" "" array_push ${msg} "Baked data needs to be updated. Please run `cargo make bakeddata`" array_push ${msg} "" array_push ${msg} "${output.stdout}" msg = array_join ${msg} "\n" trigger_error ${msg} end which refers to https://github.com/unicode-org/icu4x/blob/58e7b89140dd95dfc778b1bc34d88abefe598208/tools/make/data.toml#L102 [tasks.bakeddata] description = "Builds full baked data" category = "ICU4X Data" script_runner = "@duckscript" script = ''' exit_on_error true if array_is_empty ${@} exec --fail-on-error cargo run -p bakeddata-scripts --release else exec --fail-on-error cargo build -p bakeddata-scripts for component in ${@} exec --fail-on-error target/debug/bakeddata-scripts "${component}" end end ''' which relies on https://github.com/unicode-org/icu4x/tree/release/1.5/tools/bakeddata-scripts which uses https://github.com/unicode-org/icu4x/tree/release/1.5/provider/datagen https://crates.io/crates/icu_datagen The bits of generated code appear to be scattered across https://github.com/unicode-org/icu4x/blob/release/1.5/provider/datagen/src/baked_exporter.rs They aren’t immediately recognizable there because most of each generated block of code is templated-in names, but the comments and constructs I spot-checked seemed to be present. The data is ultimately encoded into buffers with databake, https://src.fedoraproject.org/rpms/rust-databake. I think it’s reasonable to argue that the generated code is actually all boilerplate from the code generator (icu_datagen+databake), and doesn’t have its own sources specific to this crate. In https://docs.fedoraproject.org/en-US/packaging-guidelines/what-can-be-packaged/#pregenerated-code, this is similar to the bison example: bison consumes a rules file (which must be in the source RPM) and produces C sources, which contain a lot of extra boilerplate code from bison itself, but the version of bison that did the generating doesn’t have to be in the source RPM, or even in Fedora, although it’s better if it can be. The origin of the data in the buffers is another matter. According to https://github.com/unicode-org/icu4x/blob/release/1.5/tutorials/data-management.md “Data generation is done using the icu_datagen crate, which pulls in data from Unicode's Common Locale Data Repository (CLDR) and from ICU4C releases to generate ICU4X data. The crate has a command line interface as well as a Rust API, which can be used in Rust scripts.” So the ultimate sources for the data are somewhere in https://cldr.unicode.org/index/downloads Looking at https://github.com/unicode-org/icu4x/blob/release/1.5/provider/datagen/src/provider.rs it appears that there could also be data from an ICU(4C) release, e.g. https://github.com/unicode-org/icu/releases/download/release-77-1/icuexportdata_release-77-1.zip I’m not sure how practical it is to follow the breadcrumb trails to associate particular Unicode data files with particular baked data buffers. Given README.md attributes the CLDR and ICU versions, This data was generated with CLDR version 46.0.0-BETA2, ICU version icu4x/2024-05-16/75.x, and LSTM segmenter version v0.1.0. I suppose it should suffice to add the following as additional Sources to cover the requirement to include original sources for the data: https://github.com/unicode-org/icu/releases/download/icu4x%2F2024-05-16%2F75.x/icuexportdata_icu4x-2024-05-16-75.x.zip https://github.com/unicode-org/cldr-json/releases/download/46.0.0-BETA2/cldr-46.0.0-BETA2-json-full.zip This is a bit bulky, but I don’t know what else to do. Whatever we decide to do with this crate, we’ll need to repeat the exercise several times. There are 13 "icu_*_data" crates in ICU4X. > > And I'm a little bit unsure whether we've already encountered issues with > endianness in API calls like this one? > > > zerovec::ZeroVec::from_bytes_unchecked(b"am\0ar\0as\0balbe\0bg\0bgcbhobn\0brxchrcswcv...") There was https://github.com/unicode-org/icu4x/issues/6292, but that was with rkyv, not zerovec/databake. To be honest, I’m not sure how to check for endianness issues here other than running the tests, which do pass on s390x (https://copr.fedorainfracloud.org/coprs/music/idna1/build/8874479/). -- You are receiving this mail because: You are on the CC list for the bug. You are always notified about changes to this product and component https://bugzilla.redhat.com/show_bug.cgi?id=2358292 Report this comment as SPAM: https://bugzilla.redhat.com/enter_bug.cgi?product=Bugzilla&format=report-spam&short_desc=Report%20of%20Bug%202358292%23c5 -- _______________________________________________ package-review mailing list -- package-review@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to package-review-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/package-review@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue