tools: fix DBC string-column detection false positives in both dbc_to_csv and asset_extract

The string-column auto-detector in both tools had two gaps that caused small
integer fields (RaceID=1, SexID=0/1, BaseSection, ColorIndex) to be falsely
classified as string columns, corrupting the generated CSVs:

1. No boundary check: a value of N was accepted as a valid string offset even
   when N landed inside a longer string (e.g. offset 3 inside "Character\...").
   Fix: precompute valid string-start boundaries (offset 0 plus every position
   immediately after a null byte); reject offsets that are not boundaries.

2. No diversity check: a column whose only non-zero value is 1 would pass the
   boundary test because offset 1 is always a valid boundary (it follows the
   mandatory null at offset 0). Fix: require at least 2 distinct non-empty
   string values before marking a column as a string column. Columns like
   SexID (all values are 0 or 1, resolving to "" and the same path fragment)
   are integer fields, not string fields.

Both dbc_to_csv and asset_extract now produce correct column metadata,
e.g. CharSections.dbc yields "strings=6,7,8" instead of "strings=0,1,...,9".
This commit is contained in:
Kelsi 2026-03-10 03:49:06 -07:00
parent 5b06a62d91
commit b31a2a66b6
2 changed files with 47 additions and 4 deletions

View file

@ -104,6 +104,7 @@ std::set<uint32_t> detectStringColumns(const DBCFile& dbc,
for (uint32_t col = 0; col < fieldCount; ++col) {
bool allZeroOrValid = true;
bool hasNonZero = false;
std::set<std::string> distinctStrings;
for (uint32_t row = 0; row < recordCount; ++row) {
uint32_t val = dbc.getUInt32(row, col);
@ -113,9 +114,18 @@ std::set<uint32_t> detectStringColumns(const DBCFile& dbc,
allZeroOrValid = false;
break;
}
// Collect distinct non-empty strings for diversity check.
const char* s = reinterpret_cast<const char*>(stringBlock.data() + val);
if (*s != '\0') {
distinctStrings.insert(std::string(s, strnlen(s, 256)));
}
}
if (allZeroOrValid && hasNonZero) {
// Require at least 2 distinct non-empty string values. Columns that
// only ever point to a single string (e.g. SexID=1 always resolves to
// the same path fragment at offset 1 in the block) are almost certainly
// integer fields whose small values accidentally land at a string boundary.
if (allZeroOrValid && hasNonZero && distinctStrings.size() >= 2) {
stringCols.insert(col);
}
}