HiveStructurePaths

Documentation for HiveStructurePaths.

HiveStructurePaths.HiveSchemaType
HiveSchema(; parsers::Dict, order::Vector, filename::String)

Defines the structure and parsing rules for a Hive file hierarchy.

Fields

  • parsers: Dict mapping key names to parsing functions
  • order: Vector defining the hierarchical order of keys in paths
  • filename: The target filename that appears in all Hive paths (one per schema)
source
HiveStructurePaths.build_hive_pathMethod
build_hive_path(schema::HiveSchema, base_dir::AbstractString; kwargs...) → String

Construct Hive-style output path with consistent ordering.

Path structure follows schema order: base_dir/key1=<val1>/key2=<val2>/.../filename where filename comes from schema.filename.

Examples

const schema = HiveSchema(
    parsers = Dict{String, Function}(
        "criterion" => identity,
        "partition" => x -> parse(Int, x),
        "k"         => x -> parse(Int, x)
    ),
    order = ["criterion", "partition", "k"],
    filename = "data.arrow"
)

build_hive_path(schema, "data/binned"; criterion="depth_iso", partition=1)
# → "data/binned/criterion=depth_iso/partition=1/data.arrow"

build_hive_path(schema, "data/cluster_assignments"; partition=2, criterion="depth_iso", k=10)
# → "data/cluster_assignments/criterion=depth_iso/partition=2/k=10/data.arrow"
# Note that the order is consistent with the previous one; the order of `kwargs` does not matter.

Arguments

  • base_dir: Base directory path
  • kwargs: Key-value pairs matching schema keys

Returns

Complete path string with Hive-style structure

source
HiveStructurePaths.find_hive_filesMethod
find_hive_files(schema::HiveSchema, root_dir::AbstractString;
                validate_keys=[], error_if_empty=false) -> Vector{String}

Recursively find files that match the schema's filename AND structure.

Arguments

  • validate_keys: List of keys (e.g. [:criterion]) that MUST be present in the path for it to be considered valid.
  • error_if_empty: If true, throws error if no matching files are found.

Returns

Sorted list of absolute paths.

source
HiveStructurePaths.parse_hive_pathMethod
parse_hive_path(schema::HiveSchema, path::AbstractString; required_keys=[]) → NamedTuple

Extract key-value pairs from Hive-style paths according to the schema.

Examples


const schema = HiveSchema(
    parsers = Dict{String, Function}(
        "criterion" => identity,
        "partition" => x -> parse(Int, x),
        "k"         => x -> parse(Int, x)
    ),
    order = ["criterion", "partition", "k"]
)

parse_hive_path(schema::HiveSchema,"data/binned/criterion=depth_iso/partition=1/data.arrow")
# → (criterion="depth_iso", partition=1, k=nothing)

parse_hive_path(schema::HiveSchema,"data/cluster_assignments/criterion=depth_iso/partition=2/k=10/data.arrow")
# → (criterion="depth_iso", partition=2, k=10)

# Validate required keys
parse_hive_path(schema::HiveSchema,"data/binned/criterion=depth_iso/partition=1/data.arrow"; required_keys=["criterion", "partition"])
# → (criterion="depth_iso", partition=1, k=nothing)

Arguments

  • path: Path string containing Hive-style key=value segments
  • required_keys: Optional list of keys that must be present (default: [])

Returns

NamedTuple with extracted values (nothing for missing fields)

Throws

  • ErrorException if any required_keys are missing from the path
source