Introduction

Inspired by RDatasets.jl, SmallDatasetMaker provides tools to create/add/update a julia package of datasets in only a few steps.

Getting started

1. Create a package

Create a julia package, for example, YourDatasets.jl. For convenience, YourDatasets in this documentation refers an arbitrary package of datasets working with SmallDatasetMaker herein after.

See PkgTemplates and Pkg.jl/Creating Packages about how to create a julia package.

SmallDatasetMaker should be added to the Project.toml of YourDatasets:

  • (YourDatasets) pkg> add SmallDatasetMaker

2. Convert the raw data to a dataset

Note

Activate the environment YourDatasets and using SmallDatasetMaker first!

  1. Make your dataset to be compressed a csv file.
  2. Define the SourceData object with the srcpath to be the path to this csv file.
  3. Call compress_save! or compress_save.
SmallDatasetMaker.SourceDataType

SourceData(srcfile, package_name, dataset_name, title, zipfile, rows, columns, description, timestamps)

srcfile is the path to the source file, the package_name will be the folder that the file resides, the dataset_name will be the name of the data without extension.

If timestamps not specified, it will be today().

source

If description not specified, it will be "".

source

If rows, columns not specified, CSV.read(srcfile, DataFrame) will be applied to get the number of rows/columns.

source

If zipfile not specified, it will be dir_data(package_name, dataset_name*".gz").

source

If title not specified, it will be "Data [$dataset_name] of [$package_name]".

source

If package_name, dataset_name not specified, (package_name, dataset_name) = get_package_dataset_name(srcfile) is applied.

Example

using SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
SD = SourceData(srcfile)
source

SourceData(mod::Module, row::DataFrameRow) applies create an SourceData objects from a row of a DataFrame (i.e., dataset_table(mod)), with abspath! applied.

This is for loading data according to dataset_table; thus, paths should be referred to that in mod instead of being relative to the current directory.

source
SmallDatasetMaker.compress_save!Function

compress_save!(mod::Module, SD::SourceData; move_source = true, targeting_mod = false) compress the SD.srcfile, save the zipped one to SD.zipfile, and update the dataset_table(mod).

Options:

  • By default, move_source = true that the source file will be moved to dir_raw().
  • Set targeting_mod = true to make sure the compressed data is saved relative to the repo of mod. Default is false, which means you can compress and save the data to whatever directory you like.

compress_save! returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[], where relpath! is applied that paths SD as well as dataset_table(mod) are modified to be relative.

Example

using YourDatasets, SmallDatasetMaker
compress_save!(YourDatasets, SD; targeting_mod = true)

This do the followings:

  1. Create zipped files under data/ of package YourDatasets in development.
  2. Move the source file SD.srcfile (i.e., the raw .csv data) to dir_raw(YourDatasets, ...) by default.
  3. Add a new line to SmallDatasetMaker.dataset_table(YourDatasets) (update data/doc/datasets.csv of YourDatasets).

See also SourceData, compress_save.

source
SmallDatasetMaker.compress_saveFunction

compress_save(mod::Module, srcpath; args...) is equivalent to compress_save!(mod, SourceData(srcpath)) but returns SD = SourceData(srcpath).

compress_save takes the same keyword arguments as compress_save!, which returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[].

Example

using YourDatasets, SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
compress_save(YourDatasets, srcfile; targeting_mod = true)
source

3. Add methods dataset and datasets

  • using SmallDatasetMaker in the module scope of YourDatasets
  • (Optional) New methods for dataset and datasets.

Example

In src/YourDatasets.jl:


module YourDatasets

    using SmallDatasetMaker
    # (required) See also `SmallDatasetMaker.datasets`.

    function dataset(package_name, dataset_name)
        SmallDatasetMaker.dataset(YourDatasets,package_name, dataset_name)
    end 
    # (optional but recommended) 
    # To allow direct use of `dataset` without `SmallDatasetMaker`.

    datasets() = SmallDatasetMaker.datasets(YourDatasets) 
    # (optional but recommended) To allow the direct use of `YourDatasets.datasets()`
end

4. Use YourDatasets

In the case new methods YourDatasets.dataset and YourDatasets.datasets has been created:

using YourDatasets
YourDatasets.datasets() # a DataFrame for all availabe packages and datasets
df = YourDatasets.dataset("LHVRSHIVA", "SHIVA") # load dataset "SHIVA" in package "LHVRSHIVA" as a DataFrame

In the case new methods YourDatasets.dataset and YourDatasets.datasets() has NOT been created:

using YourDatasets, SmallDatasetMaker
SmallDatasetMaker.datasets(YourDatasets)
df = SmallDatasetMaker.dataset(YourDatasets, "LHVRSHIVA", "SHIVA")

Best practice/Hints

Keep the default branch clean without raw data

  • Commit and push only the compressed .gz files and the updated data/doc/datasets.csv
  • You may work on an alternative branch, e.g. new-dataset-from-raw, and use git merge --no-ff new-dataset-from-raw to your default branch and manually un-stage all artifacts.
  • Noted that if the default branch isn't clean, pkg> add YourDatasets will take extra unnecessary disk space.
  • You may simply follow the hygiene of
    • always place raw data in data/raw/ and
    • add data/raw/ in .gitignore

Optional

Test

You may also optionally have the following tests in YourDatasets:

Test if the table to the list of YourDatasets is fine:

@testset "Test if datasets() works" begin
    using DataFrames
    df = YourDatasets.datasets()
    @test isa(df, DataFrame)
    @test isa(YourDatasets.__datasets, DataFrame)
end
@testset "Test if ALL datasets can be successfully loaded." begin
    using DataFrames
    for lastrow in eachrow(YourDatasets.__datasets)
        pkgnm = lastrow.PackageName
        datnm = lastrow.Dataset
        df = YourDatasets.dataset(pkgnm, datnm)
        @info "$pkgnm/$datnm goes through `PrepareTableDefault` without error."
        @test lastrow.Columns == ncol(df)
        @test lastrow.Rows == nrow(df)
    end
    @test true
end

See also

How dataset and datasets work

See also

SmallDatasetMaker.datasetsFunction

datasets(mod::Module) reads the table from dataset_table(mod), and set __datasets::DataFrame to be the const variable in the scope of mod (i.e., mod.__datasets show the list of packages and datasets).

If there is no using SmallDatasetMaker in the module $mod ... end, it will fail since it is executed at the scope of mod.

source
SmallDatasetMaker.datasetFunction

dataset(package_name::AbstractString, dataset_name::AbstractString) returns a DataFrame object unzipped from the last row returned by target_row(mod, package_name, dataset_name). This function mimics the dataset function in RDatasets.jl.

source

dataset(target_path) decompress target_path and returns it as a DataFrame.

Notice!

If you were intended to load target_path under SmallDatasetMaker or anyother package rather than the current directory you are working with, you should apply abspath(args::String...) or abspath(ACertainImportedPackage, args::String...) that target_path = SmallDatasetMaker.abspath(...).

source

How package_name and dataset_name is automatically determined:

SD = SourceData("data/raw/Hello/world.csv");
(SD.package_name, SD.dataset_name)
("Hello", "world")

See also

SmallDatasetMaker.get_package_dataset_nameFunction

Given path to the source file, get_package_dataset_name(srcpath) derive package name and dataset name from the srcpath.

Example

srcpath = joinpath("Whatever", "RDatasets", "iris.csv")
SmallDatasetMaker.get_package_dataset_name(srcpath)

# output

("RDatasets", "iris")
source

Where is the raw data

Difference between the usage of YourDatasets and RDatasets

Here are the highlights of differences between the usage of YourDatasets (created by SmallDatasetMaker) and RDatasets:

  • For RDatasets, RDatasets.__datasets is a global variable; whereas YourDatasets.__datasets is a const variable.