Introduction

Inspired by RDatasets.jl, SmallDatasetMaker provides tools to create/add/update a julia package of datasets in only a few steps.

Getting started

1. Create a package

Create a julia package, for example, YourDatasets.jl. For convenience, YourDatasets in this documentation refers an arbitrary package of datasets working with SmallDatasetMaker herein after.

See PkgTemplates and Pkg.jl/Creating Packages about how to create a julia package.

SmallDatasetMaker should be added to the Project.toml of YourDatasets:

(YourDatasets) pkg> add SmallDatasetMaker

2. Convert the raw data to a dataset

Note

Activate the environment YourDatasets and using SmallDatasetMaker first!

Make your dataset to be compressed a csv file.
Define the SourceData object with the srcpath to be the path to this csv file.
Call compress_save! or compress_save.

SmallDatasetMaker.SourceData — Type

SourceData(srcfile, package_name, dataset_name, title, zipfile, rows, columns, description, timestamps)

srcfile is the path to the source file, the package_name will be the folder that the file resides, the dataset_name will be the name of the data without extension.

If timestamps not specified, it will be today().

source

If description not specified, it will be "".

source

If rows, columns not specified, CSV.read(srcfile, DataFrame) will be applied to get the number of rows/columns.

source

If zipfile not specified, it will be dir_data(package_name, dataset_name*".gz").

source

If title not specified, it will be "Data [$dataset_name] of [$package_name]".

source

If package_name, dataset_name not specified, (package_name, dataset_name) = get_package_dataset_name(srcfile) is applied.

Example

using SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
SD = SourceData(srcfile)

source

SourceData(mod::Module, row::DataFrameRow) applies create an SourceData objects from a row of a DataFrame (i.e., dataset_table(mod)), with abspath! applied.

This is for loading data according to dataset_table; thus, paths should be referred to that in mod instead of being relative to the current directory.

source

SmallDatasetMaker.compress_save! — Function

compress_save!(mod::Module, SD::SourceData; move_source = true, targeting_mod = false) compress the SD.srcfile, save the zipped one to SD.zipfile, and update the dataset_table(mod).

Options:

By default, move_source = true that the source file will be moved to dir_raw().
Set targeting_mod = true to make sure the compressed data is saved relative to the repo of mod. Default is false, which means you can compress and save the data to whatever directory you like.

compress_save! returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[], where relpath! is applied that paths SD as well as dataset_table(mod) are modified to be relative.

Example

using YourDatasets, SmallDatasetMaker
compress_save!(YourDatasets, SD; targeting_mod = true)

This do the followings:

Create zipped files under data/ of package YourDatasets in development.
Move the source file SD.srcfile (i.e., the raw .csv data) to dir_raw(YourDatasets, ...) by default.
Add a new line to SmallDatasetMaker.dataset_table(YourDatasets) (update data/doc/datasets.csv of YourDatasets).

3. Add methods `dataset` and `datasets`

using SmallDatasetMaker in the module scope of YourDatasets
(Optional) New methods for dataset and datasets.

Example

In src/YourDatasets.jl:


module YourDatasets

    using SmallDatasetMaker
    # (required) See also `SmallDatasetMaker.datasets`.

    function dataset(package_name, dataset_name)
        SmallDatasetMaker.dataset(YourDatasets,package_name, dataset_name)
    end 
    # (optional but recommended) 
    # To allow direct use of `dataset` without `SmallDatasetMaker`.

    datasets() = SmallDatasetMaker.datasets(YourDatasets) 
    # (optional but recommended) To allow the direct use of `YourDatasets.datasets()`
end

4. Use `YourDatasets`

In the case new methods YourDatasets.dataset and YourDatasets.datasets has been created:

using YourDatasets
YourDatasets.datasets() # a DataFrame for all availabe packages and datasets
df = YourDatasets.dataset("LHVRSHIVA", "SHIVA") # load dataset "SHIVA" in package "LHVRSHIVA" as a DataFrame

In the case new methods YourDatasets.dataset and YourDatasets.datasets() has NOT been created:

using YourDatasets, SmallDatasetMaker
SmallDatasetMaker.datasets(YourDatasets)
df = SmallDatasetMaker.dataset(YourDatasets, "LHVRSHIVA", "SHIVA")

Best practice/Hints

Keep the default branch clean without raw data

Commit and push only the compressed .gz files and the updated data/doc/datasets.csv
You may work on an alternative branch, e.g. new-dataset-from-raw, and use git merge --no-ff new-dataset-from-raw to your default branch and manually un-stage all artifacts.

Noted that if the default branch isn't clean, pkg> add YourDatasets will take extra unnecessary disk space.
You may simply follow the hygiene of
- always place raw data in data/raw/ and
- add data/raw/ in .gitignore

Optional

Test

You may also optionally have the following tests in YourDatasets:

Test if the table to the list of YourDatasets is fine:

@testset "Test if datasets() works" begin
    using DataFrames
    df = YourDatasets.datasets()
    @test isa(df, DataFrame)
    @test isa(YourDatasets.__datasets, DataFrame)
end

@testset "Test if ALL datasets can be successfully loaded." begin
    using DataFrames
    for lastrow in eachrow(YourDatasets.__datasets)
        pkgnm = lastrow.PackageName
        datnm = lastrow.Dataset
        df = YourDatasets.dataset(pkgnm, datnm)
        @info "$pkgnm/$datnm goes through `PrepareTableDefault` without error."
        @test lastrow.Columns == ncol(df)
        @test lastrow.Rows == nrow(df)
    end
    @test true
end

Difference between the usage of `YourDatasets` and `RDatasets`

Here are the highlights of differences between the usage of YourDatasets (created by SmallDatasetMaker) and RDatasets:

For RDatasets, RDatasets.__datasets is a global variable; whereas YourDatasets.__datasets is a const variable.

Introduction

Getting started

1. Create a package

2. Convert the raw data to a dataset

3. Add methods `dataset` and `datasets`

Example

4. Use `YourDatasets`

Best practice/Hints

Keep the default branch clean without raw data

Optional

Test

See also

How `dataset` and `datasets` work

How `package_name` and `dataset_name` is automatically determined:

Where is the raw data

Difference between the usage of `YourDatasets` and `RDatasets`