Introduction
Inspired by RDatasets.jl, SmallDatasetMaker provides tools to create/add/update a julia package of datasets in only a few steps.
Getting started
1. Create a package
Create a julia package, for example, YourDatasets.jl. For convenience, YourDatasets in this documentation refers an arbitrary package of datasets working with SmallDatasetMaker herein after.
See PkgTemplates and Pkg.jl/Creating Packages about how to create a julia package.
SmallDatasetMaker should be added to the Project.toml of YourDatasets:
(YourDatasets) pkg> add SmallDatasetMaker
2. Convert the raw data to a dataset
Activate the environment YourDatasets and using SmallDatasetMaker first!
- Make your dataset to be compressed a csv file.
- Define the
SourceDataobject with thesrcpathto be the path to this csv file. - Call
compress_save!orcompress_save.
SmallDatasetMaker.SourceData — TypeSourceData(srcfile, package_name, dataset_name, title, zipfile, rows, columns, description, timestamps)
srcfile is the path to the source file, the package_name will be the folder that the file resides, the dataset_name will be the name of the data without extension.
If timestamps not specified, it will be today().
If description not specified, it will be "".
If rows, columns not specified, CSV.read(srcfile, DataFrame) will be applied to get the number of rows/columns.
If zipfile not specified, it will be dir_data(package_name, dataset_name*".gz").
If title not specified, it will be "Data [$dataset_name] of [$package_name]".
If package_name, dataset_name not specified, (package_name, dataset_name) = get_package_dataset_name(srcfile) is applied.
Example
using SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
SD = SourceData(srcfile)SourceData(mod::Module, row::DataFrameRow) applies create an SourceData objects from a row of a DataFrame (i.e., dataset_table(mod)), with abspath! applied.
This is for loading data according to dataset_table; thus, paths should be referred to that in mod instead of being relative to the current directory.
SmallDatasetMaker.compress_save! — Functioncompress_save!(mod::Module, SD::SourceData; move_source = true, targeting_mod = false) compress the SD.srcfile, save the zipped one to SD.zipfile, and update the dataset_table(mod).
Options:
- By default,
move_source = truethat the source file will be moved todir_raw(). - Set
targeting_mod = trueto make sure the compressed data is saved relative to the repo ofmod. Default isfalse, which means you can compress and save the data to whatever directory you like.
compress_save! returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[], where relpath! is applied that paths SD as well as dataset_table(mod) are modified to be relative.
Example
using YourDatasets, SmallDatasetMaker
compress_save!(YourDatasets, SD; targeting_mod = true)This do the followings:
- Create zipped files under
data/of packageYourDatasetsindevelopment. - Move the source file
SD.srcfile(i.e., the raw .csv data) todir_raw(YourDatasets, ...)by default. - Add a new line to
SmallDatasetMaker.dataset_table(YourDatasets)(updatedata/doc/datasets.csvofYourDatasets).
See also SourceData, compress_save.
SmallDatasetMaker.compress_save — Functioncompress_save(mod::Module, srcpath; args...) is equivalent to compress_save!(mod, SourceData(srcpath)) but returns SD = SourceData(srcpath).
compress_save takes the same keyword arguments as compress_save!, which returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[].
Example
using YourDatasets, SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
compress_save(YourDatasets, srcfile; targeting_mod = true)3. Add methods dataset and datasets
using SmallDatasetMakerin the module scope ofYourDatasets- (Optional) New methods for
datasetanddatasets.
Example
In src/YourDatasets.jl:
module YourDatasets
using SmallDatasetMaker
# (required) See also `SmallDatasetMaker.datasets`.
function dataset(package_name, dataset_name)
SmallDatasetMaker.dataset(YourDatasets,package_name, dataset_name)
end
# (optional but recommended)
# To allow direct use of `dataset` without `SmallDatasetMaker`.
datasets() = SmallDatasetMaker.datasets(YourDatasets)
# (optional but recommended) To allow the direct use of `YourDatasets.datasets()`
end
4. Use YourDatasets
In the case new methods YourDatasets.dataset and YourDatasets.datasets has been created:
using YourDatasets
YourDatasets.datasets() # a DataFrame for all availabe packages and datasets
df = YourDatasets.dataset("LHVRSHIVA", "SHIVA") # load dataset "SHIVA" in package "LHVRSHIVA" as a DataFrameIn the case new methods YourDatasets.dataset and YourDatasets.datasets() has NOT been created:
using YourDatasets, SmallDatasetMaker
SmallDatasetMaker.datasets(YourDatasets)
df = SmallDatasetMaker.dataset(YourDatasets, "LHVRSHIVA", "SHIVA")Best practice/Hints
Keep the default branch clean without raw data
- Commit and push only the compressed .gz files and the updated
data/doc/datasets.csv - You may work on an alternative branch, e.g.
new-dataset-from-raw, and usegit merge --no-ff new-dataset-from-rawto your default branch and manually un-stage all artifacts.
- Noted that if the default branch isn't clean,
pkg> add YourDatasetswill take extra unnecessary disk space. - You may simply follow the hygiene of
- always place raw data in
data/raw/and - add
data/raw/in.gitignore
- always place raw data in
Optional
Test
You may also optionally have the following tests in YourDatasets:
Test if the table to the list of YourDatasets is fine:
@testset "Test if datasets() works" begin
using DataFrames
df = YourDatasets.datasets()
@test isa(df, DataFrame)
@test isa(YourDatasets.__datasets, DataFrame)
end
@testset "Test if ALL datasets can be successfully loaded." begin
using DataFrames
for lastrow in eachrow(YourDatasets.__datasets)
pkgnm = lastrow.PackageName
datnm = lastrow.Dataset
df = YourDatasets.dataset(pkgnm, datnm)
@info "$pkgnm/$datnm goes through `PrepareTableDefault` without error."
@test lastrow.Columns == ncol(df)
@test lastrow.Rows == nrow(df)
end
@test true
endSee also
How dataset and datasets work
See also
SmallDatasetMaker.datasets — Functiondatasets(mod::Module) reads the table from dataset_table(mod), and set __datasets::DataFrame to be the const variable in the scope of mod (i.e., mod.__datasets show the list of packages and datasets).
If there is no using SmallDatasetMaker in the module $mod ... end, it will fail since it is executed at the scope of mod.
SmallDatasetMaker.dataset — Functiondataset(package_name::AbstractString, dataset_name::AbstractString) returns a DataFrame object unzipped from the last row returned by target_row(mod, package_name, dataset_name). This function mimics the dataset function in RDatasets.jl.
dataset(target_path) decompress target_path and returns it as a DataFrame.
Notice!
If you were intended to load target_path under SmallDatasetMaker or anyother package rather than the current directory you are working with, you should apply abspath(args::String...) or abspath(ACertainImportedPackage, args::String...) that target_path = SmallDatasetMaker.abspath(...).
How package_name and dataset_name is automatically determined:
SD = SourceData("data/raw/Hello/world.csv");
(SD.package_name, SD.dataset_name)("Hello", "world")See also
SmallDatasetMaker.get_package_dataset_name — FunctionGiven path to the source file, get_package_dataset_name(srcpath) derive package name and dataset name from the srcpath.
Example
srcpath = joinpath("Whatever", "RDatasets", "iris.csv")
SmallDatasetMaker.get_package_dataset_name(srcpath)
# output
("RDatasets", "iris")Where is the raw data
SmallDatasetMaker.dir_raw — FunctionPath to the directory "data/raw/" of module mod; the default directory for the raw data.
Difference between the usage of YourDatasets and RDatasets
Here are the highlights of differences between the usage of YourDatasets (created by SmallDatasetMaker) and RDatasets:
- For
RDatasets,RDatasets.__datasetsis aglobalvariable; whereasYourDatasets.__datasetsis aconstvariable.