Introduction
Inspired by RDatasets.jl, SmallDatasetMaker provides tools to create/add/update a julia package of datasets in only a few steps.
Getting started
1. Create a package
Create a julia package, for example, YourDatasets.jl
. For convenience, YourDatasets
in this documentation refers an arbitrary package of datasets working with SmallDatasetMaker
herein after.
See PkgTemplates and Pkg.jl/Creating Packages about how to create a julia package.
SmallDatasetMaker
should be added to the Project.toml
of YourDatasets
:
(YourDatasets) pkg> add SmallDatasetMaker
2. Convert the raw data to a dataset
Activate the environment YourDatasets
and using SmallDatasetMaker
first!
- Make your dataset to be compressed a csv file.
- Define the
SourceData
object with thesrcpath
to be the path to this csv file. - Call
compress_save!
orcompress_save
.
SmallDatasetMaker.SourceData
— TypeSourceData(srcfile, package_name, dataset_name, title, zipfile, rows, columns, description, timestamps)
srcfile
is the path to the source file, the package_name
will be the folder that the file resides, the dataset_name
will be the name of the data without extension.
If timestamps
not specified, it will be today()
.
If description
not specified, it will be ""
.
If rows, columns
not specified, CSV.read(srcfile, DataFrame)
will be applied to get the number of rows/columns.
If zipfile
not specified, it will be dir_data(package_name, dataset_name*".gz")
.
If title
not specified, it will be "Data [$dataset_name] of [$package_name]"
.
If package_name, dataset_name
not specified, (package_name, dataset_name) = get_package_dataset_name(srcfile)
is applied.
Example
using SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
SD = SourceData(srcfile)
SourceData(mod::Module, row::DataFrameRow)
applies create an SourceData
objects from a row of a DataFrame
(i.e., dataset_table(mod)
), with abspath!
applied.
This is for loading data according to dataset_table
; thus, paths should be referred to that in mod instead of being relative to the current directory.
SmallDatasetMaker.compress_save!
— Functioncompress_save!(mod::Module, SD::SourceData; move_source = true, targeting_mod = false)
compress the SD.srcfile
, save the zipped one to SD.zipfile
, and update the dataset_table(mod)
.
Options:
- By default,
move_source = true
that the source file will be moved todir_raw()
. - Set
targeting_mod = true
to make sure the compressed data is saved relative to the repo ofmod
. Default isfalse
, which means you can compress and save the data to whatever directory you like.
compress_save!
returns SD::SourceData
of relative paths to DATASET_ABS_DIR(mod)[]
, where relpath!
is applied that paths SD
as well as dataset_table(mod)
are modified to be relative.
Example
using YourDatasets, SmallDatasetMaker
compress_save!(YourDatasets, SD; targeting_mod = true)
This do the followings:
- Create zipped files under
data/
of packageYourDatasets
indev
elopment. - Move the source file
SD.srcfile
(i.e., the raw .csv data) todir_raw(YourDatasets, ...)
by default. - Add a new line to
SmallDatasetMaker.dataset_table(YourDatasets)
(updatedata/doc/datasets.csv
ofYourDatasets
).
See also SourceData
, compress_save
.
SmallDatasetMaker.compress_save
— Functioncompress_save(mod::Module, srcpath; args...)
is equivalent to compress_save!(mod, SourceData(srcpath))
but returns SD = SourceData(srcpath)
.
compress_save
takes the same keyword arguments as compress_save!
, which returns SD::SourceData
of relative paths to DATASET_ABS_DIR(mod)[]
.
Example
using YourDatasets, SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
compress_save(YourDatasets, srcfile; targeting_mod = true)
3. Add methods dataset
and datasets
using SmallDatasetMaker
in the module scope ofYourDatasets
- (Optional) New methods for
dataset
anddatasets
.
Example
In src/YourDatasets.jl
:
module YourDatasets
using SmallDatasetMaker
# (required) See also `SmallDatasetMaker.datasets`.
function dataset(package_name, dataset_name)
SmallDatasetMaker.dataset(YourDatasets,package_name, dataset_name)
end
# (optional but recommended)
# To allow direct use of `dataset` without `SmallDatasetMaker`.
datasets() = SmallDatasetMaker.datasets(YourDatasets)
# (optional but recommended) To allow the direct use of `YourDatasets.datasets()`
end
4. Use YourDatasets
In the case new methods YourDatasets.dataset
and YourDatasets.datasets
has been created:
using YourDatasets
YourDatasets.datasets() # a DataFrame for all availabe packages and datasets
df = YourDatasets.dataset("LHVRSHIVA", "SHIVA") # load dataset "SHIVA" in package "LHVRSHIVA" as a DataFrame
In the case new methods YourDatasets.dataset
and YourDatasets.datasets()
has NOT been created:
using YourDatasets, SmallDatasetMaker
SmallDatasetMaker.datasets(YourDatasets)
df = SmallDatasetMaker.dataset(YourDatasets, "LHVRSHIVA", "SHIVA")
Best practice/Hints
Keep the default branch clean without raw data
- Commit and push only the compressed .gz files and the updated
data/doc/datasets.csv
- You may work on an alternative branch, e.g.
new-dataset-from-raw
, and usegit merge --no-ff new-dataset-from-raw
to your default branch and manually un-stage all artifacts.
- Noted that if the default branch isn't clean,
pkg> add YourDatasets
will take extra unnecessary disk space. - You may simply follow the hygiene of
- always place raw data in
data/raw/
and - add
data/raw/
in.gitignore
- always place raw data in
Optional
Test
You may also optionally have the following tests in YourDatasets
:
Test if the table to the list of YourDatasets
is fine:
@testset "Test if datasets() works" begin
using DataFrames
df = YourDatasets.datasets()
@test isa(df, DataFrame)
@test isa(YourDatasets.__datasets, DataFrame)
end
@testset "Test if ALL datasets can be successfully loaded." begin
using DataFrames
for lastrow in eachrow(YourDatasets.__datasets)
pkgnm = lastrow.PackageName
datnm = lastrow.Dataset
df = YourDatasets.dataset(pkgnm, datnm)
@info "$pkgnm/$datnm goes through `PrepareTableDefault` without error."
@test lastrow.Columns == ncol(df)
@test lastrow.Rows == nrow(df)
end
@test true
end
See also
How dataset
and datasets
work
See also
SmallDatasetMaker.datasets
— Functiondatasets(mod::Module)
reads the table from dataset_table(mod)
, and set __datasets::DataFrame
to be the const
variable in the scope of mod
(i.e., mod.__datasets
show the list of packages and datasets).
If there is no using SmallDatasetMaker
in the module $mod ... end
, it will fail since it is executed at the scope of mod
.
SmallDatasetMaker.dataset
— Functiondataset(package_name::AbstractString, dataset_name::AbstractString)
returns a DataFrame
object unzipped from the last row
returned by target_row(mod, package_name, dataset_name)
. This function mimics the dataset
function in RDatasets.jl
.
dataset(target_path)
decompress target_path
and returns it as a DataFrame
.
Notice!
If you were intended to load target_path
under SmallDatasetMaker
or anyother package rather than the current directory you are working with, you should apply abspath(args::String...)
or abspath(ACertainImportedPackage, args::String...)
that target_path = SmallDatasetMaker.abspath(...)
.
How package_name
and dataset_name
is automatically determined:
SD = SourceData("data/raw/Hello/world.csv");
(SD.package_name, SD.dataset_name)
("Hello", "world")
See also
SmallDatasetMaker.get_package_dataset_name
— FunctionGiven path to the source file, get_package_dataset_name(srcpath)
derive package name and dataset name from the srcpath
.
Example
srcpath = joinpath("Whatever", "RDatasets", "iris.csv")
SmallDatasetMaker.get_package_dataset_name(srcpath)
# output
("RDatasets", "iris")
Where is the raw data
SmallDatasetMaker.dir_raw
— FunctionPath to the directory "data/raw/" of module mod
; the default directory for the raw data.
Difference between the usage of YourDatasets
and RDatasets
Here are the highlights of differences between the usage of YourDatasets
(created by SmallDatasetMaker
) and RDatasets
:
- For
RDatasets
,RDatasets.__datasets
is aglobal
variable; whereasYourDatasets.__datasets
is aconst
variable.