SmallDatasetMaker

Documentation for SmallDatasetMaker.

SmallDatasetMaker.SourceDataMethod

SourceData(srcfile, package_name, dataset_name, title, zipfile, rows, columns, description, timestamps)

srcfile is the path to the source file, the package_name will be the folder that the file resides, the dataset_name will be the name of the data without extension.

If timestamps not specified, it will be today().

source
SmallDatasetMaker.SourceDataMethod

If package_name, dataset_name not specified, (package_name, dataset_name) = get_package_dataset_name(srcfile) is applied.

Example

using SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
SD = SourceData(srcfile)
source
SmallDatasetMaker.SourceDataMethod

SourceData(mod::Module, row::DataFrameRow) applies create an SourceData objects from a row of a DataFrame (i.e., dataset_table(mod)), with abspath! applied.

This is for loading data according to dataset_table; thus, paths should be referred to that in mod instead of being relative to the current directory.

source
SmallDatasetMaker.abspath!Method

abspath!(SD::SourceData, mod::Module) makes all paths in SD to be absolute with the starting directory DATASET_ABS_DIR.

source
SmallDatasetMaker.abspathMethod

abspath(mod::Module, args...) = joinpath(DATASET_ABS_DIR(mod)[], args...) return absolute path of the module mod.

WARNING: DO NOT EXPORT THIS FUNCTION

this function has the same name of abspath in FilePathsBase and Base.Filesystem.

source
SmallDatasetMaker.cleantableMethod

cleantable(mod::Module) remove redundant entries of the dataset_table(mod) and overwrite data/doc/datasets.csv. Use with caution and verify before commit & push.

Example

using SmallDatasetMaker, YourDatasets
SmallDatasetMaker.cleantable(YourDatasets)
source
SmallDatasetMaker.compress_save!Method

compress_save!(mod::Module, SD::SourceData; move_source = true, targeting_mod = false) compress the SD.srcfile, save the zipped one to SD.zipfile, and update the dataset_table(mod).

Options:

  • By default, move_source = true that the source file will be moved to dir_raw().
  • Set targeting_mod = true to make sure the compressed data is saved relative to the repo of mod. Default is false, which means you can compress and save the data to whatever directory you like.

compress_save! returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[], where relpath! is applied that paths SD as well as dataset_table(mod) are modified to be relative.

Example

using YourDatasets, SmallDatasetMaker
compress_save!(YourDatasets, SD; targeting_mod = true)

This do the followings:

  1. Create zipped files under data/ of package YourDatasets in development.
  2. Move the source file SD.srcfile (i.e., the raw .csv data) to dir_raw(YourDatasets, ...) by default.
  3. Add a new line to SmallDatasetMaker.dataset_table(YourDatasets) (update data/doc/datasets.csv of YourDatasets).

See also SourceData, compress_save.

source
SmallDatasetMaker.compress_saveMethod

compress_save(mod::Module, srcpath; args...) is equivalent to compress_save!(mod, SourceData(srcpath)) but returns SD = SourceData(srcpath).

compress_save takes the same keyword arguments as compress_save!, which returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[].

Example

using YourDatasets, SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
compress_save(YourDatasets, srcfile; targeting_mod = true)
source
SmallDatasetMaker.datasetMethod

dataset(target_path) decompress target_path and returns it as a DataFrame.

Notice!

If you were intended to load target_path under SmallDatasetMaker or anyother package rather than the current directory you are working with, you should apply abspath(args::String...) or abspath(ACertainImportedPackage, args::String...) that target_path = SmallDatasetMaker.abspath(...).

source
SmallDatasetMaker.datasetMethod

dataset(package_name::AbstractString, dataset_name::AbstractString) returns a DataFrame object unzipped from the last row returned by target_row(mod, package_name, dataset_name). This function mimics the dataset function in RDatasets.jl.

source
SmallDatasetMaker.dataset_tableMethod

dataset_table(mod::Module) = joinpath(DATASET_ABS_DIR(mod)[],"data", "doc", "datasets.csv")

The reason for dataset_table to be a function rather than a constant is that I can redefine it in the scope of test. See test/compdecomp.jl.

source
SmallDatasetMaker.datasetsMethod

datasets(mod::Module) reads the table from dataset_table(mod), and set __datasets::DataFrame to be the const variable in the scope of mod (i.e., mod.__datasets show the list of packages and datasets).

If there is no using SmallDatasetMaker in the module $mod ... end, it will fail since it is executed at the scope of mod.

source
SmallDatasetMaker.difftablesMethod

Given a series of DataFrames, difftables(df0::AbstractDataFrame, dfs::AbstractDataFrame...; ignoring = Cols()) returns report::DataFrame with columns

  • :nrow: number of rows of each DataFrame.
  • :ncol: number of columns of each DataFrame.
  • :cols_lack: lack of columns comparing to df0.
  • :cols_add: extra columns comparing to df0.

This function is useful for update an existing dataset (where the new data might have unidentical column names).

source
SmallDatasetMaker.get_package_dataset_nameMethod

Given path to the source file, get_package_dataset_name(srcpath) derive package name and dataset name from the srcpath.

Example

srcpath = joinpath("Whatever", "RDatasets", "iris.csv")
SmallDatasetMaker.get_package_dataset_name(srcpath)

# output

("RDatasets", "iris")
source
SmallDatasetMaker.target_rowMethod

target_row returns the latest information in datasets(mod::Module). Given package_name, dataset_name, target_row(mod, package_name, dataset_name), target_row returns the last row that matches row.PackageName == package_name && row.Dataset == dataset_name".

source
SmallDatasetMaker.tryparse_summaryMethod

tryparse_summary(v::AbstractVector, typetoparse::Type{<:Any})

Example

julia> tryparse_summary(["1", "2", "3.3", 10, "NaN"], Float64) .|> typeof
5-element Vector{DataType}:
 SmallDatasetMaker.NotException
 SmallDatasetMaker.NotException
 SmallDatasetMaker.NotException
 MethodError
 SmallDatasetMaker.NotException
source
SmallDatasetMaker.tryparse_summaryMethod

tryparse_summary(df::AbstractDataFrame, typetoparse) returns a "long" dataframe with columns :variable_name, :exception_type and :exception_msg.

Example

using DataFrames
df = DataFrame(
    :name => ["John", "Roe", "Mary", "Hello", "World"],
    :salary => [5.372, "1.1", "1", "NaN", "#value"],
    :age => string.([20, 13, 17, 22, 100])
)
summary = tryparse_summary(df, Float64)
combine(groupby(summary, [:variable_name, :exception_type, :exception_msg]), nrow)
source
SmallDatasetMaker.unzip_fileMethod

unzip_file(target_path) unzip file at target_path to current directory preserve its original name.

Notice!

If you were intended to load target_path under SmallDatasetMaker or anyother package rather than the current directory you are working with, you should apply abspath(args::String...) or abspath(ACertainImportedPackage, args::String...) that target_path = SmallDatasetMaker.abspath(...).

source