SmallDatasetMaker
Documentation for SmallDatasetMaker.
SmallDatasetMaker.column_field_dictionary
SmallDatasetMaker.field_column_dictionary
SmallDatasetMaker.ordered_columns
DataFrames.DataFrame
SmallDatasetMaker.SourceData
SmallDatasetMaker.SourceData
SmallDatasetMaker.SourceData
SmallDatasetMaker.SourceData
SmallDatasetMaker.SourceData
SmallDatasetMaker.SourceData
SmallDatasetMaker.SourceData
SmallDatasetMaker.SourceData
SmallDatasetMaker.DATASET_ABS_DIR
SmallDatasetMaker.abspath
SmallDatasetMaker.abspath!
SmallDatasetMaker.cleantable
SmallDatasetMaker.compress_save
SmallDatasetMaker.compress_save
SmallDatasetMaker.compress_save!
SmallDatasetMaker.compress_save!
SmallDatasetMaker.create_empty_table
SmallDatasetMaker.dataset
SmallDatasetMaker.dataset
SmallDatasetMaker.dataset
SmallDatasetMaker.dataset_dir
SmallDatasetMaker.dataset_table
SmallDatasetMaker.datasets
SmallDatasetMaker.datasets
SmallDatasetMaker.difftables
SmallDatasetMaker.difftables
SmallDatasetMaker.dir_data
SmallDatasetMaker.dir_data
SmallDatasetMaker.dir_raw
SmallDatasetMaker.dir_raw
SmallDatasetMaker.get_package_dataset_name
SmallDatasetMaker.get_package_dataset_name
SmallDatasetMaker.load_original
SmallDatasetMaker.relpath!
SmallDatasetMaker.return_compressed
SmallDatasetMaker.return_compressed
SmallDatasetMaker.target_row
SmallDatasetMaker.tryparse_summary
SmallDatasetMaker.tryparse_summary
SmallDatasetMaker.tryparse_summary
SmallDatasetMaker.unzip_file
SmallDatasetMaker.unzip_file
SmallDatasetMaker.column_field_dictionary
— Constantcolumn_field_dictionary
follows the order of the field of Source data.
SmallDatasetMaker.field_column_dictionary
— Constantfield_column_dictionary
follows the order of the field of Source data.
SmallDatasetMaker.ordered_columns
— ConstantThe order for dataset_table()
.
DataFrames.DataFrame
— MethodConstruct a DataFrame
following the order of ordered_columns
.
SmallDatasetMaker.SourceData
— MethodIf zipfile
not specified, it will be dir_data(package_name, dataset_name*".gz")
.
SmallDatasetMaker.SourceData
— MethodIf rows, columns
not specified, CSV.read(srcfile, DataFrame)
will be applied to get the number of rows/columns.
SmallDatasetMaker.SourceData
— MethodIf description
not specified, it will be ""
.
SmallDatasetMaker.SourceData
— MethodSourceData(srcfile, package_name, dataset_name, title, zipfile, rows, columns, description, timestamps)
srcfile
is the path to the source file, the package_name
will be the folder that the file resides, the dataset_name
will be the name of the data without extension.
If timestamps
not specified, it will be today()
.
SmallDatasetMaker.SourceData
— MethodIf title
not specified, it will be "Data [$dataset_name] of [$package_name]"
.
SmallDatasetMaker.SourceData
— MethodIf package_name, dataset_name
not specified, (package_name, dataset_name) = get_package_dataset_name(srcfile)
is applied.
Example
using SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
SD = SourceData(srcfile)
SmallDatasetMaker.SourceData
— MethodSourceData(mod::Module, row::DataFrameRow)
applies create an SourceData
objects from a row of a DataFrame
(i.e., dataset_table(mod)
), with abspath!
applied.
This is for loading data according to dataset_table
; thus, paths should be referred to that in mod instead of being relative to the current directory.
SmallDatasetMaker.DATASET_ABS_DIR
— MethodDATASET_ABS_DIR(mod::Module)
returns the absolute directory for package mod
.
SmallDatasetMaker.abspath!
— Methodabspath!(SD::SourceData, mod::Module)
makes all paths in SD
to be absolute with the starting directory DATASET_ABS_DIR
.
SmallDatasetMaker.abspath
— Methodabspath(mod::Module, args...) = joinpath(DATASET_ABS_DIR(mod)[], args...)
return absolute path of the module mod
.
WARNING: DO NOT EXPORT THIS FUNCTION
this function has the same name of abspath
in FilePathsBase
and Base.Filesystem
.
SmallDatasetMaker.cleantable
— Methodcleantable(mod::Module)
remove redundant entries of the dataset_table(mod)
and overwrite data/doc/datasets.csv
. Use with caution and verify before commit & push.
Example
using SmallDatasetMaker, YourDatasets
SmallDatasetMaker.cleantable(YourDatasets)
SmallDatasetMaker.compress_save!
— Methodcompress_save!(mod::Module, SD::SourceData; move_source = true, targeting_mod = false)
compress the SD.srcfile
, save the zipped one to SD.zipfile
, and update the dataset_table(mod)
.
Options:
- By default,
move_source = true
that the source file will be moved todir_raw()
. - Set
targeting_mod = true
to make sure the compressed data is saved relative to the repo ofmod
. Default isfalse
, which means you can compress and save the data to whatever directory you like.
compress_save!
returns SD::SourceData
of relative paths to DATASET_ABS_DIR(mod)[]
, where relpath!
is applied that paths SD
as well as dataset_table(mod)
are modified to be relative.
Example
using YourDatasets, SmallDatasetMaker
compress_save!(YourDatasets, SD; targeting_mod = true)
This do the followings:
- Create zipped files under
data/
of packageYourDatasets
indev
elopment. - Move the source file
SD.srcfile
(i.e., the raw .csv data) todir_raw(YourDatasets, ...)
by default. - Add a new line to
SmallDatasetMaker.dataset_table(YourDatasets)
(updatedata/doc/datasets.csv
ofYourDatasets
).
See also SourceData
, compress_save
.
SmallDatasetMaker.compress_save
— Methodcompress_save(mod::Module, srcpath; args...)
is equivalent to compress_save!(mod, SourceData(srcpath))
but returns SD = SourceData(srcpath)
.
compress_save
takes the same keyword arguments as compress_save!
, which returns SD::SourceData
of relative paths to DATASET_ABS_DIR(mod)[]
.
Example
using YourDatasets, SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
compress_save(YourDatasets, srcfile; targeting_mod = true)
SmallDatasetMaker.create_empty_table
— MethodInitiate referencing table at dataset_table(args...)
. It takes exactly the same arguments of dataset_table
.
SmallDatasetMaker.dataset
— Methoddataset(target_path)
decompress target_path
and returns it as a DataFrame
.
Notice!
If you were intended to load target_path
under SmallDatasetMaker
or anyother package rather than the current directory you are working with, you should apply abspath(args::String...)
or abspath(ACertainImportedPackage, args::String...)
that target_path = SmallDatasetMaker.abspath(...)
.
SmallDatasetMaker.dataset
— Methoddataset(package_name::AbstractString, dataset_name::AbstractString)
returns a DataFrame
object unzipped from the last row
returned by target_row(mod, package_name, dataset_name)
. This function mimics the dataset
function in RDatasets.jl
.
SmallDatasetMaker.dataset_dir
— Methoddataset_dir(mod::Module, args::String...)
returns the absolute dataset path referencing mod
.
SmallDatasetMaker.dataset_table
— Methoddataset_table(mod::Module) = joinpath(DATASET_ABS_DIR(mod)[],"data", "doc", "datasets.csv")
The reason for dataset_table
to be a function rather than a constant is that I can redefine it in the scope of test. See test/compdecomp.jl
.
SmallDatasetMaker.datasets
— Methoddatasets(mod::Module)
reads the table from dataset_table(mod)
, and set __datasets::DataFrame
to be the const
variable in the scope of mod
(i.e., mod.__datasets
show the list of packages and datasets).
If there is no using SmallDatasetMaker
in the module $mod ... end
, it will fail since it is executed at the scope of mod
.
SmallDatasetMaker.difftables
— MethodGiven a series of DataFrame
s, difftables(df0::AbstractDataFrame, dfs::AbstractDataFrame...; ignoring = Cols())
returns report::DataFrame
with columns
:nrow
: number of rows of eachDataFrame
.:ncol
: number of columns of eachDataFrame
.:cols_lack
: lack of columns comparing todf0
.:cols_add
: extra columns comparing todf0
.
This function is useful for update an existing dataset (where the new data might have unidentical column names).
SmallDatasetMaker.dir_data
— MethodRelative path to the directory of data; this is called by SourceData
.
SmallDatasetMaker.dir_data
— MethodAbsolute path to the directory of data.
SmallDatasetMaker.dir_raw
— MethodPath to the directory "data/raw/" of module mod
; the default directory for the raw data.
SmallDatasetMaker.get_package_dataset_name
— MethodGiven path to the source file, get_package_dataset_name(srcpath)
derive package name and dataset name from the srcpath
.
Example
srcpath = joinpath("Whatever", "RDatasets", "iris.csv")
SmallDatasetMaker.get_package_dataset_name(srcpath)
# output
("RDatasets", "iris")
SmallDatasetMaker.load_original
— Methodload_original(path::AbstractString)
opens path
and return the read data.
SmallDatasetMaker.relpath!
— Methodrelpath!(SD::SourceData, mod::Module)
makes all paths in SD
to be relative path to DATASET_ABS_DIR
.
SmallDatasetMaker.return_compressed
— Methodreturn_compressed(path::AbstractString)
returned compressed data.
Example
compressed = return_compressed("data/data.csv")
SmallDatasetMaker.return_compressed
— Methodreturn_compressed(data::Vector{UInt8})
returned compressed data.
Example
data = load_original("data/data.csv")
compressed = return_compressed(data)
SmallDatasetMaker.target_row
— Methodtarget_row
returns the latest information in datasets(mod::Module)
. Given package_name, dataset_name
, target_row(mod, package_name, dataset_name)
, target_row
returns the last row
that matches row.PackageName == package_name && row.Dataset == dataset_name"
.
SmallDatasetMaker.tryparse_summary
— Methodtryparse_summary(v::AbstractVector, typetoparse::Type{<:Any})
Example
julia> tryparse_summary(["1", "2", "3.3", 10, "NaN"], Float64) .|> typeof
5-element Vector{DataType}:
SmallDatasetMaker.NotException
SmallDatasetMaker.NotException
SmallDatasetMaker.NotException
MethodError
SmallDatasetMaker.NotException
SmallDatasetMaker.tryparse_summary
— Methodtryparse_summary(df::AbstractDataFrame, typetoparse)
returns a "long" dataframe with columns :variable_name
, :exception_type
and :exception_msg
.
Example
using DataFrames
df = DataFrame(
:name => ["John", "Roe", "Mary", "Hello", "World"],
:salary => [5.372, "1.1", "1", "NaN", "#value"],
:age => string.([20, 13, 17, 22, 100])
)
summary = tryparse_summary(df, Float64)
combine(groupby(summary, [:variable_name, :exception_type, :exception_msg]), nrow)
SmallDatasetMaker.unzip_file
— Methodunzip_file(target_path)
unzip file at target_path
to current directory preserve its original name.
Notice!
If you were intended to load target_path
under SmallDatasetMaker
or anyother package rather than the current directory you are working with, you should apply abspath(args::String...)
or abspath(ACertainImportedPackage, args::String...)
that target_path = SmallDatasetMaker.abspath(...)
.
SmallDatasetMaker.unzip_file
— MethodThe same as dataset
, but also save the unzip file.