SmallDatasetMaker
Documentation for SmallDatasetMaker.
SmallDatasetMaker.column_field_dictionarySmallDatasetMaker.field_column_dictionarySmallDatasetMaker.ordered_columnsDataFrames.DataFrameSmallDatasetMaker.SourceDataSmallDatasetMaker.SourceDataSmallDatasetMaker.SourceDataSmallDatasetMaker.SourceDataSmallDatasetMaker.SourceDataSmallDatasetMaker.SourceDataSmallDatasetMaker.SourceDataSmallDatasetMaker.SourceDataSmallDatasetMaker.DATASET_ABS_DIRSmallDatasetMaker.abspathSmallDatasetMaker.abspath!SmallDatasetMaker.cleantableSmallDatasetMaker.compress_saveSmallDatasetMaker.compress_saveSmallDatasetMaker.compress_save!SmallDatasetMaker.compress_save!SmallDatasetMaker.create_empty_tableSmallDatasetMaker.datasetSmallDatasetMaker.datasetSmallDatasetMaker.datasetSmallDatasetMaker.dataset_dirSmallDatasetMaker.dataset_tableSmallDatasetMaker.datasetsSmallDatasetMaker.datasetsSmallDatasetMaker.difftablesSmallDatasetMaker.difftablesSmallDatasetMaker.dir_dataSmallDatasetMaker.dir_dataSmallDatasetMaker.dir_rawSmallDatasetMaker.dir_rawSmallDatasetMaker.get_package_dataset_nameSmallDatasetMaker.get_package_dataset_nameSmallDatasetMaker.load_originalSmallDatasetMaker.relpath!SmallDatasetMaker.return_compressedSmallDatasetMaker.return_compressedSmallDatasetMaker.target_rowSmallDatasetMaker.tryparse_summarySmallDatasetMaker.tryparse_summarySmallDatasetMaker.tryparse_summarySmallDatasetMaker.unzip_fileSmallDatasetMaker.unzip_file
SmallDatasetMaker.column_field_dictionary — Constantcolumn_field_dictionary follows the order of the field of Source data.
SmallDatasetMaker.field_column_dictionary — Constantfield_column_dictionary follows the order of the field of Source data.
SmallDatasetMaker.ordered_columns — ConstantThe order for dataset_table().
DataFrames.DataFrame — MethodConstruct a DataFrame following the order of ordered_columns.
SmallDatasetMaker.SourceData — MethodIf zipfile not specified, it will be dir_data(package_name, dataset_name*".gz").
SmallDatasetMaker.SourceData — MethodIf rows, columns not specified, CSV.read(srcfile, DataFrame) will be applied to get the number of rows/columns.
SmallDatasetMaker.SourceData — MethodIf description not specified, it will be "".
SmallDatasetMaker.SourceData — MethodSourceData(srcfile, package_name, dataset_name, title, zipfile, rows, columns, description, timestamps)
srcfile is the path to the source file, the package_name will be the folder that the file resides, the dataset_name will be the name of the data without extension.
If timestamps not specified, it will be today().
SmallDatasetMaker.SourceData — MethodIf title not specified, it will be "Data [$dataset_name] of [$package_name]".
SmallDatasetMaker.SourceData — MethodIf package_name, dataset_name not specified, (package_name, dataset_name) = get_package_dataset_name(srcfile) is applied.
Example
using SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
SD = SourceData(srcfile)SmallDatasetMaker.SourceData — MethodSourceData(mod::Module, row::DataFrameRow) applies create an SourceData objects from a row of a DataFrame (i.e., dataset_table(mod)), with abspath! applied.
This is for loading data according to dataset_table; thus, paths should be referred to that in mod instead of being relative to the current directory.
SmallDatasetMaker.DATASET_ABS_DIR — MethodDATASET_ABS_DIR(mod::Module) returns the absolute directory for package mod.
SmallDatasetMaker.abspath! — Methodabspath!(SD::SourceData, mod::Module) makes all paths in SD to be absolute with the starting directory DATASET_ABS_DIR.
SmallDatasetMaker.abspath — Methodabspath(mod::Module, args...) = joinpath(DATASET_ABS_DIR(mod)[], args...) return absolute path of the module mod.
WARNING: DO NOT EXPORT THIS FUNCTION
this function has the same name of abspath in FilePathsBase and Base.Filesystem.
SmallDatasetMaker.cleantable — Methodcleantable(mod::Module) remove redundant entries of the dataset_table(mod) and overwrite data/doc/datasets.csv. Use with caution and verify before commit & push.
Example
using SmallDatasetMaker, YourDatasets
SmallDatasetMaker.cleantable(YourDatasets)SmallDatasetMaker.compress_save! — Methodcompress_save!(mod::Module, SD::SourceData; move_source = true, targeting_mod = false) compress the SD.srcfile, save the zipped one to SD.zipfile, and update the dataset_table(mod).
Options:
- By default,
move_source = truethat the source file will be moved todir_raw(). - Set
targeting_mod = trueto make sure the compressed data is saved relative to the repo ofmod. Default isfalse, which means you can compress and save the data to whatever directory you like.
compress_save! returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[], where relpath! is applied that paths SD as well as dataset_table(mod) are modified to be relative.
Example
using YourDatasets, SmallDatasetMaker
compress_save!(YourDatasets, SD; targeting_mod = true)This do the followings:
- Create zipped files under
data/of packageYourDatasetsindevelopment. - Move the source file
SD.srcfile(i.e., the raw .csv data) todir_raw(YourDatasets, ...)by default. - Add a new line to
SmallDatasetMaker.dataset_table(YourDatasets)(updatedata/doc/datasets.csvofYourDatasets).
See also SourceData, compress_save.
SmallDatasetMaker.compress_save — Methodcompress_save(mod::Module, srcpath; args...) is equivalent to compress_save!(mod, SourceData(srcpath)) but returns SD = SourceData(srcpath).
compress_save takes the same keyword arguments as compress_save!, which returns SD::SourceData of relative paths to DATASET_ABS_DIR(mod)[].
Example
using YourDatasets, SmallDatasetMaker
srcfile = "data/raw/Category_A/Dataset_B.csv" # path to the .csv to be compressed.
compress_save(YourDatasets, srcfile; targeting_mod = true)SmallDatasetMaker.create_empty_table — MethodInitiate referencing table at dataset_table(args...). It takes exactly the same arguments of dataset_table.
SmallDatasetMaker.dataset — Methoddataset(target_path) decompress target_path and returns it as a DataFrame.
Notice!
If you were intended to load target_path under SmallDatasetMaker or anyother package rather than the current directory you are working with, you should apply abspath(args::String...) or abspath(ACertainImportedPackage, args::String...) that target_path = SmallDatasetMaker.abspath(...).
SmallDatasetMaker.dataset — Methoddataset(package_name::AbstractString, dataset_name::AbstractString) returns a DataFrame object unzipped from the last row returned by target_row(mod, package_name, dataset_name). This function mimics the dataset function in RDatasets.jl.
SmallDatasetMaker.dataset_dir — Methoddataset_dir(mod::Module, args::String...) returns the absolute dataset path referencing mod.
SmallDatasetMaker.dataset_table — Methoddataset_table(mod::Module) = joinpath(DATASET_ABS_DIR(mod)[],"data", "doc", "datasets.csv")
The reason for dataset_table to be a function rather than a constant is that I can redefine it in the scope of test. See test/compdecomp.jl.
SmallDatasetMaker.datasets — Methoddatasets(mod::Module) reads the table from dataset_table(mod), and set __datasets::DataFrame to be the const variable in the scope of mod (i.e., mod.__datasets show the list of packages and datasets).
If there is no using SmallDatasetMaker in the module $mod ... end, it will fail since it is executed at the scope of mod.
SmallDatasetMaker.difftables — MethodGiven a series of DataFrames, difftables(df0::AbstractDataFrame, dfs::AbstractDataFrame...; ignoring = Cols()) returns report::DataFrame with columns
:nrow: number of rows of eachDataFrame.:ncol: number of columns of eachDataFrame.:cols_lack: lack of columns comparing todf0.:cols_add: extra columns comparing todf0.
This function is useful for update an existing dataset (where the new data might have unidentical column names).
SmallDatasetMaker.dir_data — MethodRelative path to the directory of data; this is called by SourceData.
SmallDatasetMaker.dir_data — MethodAbsolute path to the directory of data.
SmallDatasetMaker.dir_raw — MethodPath to the directory "data/raw/" of module mod; the default directory for the raw data.
SmallDatasetMaker.get_package_dataset_name — MethodGiven path to the source file, get_package_dataset_name(srcpath) derive package name and dataset name from the srcpath.
Example
srcpath = joinpath("Whatever", "RDatasets", "iris.csv")
SmallDatasetMaker.get_package_dataset_name(srcpath)
# output
("RDatasets", "iris")SmallDatasetMaker.load_original — Methodload_original(path::AbstractString) opens path and return the read data.
SmallDatasetMaker.relpath! — Methodrelpath!(SD::SourceData, mod::Module) makes all paths in SD to be relative path to DATASET_ABS_DIR.
SmallDatasetMaker.return_compressed — Methodreturn_compressed(path::AbstractString) returned compressed data.
Example
compressed = return_compressed("data/data.csv")SmallDatasetMaker.return_compressed — Methodreturn_compressed(data::Vector{UInt8}) returned compressed data.
Example
data = load_original("data/data.csv")
compressed = return_compressed(data)SmallDatasetMaker.target_row — Methodtarget_row returns the latest information in datasets(mod::Module). Given package_name, dataset_name, target_row(mod, package_name, dataset_name), target_row returns the last row that matches row.PackageName == package_name && row.Dataset == dataset_name".
SmallDatasetMaker.tryparse_summary — Methodtryparse_summary(v::AbstractVector, typetoparse::Type{<:Any})
Example
julia> tryparse_summary(["1", "2", "3.3", 10, "NaN"], Float64) .|> typeof
5-element Vector{DataType}:
SmallDatasetMaker.NotException
SmallDatasetMaker.NotException
SmallDatasetMaker.NotException
MethodError
SmallDatasetMaker.NotExceptionSmallDatasetMaker.tryparse_summary — Methodtryparse_summary(df::AbstractDataFrame, typetoparse) returns a "long" dataframe with columns :variable_name, :exception_type and :exception_msg.
Example
using DataFrames
df = DataFrame(
:name => ["John", "Roe", "Mary", "Hello", "World"],
:salary => [5.372, "1.1", "1", "NaN", "#value"],
:age => string.([20, 13, 17, 22, 100])
)
summary = tryparse_summary(df, Float64)
combine(groupby(summary, [:variable_name, :exception_type, :exception_msg]), nrow)SmallDatasetMaker.unzip_file — Methodunzip_file(target_path) unzip file at target_path to current directory preserve its original name.
Notice!
If you were intended to load target_path under SmallDatasetMaker or anyother package rather than the current directory you are working with, you should apply abspath(args::String...) or abspath(ACertainImportedPackage, args::String...) that target_path = SmallDatasetMaker.abspath(...).
SmallDatasetMaker.unzip_file — MethodThe same as dataset, but also save the unzip file.