Index · BetaML.jl Documentation (2024)

Welcome to the documentation of the Beta Machine Learning toolkit.

About

The BetaML toolkit provides machine learning algorithms written in the Julia programming language.

Aside the algorithms themselves, BetaML provides many "utility" functions. Because algorithms are all self-contained in the library itself (you are invited to explore their source code by typing @edit functionOfInterest(par1,par2,...)), the utility functions have APIs that are coordinated with the algorithms, facilitating the "preparation" of the data for the analysis, the choice of the hyper-parameters or the evaluation of the models. Most models have an interface for the MLJ framework.

Aside Julia, BetaML can be accessed in R or Python using respectively JuliaCall and PyJulia. See the tutorial for details.

Installation

The BetaML package is included in the standard Julia register, install it with:

] add BetaML

Available modules

While BetaML is split in several (sub)modules, all of them are re-exported at the root module level. This means that you can access their functionality by simply typing using BetaML:

using BetaMLmyLayer = DenseLayer(2,3) # DenseLayer is defined in the Nn submoduleres = KernelPerceptronClassifier() # KernelPerceptronClassifier is defined in the Perceptron module@edit DenseLayer(2,3) # Open a text editor with to the relevant source code

Each module is documented on the links below (you can also use the inline Julia help system: just press the question mark ? and then, on the special help prompt help?>, type the function name):

BetaML.Perceptron: The Perceptron, Kernel Perceptron and Pegasos classification algorithms;
BetaML.Trees: The Decision Trees and Random Forests algorithms for classification or regression (with missing values supported);
BetaML.Nn: Implementation of Artificial Neural Networks;
BetaML.Clustering: (hard) Clustering algorithms (K-Means, K-Mdedoids)
BetaML.GMM: Various algorithms (Clustering, regressor, missing imputation / collaborative filtering / recommandation systems) that use a Generative (Gaussian) mixture models (probabilistic) fitter, fitted using a EM algorithm;
BetaML.Imputation: Imputation algorithms;
BetaML.Utils: Various utility functions (scale, one-hot, distances, kernels, pca, autoencoder, predictions analysis, feature importance..).

Available models

Currently BetaML provides the following models:

`BetaML` name	Hp	`MLJ` Interface	Category*
PerceptronClassifier	☒	PerceptronClassifier	Supervised classifier
KernelPerceptronClassifier	☒	KernelPerceptronClassifier	Supervised classifier
PegasosClassifier	☒	PegasosClassifier	Supervised classifier
DecisionTreeEstimator	☒	DecisionTreeClassifier, DecisionTreeRegressor	Supervised regressor and classifier
RandomForestEstimator	☒	RandomForestClassifier, RandomForestRegressor	Supervised regressor and classifier
NeuralNetworkEstimator	☒	NeuralNetworkRegressor, MultitargetNeuralNetworkRegressor, NeuralNetworkClassifier	Supervised regressor and classifier
GaussianMixtureRegressor	☒	GaussianMixtureRegressor, MultitargetGaussianMixtureRegressor	Supervised regressor
GaussianMixtureRegressor2	☒		Supervised regressor
KMeansClusterer	☒	KMeansClusterer	Unsupervised hard clusterer
KMedoidsClusterer	☒	KMedoidsClusterer	Unsupervised hard clusterer
GaussianMixtureClusterer	☒	GaussianMixtureClusterer	Unsupervised soft clusterer
SimpleImputer	☒	SimpleImputer	Unsupervised missing data imputer
GaussianMixtureImputer	☒	GaussianMixtureImputer	Unsupervised missing data imputer
RandomForestImputer	☒, ☒	RandomForestImputer	Unsupervised missing data imputer
GeneralImputer	☒	GeneralImputer	Unsupervised missing data imputer
MinMaxScaler			Data transformer
StandardScaler			Data transformer
Scaler	☒		Data transformer
PCAEncoder	☒		Unsupervised dimensionality reduction
AutoEncoder	☒	AutoEncoder	Unsupervised non-linear dimensionality reduction
OneHotEncoder	☒		Data transformer
OrdinalEncoder	☒		Data transformer
ConfusionMatrix	☒		Predictions analysis
FeatureRanker	☒		Predictions analysis

* There is no formal distinction in BetaML between a transformer, or also a model to assess predictions, and a unsupervised model. They are all treated as unsupervised models that given some data they lern how to return some useful information, wheter a class grouping, a specific tranformation or a quality evaluation..

Usage

New to BetaML or even to Julia / Machine Learning altogether? Start from the tutorial!

All models supports the (a) model construction (where hyperparameters and options are choosen), (b) fitting and (c) prediction paradigm. A few model support inverse_transform, for example to go back from the one-hot encoded columns to the original categorical variable (factor).

This paradigm is described in detail in the API V2 page.

Quick examples

(see the tutorial for a more step-by-step guide to the examples below and to other examples)

Using an Artificial Neural Network for multinomial categorisation

In this example we see how to train a neural networks model to predict the specie's name (5th column) given floral sepals and petals measures (first 4 columns) in the famous iris flower dataset.

# Load Modulesusing DelimitedFiles, Randomusing Pipe, Plots, BetaML # Load BetaML and other auxiliary modulesRandom.seed!(123); # Fix the random seed (to obtain reproducible results).# Load the datairis = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris.csv"),',',skipstart=1)x = convert(Array{Float64,2}, iris[:,1:4])y = convert(Array{String,1}, iris[:,5])# Encode the categories (levels) of y using a separate column per each category (aka "one-hot" encoding) ohmod = OneHotEncoder()y_oh = fit!(ohmod,y) # Split the data in training/testing sets((xtrain,xtest),(ytrain,ytest),(ytrain_oh,ytest_oh)) = partition([x,y,y_oh],[0.8,0.2])(ntrain, ntest) = size.([xtrain,xtest],1)# Define the Artificial Neural Network modell1 = DenseLayer(4,10,f=relu) # The activation function is `ReLU`l2 = DenseLayer(10,3) # The activation function is `identity` by defaultl3 = VectorFunctionLayer(3,f=softmax) # Add a (parameterless include("Imputation_tests.jl")) layer whose activation function (`softmax` in this case) is defined to all its nodes at oncemynn = NeuralNetworkEstimator(layers=[l1,l2,l3],loss=crossentropy,descr="Multinomial logistic regression Model Sepal", batch_size=2, epochs=200) # Build the NN and use the cross-entropy as error function.# Alternatively, swith to hyperparameters auto-tuning with `autotune=true` instead of specify `batch_size` and `epoch` manually# Train the model (using the ADAM optimizer by default)res = fit!(mynn,fit!(Scaler(),xtrain),ytrain_oh) # Fit the model to the (scaled) data# Obtain predictions and test them against the ground true observationsŷtrain = @pipe predict(mynn,fit!(Scaler(),xtrain)) |> inverse_predict(ohmod,_) # Note the scaling and reverse one-hot encoding functionsŷtest = @pipe predict(mynn,fit!(Scaler(),xtest)) |> inverse_predict(ohmod,_) train_accuracy = accuracy(ŷtrain,ytrain) # 0.975test_accuracy = accuracy(ŷtest,ytest) # 0.96# Analyse model performancescm = ConfusionMatrix()fit!(cm,ytest,ŷtest)print(cm)

A ConfusionMatrix BetaMLModel (fitted)-----------------------------------------------------------------*** CONFUSION MATRIX ***Scores actual (rows) vs predicted (columns):4×4 Matrix{Any}: "Labels" "virginica" "versicolor" "setosa" "virginica" 8 1 0 "versicolor" 0 14 0 "setosa" 0 0 7Normalised scores actual (rows) vs predicted (columns):4×4 Matrix{Any}: "Labels" "virginica" "versicolor" "setosa" "virginica" 0.888889 0.111111 0.0 "versicolor" 0.0 1.0 0.0 "setosa" 0.0 0.0 1.0 *** CONFUSION REPORT ***- Accuracy: 0.9666666666666667- Misclassification rate: 0.033333333333333326- Number of classes: 3 N Class precision recall specificity f1score actual_count predicted_count TPR TNR support 1 virginica 1.000 0.889 1.000 0.941 9 8 2 versicolor 0.933 1.000 0.938 0.966 14 15 3 setosa 1.000 1.000 1.000 1.000 7 7- Simple avg. 0.978 0.963 0.979 0.969- Weigthed avg. 0.969 0.967 0.971 0.966

ϵ = info(mynn)["loss_per_epoch"]plot(1:length(ϵ),ϵ, ylabel="epochs",xlabel="error",legend=nothing,title="Avg. error per epoch on the Sepal dataset")heatmap(info(cm)["categories"],info(cm)["categories"],info(cm)["normalised_scores"],c=cgrad([:white,:blue]),xlabel="Predicted",ylabel="Actual", title="Confusion Matrix")

Using Random forests for regression

In this example we predict, using another classical ML dataset, the miles per gallon of various car models.

Note in particular:

(a) how easy it is in Julia to import remote data, even cleaning them without ever saving a local file on disk;
(b) how Random Forest models can directly work on data with missing values, categorical one and non-numerical one in general without any preprocessing

# Load modulesusing Random, HTTP, CSV, DataFrames, BetaML, Plotsimport Pipe: @pipeRandom.seed!(123)# Load dataurlData = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"data = @pipe HTTP.get(urlData).body |> replace!(_, UInt8('\t') => UInt8(' ')) |> CSV.File(_, delim=' ', missingstring="?", ignorerepeated=true, header=false) |> DataFrame;# Preprocess dataX = Matrix(data[:,2:8]) # cylinders, displacement, horsepower, weight, acceleration, model year, origin, model namey = data[:,1] # miles per gallon(xtrain,xtest),(ytrain,ytest) = partition([X,y],[0.8,0.2])# Model definition, hyper-parameters auto-tuning, training and predictionm = RandomForestEstimator(autotune=true)ŷtrain = fit!(m,xtrain,ytrain) # shortcut for `fit!(m,xtrain,ytrain); ŷtrain = predict(x,xtrain)`ŷtest = predict(m,xtest)# Prediction assessmentrelative_mean_error_train = relative_mean_error(ytrain,ŷtrain) # 0.039relative_mean_error_test = relative_mean_error(ytest,ŷtest) # 0.076scatter(ytest,ŷtest,xlabel="Actual",ylabel="Estimated",label=nothing,title="Est vs. obs MPG (test set)")

Further examples

Finally, you may want to give a look at the "test" folder. While the primary objective of the scripts under the "test" folder is to provide automatic testing of the BetaML toolkit, they can also be used to see how functions should be called, as virtually all functions provided by BetaML are tested there.

Benchmarks

A page summarising some basic benchmarks for BetaML and other leading Julia ML libraries is available here.

Acknowledgements

The development of this package at the Bureau d'Economie Théorique et Appliquée (BETA, Nancy) was supported by the French National Research Agency through the Laboratory of Excellence ARBRE, a part of the “Investissem*nts d'Avenir” Program (ANR 11 – LABX-0002-01).