This page describes how you can quickly get started using mlpack in C++ and gives a few examples of usage, and pointers to deeper documentation.
Keep in mind that mlpack also has interfaces to other languages, and quickstart guides for those other languages are available too. If that is what you are looking for, see the quickstarts for Python, the command line, Julia, R, or Go.
To use mlpack in C++, you only need the header files associated with the libraries, and the dependencies Armadillo and ensmallen (detailed in the main README).
The headers may already be pre-packaged for your distribution; for instance, for Ubuntu and Debian you can simply run the command
sudo apt-get install libmlpack-dev
and on Fedora or Red Hat:
sudo dnf install mlpack-devel
You can also use a Docker image from Dockerhub, which has mlpack headers already installed:
docker run -it mlpack/mlpack /bin/bash
If you prefer to build mlpack from scratch, see the main README.
Note for Ubuntu LTS Users: The libmlpack-dev version in the Ubuntu LTS
repositories may not always be the latest. This can lead to issues, such as
missing header files (e.g., mlpack.hpp
missing in versions prior to 4.0).
To ensure compatibility with the latest mlpack features and examples,
we recommend building mlpack from source, as explained in the
main README.
The mlpack port in vcpkg is kept up to date by Microsoft team members and community contributors. The url of vcpkg is: https://github.com/Microsoft/vcpkg . You can download and install mlpack using the vcpkg dependency manager:
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh # ./bootstrap-vcpkg.bat for Windows
./vcpkg integrate install
./vcpkg install mlpack
If the version is out of date, please create an issue or pull request on the vcpkg repository.
The mlpack recipe in Conan is kept up to date by the Conan team members and the community contributors.
Follow the instructions on this page on how to set up Conan.
Install mlpack:
conan install --requires="mlpack/[*]" --build=missing
If the version is outdated or there is a new release version, please create an issue or pull request on the conan-center-index repository.
As a really simple example of how to use mlpack in C++, let's do some simple
classification on a subset of the standard machine learning covertype
dataset.
We'll first split the dataset into a training set and a test set, then we'll
train an mlpack random forest on the training data, and finally we'll print the
accuracy of the random forest on the test dataset.
The first step is to download the covertype dataset onto your system so that it is available for the program. A shell command below is given to do this:
# Get the dataset and unpack it.
wget https://www.mlpack.org/datasets/covertype-small.data.csv.gz
wget https://www.mlpack.org/datasets/covertype-small.labels.csv.gz
gunzip covertype-small.data.csv.gz covertype-small.labels.csv.gz
With that in place, let's write a C++ program to split the data and perform the classification:
// Define these to print extra informational output and warnings.
#define MLPACK_PRINT_INFO
#define MLPACK_PRINT_WARN
#include <mlpack.hpp>
using namespace arma;
using namespace mlpack;
using namespace mlpack::tree;
using namespace std;
int main()
{
// Load the datasets.
mat dataset;
Row<size_t> labels;
if (!data::Load("covertype-small.data.csv", dataset))
throw std::runtime_error("Could not read covertype-small.data.csv!");
if (!data::Load("covertype-small.labels.csv", labels))
throw std::runtime_error("Could not read covertype-small.labels.csv!");
// Labels are 1-7, but we want 0-6 (we are 0-indexed in C++).
labels -= 1;
// Now split the dataset into a training set and test set, using 30% of the
// dataset for the test set.
mat trainDataset, testDataset;
Row<size_t> trainLabels, testLabels;
data::Split(dataset, labels, trainDataset, testDataset, trainLabels,
testLabels, 0.3);
// Create the RandomForest object and train it on the training data.
RandomForest<> r(trainDataset,
trainLabels,
7 /* number of classes */,
10 /* number of trees */,
3 /* minimum leaf size */);
// Compute and print the training error.
Row<size_t> trainPredictions;
r.Classify(trainDataset, trainPredictions);
const double trainError =
arma::accu(trainPredictions != trainLabels) * 100.0 / trainLabels.n_elem;
cout << "Training error: " << trainError << "%." << endl;
// Now compute predictions on the test points.
Row<size_t> testPredictions;
r.Classify(testDataset, testPredictions);
const double testError =
arma::accu(testPredictions != testLabels) * 100.0 / testLabels.n_elem;
cout << "Test error: " << testError << "%." << endl;
}
Now, you can compile the program with your favorite C++ compiler; here's an
example command that uses g++
, and assumes the file above is saved as
cpp_quickstart_1.cpp
.
g++ -O3 -std=c++17 -o cpp_quickstart_1 cpp_quickstart_1.cpp -larmadillo -fopenmp
Then, you can run the program easily:
./cpp_quickstart_1
We can see by looking at the output that we achieve reasonably good accuracy on the test dataset (80%+):
Training error: 19.4329%.
Test error: 24.17%.
It's easy to modify the code above to do more complex things, or to use different mlpack learners, or to interface with other machine learning toolkits.
In this example, we'll train a collaborative filtering model using mlpack's CF
class. We'll train this on this
MovieLens dataset, and then we'll
use the model that we train to give recommendations.
First, download the MovieLens dataset:
wget https://www.mlpack.org/datasets/ml-20m/ratings-only.csv.gz
wget https://www.mlpack.org/datasets/ml-20m/movies.csv.gz
gunzip ratings-only.csv.gz movies.csv.gz
Next, we can use the following C++ code:
// Define these to print extra informational output and warnings.
#define MLPACK_PRINT_INFO
#define MLPACK_PRINT_WARN
#include <mlpack.hpp>
using namespace arma;
using namespace mlpack;
using namespace mlpack::cf;
using namespace std;
int main()
{
// Load the ratings.
mat ratings;
if (!data::Load("ratings-only.csv", ratings))
throw std::runtime_error("Could not load ratings-only.csv!");
// Now, load the names of the movies as a single-feature categorical dataset.
// We can use `moviesInfo.UnmapString(i, 0)` to get the i'th string.
data::DatasetInfo moviesInfo;
mat movies; // This will be unneeded.
if (!data::Load("movies.csv", movies, moviesInfo))
throw std::runtime_error("Could not load movies.csv!");
// Split the ratings into a training set and a test set, using 10% of the
// dataset for the test set.
mat trainRatings, testRatings;
data::Split(ratings, trainRatings, testRatings, 0.1);
// Train the CF model using RegularizedSVD as the decomposition algorithm.
// Here we use a rank of 10 for the decomposition.
CFType<RegSVDPolicy> cf(
trainRatings,
RegSVDPolicy(),
5, /* number of users to use for similarity computations */
10 /* rank of decomposition */);
// Now compute the RMSE for the test set user and item combinations. To do
// this we must assemble the list of users and items.
Mat<size_t> combinations(2, testRatings.n_cols);
for (size_t i = 0; i < testRatings.n_cols; ++i)
{
combinations(0, i) = size_t(testRatings(0, i)); // (user)
combinations(1, i) = size_t(testRatings(1, i)); // (item)
}
vec predictions;
cf.Predict(combinations, predictions);
const double rmse = norm(predictions - testRatings.row(2).t(), 2) /
sqrt((double) testRatings.n_cols);
std::cout << "RMSE of trained model is " << rmse << "." << endl;
// Compute the top 10 movies for user 1.
Col<size_t> users = { 1 };
Mat<size_t> recommendations;
cf.GetRecommendations(10, recommendations, users);
// Now print each movie.
cout << "Recommendations for user 1:" << endl;
for (size_t i = 0; i < recommendations.n_elem; ++i)
{
cout << " " << (i + 1) << ". "
<< moviesInfo.UnmapString(recommendations[i], 2) << "." << endl;
}
}
This can be compiled the same way as before, assuming the code is saved as
cpp_quickstart_2.cpp
:
g++ -O3 -std=c++17 -o cpp_quickstart_2 cpp_quickstart_2.cpp -fopenmp -larmadillo
And then it can be easily run:
./cpp_quickstart_2
Here is some example output, showing that user 1 seems to have good taste in movies:
RMSE of trained model is 0.795323.
Recommendations for user 1:
1: Casablanca (1942)
2: Pan's Labyrinth (Laberinto del fauno, El) (2006)
3: Godfather, The (1972)
4: Answer This! (2010)
5: Life Is Beautiful (La Vita è bella) (1997)
6: Adventures of Tintin, The (2011)
7: Dark Knight, The (2008)
8: Out for Justice (1991)
9: Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
10: Schindler's List (1993)
Now that you have done some simple work with mlpack, you have seen how it can easily plug into a data science production workflow in C++. But these two examples have only shown a little bit of the functionality of mlpack. Lots of other functionality is available.
Some of this functionality is demonstrated in the examples repository.
A full list of all classes and functions that mlpack implements can be found by browsing the well-commented source code.