The Yandex CatBoost machine learning product is a new, extremely fast and accurate algorithm for text classification — it sees words as numbers. There are no hand-coded features or complex feature engineering in CatBoost. It was built specifically with large data set classifiers in mind: Facebook’s text classification feed appended to the general Internet, and Twitter feeds on 7 languages with billions of tweets.
CatBoost achieves its state-of-the-art accuracy by combining two advanced machine learning algorithms: Adaboost with decision trees. The combination not only speeds up the training process, but also delivers better accuracy than using one of these algorithms alone.
Much of the impressive speed-up comes from making all training decisions in a single pass, and by leveraging GPUs to train on more data points than are possible with other algorithms. In addition, CatBoost’s high accuracy is particularly useful for real-time classification in high-volume settings like Facebook, where high precision is critical.
CatBoost trains models (coders) that predict the categories of unlabeled examples using textual features. When a new text document arrives at the classifier, its features are used to predict its category based on partial similarity to data in one of the existing coders. By default CatBoost uses K=10 coders with complexity parameter $k$ from $2$ up to $20$.