Recall

Multi-task prediction: here the demo predicts CTR and CTCVR simultaneously. Where: p(CTCVR) = p(CTR) × p(CVR)
It still uses two MLPs with the same input; the first embedding layer is shared, while the later layers are separate. MLP input is [dim_emb * n_sparse + n_dense], i.e., embeddings for sparse features plus raw dense features.
The two outputs are CTR and CVR predictions.
In the demo, each uses a separate BCE loss, and the losses are summed.

Two MLPs. One MLP takes the user matrix as input, the other takes the item matrix: [num_samples, dim_features].
Outputs are two embedding matrices:[num_samples, dim_features].
Loss functions include point-wise, pair-wise, and batch-wise. Here it uses batch-wise InfoNCE.
Only positive samples are selected; all others within the batch are treated as negatives.
We want the positive samples to be classified as positive, which is equivalent to BCE(dot(user_emb, item_emb)).

Main Rank

Basic Principle

Ranking is a classification problem. Using the actual 0/1 click labels as ground truth, the predicted probability is directly used as the CTR prediction.

This is LR plus second-order cross terms.
Engineering Optimization

Use the sum-of-squares identity to reduce complexity: convert cross terms into “square of sums minus sum of squares.” Then replace weights with embeddings, which is equivalent to the original FM definition but saves memory.

Embedding is equivalent to one-hot + linear layer, but you directly lookup weights instead.

Basic Principle

Input sample features x have shape [num_samples, num_features].

GBDT projects each sample into the leaves of num_trees trees, producing a sparse one-hot matrix of shape [num_samples, num_trees * leaves]. Most entries are zero because each tree activates only one leaf.

Then concatenate this sparse matrix with the original feature matrix along the feature dimension and feed it into LR.
Core Idea

LR cannot learn nonlinear feature combinations.

GBDT learns nonlinear combinations through tree structures (leaf grouping), and LR assigns linear weights to these combinations.
Engineering Optimization

GBDT is trained offline, online only inference. LR can be trained online with SGD, updating with a single sample.

GBDT is slower to train, cannot update with single samples, and must update structure using the full dataset.

Even inference can be slow, but optimizations (e.g., hashing) can help.