The decoder consists of three main components:
1) a face transformer; 2) a classification head; and 3) a fusion network.
It performs point cloud completion by predicting the 3D coordinate offset and opacity value for each candidate point in $P^S$.
Based on these predictions, the missing 3D point cloud $P^{\text{out}}$ is estimated.
Importantly, the proposed decoder focuses solely on reconstructing the missing regions, rather than regenerating the entire object point cloud.
Moreover, thanks to its transformer-based design, the decoder can handle input point clouds with varying densities.
We define the offset $o_m = \{o_x, o_y, o_z\}$ to represent the $(x, y, z)$ displacements of the $m$-th candidate point $p_m$.
For each candidate point $p_m$, a local coordinate system is defined with $p_m$ as the origin, where the sampling ray is aligned with the $z$-axis, and the two axes of the corresponding face define the $x$- and $y$-axes.
The opacity value $\sigma$ represents the influence of each point, similar to its role in NeRF.
Only points with $\sigma$ above a certain value ($\geq0.5$) are considered meaningful, while the rest are filtered out.
This decoder design allows the model to adjust the output point cloud density.
Then, the missing 3D point cloud $P^{\text{out}}$ is defined as follows:
\begin{equation}
\label{eq:p_out}
P^{\text{out}} = \{ p_m + o_m \mid \sigma_m \geq 0.5,\; m = 1, \ldots, M \}.
\end{equation}
The final completion result is obtained by combining the predicted points $P^{\text{out}}$ with the incomplete input $P^I$, yielding $P^{\text{pred}} = P^I \cup P^{\text{out}}$.