3.2. Dynamic Points Culled Algorithm
Our method combines optical flow and epipolar lines for the initial assessment of dynamic feature points, refining the removal process with semantic information. This approach draws inspiration from the LK optical flow and epipolar line motion consistency check methods, resembling the dynamic detection strategy of DS-SLAM. Initially, we apply the LK optical flow method to track feature points extracted by HFNet from the previous frame, obtaining optical flow vectors according to the results of the current frame, as illustrated in
Figure 2b. The points successfully tracked across the previous and current frames are denoted as
and
, as depicted in Equation (
1), where
u and
v represent pixel coordinates.
After applying the RANSAC algorithm to filter out anomalous optical flow vectors and obtain the fundamental matrix between the two frames [
31], the epipolar line
can be represented as shown in Equation (
2):
The distance
D between a pixel in the current frame and its corresponding epipolar line can be expressed as shown in Equation (
3). If the distance is too far, the point is considered a potential dynamic point.
This method has been widely applied in various dynamic SLAM systems. Although this method can approximately identify dynamic feature points, significant errors in optical flow tracking and epipolar line calculations, coupled with issues when dynamic feature points move along epipolar lines, necessitate its combination with semantic information. Over-removal or under-removal can severely impact the system’s robustness. Despite heightened attention to this issue in recent research, a satisfactory solution remains elusive. Our proposed Algorithm 1 addresses this issue to a certain extent, facilitating more precise removal.
Algorithm 1 Dynamic Points Culled Algorithm |
- Input:
Dynamic points , Static points , Detect boxes , Mask; Thresholds , ; - Output:
Precise mask, Final mask; - 1:
for each in do - 2:
Divide into nine boxes ; - 3:
for each in do - 4:
Initialize , ; - 5:
AppendTheDynamicPoints(,); - 6:
AppendTheStaticPoints(,); - 7:
; - 8:
if then - 9:
Append to ; - 10:
else - 11:
Append to ; - 12:
end if - 13:
end for - 14:
Check and merge near boxes from to ; - 15:
if then - 16:
; - 17:
end if - 18:
Remove the corresponding mask from ; - 19:
end for
|
After the initial selection of dynamic and static points on the current frame using the optical flow–epipolar line method, the system divides the _target detection boxes into nine areas, as shown in
Figure 2d. Between lines 3 and 6, each box’s dynamic points are evaluated and categorized into Dynamic In Box (Dib) or Static In Box (Sib). From lines 7 to 12, based on the ratio of dynamic to static points, each small box is determined to be either a static or dynamic sub-box. At line 14, the system identifies dynamic sub-boxes within the mother detection box and adjusts adjacent boxes to a dynamic state to ensure no potential dynamic regions are missed in the mask, with the final result shown on the left side of
Figure 2e. Moreover, considering dynamic objects may move between frames, lines 15 to 18 assess the dynamic level of the _target. If the majority of the mother box associated with the _target is occupied by dynamic sub-boxes, the _target is considered highly dynamic and marked for complete removal, as depicted on the right side of
Figure 2e. This process (lines 2 to 18) outlines the assessment procedure for a single _target.
The system evaluates all _targets on the frame, preserving the masks within all dynamic sub-boxes and performing dilation. This strategy avoids probability calculations for dynamic objects and does not rely on subjective judgments, providing an accurate assessment of all potentially moving regions on the frame. It maximizes the utilization of the optical flow–epipolar line judgment method.
3.3. Frame Rotation Estimation and Feature Point Matching
The feature extraction based on HFNet provides high-quality feature points and descriptors, thereby improving matching efficiency and triangulation accuracy, ultimately enhancing the accuracy of pose estimation. However, in instances of frame rotation, as noted in the HFNet-SLAM paper [
7], the performance of deep-learning-based feature extraction significantly deteriorates, resulting in feature matching failures and tracking loss. In contrast, ORB descriptors used in ORB-SLAM3 have demonstrated excellent robustness in rotation, which is the foundation of our work.
To tackle this issue, we assess the frame rotation between the current and previous frames to decide on the descriptor to be used and to select a suitable distance calculation method for various descriptors. We utilize the optical flow vectors obtained from the previous tracking step and estimate the frame rotation angle using these vectors. The detailed process of frame rotation angle estimation is outlined in Algorithm 2.
When a frame rotation occurs in the frame, the distributed optical flow vectors around the image’s center point exhibit a characteristic pattern, as shown in the figure. While motion between frames is complex and may involve rotations along multiple axes, including frame rotation, the perpendicular bisector of the optical flow vectors may not necessarily pass exactly through the center of the frame rotation. Additionally, the tracking of optical flow vectors may not be entirely accurate. Therefore, we employ a least-squares method to optimize and solve this problem.
In Algorithm 2, the first line calculates the perpendicular bisector for each optical flow vector. Given a base optical flow vector, the midpoint and slope are used to determine the equation of its perpendicular bisector, represented by Equation (
4):
The formula represents the line on which the optical flow vector lies, where a, b, and c are the coefficients corresponding to this line. x and y represent the points on the line of the optical flow vector, including the two endpoints of the optical flow vector.
In the second line, the center point of the frame is chosen as the starting point for optimization. Since most detected and recognized frame rotations typically occur near the center of the image, this choice significantly reduces optimization time.
Algorithm 2 Rotation Estimation Algorithm |
- Input:
Optical flow vectors ; Frame’s dimensions H, W; Distance threshold . - Output:
Optimal point ; Rotation angle . - 1:
List of perpendicular bisectors for each - 2:
- 3:
Solve minimization problem for using - 4:
Empty list - 5:
- 6:
for each in do - 7:
Calculate distance of to both ends of - 8:
if difference in distances then - 9:
angle between line endpoints - 10:
Append to - 11:
end if - 12:
end for - 13:
if length of is sufficient then - 14:
length of - 15:
else - 16:
Indicate rotation did not happen - 17:
end if
|
The third line constructs and solves the optimization function
as outlined in Equation (
5), aiming to minimize the distance between
and all perpendicular bisector lines:
We seek a rough estimate of the frame rotation angle for real-time applications. BFGS optimization is employed for faster iteration. The gradient of the objective function is computed as shown in Equation (
6), where
and
are given by Equations (
7) and (
8):
Each update of
occurs in a specified direction, allowing
to converge quickly. The updated
is represented by Equation (
9):
Here,
and
are the iteration directions, determined by Equation (
10):
The
matrix
reflects local curvature information of the objective function near the latest iteration point, providing a more accurate descent direction. The formula for
is given by Equation (
11):
Here,
represents the change vector,
represents the gradient change vector, and
is used to adjust the update magnitude, ensuring positivity. The specific formulas are as follows:
After iterative optimization,
obtains the optimal
point as the frame rotation center. Lines 6 to 12 consider that if the motion between the two frames is relatively close to a frame rotation, the distances from
to the ends of all optical flow vectors should be similar. As shown in
Figure 3, there will be a certain difference
between the distance from
to
and the distance from
to
. If
is too large, this vector will be filtered out. This process is expressed in Equation (
15), where
is the frame rotation center, and
and
are the two endpoints of the optical flow vector.
Therefore, lines 13 to 17 utilize this principle to filter some optical flow vectors. If too many vectors are discarded, it means that the motion between the two frames does not clearly involve a frame rotation. The remaining vectors’ angles from both ends to the frame rotation center are then used to estimate the rotational movement between the current and previous frames. This is shown in Equation (
16), where
and
represent the two endpoints of each optical flow vector:
The threshold angle is set at 20 degrees because performance typically starts to degrade once estimated angles exceed 15° in testing. If the angle is bigger than the threshold angle, the system assumes that HFNet-generated descriptors should not be used due to potential frame rotation. Instead, it recalculates ORB descriptors for feature points between the current and previous frames, employing the same method as ORB-SLAM3 for subsequent matching. Conversely, if the angle is smaller, the descriptors from HFNet are used for the current frame, and the descriptors from the previous frame are retrieved from storage. In this case, BOW is still employed to accelerate matching, but the distance calculation shifts from Hamming distance to Euclidean norm, as shown in Equation (
17), where
and
represent the descriptors being matched:
After this step, certain scenarios where frame rotation loss occurred due to the use of HFNet descriptors will be switched to ORB descriptors. In summary, after this step, when the system estimates that a scene has undergone rotation, it selects HFNet descriptors for previously extracted feature points on keyframes based on the estimated angle, or re-extracts ORB descriptors, and chooses the matching computation method based on whether rotation is detected. This step effectively combines the rotation robustness of handcrafted extraction methods with the accuracy advantages of deep features, allowing deep-features-based systems to navigate through scenarios where significant frame rotations could otherwise lead to a drop in matching performance and tracking loss, as shown in
Figure 4.
3.4. Loop Closure
The ORB-SLAM3, on which our system is based, utilizes the bag-of-words (BOW) method for loop closure detection. However, BOW-based approaches have limited descriptive capabilities for scenes and tend to lose spatial information about depicted objects. Hence, we replace the original BOW method with global descriptors generated by HFNet. When the local mapping thread receives a new keyframe, it computes its global descriptor vector, denoted as
, and saves the current keyframe’s global descriptor to the keyframe library. It then calculates the Euclidean norm (
-norm) between the global descriptor vectors of the current keyframe and other keyframes in the library:
Here, and represent the global descriptors extracted by HFNet, each consisting of 4096 floating-point numbers. After obtaining global descriptors for keyframes, the system calculates the distance with all stored global descriptors in the keyframe library. A smaller indicates higher similarity between the two frames, increasing the likelihood of a loop closure. The system selects frames with the highest similarity as loop closure candidates based on their similarity with all descriptors in the keyframe library. Following this, in a manner akin to ORB-SLAM3, a geometric verification is conducted on co-visible keyframes to ascertain the occurrence of a loop closure. Our approach effectively leverages the global descriptor capabilities of HFNet, providing more accurate descriptions of keyframes compared to ORB-SLAM3 in experiments. This leads to more accurate loop closure detection, thereby mitigating substantial trajectory deviations.