4.2. Performance Evaluation
We conducted a comprehensive quantitative and qualitative analysis of 11 state-of-the-art segmentation models, including the three benchmark models we developed. This thorough analysis involved intra-dataset performance assessments across five diverse datasets, taking into account various factors such as demography, NIR and VIS illumination, sensor types, collection setups, and environmental conditions. Furthermore, we assessed the models’ ability to detect noise (eyelash detection), their cross-dataset performance, and practical usability.
In addition to intra-dataset performance, we placed emphasis on cross-dataset evaluation to assess the generalization capability of the models. This is a crucial aspect for biometric systems that need to operate effectively on unseen data in real-world applications. Our benchmarking methodology also accounts for practicability by evaluating inference time and computational efficiency, both of which are essential for real-time deployment in biometric systems. These factors together ensure that the models not only excel in accuracy but are also robust and practical for deployment.
(A) Intra-dataset performance evaluation: a detailed summary of intra-dataset performance is provided in
Table 4, with significant insights for each dataset detailed below:
Composite subset (NIR): Our benchmark model, U-Net, achieved the highest mIoU score of 91.70% and the second-highest F1 score of 94.04%. Our U-Net++ model secured the second-highest mIoU of 91.41% and the highest F1 score of 94.20%. Lozej et al. (FT-10) and Lozej et al. (FT-ES) posted mIoU scores of 91.13% and 91.16%, with F1 scores of 93.62% and 94.01%, respectively. Our U-Net with weight map model achieved 91.03% mIoU and 93.80% F1 score. Trokielewicz et al. (FT), OSIRIS, and USIT performed with mIoU scores of 85.09%, 85.98%, and 87.30%, and F1 scores of 88.14%, 87.17%, and 87.10%, respectively. This indicates that these models are less suited for this dataset compared with the other DL-based models.
CASIA Thousand (NIR): Our U-Net++ models achieved the highest mIoU score of 95.26% and the second-highest F1 score of 95.22%. Our U-Net model also demonstrated comparable results. The models developed by Lozej et al. achieved mIoU scores similar to those of U-Net and U-Net++, albeit with a lower F1 score of 94.61%. The FCN and GAN models by Bezerra showed robust performance, with F1 scores of 94.42% and 95.38% (highest among all models), respectively, suggesting they are highly effective for this dataset. The performance of Trokielewicz et al. (FT) and traditional methods such as OSIRIS v4.1 and USIT (Wahet) was comparatively degraded, with IoU scores of 82.98%, 88.51%, and 80.83%, and F1 scores of 89.52%, 87.78%, and 81.62%, respectively.
CASIA Distance (NIR): Our U-Net++ model achieved the highest IoU and F1 scores of 94.72% and 94.51%, respectively. The U-Net model achieved a mIoU of 94.48% and an F1 score of 93.58%. The models of Lozej et al. (FT-10) and Lozej et al. (FT-ES) secure mIoU scores of 92.80% and 93.56% and F1 scores of 92.20% and 93.04%, respectively. Models developed by Wang [
7] and Wang [
23] both achieved F1 scores of 94.25% and 94.30%, respectively. However, Wang’s [
23] model achieved a lower mIoU score (89.40%) compared with our models and Lozej’s models. The model from Trokielewicz et al. (FT) recorded an IoU of 79.38% and an F1 score of 85.28%, highlighting its lesser effectiveness compared with the other deep learning (DL) models. The traditional method, OSIRIS v4.1, achieved an IoU of 83.38% and an F1 score of 82.87%, demonstrating that traditional approaches are less effective for this dataset in comparison to DL-based models. USIT (Wahet) achieved a mIoU of 70.34% and an F1 score of 72.45%, indicating its unsuitability for this dataset and further underscoring the superiority of DL-based methods for this dataset.
UBIRIS.v2 NICE.I (VIS): Our U-Net and U-Net++ models, along with Lozej’s models, achieved F1 scores of 90.89%, 90.87%, and 90.66%, respectively. In terms of the IoU scores, our U-Net++ and U-Net models recorded the highest (91.86%) and second highest (91.78%) scores, respectively, while Lozej’s model achieved a mIoU of 91.04%. Wang et al. reported the highest F1 score of 91.78% for this dataset, indicating top performance among the evaluated models. Additionally, Bezerra’s GAN demonstrated strong performance with an F1 score of 91.42%. The performance of Bezerra’s FCN, with an F1 score of 88.20%, and the Trokielewicz et al. (FT) models, with an F1 score of 85.28%, was less effective in segmentation compared with other deep learning models for this dataset. Traditional methods, such as OSIRIS v4.1 and USIT (Wahet), were found to be unsuitable for this dataset, achieving mIoU and F1 scores ranging from 20% to 43%, further underscoring the inadequacy of traditional methods for this dataset.
MICHE-I (VIS): Our benchmark models U-Net and U-Net++ stood out with the highest and second-highest mIoU scores of 92.98% and 92.94%, respectively, complemented by F1 scores of 92.27% and 92.82%. Lozej’s model achieved a mIoU of 92.36% and a F1 score of 91.83%. Trokielewicz’s model was less effective, achieving an IoU score of 84.21% and an F1 score of 83.42%. Bezerra’s FCN and GAN models showed F1 scores of 83.03% and 87.20%, indicating they were less efficient than the U-Net variants. Traditional methods, OSIRIS v4.1 and USIT (Wahet), showed comparatively degraded performance, with F1 scores of 32.48% and 26.03%, respectively, underscoring their unsuitability for this dataset.
In summary, the performance of all U-Net variant models, including those developed by us, Lozej, and Wang, was comparable and consistent across both NIR and VIS datasets, with minimal differences in accuracy observed. The FCN model demonstrated strong performance on the CASIA Thousand dataset but was less effective on the UBIRIS.v2 NICE.I (VIS) and MICHE-I (VIS) datasets, indicating its reduced efficacy for VIS datasets. Similarly, GAN-based models exhibited robust performance for the CASIA Thousand and UBIRIS.v2 NICE.I (VIS) datasets but were less effective on the MICHE-I (VIS) dataset compared with the U-Net variants. DL-based iris segmentation methods, with the exception of the model developed by Trokielewicz et al. [
24], consistently outperformed traditional methods such as OSIRIS [
12] and USIT [
32] across all evaluated metrics for each dataset. Trokielewicz et al. [
24] surpassed traditional methods in all metrics on VIS datasets and achieved higher F1 scores, albeit with a lower mIoU score than OSIRIS, for NIR datasets. Initially designed for postmortem data, the model by Trokielewicz et al. [
24] notably outperformed OSIRIS on postmortem datasets. However, our observations lead to the conclusion that a model tailored for postmortem iris analysis can not be optimally fine-tuned for live iris datasets, resulting in its under-performance compared with more recent DL-based models and traditional iris segmentation methods on NIR datasets with live subjects. These findings further underscore that traditional methods are not well-suited for VIS datasets.
(B) Assessment of eyelash detection capability: To evaluate the models’ capability in eyelash detection, we separated 160 samples with eyelash occlusions from the test set of the composite subset dataset (refer to mIoU EL in
Table 4). mIoU was used as an evaluation metric. mIoU refers to the score for the entire test set of 330 samples. Our three models and Lozej’s model demonstrated strong (>90%) accuracy and comparable performance in eyelash occlusion detection. Despite these high accuracy rates, a decline in performance was noted for the models when dealing with samples containing eyelash occlusions, as opposed to their performance on the complete general test set. Limited training iris masks with annotated eyelashes could be the reason for the lower scores. Adding more training masks with annotated eyelashes may improve this performance.
(C) Cross-dataset performance evaluation: Generalization capability is a cornerstone of biometric recognition systems, ensuring that models can effectively adapt to and classify new, unseen data. To assess the generalization capability of the implemented models, we conducted cross-dataset testing. The models, fine-tuned on the composite subset dataset, were tested on the CASIA Distance dataset without any further tuning. The CASIA Distance dataset was not included in our composite or composite subset dataset. Its images were captured from a three-meter distance under moving conditions with NIR illumination, making it distinct from the other CASIA datasets and unseen by the models trained on the composite or composite subset dataset. The cross-dataset performance is shown in
Table 5.
Table 5 indicates that all of our proposed models exhibited strong generalization capabilities and outperformed all the implemented methods. The U-Net model achieved a mIoU of 0.9231 and an F1 score of 91.43%, while the U-Net++ model showed a slight improvement with a mIoU of 0.9331 and an F1 score of 92.01%. These results establish a standard benchmark for generalization capability in iris segmentation models. Conversely, the original model from Lozej et al. showed limited adaptability, with a mIoU of 0.4347 and an F1 score of 6.67% when applied to the CASIA Distance dataset. However, once fine-tuned on our composite subset dataset, the same model’s performance was significantly enhanced, reaching a mIoU of 0.9080 and an F1 score of 89.72%. This demonstrates that even models initially lacking in generalization can achieve benchmark standards with the appropriate training dataset. Similarly, Trokielewicz et al.’s original model was outperformed by our benchmark models when tested on the CASIA Distance dataset. It performed with a mIoU of 0.7398 and an F1 score of 79.28%. After fine-tuning, there was a notable increase to a mIoU of 0.8137 and an F1 score of 87.91%, though this was still below the performance of our benchmark models. Overall, all the current open-source DL segmentation models have limited generalization capability. These results underscore the importance of comprehensive and diverse training in the development of models that excel not only in familiar conditions but also maintain high accuracy when faced with new and challenging datasets. Our benchmark models serve as a robust standard for the future development of the field of iris segmentation, guiding researchers toward creating more adaptive and reliable biometric recognition systems.
(D) Assessment of practicability: The practicability of the DL models is evaluated in terms of the number of parameters, storage space, and inference time (the duration required for a single prediction) of all models, as summarized in
Table 6. These factors are critical for the deployment of models in practical applications, where efficiency and minimal resource consumption are often as important as accuracy. We measured the average inference time on a desktop configured with an Intel Core i9-12900K CPU, evaluating .h5 models. Our evaluation indicates the Wang et al. model requires the lowest storage space at 100 Mb, while Lojez’s model requires 13 times more space at 13.8 GB. The Trokielewicz model has the fastest inference time at 0.004 s with a storage space of 104 Mb. In contrast, Lojez’s model requires the highest inference time of 0.294 s. Our U-Net and U-Net with weighted map models have comparable storage space (360 MB) and inference time (0.02 s). Our U-Net++ model requires a storage space of 414 Mb and an inference time of 0.269 s. Overall, all models present significant computational and storage demands, making their implementation on mobile devices challenging. Therefore, further optimization of these models is required.
(E) Visual evaluation: Figure 5 shows the segmentation results of the implemented methods from different datasets. Generally, noise such as eyelashes, eyelid occlusion, specular reflections, uneven illumination, off-angle iris, and smaller-sized iris make the segmentation task difficult. To evaluate all models’ performance on those edge cases, we visually assessed all challenging iris images from all the datasets. The findings are summarized here:
DL methods showed improved noise masking capability compared with the traditional methods (OSIRIS and USIT) for cases like eyelid and eyelash occlusions, high and low illuminated samples, variable pupil dilation, smaller iris area, eyeglasses, and off-angle iris.
A few incorrect segmentations were observed for cases with eyeglasses, eyelash occlusion, and off-angle iris samples. Limited training samples representing these cases might be the cause of these errors. Hence, adding more annotated training images with those cases may improve the segmentation performance.
OSIRIS showed limited capability in eyeglass and eyelash detection.
USIT could not detect eyelashes.
Both OSIRIS and USIT performed poorly against off-angle iris.
Note: For visual comparison, we only compared OSIRIS and USIT’s performance on the NIR dataset, as they were unsuitable for VIS datasets.