1. Introduction
Korean pine nut is the seed of
Pinus koraiensis Sieb. et Zucc., which is resistant to cold and prefers slightly acidic or neutral soil. It is mainly produced in the Changbai Mountain area in northeast China, including Jilin and Xiaoxing’anling, with an altitude range of 150–1800 m, in forests with warm, cold, and humid climates. It is also distributed in Japan (Honshu), Korea, and Russia (Amur, Khabarovsk) [
1]. Pine nuts are rich in unsaturated fatty acids beneficial for human health [
2], which makes them and other products such as pine nut oil popular among consumers.
The unsaturated fatty acids in Korean pine nuts are an important indicator of their nutritional value. As a result, Korean pine nuts occupy a high position among nut foods. From the perspective of consumption, the fat content is a direct indicator of the fatty acid content and the oil yield [
2,
3]. Therefore, the fat content can be used as the detection _target. Traditional chemical detection methods are laborious and time-consuming, making them impossible to use for detection of large quantities, and improper treatment of waste liquid can pollute the environment. Rapid and nondestructive detection of pine nut fat content can help classify its edible grade. The most important indirect analysis characteristic of near-infrared spectroscopy (NIRS) is the regression detection of specific substance content in samples. The research on NIRS mainly focuses on ensuring the accuracy of the detection results as quickly as possible in real time. By reducing the number of steps in the detection process, simplifying the operation, avoiding the generation of a large amount of waste and harmful reagents, and making precision detection less dependent on strict experimental conditions, it is expected to achieve a high level of popularity, with wide sample coverage and low equipment and operation thresholds for manufacturers. This nondestructive detection technology is completely capable of detecting the fat content of Korean pine nuts [
4].
Currently, there is limited research on the application of NIRS for analyzing Korean pine nuts. However, due to its advantages such as rapidity, environmental friendliness, and ease of operation, NIRS has found widespread use in analyzing agricultural and food products [
4,
5,
6]. In recent years, researchers have increasingly utilized spectroscopic techniques to analyze the fat content in various foods, such as soybeans, meat, and dairy products [
7,
8,
9,
10]. NIRS can be used to analyze samples using diffuse reflectance spectroscopy. Spectroscopy combined with chemometrics has been extensively employed in testing nut quality, encompassing qualitative tasks such as variety identification and adulteration detection, as well as quantitative analysis of substance content. Existing research demonstrates the traceability of multiple varieties of walnuts from different production areas using NIRS [
11]. In a study on peanuts and blocky nuts, NIRS successfully distinguished among peanuts, pine nuts, almonds, sesame seeds, and flax seeds [
12]. Moreover, in the domain of substance content analysis, NIRS accurately detects higher levels of protein, water, and other substances in nuts and characterizes lower levels of water-soluble sugars and AFB1 [
13,
14,
15,
16,
17]. Additionally, NIRS can quantify the unique crispy texture of nuts, corresponding to physical properties such as fracture force, hardness, and elasticity modulus [
14].
The prediction and classification of nuts have relied on various methods, including statistical techniques such as multiple linear regression (MLR), partial least squares (PLS) regression, and the naive Bayes algorithm; chemometric techniques such as first and second derivative, multiplicative scatter correction (MSC), and standard normal variate (SNV) algorithms [
18,
19]; and machine learning techniques such as different types of kernel smoothing methods, boosting methods, and additive models [
20,
21,
22]. These models typically operate in a batch learning or offline learning mode. Traditional batch-style machine learning methods, however, are plagued by several significant limitations: (1) they exhibit low efficiency in terms of time and space costs; and (2) they demonstrate poor scalability for large-scale applications because the models typically require retraining from scratch with new training data.
Online learning, a subfield of machine learning, differs from traditional batch-style machine learning in that it aims to incrementally learn from sequential data. Online learning algorithms are easy to understand and implement, typically built on theories with rigorous regret bounds [
23], and the algorithm can immediately update the prediction model for new data. Therefore, in large-scale food inspection, when the test data are input to the model in a sequential manner and the detection _target may drift or evolve over time, online learning algorithms are usually more efficient and scalable than offline learning algorithms.
Online learning includes unsupervised learning and supervised learning, with unsupervised learning mostly using methods such as kernel PCA [
24], kernel ICA [
25], and manifold learning [
26]. However, the spectral band extraction techniques for NIRS, including uninformative variable elimination (UVE), Monte Carlo uninformative variable elimination (MC-UVE), the successive projections algorithm (SPA), and the competitive adaptive reweighted sampling (CARS) algorithm [
27], do not currently have online optimization algorithms based on increment. To address this issue, we propose an online detection model based on RPLS. We utilize UVE to extract spectral bands and adjust model parameters to align with the requirements of online learning. Additionally, we design an improved online preprocessing method to calibrate raw spectra. The main contributions are as follows:
To address the issue of independent scatter correction not allowing the addition of new samples outside the original modeling dataset, we propose an online multiplicative scatter correction (OMSC) preprocessing algorithm. Inspired by the reference spectrum in MSC, we design a dynamic reference spectrum that can change with variations in the detection samples, enabling online correction of the original spectra.
To address the problem of constantly changing datasets during online detection, which can lead to the problem of constantly changing feature bands, we use UVE to extract the spectral feature bands and expand the number of bands in the feature subset. During the iterative update process of the model, we analyze the impact of parameter settings on the coverage range of the selected feature bands and verify the necessity of adjusting the number of features.
To address the issue of detecting newly added pine nut samples without rebuilding the model and to solve the problem of the original regression model performing poorly on samples from different batches, we conducted research on the sustainable use of offline models and established an online detection model based on RPLS.
The rest of this article is structured as follows. In
Section 2, we describe the experimental setup, sample preparation, and details of the proposed method.
Section 3 presents extensive experiments conducted on the dataset prepared for this study to validate the effectiveness of the online learning model. In
Section 4, we compare the prediction performance of the original offline model with that of the online model on new samples. Furthermore, we examine whether the online approach proposed in this study enhances the model’s generalization ability.
2. Materials and Methods
2.1. Preparation of Materials and Dataset Partitioning
In accordance with the research requirements, the samples needed for the experiment were all purchased from the main production areas of
Pinus koraiensis in Northeast China. The sample preparation mainly included pine nut selection, shelling, and kernel separation. Based on the principles of random sampling, chemometrics, and machine learning modeling requirements, the final sample size was determined. After the pine cones had matured, 100 mature and well-preserved pine nuts were randomly sampled as one group, with each group weighing about 20 g. A total of 120 groups of samples were collected and placed in separate sealed bags, numbered from #1 to #120. These samples were used to establish the offline model. For the online learning model, 75 new samples were needed, which were purchased from different batches of pine nuts. Starting from the first purchase, a small batch of freshly picked pine nuts was purchased from farmers every 3 days, and 5 groups of new samples were made following the above experimental steps. A total of 75 groups of samples were collected, numbered from #1 to #75. The sampling process and the final prepared samples are shown in
Figure 1. All samples were properly stored away from light, waiting for the next step of spectral collection and chemical experiments.
During the data acquisition phase of pine nut processing, we conducted spectral detection, comparison, and analysis of pine nuts with and without their skins.
Figure 2a,b illustrates the spectral data for skinned and unshelled pine nuts, respectively. It is evident that the spectral trends and absorption peaks are largely consistent between the two sets. Considering the conclusions drawn from
Figure 2a,b, along with the ability of near-infrared spectroscopy to penetrate materials up to 0.1 mm [
28], we ultimately decided to use unshelled pine nuts with skins for detection to ensure the integrity of the samples.
It should be noted that quantitative analysis of the fat content requires coordination with chemical analysis. Chemical analysis requires a certain amount of the sample to undergo a series of reactions and extractions to obtain valid data. Considering the potential loss during chemical analysis and the feasibility of collecting spectral data, each sample was standardized to 20 g.
2.2. Spectral Data Collection and Chemical Experiments
The NIRQuest512 Near Infrared spectrometer from Ocean Optics was selected for spectral data acquisition due to its robustness, high signal-to-noise ratio, high resolution, and capability for acquiring high-dimensional spectral data. Its wavelength range spans from 900 nm to 1700 nm, encompassing the spectral information necessary for analyzing the chemical bonds of fat in pine nuts. To ensure accurate data collection, it is crucial to maintain close contact between the pine nut sample and the probe fixture to prevent light leakage.
The NIRQuest512 near-infrared spectrometer from Ocean Optics was selected for spectral data acquisition due to its robustness, high signal-to-noise ratio, high resolution, and ability to acquire high-dimensional spectral data. The spectrometer operates within a wavelength range of 900 nm to 1700 nm, which is ideal for analyzing the chemical bonds in the fats of pine nuts. The light source for the spectrometer was the HL-2000 Tungsten Halogen Light Source. To minimize light leakage and ensure accurate data collection, we maintained close contact between the pine nut sample and the reflection probe fixture. The fiber optic accessories included VIS-NIR fibers with core sizes of 200, 400, and 800 microns. The reflection probe had configurations with one fiber for illumination and three fibers for collection. Additionally, the entrance slit of the spectrometer was 50 m, with a pixel size of 50 × 300 m.
Sampling points with uniform texture were randomly selected, and spectra were acquired when the spectral curve became clear, stable, and exhibited no significant fluctuations. Each data point was averaged over three scans, and this process was repeated to collect 100 samples, with the mean calculated as the raw spectral data for each set of samples. The NIRQuest512 spectrometer is accompanied by SpectraSuite® software, which facilitates sampling, averaging, and exporting commands using scripts. All data were exported to an Excel file for storage and further analysis.
Subsequently, the 195 sets of samples with completed spectral data collection underwent fat content detection using the Soxhlet extraction method. The Soxhlet extraction method is widely recognized as the standard method for measuring fat content due to the principle that fat readily dissolves in organic solvents. After extracting the sample directly with anhydrous ether or petroleum ether, the solvent is evaporated, and the residue is dried to a constant weight, allowing for the calculation of the free fat content. The main steps include processing the pine nuts, packaging the pine nut powder, drying the samples, extracting the samples, weighing the extracted material, and calculating the results. It is important to note that the number of extraction cycles was set based on experiments with other nuts. Since pine nuts have a high fat content, to ensure data accuracy, the extraction cycle was set to 72 times as per the standard. The experiment was conducted in accordance with national food safety standards [
29] (GB5009.6-2016), with fat content determination certified by the Heilongjiang Institute of Quality Supervision and Testing.
Subsequently, the 195 sets of samples with completed spectral data collection underwent fat content detection using the Soxhlet extraction method. The Soxhlet extraction method is widely recognized as the standard method for measuring fat content, involving several steps including slicing, packaging, drying, extraction, weighing, and result calculation. The experiment was conducted in accordance with national food safety standards [
29] (GB5009.6-2016), with fat content determination certified by the Heilongjiang Institute of Quality Supervision and Testing.
2.3. Data Analysis
2.3.1. Offline and Online Preprocessing of Spectral Data
MSC is a preprocessing algorithm designed to mitigate the scattering effects caused by surface properties of samples, such as variations in refractive index, particle size, and surface roughness. It is particularly suitable for diffuse reflectance spectroscopy due to its ability to effectively remove unwanted scattering effects from spectral data, thereby enhancing the accuracy of quantitative analysis. According to reference [
30], MSC has demonstrated effective performance in processing the spectra of pine nuts compared to other preprocessing algorithms designed to mitigate scattering effects. For a specific spectrum, the MSC algorithm can be performed as follows.
First, the average spectrum
of the calibration set samples is calculated. Then, a linear regression operation, given by Equation (
1), is conducted between each individual spectrum
and the average spectrum
.
Calculate the slope
and intercept
based on the principle of least squares. Then, obtain
, which is given by Equation (
2).
During the research process, it was observed that, besides the inherent differences in physical and chemical properties between new and old pine nuts, the spectral data collection intervals varied significantly among different batches of samples. This resulted in distinct initial conditions for data collection in experiments. Experimental validation indicated that simply merging new and old data for scattering correction did not yield satisfactory results when predicting the behavior of the new sample set. In online model research, where the sample set is real-time and dynamic, it is imperative that new data undergo independent scattering correction from the original dataset. Additionally, since the preprocessed new data will iteratively update the parameters of the original predictive model, it is crucial that preprocessing does not deviate significantly from the original modeling dataset. The real-time and dynamic nature of new sample data necessitates online preprocessing. To address these challenges, we propose an enhanced version of the MSC algorithm, termed OMSC.
When the
-th new sample spectrum enters the preprocessing stage, the mean value of all
sample data is:
According to the MSC principle, the data of the
-th sample obtained after MSC preprocessing is:
At this point,
has not been corrected and, by applying the least squares method to obtain
and
, we have:
Theorem 1. Regret is the difference between the cumulative actual loss and the minimum loss under a fixed hypothesis known in advance. It can be represented as: In general [
31], the regret bound is defined as the upper bound corresponding to the worst-case regret value of a certain online learning algorithm. If the regret bound of a certain online learning algorithm is a sub-linear function with respect to the number of iterations
T, that is,
, then this online learning algorithm can be considered ideal because, as
T tends to infinity, the losses of the optimal offline algorithm and the online learning algorithm can be considered approximately equal. The proposed OMSC in this study does not affect the convergence of online learning algorithms, and this will be proven next.
Theorem 2. Suppose the maximum deviation of the near-infrared spectral absorbance at the same wavelength between the samples is E, where E is a positive constant.
Theorem 3. If the upper bound of the regret value for the online learning algorithm in this study is R, then after preprocessing the newly added sample data with OMSC, the upper bound of the regret value for the final algorithm is .
Proof. To prove it by contradiction, try to assume that the statement is false; proceed from there, and at some point, you will arrive at a contradiction.
When the new sample data
are preprocessed, it will cause a slight variation in
, and the maximum magnitude of the variation will not exceed
.
According to the principle of the MSC algorithm,
, it can be inferred that:
Therefore, the loss function,
, and we can obtain:
□
The pseudocode of the OMSC algorithm is represented in Algorithm 1.
Algorithm 1: The OMSC pseudo-code. |
Input: A set of NIR spectra collected for i samples , is reference NIR spectra of , new NIR spectra Output: MSC transformed spectra , reference NIR spectra - 1:
Compute the reference spectra of samples: - 2:
residual: - 3:
Compute and : - 4:
- 5:
return
|
After the scattering correction, the spectral data are subjected to the S–G convolution smoothing process. This algorithm utilizes polynomial fitting and the least squares method to calculate the weighted average value of wavelength points within the window [
32]. Its purpose is to eliminate the high-frequency noise carried by the original spectral data. The result of S–G algorithm preprocessing is not affected by other samples in the dataset. In both offline and online model studies, the data can be directly smoothed after scattering correction.
2.3.2. Feature Extraction Methodology
The UVE algorithm can screen and remove the invalid information carried by the full spectrum data. It reduces the data size to within a reasonable limit and tries to ensure the amount of effective information as much as possible. The basic principle of UVE is to introduce a random noise matrix into the spectral data matrix and obtain the PLS regression model by cross-validation [
33]. Due to the noise matrix and the original spectral data matrix having the same dimension, the regression coefficient matrix can be obtained, denoted as
B. There exists a linear relationship between the spectral data matrix and the fat content matrix as follows:
In the equation above,
b is the regression coefficient vector, and
e represents the error vector. The average and standard deviation of vector
b in matrix
B are divided to obtain
C.
Here, vector i represents the i-th column of the spectral data matrix, and mean and represent the mean and standard deviation of vector b, respectively. By judging the absolute value of , we consider whether to retain the i-th column vector in the spectral data matrix.
2.3.3. Modeling Methodology
PLS is widely used in various fields for its stability. In particular, its excellent performance in dealing with multicollinearity makes it one of the most recognized regression algorithms in the field of NIRS analysis [
34,
35].
The recursive partial least squares (RPLS) regression algorithm is commonly used in regression analysis. RPLS updates the regression coefficients of the original model during the iterative process with newly added modeling data. In this way, it can extract information from the feature data of newly added pine nut samples [
36,
37].
RPLS involves operations with two important covariance matrices. The regression coefficients of the PLS model are calculated using matrix
, while the latent variables are obtained from matrix
. Here,
X denotes the spectral feature matrix, and
y represents the vector of actual fat content. When the feature data of the
t-th sample in the new dataset are added to the sample database,
and
are recursively updated using Equations (
8) and (
9):
where
and
respectively represent the replaced covariance matrices, and
represents the forgetting factor
, whose role is to facilitate the rate at which the original covariance matrix is updated through feedback. During the recursive calculation process, the
t-th sample spectrum feature data
and the true value data
of the fat content both need to be standardized. The process of standardization involves an average vector and a standard deviation vector. The recursive updating process of the average and standard deviation vectors is shown in Equations (
10) to (
13):
where
N represents the number of all samples in the database at this time. The value of
N varies according to the change in the number of iterations. Then, Equations (
14) and (
15) are used to calculate the standardized spectrum feature data
and the true value data
of fat content.
The initial series of values before recursive updating of the RPLS model can be calculated based on the feature data and the true values of fat content in the modeling dataset. These initial values are represented as , and .
2.3.4. Fat Content Calibration Model
This study first establishes an offline PLS prediction model for the fat content of Korean pine nuts. The process of upgrading this offline model to an online learning model mainly revolves around the newly added sample dataset. The spectral data in the new dataset are preprocessed by the OMSC and S–G convolution smoothing algorithms, and the data then need to be reselected for features by UVE. With the offline model, the RPLS algorithm can be used to achieve online updating of the prediction model. The construction process of the offline and online learning models is shown in
Figure 3.
2.3.5. Model Validation
By adjusting and comparing parameters, the modeling process is fine-tuned, and the final model is evaluated and decided upon. In this study, the root mean square error (RMSE) and the coefficient of determination [
38]
are used as evaluation metrics for the regression model. Specifically, RMSE is divided into the root mean square error of cross-validation (RMSECV) and the root mean square error of prediction [
39] (RMSEP) for the calibration set and prediction set, respectively. RMSE and
are calculated using Equations (
16) and (
17), respectively:
where
represents the actual measured values corresponding to the pine nut fat content in this study,
represents the predicted values, and
represents the average measured values.
4. Discussion
NIRS is a rapid and effective technology for assessing the quality of agricultural products and food. Several studies have successfully correlated the nutritional content of nuts with spectral data [
40,
41,
42]. However, a common challenge in practical application arises from the fact that the items being measured often arrive in batches, making it difficult to consistently match the physicochemical properties of the samples used during offline model development. Consequently, offline models may remain limited to feasibility studies and may fail to transition out of laboratory settings. This limitation stems from the requirement in offline learning that all training data be available during model development, with the model only becoming usable for predictions after training is complete. In contrast, online learning processes data sequentially, continuously updating the model (referred to as the offline model in this study) as real-time data become available. Nonetheless, this advantage of online learning introduces certain risks. Since the model processes one data point at a time and updates weights immediately after training, erroneous weight calculations resulting from faulty data can potentially lead the model astray. To mitigate this risk, this study thoroughly preprocesses new samples, ensuring alignment with the original reference spectra and thereby reducing the likelihood of online learning weight calculations veering off course from the source data, effectively minimizing residuals.
To validate the superiority of the online learning model, we focused on comparing the performance of three online models. We specifically discussed the impact of spectral data dimensionality on modeling work. Additionally, the sample quantity determines the dataset volume, thereby affecting optimization effectiveness. Having too many new samples would eliminate the advantage of iteratively updating model parameters instead of reestablishing the model. The number of online dataset samples should be kept within a reasonable range. In this study, data collection was conducted in batches, with a total of 75 sets of samples comprising the NIR_ONLINE dataset. The NIR_ONLINE dataset samples were organized into batches of 5, sequentially inputted into the online model for training, and real-time outputs of RMSEP and
were obtained. This process was used to analyze the model quality mentioned in
Table 3.
Figure 9a,b depict the RMSEP and
iteration curves for the online model. The results of the prediction set tests directly reflect the strength of the predictive model performance.
From
Figure 9a,b, the OMSC-RPLS method can accurately extract effective feature information from the new sample dataset. Compared with traditional preprocessing methods, the OMSC and RPLS algorithms make the updating and correction process more consistent with the requirements of improving model adaptability. The OMSC-RPLS-100 model is initially unstable, and its accuracy is slightly lower than that of the OMSC-RPLS-70 model. However, as the model continues to iterate and the dataset is input in batches to the online model for training and prediction, new data gradually increase, and the OMSC-RPLS-70 model strictly controls the number of bands. When the number of feature bands gradually exceeds 70, the model accuracy gradually decreases. On the other hand, the accuracy of the OMSC-RPLS-100 model gradually increases, and when the number of samples in the NIR_ONLINE dataset reaches about 30, the model accuracy approaches the maximum value, and the weight tends to stop updating. The NIR_ONLINE dataset should ideally contain as few samples as possible. In this study, the sample size of the NIR_ONLINE dataset ranged from 10 to 60. The model validation results are shown in
Table 3.
In general, a larger number of training samples leads to higher prediction accuracy. Following the design principle of minimizing the size of the online learning dataset, this study set the number of samples in the online partial correction set to 30. At this point, the enhanced prediction accuracy of the updated online learning model now exceeds that of the offline model, with both model RMSEP and approaching their maximum values. The prediction results meet the accuracy requirements. In future optimizations of the model through online learning, the proportion of online samples can serve as a reference. In similar detection tasks in the future, a more in-depth investigation can be conducted into the setting of the volume of the online learning dataset.
5. Conclusions
This study aims to address the limitations of conventional methods for determining the fat content of Korean pine nuts by proposing a comprehensive approach that leverages near-infrared spectroscopy (NIRS). Initially, a PLS offline prediction model was developed, which offers a rapid, nondestructive, and accurate detection method. However, recognizing the offline model’s shortcomings, such as limited generalization and inadequate sample preprocessing for model updates, we advanced an OMSC-RPLS online learning model. The OMSC algorithm performs independent scatter correction on new sample data without the need for the complete set of original model data, thereby enhancing prediction accuracy. Additionally, by expanding the feature selection range during preprocessing, the new model captures a greater proportion of relevant information, which, when fed into the RPLS model, leads to a progressively updated and stable regression model.
The results demonstrate that the online learning model significantly outperforms the original model in detecting new batches of samples, showcasing enhanced adaptability. This online model holds significant guiding and practical value for the determination of pine nut nutritional content, offering reference and application value for quantitative analysis, quality testing, and online learning research related to other nut varieties. Theoretically, the establishment of an updated sample database facilitates the long-term optimization and updating of models aimed at similar detection objectives.
Despite these advancements, the recursive PLS model is not the only option. Various incremental learning algorithms, such as online stochastic gradient descent, online AdaBoost, online SVM, and online k-means, remain underexplored. Furthermore, algorithms like online collaborative filtering, widely used on e-commerce platforms to update user preferences in real time, could also be applied to near-infrared spectroscopy. Compared to these emerging algorithms, recursive PLS offers higher interpretability of spectral data, and the predecessor algorithm, PLS, has been widely applied in near-infrared spectroscopy. After balancing the advantages of recursive PLS with those of emerging algorithms, this study chose the recursive PLS algorithm as the online model. Future research should investigate these alternatives to assess their applicability to NIRS.
Moreover, this study focused on near-infrared spectroscopy, while other potential nondestructive techniques, such as laser-induced breakdown spectroscopy (LIBS) and Raman spectroscopy, could also be applied to nondestructive food testing. Given LIBS’s high sensitivity and limited penetration, it could be a viable alternative for detecting the nutritional content of pine nuts. Future research should explore these techniques for potential applications in nondestructive analysis.
In conclusion, this study has delved deeply into chemometrics, machine learning, and online learning methods, seamlessly integrating them to establish a robust quality evaluation model for Korean pine nuts. This model not only characterizes the properties and determines the nutrient content of the nuts but also transitions from an offline to an online learning model, setting the stage for ongoing research and development in this field.