Journal Description
Data
Data
is a peer-reviewed, open access journal on data in science, with the aim of enhancing data transparency and reusability. The journal publishes in two sections: a section on the collection, treatment and analysis methods of data in science; a section publishing descriptions of scientific and scholarly datasets (one dataset per paper). The journal is published monthly online by MDPI.
- Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within Scopus, ESCI (Web of Science), Ei Compendex, dblp, Inspec, RePEc, and other databases.
- Journal Rank: JCR - Q2 (Multidisciplinary Sciences) / CiteScore - Q2 (Information Systems and Management)
- Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 27.7 days after submission; acceptance to publication is undertaken in 3.5 days (median values for papers published in this journal in the first half of 2024).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
Impact Factor:
2.2 (2023);
5-Year Impact Factor:
2.4 (2023)
Latest Articles
Teal-WCA: A Climate Services Platform for Planning Solar Photovoltaic and Wind Energy Resources in West and Central Africa in the Context of Climate Change
Data 2024, 9(12), 148; https://doi.org/10.3390/data9120148 - 10 Dec 2024
Abstract
►
Show Figures
To address the growing electricity demand driven by population growth and economic development while mitigating climate change, West and Central African countries are increasingly prioritizing renewable energy as part of their Nationally Determined Contributions (NDCs). This study evaluates the implications of climate change
[...] Read more.
To address the growing electricity demand driven by population growth and economic development while mitigating climate change, West and Central African countries are increasingly prioritizing renewable energy as part of their Nationally Determined Contributions (NDCs). This study evaluates the implications of climate change on renewable energy potential using ten downscaled and bias-adjusted CMIP6 models (CDFt method). Key climate variables—temperature, solar radiation, and wind speed—were analyzed and integrated into the Teal-WCA platform to aid in energy resource planning. Projected temperature increases of 0.5–2.7 °C (2040–2069) and 0.7–5.2 °C (2070–2099) relative to 1985–2014 underscore the need for strategies to manage the rising demand for cooling. Solar radiation reductions (~15 W/m2) may lower photovoltaic (PV) efficiency by 1–8.75%, particularly in high-emission scenarios, requiring a focus on system optimization and diversification. Conversely, wind speeds are expected to increase, especially in coastal regions, enhancing wind power potential by 12–50% across most countries and by 25–100% in coastal nations. These findings highlight the necessity of integrating climate-resilient energy policies that leverage wind energy growth while mitigating challenges posed by reduced solar radiation. By providing a nuanced understanding of the renewable energy potential under changing climatic conditions, this study offers actionable insights for sustainable energy planning in West and Central Africa.
Full article
Open AccessArticle
Parallel Simplex, an Alternative to Classical Experimentation: A Case Study
by
Francisco Zorrilla Briones, Inocente Yuliana Meléndez Pastrana, Manuel Alonso Rodríguez Morachis and José Luís Anaya Carrasco
Data 2024, 9(12), 147; https://doi.org/10.3390/data9120147 - 10 Dec 2024
Abstract
Experimentation is a strong methodology that improves and optimizes processes. Nevertheless, in many cases, real-life dynamics of production demands and other restrictions inhibit the use of these methodologies because their use implies stopping production, generating scrap, jeopardizing demand accomplishments, and other problems. Proposed
[...] Read more.
Experimentation is a strong methodology that improves and optimizes processes. Nevertheless, in many cases, real-life dynamics of production demands and other restrictions inhibit the use of these methodologies because their use implies stopping production, generating scrap, jeopardizing demand accomplishments, and other problems. Proposed here is an alternative methodology to search for the best process variable levels and optimize the response of the process without the need to stop production. This algorithm is based on the principles of the Variable Simplex developed by Nelder and Mead and the continuous iterative process of EVOPS developed by Box, which is then modified as a simplex by Spendley. It is named parallel simplex because it searches for the best response with three independent Simplexes searching for the same response at the same time. The algorithm was designed for three simplexes of two input variables each. The case study documented shows that it is efficient and effective.
Full article
(This article belongs to the Special Issue Cutting-Edge Datasets and Algorithms for Enhancing Industrial Processes and Supply Chain Optimization)
►▼
Show Figures
Figure 1
Open AccessArticle
Data Decomposition Modeling Based on Improved Dung Beetle Optimization Algorithm for Wind Power Prediction
by
Jiajian Ke and Tian Chen
Data 2024, 9(12), 146; https://doi.org/10.3390/data9120146 - 9 Dec 2024
Abstract
Accurate wind power forecasting is essential for maintaining the stability of a power system and enhancing scheduling efficiency in the power sector. To enhance prediction accuracy, this paper presents a hybrid wind power prediction model that integrates the improved complementary ensemble empirical mode
[...] Read more.
Accurate wind power forecasting is essential for maintaining the stability of a power system and enhancing scheduling efficiency in the power sector. To enhance prediction accuracy, this paper presents a hybrid wind power prediction model that integrates the improved complementary ensemble empirical mode decomposition (ICEEMDAN), the RIME optimization algorithm (RIME), sample entropy (SE), the improved dung beetle optimization (IDBO) algorithm, the bidirectional long short-term memory (BiLSTM) network, and multi-head attention (MHA). In this model, RIME is utilized to improve the parameters of ICEEMDAN, reducing data decomposition complexity and effectively capturing the original data information. The IDBO algorithm is then utilized to improve the hyperparameters of the MHA-BiLSTM model. The proposed RIME-ICEEMDAN-IDBO-MHA-BiLSTM model is contrasted with ten others in ablation experiments to validate its performance. The experimental findings prove that the proposed model achieves MAPE values of 5.2%, 6.3%, 8.3%, and 5.8% across four datasets, confirming its superior predictive performance and higher accuracy.
Full article
(This article belongs to the Topic Decision-Making and Data Mining for Sustainable Computing)
►▼
Show Figures
Figure 1
Open AccessArticle
Formalization for Subsequent Computer Processing of Kara Sea Coastline Data
by
Daria Bogatova and Stanislav Ogorodov
Data 2024, 9(12), 145; https://doi.org/10.3390/data9120145 - 9 Dec 2024
Abstract
►▼
Show Figures
This study aimed to develop a methodological framework for predicting shoreline dynamics using machine learning techniques, focusing on analyzing generalized data without distinguishing areas with higher or lower retreat rates. Three sites along the southwestern Kara Sea coast were selected for this investigation.
[...] Read more.
This study aimed to develop a methodological framework for predicting shoreline dynamics using machine learning techniques, focusing on analyzing generalized data without distinguishing areas with higher or lower retreat rates. Three sites along the southwestern Kara Sea coast were selected for this investigation. The study analyzed key coastal features, including lithology, permafrost, and geomorphology, using a combination of field studies and remote sensing data. Essential datasets were compiled and formatted for computer-based analysis. These datasets included information on permafrost and the geomorphological characteristics of the coastal zone, climatic factors influencing the shoreline, and measurements of bluff top positions and retreat rates over defined time periods. The positions of the bluff tops were determined through a combination of imagery with varying resolutions and field measurements. A novel aspect of the study involved employing geostatistical methods to analyze erosion rates, providing new insights into the shoreline dynamics. The data analysis allowed us to identify coastal areas experiencing the most significant changes. By continually refining neural network models with these datasets, we can improve our understanding of the complex interactions between natural factors and shoreline evolution, ultimately aiding in developing effective coastal management strategies.
Full article
Figure 1
Open AccessData Descriptor
Multi-Modal Dataset of Human Activities of Daily Living with Ambient Audio, Vibration, and Environmental Data
by
Thomas Pfitzinger, Marcel Koch, Fabian Schlenke and Hendrik Wöhrle
Data 2024, 9(12), 144; https://doi.org/10.3390/data9120144 - 9 Dec 2024
Abstract
►▼
Show Figures
The detection of human activities is an important step in automated systems to understand the context of given situations. It can be useful for applications like healthcare monitoring, smart homes, and energy management systems for buildings. To achieve this, a sufficient data basis
[...] Read more.
The detection of human activities is an important step in automated systems to understand the context of given situations. It can be useful for applications like healthcare monitoring, smart homes, and energy management systems for buildings. To achieve this, a sufficient data basis is required. The presented dataset contains labeled recordings of 25 different activities of daily living performed individually by 14 participants. The data were captured by five multisensors in supervised sessions in which a participant repeated each activity several times. Flawed recordings were removed, and the different data types were synchronized to provide multi-modal data for each activity instance. Apart from this, the data are presented in raw form, and no further filtering was performed. The dataset comprises ambient audio and vibration, as well as infrared array data, light color and environmental measurements. Overall, 8615 activity instances are included, each captured by the five multisensor devices. These multi-modal and multi-channel data allow various machine learning approaches to the recognition of human activities, for example, federated learning and sensor fusion.
Full article
Figure 1
Open AccessArticle
A Data Storage, Analysis, and Project Administration Engine (TMFdw) for Small- to Medium-Size Interdisciplinary Ecological Research Programs with Full Raster Data Capabilities
by
Paulina Grigusova, Christian Beilschmidt, Maik Dobbermann, Johannes Drönner, Michael Mattig, Pablo Sanchez, Nina Farwig and Jörg Bendix
Data 2024, 9(12), 143; https://doi.org/10.3390/data9120143 - 6 Dec 2024
Abstract
Over almost 20 years, a data storage, analysis, and project administration engine (TMFdw) has been continuously developed in a series of several consecutive interdisciplinary research projects on functional biodiversity of the southern Andes of Ecuador. Starting as a “working database”, the system now
[...] Read more.
Over almost 20 years, a data storage, analysis, and project administration engine (TMFdw) has been continuously developed in a series of several consecutive interdisciplinary research projects on functional biodiversity of the southern Andes of Ecuador. Starting as a “working database”, the system now includes program management modules and literature databases, which are all accessible via a web interface. Originally designed to manage data in the ecological Research Unit 816 (SE Ecuador), the open software is now being used in several other environmental research programs, demonstrating its broad applicability. While the system was mainly developed for abiotic and biotic tabular data in the beginning, the new research program demands full capabilities to work with area-wide and high-resolution big models and remote sensing raster data. Thus, a raster engine was recently implemented based on the Geo Engine technology. The great variety of pre-implemented desktop GIS-like analysis options for raster point and vector data is an important incentive for researchers to use the system. A second incentive is to implement use cases prioritized by the researchers. As an example, we present machine learning models to generate high-resolution (30 m) microclimate raster layers for the study area in different temporal aggregation levels for the most important variables of air temperature, humidity, precipitation, and solar radiation. The models implemented as use cases outperform similar models developed in other research programs.
Full article
Open AccessArticle
Nearest-Better Network-Assisted Fitness Landscape Analysis of Contaminant Source Identification in Water Distribution Network
by
Yiya Diao, Changhe Li, Sanyou Zeng and Shengxiang Yang
Data 2024, 9(12), 142; https://doi.org/10.3390/data9120142 - 6 Dec 2024
Abstract
Contaminant Source Identification in Water Distribution Network (CSWIDN) is critical for ensuring public health, and optimization algorithms are commonly used to solve this complex problem. However, these algorithms are highly sensitive to the problem’s landscape features, which has limited their effectiveness in practice.
[...] Read more.
Contaminant Source Identification in Water Distribution Network (CSWIDN) is critical for ensuring public health, and optimization algorithms are commonly used to solve this complex problem. However, these algorithms are highly sensitive to the problem’s landscape features, which has limited their effectiveness in practice. Despite this, there has been little experimental analysis of the fitness landscape for CSWIDN, particularly given its mixed-encoding nature. This study addresses this gap by conducting a comprehensive fitness landscape analysis of CSWIDN using the Nearest-Better Network (NBN), the only applicable method for mixed-encoding problems. Our analysis reveals for the first time that CSWIDN exhibits the landscape features, including neutrality, ruggedness, modality, dynamic change, and separability. These findings not only deepen our understanding of the problem’s inherent landscape features but also provide quantitative insights into how these features influence algorithm performance. Additionally, based on these insights, we propose specific algorithm design recommendations that are better suited to the unique challenges of the CSWIDN problem. This work advances the knowledge of CSWIDN optimization by both qualitatively characterizing its landscape and quantitatively linking these features to algorithms’ behaviors.
Full article
(This article belongs to the Topic Water and Energy Monitoring and Their Nexus)
►▼
Show Figures
Figure 1
Open AccessData Descriptor
A Dataset of Plant Species Richness in Chinese National Nature Reserves
by
Chunjing Wang, Wuxian Yan and Jizhong Wan
Data 2024, 9(12), 141; https://doi.org/10.3390/data9120141 - 30 Nov 2024
Abstract
►▼
Show Figures
This comprehensive dataset on the number of plant species, genera, and families in 383 national nature reserves in China has been compiled based on the available literature. Heilongjiang Province and the Guangxi Zhuang Autonomous Region have the highest number of nature reserves. Species
[...] Read more.
This comprehensive dataset on the number of plant species, genera, and families in 383 national nature reserves in China has been compiled based on the available literature. Heilongjiang Province and the Guangxi Zhuang Autonomous Region have the highest number of nature reserves. Species richness is relatively high in the Jinfoshan, Dabashan, Wenshan, Hupingshan, and Shennongjia Nature Reserves. This dataset provides important baseline information on plant species richness coupling with genus and family numbers in Chinese national nature reserves and should help researchers and environmentalists understand the dynamic species changes in various nature reserves. This detailed and reliable information may serve as the foundation for future plant research in Chinese nature reserves and play a positive role in promoting more effective natural protection, biological distribution, and biodiversity conservation in these areas.
Full article
Figure 1
Open AccessArticle
Algorithm for Trajectory Simplification Based on Multi-Point Construction in Preselected Area and Noise Smoothing Processing
by
Simin Huang and Zhiying Yang
Data 2024, 9(12), 140; https://doi.org/10.3390/data9120140 - 29 Nov 2024
Abstract
Simplifying trajectory data can improve the efficiency of trajectory data analysis and query and reduce the communication cost and computational overhead of trajectory data. In this paper, a real-time trajectory simplification algorithm (SSFI) based on the spatio-temporal feature information of implicit trajectory points
[...] Read more.
Simplifying trajectory data can improve the efficiency of trajectory data analysis and query and reduce the communication cost and computational overhead of trajectory data. In this paper, a real-time trajectory simplification algorithm (SSFI) based on the spatio-temporal feature information of implicit trajectory points is proposed. The algorithm constructs the preselected area through the error measurement method based on the feature information of implicit trajectory points (IEDs) proposed in this paper, predicts the falling point of trajectory points, and realizes the one-way error-bounded simplified trajectory algorithm. Experiments show that the simplified algorithm has obvious progress in three aspects: running speed, compression accuracy, and simplification rate. When the trajectory data scale is large, the performance of the algorithm is much better than that of other line segment simplification algorithms. The GPS error cannot be avoided. The Kalman filter smoothing trajectory can effectively eliminate the influence of noise and significantly improve the performance of the simplified algorithm. According to the characteristics of the trajectory data, this paper accurately constructs a mathematical model to describe the motion state of objects, so that the performance of the Kalman filter is better than other filters when smoothing trajectory data. In this paper, the trajectory data smoothing experiment is carried out by adding random Gaussian noise to the trajectory data. The experiment shows that the Kalman filter’s performance under the mathematical model is better than other filters.
Full article
(This article belongs to the Special Issue IoT and Big Data Applications in Smart Cities: Recent Advances, Challenges, and Critical Issues)
►▼
Show Figures
Figure 1
Open AccessArticle
Detective Gadget: Generic Iterative Entity Resolution over Dirty Data
by
Marcello Buoncristiano, Giansalvatore Mecca, Donatello Santoro and Enzo Veltri
Data 2024, 9(12), 139; https://doi.org/10.3390/data9120139 - 25 Nov 2024
Abstract
In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want
[...] Read more.
In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want to avoid integrating different entities or missing matches. However, existing approaches struggle with the challenges posed by rapidly changing data and the presence of dirtiness, which requires an iterative refinement during the time. We present Detective Gadget, a novel system for iterative ER that seamlessly integrates data-cleaning into the ER workflow. Detective Gadgetemploys an alias-based hashing mechanism for fast and scalable matching, check functions to detect and correct mismatches, and a human-in-the-loop framework to refine results through expert feedback. The system iteratively improves data quality and matching accuracy by leveraging evidence from both automated and manual decisions. Extensive experiments across diverse real-world scenarios demonstrate its effectiveness, achieving high accuracy and efficiency while adapting to evolving datasets.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures
Figure 1
Open AccessArticle
CARE to Compare: A Real-World Benchmark Dataset for Early Fault Detection in Wind Turbine Data
by
Christian Gück, Cyriana M. A. Roelofs and Stefan Faulstich
Data 2024, 9(12), 138; https://doi.org/10.3390/data9120138 - 23 Nov 2024
Abstract
►▼
Show Figures
Early fault detection plays a crucial role in the field of predictive maintenance for wind turbines, yet the comparison of different algorithms poses a difficult task because domain-specific public datasets are scarce. Many comparisons of different approaches either use benchmarks composed of data
[...] Read more.
Early fault detection plays a crucial role in the field of predictive maintenance for wind turbines, yet the comparison of different algorithms poses a difficult task because domain-specific public datasets are scarce. Many comparisons of different approaches either use benchmarks composed of data from many different domains, inaccessible data, or one of the few publicly available datasets that lack detailed information about the faults. Moreover, many publications highlight a couple of case studies where fault detection was successful. With this paper, we publish a high quality dataset that contains data from 36 wind turbines across 3 different wind farms as well as the most detailed fault information of any public wind turbine dataset as far as we know. The new dataset contains 89 years worth of real-world operating data of wind turbines, distributed across 44 labeled time frames for anomalies that led up to faults, as well as 51 time series representing normal behavior. Additionally, the quality of training data is ensured by turbine-status-based labels for each data point. Furthermore, we propose a new scoring method, called CARE (Coverage, Accuracy, Reliability and Earliness), which takes advantage of the information depth that is present in the dataset to identify good early fault detection models for wind turbines. This score considers the anomaly detection performance, the ability to recognize normal behavior properly, and the capability to raise as few false alarms as possible while simultaneously detecting anomalies early.
Full article
Figure 1
Open AccessData Descriptor
Dual Transcriptome of Post-Germinating Mutant Lines of Arabidopsis thaliana Infected by Alternaria brassicicola
by
Mailen Ortega-Cuadros, Laurine Chir, Sophie Aligon, Nubia Velasquez, Tatiana Arias, Jerome Verdier and Philippe Grappin
Data 2024, 9(11), 137; https://doi.org/10.3390/data9110137 - 18 Nov 2024
Abstract
Alternaria brassicicola is a seed-borne pathogen that causes black spot disease in Brassica crops, yet the seed defense mechanisms against this fungus remain poorly understood. Building upon recent reports that highlighted the involvement of indole pathways in seeds infected by Alternaria, this
[...] Read more.
Alternaria brassicicola is a seed-borne pathogen that causes black spot disease in Brassica crops, yet the seed defense mechanisms against this fungus remain poorly understood. Building upon recent reports that highlighted the involvement of indole pathways in seeds infected by Alternaria, this study provides transcriptomic resources to further elucidate the role of these metabolic pathways during the interaction between seeds and fungal pathogens. Using RNA sequencing, we examined the gene expression of glucosinolate-deficient mutant lines (cyp79B2/cyp79B3 and qko) and a camalexin-deficient line (pad3), generating a dataset from 14 samples. These samples were inoculated with Alternaria or water, and collected at 3, 6, and 10 days after sowing to extract total RNA. Sequencing was performed using DNBseq™ technology, followed by bioinformatics analyses with tools such as FastQC (version 0.11.9), multiQC (version 1.13), Venny (version 2.0), Salmon software (version 0.14.1), and R packages DESeq2 (version 1.36.0), ClusterProfiler (version 4.12.6) and ggplot2 (version 3.4.0). By providing this valuable dataset, we aim to contribute to a deeper understanding of seed defense mechanisms against Alternaria, leveraging RNA-seq for various analyses, including differential gene expression and co-expression correlation. This work serves as a foundation for a more comprehensive grasp of the interactions during seed infection and highlights potential _targets for enhancing crop protection and management.
Full article
(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)
►▼
Show Figures
Figure 1
Open AccessData Descriptor
Two Datasets over South Tyrol and Tyrol Areas to Understand and Characterize Water Resource Dynamics in Mountain Regions
by
Ludovica De Gregorio, Giovanni Cuozzo, Riccardo Barella, Francisco Corvalán, Felix Greifeneder, Peter Grosse, Abraham Mejia-Aguilar, Georg Niedrist, Valentina Premier, Paul Schattan, Alessandro Zandonai and Claudia Notarnicola
Data 2024, 9(11), 136; https://doi.org/10.3390/data9110136 - 16 Nov 2024
Abstract
In this work, we present two datasets for specific areas located on the Alpine arc that can be exploited to monitor and understand water resource dynamics in mountain regions. The idea is to provide the reader with information about the different sources of
[...] Read more.
In this work, we present two datasets for specific areas located on the Alpine arc that can be exploited to monitor and understand water resource dynamics in mountain regions. The idea is to provide the reader with information about the different sources of water supply over five defined test areas over the South Tyrol (Italy) and Tyrol (Austria) areas in alpine environments. The snow cover fraction (SCF) and Soil Moisture Content (SMC) datasets are derived from machine learning algorithms based on remote sensing data. Both SCF and SMC products are characterized by a spatial resolution of 20 m and are provided for the period from October 2020 to May 2023 (SCF) and from October 2019 to September 2022 (SMC), respectively, covering winter seasons for SCF and spring–summer seasons for SMC. For SCF maps, the validation with very high-resolution images shows high correlation coefficients of around 0.9. The SMC products were originally produced with an algorithm validated at a global scale, but here, to obtain more insights into the specific alpine mountain environment, the values estimated from the maps are compared with ground measurements of automatic stations located at different altitudes and characterized by different aspects in the Val Mazia catchment in South Tyrol (Italy). In this case, an MAE between 0.05 and 0.08 and an unbiased RMSE between 0.05 and 0.09 m3·m−3 were achieved. The datasets presented can be used as input for hydrological models and to hydrologically characterize the study alpine area starting from different sources of information.
Full article
(This article belongs to the Topic Techniques and Science Exploitations for Earth Observation and Planetary Exploration)
►▼
Show Figures
Figure 1
Open AccessData Descriptor
Dataset to Quantify Spillover Effects Among Concurrent Green Initiatives
by
Rong Zhang, Qi Zhang, Conghe Song and Li An
Data 2024, 9(11), 135; https://doi.org/10.3390/data9110135 - 13 Nov 2024
Abstract
►▼
Show Figures
Green initiatives are popular mechanisms globally to enhance environmental and human wellbeing. However, multiple green initiatives, when overlapping geographically and _targeting the same participants, may interact with each other, giving rise to what is termed “spillover effects”, where one initiative and its outcomes
[...] Read more.
Green initiatives are popular mechanisms globally to enhance environmental and human wellbeing. However, multiple green initiatives, when overlapping geographically and _targeting the same participants, may interact with each other, giving rise to what is termed “spillover effects”, where one initiative and its outcomes influence another. This study examines the spillover effects among four major concurrent initiatives in the United States (U.S.) and China using a comprehensive dataset. In the U.S., we analysed county-level data in 2018 for the Conservation Reserve Program (CRP) and the Environmental Quality Incentives Program (EQIP), both operational for over 25 years. In China, data from Fanjingshan and Tianma National Nature Reserves (2014–2015) were used to evaluate the Grain-to-Green Program (GTGP) and the Forest Ecological Benefit Compensation (FEBC) program. The dataset comprises 3106 records for the U.S. and 711 plots for China, including several socio-economic variables. The results of multivariate linear regression indicate that there exist significant spillover effects between CRP & EQIP and GTGP & FEBC, with one initiative potentially enhancing or offsetting another’s impacts by 22% to 100%. This dataset provides valuable insights for researchers and policymakers to optimize the effectiveness and resilience of concurrent green initiatives.
Full article
Figure 1
Open AccessData Descriptor
The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset
by
Mamtimin Qasim, Wushour Silamu and Minghui Qiu
Data 2024, 9(11), 134; https://doi.org/10.3390/data9110134 - 11 Nov 2024
Abstract
Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and
[...] Read more.
Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures
Figure 1
Open AccessData Descriptor
Additions to Space Physics Data Facility and pysatNASA: Increasing Mars Global Surveyor and Mars Atmosphere and Volatile EvolutioN Dataset Utility
by
Teresa M. Esman, Alexa J. Halford, Jeff Klenzing and Angeline G. Burrell
Data 2024, 9(11), 133; https://doi.org/10.3390/data9110133 - 8 Nov 2024
Abstract
►▼
Show Figures
The Space Physics Data Facility (SPDF) is a digital archive of space physics data and is useful for the storage, analysis, and dissemination of data. We discuss the process used to create an amended dataset and store it on the SPDF. The operational
[...] Read more.
The Space Physics Data Facility (SPDF) is a digital archive of space physics data and is useful for the storage, analysis, and dissemination of data. We discuss the process used to create an amended dataset and store it on the SPDF. The operational software to generate the archival data software uses the open-source Python package pysat, and an end-user module has been added to the pysatNASA module. The result is the addition of data products to the Mars Global Surveyor (MGS) magnetometer (MAG) dataset, its archival location on SPDF, and pysat compatibility. The primary and metadata format increases the convenience and efficiency for users of the MGS MAG data. The storage of planetary and heliophysics data in one location supports the use of data throughout the solar system for comparison, while pysat compatibility enables loading data in an identical format for ease of processing. We encourage the use of the outlined process for past, present, and future space science missions of all sizes and funding levels. This includes balloons to Flagship-class missions.
Full article
Figure 1
Open AccessData Descriptor
The VNF Cybersecurity Dataset for Research (VNFCYBERDATA)
by
Believe Ayodele and Victor Buttigieg
Data 2024, 9(11), 132; https://doi.org/10.3390/data9110132 - 8 Nov 2024
Abstract
►▼
Show Figures
Virtualisation has received widespread adoption and deployment across a wide range of enterprises and industries throughout the years. Network Function Virtualisation (NFV) is a technical concept that presents a method for dynamically delivering virtualised network functions as virtualised or software components. Virtualised Network
[...] Read more.
Virtualisation has received widespread adoption and deployment across a wide range of enterprises and industries throughout the years. Network Function Virtualisation (NFV) is a technical concept that presents a method for dynamically delivering virtualised network functions as virtualised or software components. Virtualised Network Function (VNF) has distinct advantages, but it also faces serious security challenges. Cyberattacks such as Denial of Service (DoS), malware/rootkit injection, port scanning, and so on can _target VNF appliances just like any other network infrastructure. To create exceptional training exercises for machine or deep learning (ML/DL) models to combat cyberattacks in VNF, a suitable dataset (VNFCYBERDATA) exhibiting an actual reflection, or one that is reasonably close to an actual reflection, of the problem that the ML/DL model could address is required. This article describes a real VNF dataset that contains over seven million data points and twenty-five cyberattacks generated from five VNF appliances. To facilitate a realistic examination of VNF traffic, the dataset includes both benign and malicious traffic.
Full article
Figure 1
Open AccessData Descriptor
Influence of Temperature Variability on the Efficacy of Negative Ions in Removing Particulate Matter and Pollutants: An Experimental Database
by
Paola M. Ortiz-Grisales, Leidy Gutiérrez-León and Carlos D. Zuluaga-Ríos
Data 2024, 9(11), 131; https://doi.org/10.3390/data9110131 - 8 Nov 2024
Abstract
►▼
Show Figures
Cities globally must make urgent decisions to ensure a sustainable future as rising pollution, particularly PM2.5, poses severe health risks like respiratory and heart diseases. PM2.5’s harmful composition also impacts vegetation and the environment. Immediate government intervention is necessary to mitigate these effects.
[...] Read more.
Cities globally must make urgent decisions to ensure a sustainable future as rising pollution, particularly PM2.5, poses severe health risks like respiratory and heart diseases. PM2.5’s harmful composition also impacts vegetation and the environment. Immediate government intervention is necessary to mitigate these effects. This study tackles the urgent problem of reducing PM2.5 levels in Medellín’s urban and indoor environments, where pollution presents serious health risks. To explore effective solutions, this research provides new data on the interaction between particulate matter from various pollutants and negative ions under different temperature conditions, offering valuable insights into air quality improvement strategies. Using a high-voltage system, ions bind to pollutants, accelerating their removal. Experiments measured temperature, humidity, formaldehyde, volatile organic compounds, negative ions, and PM2.5 in a 40 cm3 chamber across various conditions. Pollutants tested included cigarette smoke, incense, charcoal, and gasoline at two voltage levels and three temperature ranges. The data, available in CSV format, were based on 36,000 samples and repeated tests for reliability. This resource is designed to support studies investigating particulate matter control in urban and indoor environments, as well as to improve our understanding of negative ion-based air purification processes. The data are publicly available and structured in formats compatible with leading data analysis platforms.
Full article
Figure 1
Open AccessData Descriptor
Non-Destructive Wood Analysis Dataset: Comparing X-Ray and Terahertz Imaging Techniques
by
Caroline Marc, Bertrand Marcon, Louis Denaud and Stéphane Girardon
Data 2024, 9(11), 130; https://doi.org/10.3390/data9110130 - 5 Nov 2024
Abstract
►▼
Show Figures
Wood density measurement plays a crucial role in assessing wood quality and predicting its mechanical performance. This dataset was collected to compare the accuracy and reliability of two non-destructive techniques, X-rays and terahertz waves, for measuring wood density. While X-rays have been commonly
[...] Read more.
Wood density measurement plays a crucial role in assessing wood quality and predicting its mechanical performance. This dataset was collected to compare the accuracy and reliability of two non-destructive techniques, X-rays and terahertz waves, for measuring wood density. While X-rays have been commonly used in the industry due to their effectiveness, they pose health risks due to ionizing radiation. Terahertz waves, on the other hand, are non-ionizing and offer high spatial resolution. This article presents a database of wood samples measurements obtained using both techniques, on the same 110 samples with a fine location of the measuring points, on a wide range of wood species (tropical and temperate ones) and densities, from 111 kg·m−3 to 1086 kg·m−3. The database includes X-ray and terahertz scans, sample dimensions, moisture content, and color photographs.
Full article
Figure 1
Open AccessArticle
Data Hub for Life Cycle Assessment of Climate Change Solutions—Hydrogen Case Study
by
Shiva Zargar, Miyuru Kannangara, Giovanna Gonzales-Calienes, Jianjun Yang, Jalil Shadbahr, Cyrille Decès-Petit and Farid Bensebaa
Data 2024, 9(11), 129; https://doi.org/10.3390/data9110129 - 5 Nov 2024
Abstract
Life cycle assessment, which evaluates the complete life cycle of a product, is considered the standard methodological framework to evaluate the environmental performance of climate change solutions. However, significant challenges exist related to datasets used to quantify these environmental indicators. Although extensive research
[...] Read more.
Life cycle assessment, which evaluates the complete life cycle of a product, is considered the standard methodological framework to evaluate the environmental performance of climate change solutions. However, significant challenges exist related to datasets used to quantify these environmental indicators. Although extensive research and commercial data on climate change technologies, pathways, and facilities exist, they are not readily available to practitioners of life cycle assessment in the right format and structure using an open platform. In this study, we propose a new open data hub platform for life cycle assessment, considering a hierarchical data flow starting with raw data collected on climate change technologies at laboratory, pilot, demonstration, or commercial scales to provide the information required for policy and decision-making. This platform makes data accessible at multiple levels for practitioners of life cycle assessment, while making data interoperable across platforms. The proposed data hub platform and workflow are explained through the polymer electrolyte membrane electrolysis hydrogen production as a case study. The climate change environment impact of 1.17 ± 0.03 kg CO2 eq./kg H2 was calculated for the case study. The current data hub platform is limited to evaluating environmental impacts; however, future additions of economic and social aspects are envisaged.
Full article
(This article belongs to the Section Information Systems and Data Management)
►▼
Show Figures
Figure 1
Journal Menu
► ▼ Journal Menu-
- Data Home
- Aims & Scope
- Editorial Board
- Reviewer Board
- Topical Advisory Panel
- Instructions for Authors
- Guidelines for Reviewers
- Special Issues
- Topics
- Sections & Collections
- Article Processing Charge
- Indexing & Archiving
- Editor’s Choice Articles
- Most Cited & Viewed
- Journal Statistics
- Journal History
- Journal Awards
- Editorial Office
Journal Browser
► ▼ Journal BrowserHighly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Topic in
BDCC, Data, MAKE, Mathematics
Big Data Intelligence: Methodologies and Applications
Topic Editors: Liang Zhao, Liang Zou, Boxiang DongDeadline: 31 December 2024
Topic in
BDCC, Data, Environments, Geosciences, Remote Sensing
Database, Mechanism and Risk Assessment of Slope Geologic Hazards
Topic Editors: Chong Xu, Yingying Tian, Xiaoyi Shao, Zikang Xiao, Yulong CuiDeadline: 28 February 2025
Topic in
Data, Energies, Sensors, Sustainability, Water
Water and Energy Monitoring and Their Nexus
Topic Editors: Lucas Pereira, Hugo Morais, Wolf-Gerrit FrühDeadline: 31 March 2025
Topic in
Algorithms, Data, Earth, Geosciences, Mathematics, Land, Water
Applications of Algorithms in Risk Assessment and Evaluation
Topic Editors: Yiding Bao, Qiang WeiDeadline: 31 July 2025
Conferences
Special Issues
Special Issue in
Data
Navigating Emerging Advancements and Challenges in AI and Big Data Technologies for Business and Society
Guest Editor: Michael GerlichDeadline: 30 March 2025
Special Issue in
Data
New Progress in Big Earth Data
Guest Editors: Aditya Chakravarty, Juanle WangDeadline: 30 March 2025
Special Issue in
Data
Cutting-Edge Datasets and Algorithms for Enhancing Industrial Processes and Supply Chain Optimization
Guest Editors: Iván Pérez-Olguín, Luis Carlos Méndez González, Luis Alberto Rodríguez-PicónDeadline: 30 April 2025
Special Issue in
Data
Data-Driven Approaches for Safety in Industrial Sites
Guest Editors: Francesca Mauro, Mara Lombardi, Mario FargnoliDeadline: 30 June 2025
Topical Collections
Topical Collection in
Data
Modern Geophysical and Climate Data Analysis: Tools and Methods
Collection Editors: Vladimir Sreckovic, Zoran Mijic