Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project
DOI:
https://doi.org/10.1590/Keywords:
Self-organizing map, Random forest, Oceanography, Modeling, Santos basinAbstract
Integrating intricate environmental data within a unified analytical framework for extensive conservation and
monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model
outlining cause-and-effect relationships, addressing dissimilarities in data source quantity and information content,
grappling with missing or noisy data, fine-tuning model optimization, achieving accurate predictions, and tackling the
issue of imbalanced observations across factors. In the context of the Santos project, dedicated to comprehending
the spatio-temporal dynamics of benthic, pelagic, and physical systems for the facilitation of conservation and
monitoring programs, the application of machine learning's random forest (RF) technique for modeling univariate
data offers notable advantages. This approach adeptly handles non-linearity, covariation, and interactive effects
among predictors. For modeling multivariate data sets, a hybrid strategy combining a self-organizing map (SOM)
and RF is harnessed to effectively tackle the challenges. Addressing missing values, the bagging imputation
technique demonstrated superior performance compared to other methods. Both machine learning techniques
discussed herein exhibit resilience against the impact of noisy data, yet the identification of noisy data remains
feasible based on model outputs. In scenarios of imbalanced data sets, we investigate the correlation between
the RF model's overall statistics and those of individual classes. The joint interpretation of these statistics aids in
comprehending model limitations and facilitates discussions on the environmental mechanisms shaping observed
patterns. We propose two analytical workflows that not only enable the exploration and enhancement of model
accuracy but also facilitate the investigation of potential cause-and-effect relationships inherent in the data.
Furthermore, these workflows lay the foundation for implementing long-term learning algorithms, a pivotal
increment for monitoring initiatives. Notably, these workflows, alongside the discussed analytical challenges,
can be seamlessly implemented within iMESc, an open-source application.
References
Aken, D. V., Pavlo, A., Gordon, G. J. & Zhang, B. 2017.
Automatic Database Management System Tuning
Through Large-scale Machine Learning. In: Proceedings
of the 2017 ACM International Conference on
Management of Data (pp. 1009–1024). ACM. DOI:
https://doi.org/10.1145/3035918.3064029
Anguita, D., Ghio, A., Greco, N., Oneto, L. & Ridella, S.
Model selection for support vector machines:
Advantages and disadvantages of the Machine Learning
Theory. In: The 2010 International Joint Conference on
Neural Networks (IJCNN). IEEE. DOI: https://doi.org/
1109/ijcnn.2010.5596450
Ayodele, T. 2010. New Advances in Machine Learning. In:
Zhang, Y. (ed.), New Advances in Machine Learning
(Vol. 3, pp. 19–48). InTech. DOI: https://doi.org/
5772/9385
Baker, R., Peña, J.-M., Jayamohan, J. & Jérusalem, A.
Mechanistic models versus machine learning,
a fight worth fighting for the biological community?
Biology Letters, 14(5), 20170660. DOI: https://doi.org/
1098/rsbl.2017.0660
Bartlett, P., Freund, Y., Lee, W. & Schapire, R. 1998. Boosting
the margin: a new explanation for the effectiveness of voting
methods. The Annals of Statistics, 26(5), 1651–1686.
DOI: https://doi.org/10.1214/aos/1024691352
Bernard, S., Heutte, L. & Adam, S. 2009. Multiple Cassifier
Systems. In: Benediktsson, J. A., Kittler, J., & Roli, F.
(eds.), Multiple Classifier Systems (pp. 171–180).
Berlin: Springer Berlin Heidelberg. DOI: https://doi.org/
1007/978-3-642-02326-2_18
Bertolino, A., Guerriero, A., Miranda, B., Pietrantuono,
R. & Russo, S. 2020. Learning-to-rank vs ranking-tolearn: Strategies for regression testing in continuous
integration. In: Proceedings - International Conference on
Software Engineering (pp. 1261–1272). Wahington, DC:
ICSE. DOI: https://doi.org/10.1145/3377811.3380369
Biau, G. & Scornet, E. 2016. A random forest guided tour.
TEST, 25(2), 197–227. DOI: https://doi.org/10.1007/
s11749-016-0481-7
Bilski, P. 2017. Unsupervised learning-based hierarchical
diagnostics of analog circuits. In: Workshop on Technical
Diagnostics (Vol. 119, pp. 99–104). Budapest: Elsevier BV.
DOI: https://doi.org/10.1016/j.measurement.2018.01.029
Bonaccorso, G. 2017. Machine learning algorithms.
Birmingham: Packt.
Borja, A., Elliott, M., Andersen, J., Berg, T., Carstensen, J.,
Halpern, B., Heiskanen, A.-S., Korpinen, S., Lowndes, J.,
Martin, G. & Rodriguez-Ezpeleta, N. 2016. Overview of
Integrative Assessment of Marine Systems: The Ecosystem
Approach in Practice. Frontiers in Marine Science, 3(20).
DOI: https://doi.org/10.3389/fmars.2016.00020
Breiman, L. 1996. Bagging predictors. Machine Learning,
(2), 123–140. DOI: https://doi.org/10.1007/bf00058655
Breiman, L. 2001. Random Forests. Machine Learning, 45,
–32. DOI: https://doi.org/10.1023/A:1010933404324
Butenschön, M., Clark, J., Aldridge, J., Allen, J., Artioli, Y.,
Blackford, J., Bruggeman, J., Cazenave, P., Ciavatta, S.,
Kay, S., Lessin, G., Van leeuwen, S., Van der molen, J.,
De Mora, L., Polimene, L., Sailley, S., Stephens, N. &
Torres, R. 2016. ERSEM 15.06: a generic model for
marine biogeochemistry and the ecosystem dynamics
of the lower trophic levels. Geoscientific Model
Development, 9(4), 1293–1339. DOI: https://doi.org/
5194/gmd-9-1293-2016
Carreira, R. S., Lazzari, L., Rozo, L. & Ceccopieri, M. 2023.
Bulk and isotopic characterization of organic matter in
cross-margin transect sediments in the Santos Basin,
south-western Atlantic Ocean.
Chawla, N., Bowyer, K., Hall, L. & Kegelmeyer, W. 2002.
SMOTE: Synthetic Minority Over-sampling Technique.
Journal of Artificial Intelligence Research, 16, 321–357.
DOI: https://doi.org/10.1613/jair.953
Chicco, D. & Jurman, G. 2020. The advantages of the
Matthews correlation coefficient (MCC) over F1 score
and accuracy in binary classification evaluation. BMC
Genomics, 21(1), 6. DOI: https://doi.org/10.1186/
s12864-019-6413-7
Chicco, D., Tötsch, N. & Jurman, G. 2021. The Matthews
correlation coefficient (MCC) is more reliable than balanced
accuracy, bookmaker informedness, and markedness in
two-class confusion matrix evaluation. BioData Mining,
(1), 13. DOI: https://doi.org/10.1186/s13040-021-00244-z
Chon, T.-S. 2011. Self-Organizing Maps applied to
ecological sciences. Ecological Informatics, 6(1), 50–61.
DOI: https://doi.org/10.1016/j.ecoinf.2010.11.002
Chou, J.-S., Tsai, C.-F. & Lu, Y.-H. 2013. Project dispute
prediction by hybrid machine learning techniques. Journal
of Civil Engineering and Management, 19(4), 505–517.
DOI: https://doi.org/10.3846/13923730.2013.768544
Clark, S., Sisson, Scott. & Sharma, A. 2020. Tools for
enhancing the application of self-organizing maps in
water resources research and engineering. Advances
in Water Resources, 143, 103676. DOI: https://doi.org/
1016/j.advwatres.2020.103676
Cutler, D., Edwards, T., Beard, K., Cutler, A., Hess, K.,
Gibson, J. & Lawler, J. 2007. Random Forests for
classification in ecology. Ecology, 88(11), 2783–2792.
DOI: https://doi.org/10.1890/07-0539.1
Dailianis, T., Smith, C., Papadopoulou, N., Gerovasileiou,
V., Sevastou, K., Bekkby, T., Bilan, M., Billett, D.,
Boström, C., Carreiro-Silva, M., Danovaro, R., Fraschetti,
S., Gagnon, K., Gambi, C., Grehan, A., Kipson, S., Kotta,
J., Mcowen, C., Morato, T., Ojaveer, H., Pham, C. &
Scrimgeour, R. 2018. Human activities and resultant
pressures on key European marine habitats: An analysis
Machine learning workflows for ecosystem studies
Ocean and Coastal Research 2023, v71(suppl 3):e23021 22
Fonseca and Vieira
of mapped resources. Marine Policy, 98, 1–10. DOI:
https://doi.org/10.1016/j.marpol.2018.08.038
Dalto, A. G., Moura, R. B., Sallorenzo, I. & Lavrado, H. P. 2023.
Habitat quality assessment using the benthic macrofauna
in Santos Basin continental shelf, SW Atlantic.
Ditria, E., Buelow, C., Gonzalez-Rivero, M. & Connolly, R.
Artificial intelligence and automated monitoring
for assisting conservation of marine ecosystems:
A perspective. Frontiers in Marine Science, 9, 918104.
DOI: https://doi.org/10.3389/fmars.2022.918104
Effrosynidis, D. & Arampatzis, A. 2021. An evaluation
of feature selection methods for environmental data.
Ecological Informatics, 61, 101224. DOI: https://doi.org/
1016/j.ecoinf.2021.101224
Figueiredo Jr., A. G., Carneiro, J. C., Santos Filho, J. R.,
Cecilio, A. B., Rocha, G. J., Santos, S. T. V., Oliveira,
A. S., Ferreira, F. & Luz, M. R. 2023. Sedimentary
processes as a set-up conditions for living benthic
communities in Santos Basin, Brazil.
Fox, E., Hill, R., Leibowitz, S., Olsen, A., Thornbrugh, D. & Weber,
M. 2017. Assessing the accuracy and stability of variable
selection methods for random forest modeling in ecology.
Environmental Monitoring and Assessment, 189(7), 316.
DOI: https://doi.org/10.1007/s10661-017-6025-0
Franks, P. 2018. Global Ecology and Oceanography of
Harmful Algal Blooms. In: Glibert, P. M., Berdalet,
E., Burford, M. A., Pitcher, G. C., & Zhou, M. (eds.),
Ecological Studies (Vol. 232, pp. 359–377). Springer
International Publishing. DOI: https://doi.org/10.1007/
-3-319-70069-4_19
Freund, Y. & Schapire, R. E. 1996. Experiments with a new
boosting algorithm. In: Proceedings of the Thirteenth
International Conference on International Conference on
Machine Learning (Vol. 13, pp. 148–156). San Francisco:
Scientific Research Publishing, Inc. DOI: https://doi.org/
4236/iim.2010.26047
Furian, N., O’sullivan, M., Walker, C., Vössner, S. & Neubacher,
D. 2015. A conceptual modeling framework for discrete
event simulation using hierarchical control structures.
Simulation Modelling Practice and Theory, 56, 82–96.
DOI: https://doi.org/10.1016/j.simpat.2015.04.004
Gallucci, F., Corbisier, T. N., Gheller, P., Brito, S., Vieira,
D. C. & Fonseca, G. 2023. Spatial distribution of
meiofauna communities at the Santos Basin.
García, S., Luengo, J. & Herrera, F. 2015a. Dealing with missing
values. In: Data Preprocessing in Data Mining (Vol. 72,
pp. 59–105). Berlin: Springer International Publishing.
DOI: https://doi.org/10.1007/978-3-319-10247-4_4
García, S., Luengo, J. & Herrera, F. 2015b. Dealing with noisy
data. In: Data Preprocessing in Data Mining (Vol. 72,
pp. 107–145). Berlin: Springer International Publishing.
DOI: https://doi.org/10.1007/978-3-319-10247-4_5
Gardner, M. & Dorling, S. 2000. Statistical surface ozone
models: an improved methodology to account for nonlinear behaviour. Atmospheric Environment, 34(1), 21–34.
DOI: https://doi.org/org/10.1016/S1352-2310(99)00359-3
Gligorijević, V. & Pržulj, N. 2015. Methods for biological
data integration: perspectives and challenges. Journal
of The Royal Society Interface, 12(112), 20150571.
DOI: https://doi.org/10.1098/rsif.2015.0571
Goldstein, B., Polley, E. & Briggs, F. 2011. Random Forests
for Genetic Association Studies. Statistical Applications
in Genetics and Molecular Biology, 10(1). DOI: https://
doi.org/10.2202/1544-6115.1691
Grehan, A., Arnaud-Haond, S., D’onghia, G., Savini,
A. & Yesson, C. 2017. Towards ecosystem based
management and monitoring of the deep Mediterranean,
North-East Atlantic and Beyond. Deep Sea Research
Part II: Topical Studies in Oceanography, 145, 1–7. DOI:
https://doi.org/10.1016/j.dsr2.2017.09.014
Gupta, S. & Gupta, A. 2019. Dealing with Noise Problem
in Machine Learning Data-sets: A Systematic Review.
Procedia Computer Science, 161, 466–474. DOI:
https://doi.org/10.1016/j.procs.2019.11.146
Hastie, T., Tibshirani, R. & Friedman, J. 2009. The Elements
of Statistical Learning. New York: Springer New York.
DOI: https://doi.org/10.1007/978-0-387-84858-7
Hino, M., Benami, E. & Brooks, N. 2018. Machine learning
for environmental monitoring. Nature Sustainability, 1(10),
–588. DOI: https://doi.org/10.1038/s41893-018-0142-9
Ho, S., Phua, K., Wong, L., Bin & Goh, W. 2020. Extensions
of the External Validation for Checking Learned Model
Interpretability and Generalizability. Patterns, 1(8), 100129.
DOI: https://doi.org/10.1016/j.patter.2020.100129
Jain, A. & Kumar, A. 2007. Hybrid neural network models
for hydrologic time series forecasting. Applied Soft
Computing, 7(2), 585–592. DOI: https://doi.org/10.1016/
j.asoc.2006.03.002
Jeni, L., Cohn, J. & De La Torre, F. 2013. Facing Imbalanced
Data–Recommendations for the Use of Performance
Metrics. In: 2013 Humaine Association Conference on
Affective Computing and Intelligent Interaction (Vol. 61,
pp. 245–251). Geneva: IEEE. DOI: https://doi.org/10.1109/
acii.2013.47
Jiang, M. & Zhu, Z. 2022. The Role of Artificial Intelligence
Algorithms in Marine Scientific Research. Frontiers in
Marine Science, 9, 1–4. DOI: https://doi.org/10.3389/
fmars.2022.920994
Jordanov, I., Petrov, N. & Petrozziello, A. 2018. Classifiers
Accuracy Improvement Based on Missing Data
Imputation. Journal of Artificial Intelligence and Soft
Computing Research, 8(1), 31–48. DOI: https://doi.org/
1515/jaiscr-2018-0002
Kangur, K., Park, Y.-S., Kangur, A., Kangur, P. & Lek, S. 2007.
Patterning long-term changes of fish community in large
shallow Lake Peipsi. Ecological Modelling, 203(1–2), 34–44.
DOI: https://doi.org/10.1016/j.ecolmodel.2006.03.039
Kaur, H., Pannu, H. & Malhi, A. 2020. A Systematic Review on
Imbalanced Data Challenges in Machine Learning. ACM
Computing Surveys, 52(4), 1–36. DOI: https://doi.org/
1145/3343440
Kohonen, T. 1990. The self-organizing map. Proceedings
of the IEEE, 78(9), 1464–1480. DOI: https://doi.org/
1109/5.58325
Kohonen, T. 2001. Self-Organizing Maps. Springer: Berlin.
Krawczyk, B. 2016. Learning from imbalanced data: open
challenges and future directions. Progress in Artificial
Intelligence, 5(4), 221–232. DOI: https://doi.org/10.1007/
s13748-016-0094-0
Machine learning workflows for ecosystem studies
Ocean and Coastal Research 2023, v71(suppl 3):e23021 23
Fonseca and Vieira
Landis, J. & Koch, G. 1977. The Measurement of Observer
Agreement for Categorical Data. Biometrics, 33(1),
–174. DOI: https://doi.org/10.2307/2529310
Lawrence, R., Almasi, G. & Rushmeier, H. 1999. A scalable
parallel algorithm for self-organizing maps with
applications to sparse data mining problems. Data
Mining and Knowledge Discovery, 3(2), 171–195. DOI:
https://doi.org/10.1023/A:1009817804059
Levy, O., Ball, B., Bond-Lamberty, B., Cheruvelil, K., Finley, A.,
Lottig, N., Punyasena, S., Xiao, J., Zhou, J., Buckley, L.,
Filstrup, C., Keitt, T., Kellner, J., Knapp, A., Richardson, A.,
Tcheng, D., Toomey, M., Vargas, R., Voordeckers, J.,
Wagner, T. & Williams, J. 2014. Approaches to advance
scientific understanding of macrosystems ecology.
Frontiers in Ecology and the Environment, 12(1), 15–23.
DOI: https://doi.org/10.1890/130019
L’Heureux, A., Grolinger, K., Elyamany, H. & Capretz, M.
Machine Learning With Big Data: Challenges
and Approaches. IEEE Access, 5, 7776–7797. DOI:
https://doi.org/10.1109/access.2017.2696365
Little, R. & Rubin, D. 2002. Statistical Analysis with
Missing Data. Hoboken: John Wiley & Sons, Inc. DOI:
https://doi.org/10.1002/9781119013563
Liu, Y., Weisberg, R. & Mooers, C. 2006. Performance
evaluation of the self-organizing map for feature
extraction. Journal of Geophysical Research, 111(C5),
C05018. DOI: https://doi.org/10.1029/2005jc003117
Lo, Z.-P. & Bavarian, B. 1991. On the rate of convergence
in topology preserving neural networks. Biological
Cybernetics, 65(1), 55–63. DOI: https://doi.org/10.1007/
bf00197290
Loureiro, A., Torgo, L. & Soares, C. 2004. Outlier Detection
using Clustering Methods: a data cleaning application. In:
Proceedings of KDNet Symposium on Knowledge-based
Systems for the Public Sector. Sankt Augustin: KDnet.
Lunetta, K., Hayward, L., Segal, J. & Van Eerdewegh, P. 2004.
Screening large-scale association study data: exploiting
interactions using random forests. BMC Genetics, 5(1),
DOI: https://doi.org/10.1186/1471-2156-5-32
Lynam, C., Uusitalo, L., Patrício, J., Piroddi, C., Queirós, A.,
Teixeira, H., Rossberg, A., Sagarminaga, Y., Hyder, K.,
Niquil, N., Möllmann, C., Wilson, C., Chust, G.,
Galparsoro, I., Forster, R., Veríssimo, H., Tedesco, L.,
Revilla, M. & Neville, S. 2016. Uses of Innovative Modeling
Tools within the Implementation of the Marine Strategy
Framework Directive. Frontiers in Marine Science, 3,
–18. DOI: https://doi.org/10.3389/fmars.2016.00182
Ma, E.-Y., Kim, J.-W., Lee, Y., Cho, S.-W., Kim, H. & Kim, J. 2021.
Combined unsupervised-supervised machine learning
for phenotyping complex diseases with its application to
obstructive sleep apnea. Scientific Reports, 11(1), 4457.
DOI: https://doi.org/10.1038/s41598-021-84003-4
Mahesh, B. 2020. Machine Learning Algorithms - A Review.
International Journal of Science and Research, 9(1),
–386. DOI: https://doi.org/10.21275/ART20203995
Markham, I., Mathieu, R. & Wray, B. 2000. Kanban setting through
artificial intelligence: a comparative study of artificial neural
networks and decision trees. Integrated Manufacturing
Systems, 11(4), 239–246. DOI: https://doi.org/10.1108/
Michener, W. & Jones, M. 2012. Ecoinformatics: supporting
ecology as a data-intensive science. Trends in Ecology &
Evolution, 27(2), 85–93. DOI: https://doi.org/10.1016/j.
tree.2011.11.016
Moreira, D. L., Marcon, E. H., Toledo, R. G. A. & Bonecker,
A. C. T. 2023. Multidisciplinary Scientific Cruises for
Environmental Characterization in the Santos Basin –
Methods and Sampling Design. DOI: https://doi.org/
5281/ZENODO.7702291
Mount, N. J. & Weaver, D. 2011. Self-organizing maps
and boundary effects: quantifying the benefits of
torus wrapping for mapping SOM trajectories. Pattern
Analysis and Applications, 14(2), 139–148. DOI:
https://doi.org/10.1007/s10044-011-0210-5
Muñoz, A. & Muruzábal, J. 1998. Self-organizing maps
for outlier detection. Neurocomputing, 18(1–3), 33–60.
DOI: https://doi.org/10.1016/s0925-2312(97)00068-4
Natita, W., Wiboonsak, and W. & Dusadee, S. 2016.
Appropriate Learning Rate and Neighborhood Function
of Self-organizing Map (SOM) for Specific Humidity
Pattern Classification over Southern Thailand.
International Journal of Modeling and Optimization, 6(1),
–65. DOI: https://doi.org/10.7763/ijmo.2016.v6.504
Newman, E. A. 2019. Disturbance Ecology in the
Anthropocene. Frontiers in Ecology and Evolution, 7.
DOI: https://doi.org/10.3389/fevo.2019.00147
Ng, S. & Chan, and M. 2019. Effect of Neighbourhood
Size Selection in SOM-Based Image Feature
Extraction. International Journal of Machine Learning
and Computing, 9(2), 195–200. DOI: https://doi.org/
18178/ijmlc.2019.9.2.786
Nichols, J. D. & Williams, B. K. 2006. Monitoring for
conservation. Trends in Ecology & Evolution, 21(12),
–673. DOI: https://doi.org/10.1016/j.tree.2006.08.007
Oshiro, T. M., Perez, P. S. & Baranauskas, J. A. 2012. How
Many Trees in a Random Forest? In: Perner, P. (ed.),
Machine Learning and Data Mining in Pattern Recognition
(Vol. 7376, pp. 154–168). New York: Springer. DOI:
https://doi.org/10.1007/978-3-642-31537-4_13
Park, Y.-S., Chung, Y.-J. & Moon, Y.-S. 2013. Hazard ratings
of pine forests to a pine wilt disease at two spatial scales
(individual trees and stands) using self-organizing map
and random forest. Ecological Informatics, 13, 40–46.
DOI: https://doi.org/10.1016/j.ecoinf.2012.10.008
Park, Y.-S., Song, M.-Y., Park, Y.-C., Oh, K.-H., Cho, E. &
Chon, T.-S. 2007. Community patterns of benthic
macroinvertebrates collected on the national scale in
Korea. Ecological Modelling, 203(1–2), 26–33. DOI:
https://doi.org/10.1016/j.ecolmodel.2006.04.032
Penczak, T., Kruk, A., Park, Y. S. & Lek, S. 2005. Modelling
Community Structure in Freshwater Ecosystems. In: Lek,
Sovan, Scardi, M., Verdonschot, P. F. M., Descy, J.-P., &
Park, Y.-S. (eds.), Modelling Community Structure in
Freshwater Ecosystems (pp. 100–113). Springer-Verlag.
DOI: https://doi.org/10.1007/3-540-26894-4_10
Perkel, J. 2019. Workflow systems turn raw data into
scientific knowledge. Nature, 573(7772), 149–150. DOI:
https://doi.org/10.1038/d41586-019-02619-z
Platias, C. & Petasis, G. 2020. A Comparison of Machine
Learning Methods for Data Imputation. In: 11th Hellenic
Machine learning workflows for ecosystem studies
Ocean and Coastal Research 2023, v71(suppl 3):e23021 24
Fonseca and Vieira
Conference on Artificial Intelligence (pp. 150–159). New
York: ACM. DOI: https://doi.org/10.1145/3411408.3411465
Pope, D. & McNeill, F. 2013. From Big Data to Meaningful
Information. Cary: SAS.
Poulos, J. & Valle, R. 2018. Missing Data Imputation for
Supervised Learning. Applied Artificial Intelligence,
(2), 186–196. DOI: https://doi.org/10.1080/08839514.
1448143
Probst, P., Bischl, B. & Boulesteix, A.-L. 2019. Tunability:
Importance of Hyperparameters of Machine Learning
Algorithms. The Journal of Machine Learning
Research, 20(1), 1934–1965. DOI: https://doi.org/
5555/3322706.3361994
Probst, P. & Boulesteix, A.-L. 2017. To tune or not to tune
the number of trees in random forest. The Journal of
Machine Learning Research, 18(1), 1934–1965. DOI:
https://doi.org/10.48550/ARXIV.1705.05654
Probst, P., Wright, M. & Boulesteix, A. 2019.
Hyperparameters and tuning strategies for random
forest. WIREs Data Mining and Knowledge Discovery,
(3), e1301. DOI: https://doi.org/10.1002/widm.1301
Rahmati, O., Falah, F., Naghibi, S., Biggs, T., Soltani, M.,
Deo, R., Cerdà, A., Mohammadi, F. & Tien Bui, D.
Land subsidence modelling using tree-based
machine learning algorithms. Science of The Total
Environment, 672, 239–252. DOI: https://doi.org/10.1016/j.
scitotenv.2019.03.496
Razi, M. & Athappilly, K. 2005. A comparative predictive
analysis of neural networks (NNs), nonlinear regression
and classification and regression tree (CART) models.
Expert Systems with Applications, 29(1), 65–74. DOI:
https://doi.org/10.1016/j.eswa.2005.01.006
Refaeilzadeh, P., Tang, L. & Liu, H. 2009. Encyclopedia
of Database Systems. In: Liu, L. & Özsu, M. T. (eds.),
Encyclopedia of Database Systems (pp. 532–538).
Boston: Springer US. DOI: https://doi.org/10.1007/978-0-
-39940-9_565
Rhodes, J. & Jonzén, N. 2011. Monitoring temporal trends
in spatially structured populations: how should sampling
effort be allocated between space and time? Ecography,
(6), 1040–1048. DOI: https://doi.org/10.1111/j.1600-
2011.06370.x
Robinson, S. 2008. Conceptual modelling for simulation
Part II: a framework for conceptual modelling. Journal
of the Operational Research Society, 59(3), 291–304.
DOI: https://doi.org/10.1057/palgrave.jors.2602369
Rollinson, C., Finley, A., Alexander, M., Banerjee, S., Dixon
Hamil, K.-A., Koenig, L., Locke, D., Demarche, M.,
Tingley, M., Wheeler, K., Youngflesh, C. & Zipkin, E.
Working across space and time: nonstationarity
in ecological research and application. Frontiers in
Ecology and the Environment, 19(1), 66–72. DOI:
https://doi.org/10.1002/fee.2298
Sarker, I. H. 2021. Machine Learning: Algorithms, RealWorld Applications and Research Directions. SN
Computer Science, 2(3), 160. DOI: https://doi.org/
1007/s42979-021-00592-x
Schaub, M. & Abadi, F. 2011. Integrated population models:
a novel analysis framework for deeper insights into
population dynamics. Journal of Ornithology, 152(S1),
–237. DOI: https://doi.org/10.1007/s10336-010-0632-7
Stefanovič, P. & Kurasova, O. 2011. Influence of Learning
Rates and Neighboring Functions on Self-Organizing
Maps. In: WSOM 2011: Advances in Self-Organizing
Maps (Vol. 6731, pp. 141–150). Berlin: Springer
Berlin Heidelberg. DOI: https://doi.org/10.1007/978-
-642-21566-7_14
Stoudt, S., Vásquez, V. & Martinez, C. 2021. Principles for
data analysis workflows. PLOS Computational Biology,
(3), e1008770. DOI: https://doi.org/10.1371/journal.
pcbi.1008770
Stupariu, M.-S., Cushman, S., Pleşoianu, A.-I., PătruStupariu, I. & Fürst, C. 2021. Machine learning in
landscape ecological analysis: a review of recent
approaches. Landscape Ecology, 37(5), 1227–1250.
DOI: https://doi.org/10.1007/s10980-021-01366-9
Tison, J. 2004. Use of unsupervised neural networks for
ecoregional zoning of hydrosystems through diatom
communities: case study of Adour-Garonne watershed
(France). Archiv Für Hydrobiologie, 159(3), 409–422.
DOI: https://doi.org/10.1127/0003-9136/2004/0159-0409
Tsai, C.-F. & Chen, M.-L. 2010. Credit rating by hybrid machine
learning techniques. Applied Soft Computing, 10(2),
–380. DOI: https://doi.org/10.1016/j.asoc.2009.08.003
Ultsch, A. 2003. U*-matrix: a tool to visualize clusters in
high dimensional data.
Van Hulle, M. 2012. Handbook of Natural Computing. In:
Rozenberg, G., Bäck, T., & Kok, J. N. (eds.), Handbook
of Natural Computing (pp. 585–622). Berlin: Springer
Berlin Heidelberg. DOI: https://doi.org/10.1007/978-3-
-92910-9_19
Vesanto, J. & Alhoniemi, E. 2000. Clustering of the selforganizing map. IEEE Transactions on Neural Networks,
(3), 586–600. DOI: https://doi.org/10.1109/72.846731
Vesanto, J., Himberg, J., Alhoniemi, E. & Parhankangas, J.
Self-organizing map in Matlab: the SOM Toolbox.
In: Proceedings of the Matlab DSP Conference
(pp. 35–40). Espoo.
Vesanto, J., Himberg, J., Alhoniemi, E. & Parhankangas, J.
SOM Toolbox for Matlab 5.
Vieira, D. & Fonseca, G. 2023. iMESc: An Interactive
Machine Learning App for Environmental Science. DOI:
https://doi.org/10.5281/zenodo.6484391
Virts, K., Shirey, A., Priftis, G., Ankur, K., Ramasubramanian,
M., Muhammad, H., Acharya, A. & Ramachandran, R.
A Quantitative Analysis on the Use of Supervised
Machine Learning in Earth Science. In: IGARSS 2020 -
IEEE International Geoscience and Remote
Sensing Symposium (pp. 2252–2255). Waikoloa: IEEE.
DOI: https://doi.org/10.1109/igarss39084.2020.9323770
Walker, G. 2006. The tipping point of the iceberg. Nature,
(7095), 802–805. DOI: https://doi.org/10.1038/441802a
Wang, F., Shi, Z., Biswas, A., Yang, S. & Ding, J. 2020.
Multi-algorithm comparison for predicting soil salinity.
Geoderma, 365, 114211. DOI: https://doi.org/10.1016/j.
geoderma.2020.114211
Webb, J., Arthington, A. & Olden, J. 2017. Models of
Ecological Responses to Flow Regime Change to
Inform Environmental Flows Assessments. Water for the
Environment: From Policy and Science to Implementation
and Management. Water for the Environment, 287–316.
DOI: https://doi.org/10.1016/B978-0-12-803907-6.00014-0
Machine learning workflows for ecosystem studies
Ocean and Coastal Research 2023, v71(suppl 3):e23021 25
Fonseca and Vieira
Wehrens, R. & Buydens, L. 2007. Self- and Superorganizing Maps in R: The Kohonen Package. Journal
of Statistical Software, 21(5), 1–19. DOI: https://doi.org/
18637/jss.v021.i05
Wehrens, R. & Kruisselbrink, J. 2018. Flexible Self-Organizing
Maps in kohonen 3.0. Journal of Statistical Software,
(7), 1–18. DOI: https://doi.org/10.18637/jss.v087.i07
Yang, P., Wang, D., Wei, Z., Du, X. & Li, T. 2019. An
Outlier Detection Approach Based on Improved SelfOrganizing Feature Map Clustering Algorithm. IEEE
Access, 7, 115914–115925. DOI: https://doi.org/10.1109/
access.2019.2922004
Yotova, G., Varbanov, M., Tcherkezova, E. & Tsakovski, S.
Water quality assessment of a river catchment by
the composite water quality index and self-organizing
maps. Ecological Indicators, 120, 106872. DOI:
https://doi.org/10.1016/j.ecolind.2020.106872
Zhang, J.-T., Dong, Y. & Xi, Y. 2008. A comparison of SOFM
ordination with DCA and PCA in gradient analysis of
plant communities in the midst of Taihang Mountains,
China. Ecological Informatics, 3(6), 367–374. DOI:
https://doi.org/10.1016/j.ecoinf.2008.09.004
Zhang, L., Scholz, M., Mustafa, A. & Harrington, R. 2008.
Assessment of the nutrient removal performance in
integrated constructed wetlands with the self-organizing
map. Water Research, 42(13), 3519–3527. DOI: https://
doi.org/10.1016/j.watres.2008.04.027
Zhong, S., Zhang, K., Bagheri, M., Burken, J., Gu, A., Li, B.,
Ma, X., Marrone, B., Ren, Z., Schrier, J., Shi, W., Tan, H.,
Wang, T., Wang, X., Wong, B., Xiao, X., Yu, X., Zhu, J.-J. &
Zhang, H. 2021. Machine Learning: New Ideas and Tools
in Environmental Science and Engineering. Environmental
Science & Technology, 55(19), 12741–12754. DOI:
https://doi.org/10.1021/acs.est.1c01339
Zipkin, E., Zylstra, E., Wright, A., Saunders, S., Finley, A.,
Dietze, M., Itter, M. & Tingley, M. 2021. Addressing
data integration challenges to link ecological processes
across scales. Frontiers in Ecology and the Environment,
(1), 30–38. DOI: https://doi.org/10.1002/fee.2290