Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project

Authors

  • Gustavo Fonseca
  • Danilo Candido Vieira

DOI:

https://doi.org/10.1590/

Keywords:

Self-organizing map, Random forest, Oceanography, Modeling, Santos basin

Abstract

Integrating intricate environmental data within a unified analytical framework for extensive conservation and
monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model
outlining cause-and-effect relationships, addressing dissimilarities in data source quantity and information content,
grappling with missing or noisy data, fine-tuning model optimization, achieving accurate predictions, and tackling the
issue of imbalanced observations across factors. In the context of the Santos project, dedicated to comprehending
the spatio-temporal dynamics of benthic, pelagic, and physical systems for the facilitation of conservation and
monitoring programs, the application of machine learning's random forest (RF) technique for modeling univariate
data offers notable advantages. This approach adeptly handles non-linearity, covariation, and interactive effects
among predictors. For modeling multivariate data sets, a hybrid strategy combining a self-organizing map (SOM)
and RF is harnessed to effectively tackle the challenges. Addressing missing values, the bagging imputation
technique demonstrated superior performance compared to other methods. Both machine learning techniques
discussed herein exhibit resilience against the impact of noisy data, yet the identification of noisy data remains
feasible based on model outputs. In scenarios of imbalanced data sets, we investigate the correlation between
the RF model's overall statistics and those of individual classes. The joint interpretation of these statistics aids in
comprehending model limitations and facilitates discussions on the environmental mechanisms shaping observed
patterns. We propose two analytical workflows that not only enable the exploration and enhancement of model
accuracy but also facilitate the investigation of potential cause-and-effect relationships inherent in the  data.
Furthermore, these workflows lay the foundation for implementing long-term learning algorithms, a pivotal
increment for monitoring initiatives. Notably, these workflows, alongside the discussed analytical challenges,
can be seamlessly implemented within iMESc, an open-source application.

References

Aken, D. V., Pavlo, A., Gordon, G. J. & Zhang, B. 2017.

Automatic Database Management System Tuning

Through Large-scale Machine Learning. In: Proceedings

of the 2017 ACM International Conference on

Management of Data (pp. 1009–1024). ACM. DOI:

https://doi.org/10.1145/3035918.3064029

Anguita, D., Ghio, A., Greco, N., Oneto, L. & Ridella, S.

Model selection for support vector machines:

Advantages and disadvantages of the Machine Learning

Theory. In: The 2010 International Joint Conference on

Neural Networks (IJCNN). IEEE. DOI: https://doi.org/

1109/ijcnn.2010.5596450

Ayodele, T. 2010. New Advances in Machine Learning. In:

Zhang, Y. (ed.), New Advances in Machine Learning

(Vol. 3, pp. 19–48). InTech. DOI: https://doi.org/

5772/9385

Baker, R., Peña, J.-M., Jayamohan, J. & Jérusalem, A.

Mechanistic models versus machine learning,

a fight worth fighting for the biological community?

Biology Letters, 14(5), 20170660. DOI: https://doi.org/

1098/rsbl.2017.0660

Bartlett, P., Freund, Y., Lee, W. & Schapire, R. 1998. Boosting

the margin: a new explanation for the effectiveness of voting

methods. The Annals of Statistics, 26(5), 1651–1686.

DOI: https://doi.org/10.1214/aos/1024691352

Bernard, S., Heutte, L. & Adam, S. 2009. Multiple Cassifier

Systems. In: Benediktsson, J. A., Kittler, J., & Roli, F.

(eds.), Multiple Classifier Systems (pp. 171–180).

Berlin: Springer Berlin Heidelberg. DOI: https://doi.org/

1007/978-3-642-02326-2_18

Bertolino, A., Guerriero, A., Miranda, B., Pietrantuono,

R. & Russo, S. 2020. Learning-to-rank vs ranking-tolearn: Strategies for regression testing in continuous

integration. In: Proceedings - International Conference on

Software Engineering (pp. 1261–1272). Wahington, DC:

ICSE. DOI: https://doi.org/10.1145/3377811.3380369

Biau, G. & Scornet, E. 2016. A random forest guided tour.

TEST, 25(2), 197–227. DOI: https://doi.org/10.1007/

s11749-016-0481-7

Bilski, P. 2017. Unsupervised learning-based hierarchical

diagnostics of analog circuits. In: Workshop on Technical

Diagnostics (Vol. 119, pp. 99–104). Budapest: Elsevier BV.

DOI: https://doi.org/10.1016/j.measurement.2018.01.029

Bonaccorso, G. 2017. Machine learning algorithms.

Birmingham: Packt.

Borja, A., Elliott, M., Andersen, J., Berg, T., Carstensen, J.,

Halpern, B., Heiskanen, A.-S., Korpinen, S., Lowndes, J.,

Martin, G. & Rodriguez-Ezpeleta, N. 2016. Overview of

Integrative Assessment of Marine Systems: The Ecosystem

Approach in Practice. Frontiers in Marine Science, 3(20).

DOI: https://doi.org/10.3389/fmars.2016.00020

Breiman, L. 1996. Bagging predictors. Machine Learning,

(2), 123–140. DOI: https://doi.org/10.1007/bf00058655

Breiman, L. 2001. Random Forests. Machine Learning, 45,

–32. DOI: https://doi.org/10.1023/A:1010933404324

Butenschön, M., Clark, J., Aldridge, J., Allen, J., Artioli, Y.,

Blackford, J., Bruggeman, J., Cazenave, P., Ciavatta, S.,

Kay, S., Lessin, G., Van leeuwen, S., Van der molen, J.,

De Mora, L., Polimene, L., Sailley, S., Stephens, N. &

Torres, R. 2016. ERSEM 15.06: a generic model for

marine biogeochemistry and the ecosystem dynamics

of the lower trophic levels. Geoscientific Model

Development, 9(4), 1293–1339. DOI: https://doi.org/

5194/gmd-9-1293-2016

Carreira, R. S., Lazzari, L., Rozo, L. & Ceccopieri, M. 2023.

Bulk and isotopic characterization of organic matter in

cross-margin transect sediments in the Santos Basin,

south-western Atlantic Ocean.

Chawla, N., Bowyer, K., Hall, L. & Kegelmeyer, W. 2002.

SMOTE: Synthetic Minority Over-sampling Technique.

Journal of Artificial Intelligence Research, 16, 321–357.

DOI: https://doi.org/10.1613/jair.953

Chicco, D. & Jurman, G. 2020. The advantages of the

Matthews correlation coefficient (MCC) over F1 score

and accuracy in binary classification evaluation. BMC

Genomics, 21(1), 6. DOI: https://doi.org/10.1186/

s12864-019-6413-7

Chicco, D., Tötsch, N. & Jurman, G. 2021. The Matthews

correlation coefficient (MCC) is more reliable than balanced

accuracy, bookmaker informedness, and markedness in

two-class confusion matrix evaluation. BioData Mining,

(1), 13. DOI: https://doi.org/10.1186/s13040-021-00244-z

Chon, T.-S. 2011. Self-Organizing Maps applied to

ecological sciences. Ecological Informatics, 6(1), 50–61.

DOI: https://doi.org/10.1016/j.ecoinf.2010.11.002

Chou, J.-S., Tsai, C.-F. & Lu, Y.-H. 2013. Project dispute

prediction by hybrid machine learning techniques. Journal

of Civil Engineering and Management, 19(4), 505–517.

DOI: https://doi.org/10.3846/13923730.2013.768544

Clark, S., Sisson, Scott. & Sharma, A. 2020. Tools for

enhancing the application of self-organizing maps in

water resources research and engineering. Advances

in Water Resources, 143, 103676. DOI: https://doi.org/

1016/j.advwatres.2020.103676

Cutler, D., Edwards, T., Beard, K., Cutler, A., Hess, K.,

Gibson, J. & Lawler, J. 2007. Random Forests for

classification in ecology. Ecology, 88(11), 2783–2792.

DOI: https://doi.org/10.1890/07-0539.1

Dailianis, T., Smith, C., Papadopoulou, N., Gerovasileiou,

V., Sevastou, K., Bekkby, T., Bilan, M., Billett, D.,

Boström, C., Carreiro-Silva, M., Danovaro, R., Fraschetti,

S., Gagnon, K., Gambi, C., Grehan, A., Kipson, S., Kotta,

J., Mcowen, C., Morato, T., Ojaveer, H., Pham, C. &

Scrimgeour, R. 2018. Human activities and resultant

pressures on key European marine habitats: An analysis

Machine learning workflows for ecosystem studies

Ocean and Coastal Research 2023, v71(suppl 3):e23021 22

Fonseca and Vieira

of mapped resources. Marine Policy, 98, 1–10. DOI:

https://doi.org/10.1016/j.marpol.2018.08.038

Dalto, A. G., Moura, R. B., Sallorenzo, I. & Lavrado, H. P. 2023.

Habitat quality assessment using the benthic macrofauna

in Santos Basin continental shelf, SW Atlantic.

Ditria, E., Buelow, C., Gonzalez-Rivero, M. & Connolly, R.

Artificial intelligence and automated monitoring

for assisting conservation of marine ecosystems:

A perspective. Frontiers in Marine Science, 9, 918104.

DOI: https://doi.org/10.3389/fmars.2022.918104

Effrosynidis, D. & Arampatzis, A. 2021. An evaluation

of feature selection methods for environmental data.

Ecological Informatics, 61, 101224. DOI: https://doi.org/

1016/j.ecoinf.2021.101224

Figueiredo Jr., A. G., Carneiro, J. C., Santos Filho, J. R.,

Cecilio, A. B., Rocha, G. J., Santos, S. T. V., Oliveira,

A. S., Ferreira, F. & Luz, M. R. 2023. Sedimentary

processes as a set-up conditions for living benthic

communities in Santos Basin, Brazil.

Fox, E., Hill, R., Leibowitz, S., Olsen, A., Thornbrugh, D. & Weber,

M. 2017. Assessing the accuracy and stability of variable

selection methods for random forest modeling in ecology.

Environmental Monitoring and Assessment, 189(7), 316.

DOI: https://doi.org/10.1007/s10661-017-6025-0

Franks, P. 2018. Global Ecology and Oceanography of

Harmful Algal Blooms. In: Glibert, P. M., Berdalet,

E., Burford, M. A., Pitcher, G. C., & Zhou, M. (eds.),

Ecological Studies (Vol. 232, pp. 359–377). Springer

International Publishing. DOI: https://doi.org/10.1007/

-3-319-70069-4_19

Freund, Y. & Schapire, R. E. 1996. Experiments with a new

boosting algorithm. In: Proceedings of the Thirteenth

International Conference on International Conference on

Machine Learning (Vol. 13, pp. 148–156). San Francisco:

Scientific Research Publishing, Inc. DOI: https://doi.org/

4236/iim.2010.26047

Furian, N., O’sullivan, M., Walker, C., Vössner, S. & Neubacher,

D. 2015. A conceptual modeling framework for discrete

event simulation using hierarchical control structures.

Simulation Modelling Practice and Theory, 56, 82–96.

DOI: https://doi.org/10.1016/j.simpat.2015.04.004

Gallucci, F., Corbisier, T. N., Gheller, P., Brito, S., Vieira,

D. C. & Fonseca, G. 2023. Spatial distribution of

meiofauna communities at the Santos Basin.

García, S., Luengo, J. & Herrera, F. 2015a. Dealing with missing

values. In: Data Preprocessing in Data Mining (Vol. 72,

pp. 59–105). Berlin: Springer International Publishing.

DOI: https://doi.org/10.1007/978-3-319-10247-4_4

García, S., Luengo, J. & Herrera, F. 2015b. Dealing with noisy

data. In: Data Preprocessing in Data Mining (Vol. 72,

pp. 107–145). Berlin: Springer International Publishing.

DOI: https://doi.org/10.1007/978-3-319-10247-4_5

Gardner, M. & Dorling, S. 2000. Statistical surface ozone

models: an improved methodology to account for nonlinear behaviour. Atmospheric Environment, 34(1), 21–34.

DOI: https://doi.org/org/10.1016/S1352-2310(99)00359-3

Gligorijević, V. & Pržulj, N. 2015. Methods for biological

data integration: perspectives and challenges. Journal

of The Royal Society Interface, 12(112), 20150571.

DOI: https://doi.org/10.1098/rsif.2015.0571

Goldstein, B., Polley, E. & Briggs, F. 2011. Random Forests

for Genetic Association Studies. Statistical Applications

in Genetics and Molecular Biology, 10(1). DOI: https://

doi.org/10.2202/1544-6115.1691

Grehan, A., Arnaud-Haond, S., D’onghia, G., Savini,

A. & Yesson, C. 2017. Towards ecosystem based

management and monitoring of the deep Mediterranean,

North-East Atlantic and Beyond. Deep Sea Research

Part II: Topical Studies in Oceanography, 145, 1–7. DOI:

https://doi.org/10.1016/j.dsr2.2017.09.014

Gupta, S. & Gupta, A. 2019. Dealing with Noise Problem

in Machine Learning Data-sets: A Systematic Review.

Procedia Computer Science, 161, 466–474. DOI:

https://doi.org/10.1016/j.procs.2019.11.146

Hastie, T., Tibshirani, R. & Friedman, J. 2009. The Elements

of Statistical Learning. New York: Springer New York.

DOI: https://doi.org/10.1007/978-0-387-84858-7

Hino, M., Benami, E. & Brooks, N. 2018. Machine learning

for environmental monitoring. Nature Sustainability, 1(10),

–588. DOI: https://doi.org/10.1038/s41893-018-0142-9

Ho, S., Phua, K., Wong, L., Bin & Goh, W. 2020. Extensions

of the External Validation for Checking Learned Model

Interpretability and Generalizability. Patterns, 1(8), 100129.

DOI: https://doi.org/10.1016/j.patter.2020.100129

Jain, A. & Kumar, A. 2007. Hybrid neural network models

for hydrologic time series forecasting. Applied Soft

Computing, 7(2), 585–592. DOI: https://doi.org/10.1016/

j.asoc.2006.03.002

Jeni, L., Cohn, J. & De La Torre, F. 2013. Facing Imbalanced

Data–Recommendations for the Use of Performance

Metrics. In: 2013 Humaine Association Conference on

Affective Computing and Intelligent Interaction (Vol. 61,

pp. 245–251). Geneva: IEEE. DOI: https://doi.org/10.1109/

acii.2013.47

Jiang, M. & Zhu, Z. 2022. The Role of Artificial Intelligence

Algorithms in Marine Scientific Research. Frontiers in

Marine Science, 9, 1–4. DOI: https://doi.org/10.3389/

fmars.2022.920994

Jordanov, I., Petrov, N. & Petrozziello, A. 2018. Classifiers

Accuracy Improvement Based on Missing Data

Imputation. Journal of Artificial Intelligence and Soft

Computing Research, 8(1), 31–48. DOI: https://doi.org/

1515/jaiscr-2018-0002

Kangur, K., Park, Y.-S., Kangur, A., Kangur, P. & Lek, S. 2007.

Patterning long-term changes of fish community in large

shallow Lake Peipsi. Ecological Modelling, 203(1–2), 34–44.

DOI: https://doi.org/10.1016/j.ecolmodel.2006.03.039

Kaur, H., Pannu, H. & Malhi, A. 2020. A Systematic Review on

Imbalanced Data Challenges in Machine Learning. ACM

Computing Surveys, 52(4), 1–36. DOI: https://doi.org/

1145/3343440

Kohonen, T. 1990. The self-organizing map. Proceedings

of the IEEE, 78(9), 1464–1480. DOI: https://doi.org/

1109/5.58325

Kohonen, T. 2001. Self-Organizing Maps. Springer: Berlin.

Krawczyk, B. 2016. Learning from imbalanced data: open

challenges and future directions. Progress in Artificial

Intelligence, 5(4), 221–232. DOI: https://doi.org/10.1007/

s13748-016-0094-0

Machine learning workflows for ecosystem studies

Ocean and Coastal Research 2023, v71(suppl 3):e23021 23

Fonseca and Vieira

Landis, J. & Koch, G. 1977. The Measurement of Observer

Agreement for Categorical Data. Biometrics, 33(1),

–174. DOI: https://doi.org/10.2307/2529310

Lawrence, R., Almasi, G. & Rushmeier, H. 1999. A scalable

parallel algorithm for self-organizing maps with

applications to sparse data mining problems. Data

Mining and Knowledge Discovery, 3(2), 171–195. DOI:

https://doi.org/10.1023/A:1009817804059

Levy, O., Ball, B., Bond-Lamberty, B., Cheruvelil, K., Finley, A.,

Lottig, N., Punyasena, S., Xiao, J., Zhou, J., Buckley, L.,

Filstrup, C., Keitt, T., Kellner, J., Knapp, A., Richardson, A.,

Tcheng, D., Toomey, M., Vargas, R., Voordeckers, J.,

Wagner, T. & Williams, J. 2014. Approaches to advance

scientific understanding of macrosystems ecology.

Frontiers in Ecology and the Environment, 12(1), 15–23.

DOI: https://doi.org/10.1890/130019

L’Heureux, A., Grolinger, K., Elyamany, H. & Capretz, M.

Machine Learning With Big Data: Challenges

and Approaches. IEEE Access, 5, 7776–7797. DOI:

https://doi.org/10.1109/access.2017.2696365

Little, R. & Rubin, D. 2002. Statistical Analysis with

Missing Data. Hoboken: John Wiley & Sons, Inc. DOI:

https://doi.org/10.1002/9781119013563

Liu, Y., Weisberg, R. & Mooers, C. 2006. Performance

evaluation of the self-organizing map for feature

extraction. Journal of Geophysical Research, 111(C5),

C05018. DOI: https://doi.org/10.1029/2005jc003117

Lo, Z.-P. & Bavarian, B. 1991. On the rate of convergence

in topology preserving neural networks. Biological

Cybernetics, 65(1), 55–63. DOI: https://doi.org/10.1007/

bf00197290

Loureiro, A., Torgo, L. & Soares, C. 2004. Outlier Detection

using Clustering Methods: a data cleaning application. In:

Proceedings of KDNet Symposium on Knowledge-based

Systems for the Public Sector. Sankt Augustin: KDnet.

Lunetta, K., Hayward, L., Segal, J. & Van Eerdewegh, P. 2004.

Screening large-scale association study data: exploiting

interactions using random forests. BMC Genetics, 5(1),

DOI: https://doi.org/10.1186/1471-2156-5-32

Lynam, C., Uusitalo, L., Patrício, J., Piroddi, C., Queirós, A.,

Teixeira, H., Rossberg, A., Sagarminaga, Y., Hyder, K.,

Niquil, N., Möllmann, C., Wilson, C., Chust, G.,

Galparsoro, I., Forster, R., Veríssimo, H., Tedesco, L.,

Revilla, M. & Neville, S. 2016. Uses of Innovative Modeling

Tools within the Implementation of the Marine Strategy

Framework Directive. Frontiers in Marine Science, 3,

–18. DOI: https://doi.org/10.3389/fmars.2016.00182

Ma, E.-Y., Kim, J.-W., Lee, Y., Cho, S.-W., Kim, H. & Kim, J. 2021.

Combined unsupervised-supervised machine learning

for phenotyping complex diseases with its application to

obstructive sleep apnea. Scientific Reports, 11(1), 4457.

DOI: https://doi.org/10.1038/s41598-021-84003-4

Mahesh, B. 2020. Machine Learning Algorithms - A Review.

International Journal of Science and Research, 9(1),

–386. DOI: https://doi.org/10.21275/ART20203995

Markham, I., Mathieu, R. & Wray, B. 2000. Kanban setting through

artificial intelligence: a comparative study of artificial neural

networks and decision trees. Integrated Manufacturing

Systems, 11(4), 239–246. DOI: https://doi.org/10.1108/

Michener, W. & Jones, M. 2012. Ecoinformatics: supporting

ecology as a data-intensive science. Trends in Ecology &

Evolution, 27(2), 85–93. DOI: https://doi.org/10.1016/j.

tree.2011.11.016

Moreira, D. L., Marcon, E. H., Toledo, R. G. A. & Bonecker,

A. C. T. 2023. Multidisciplinary Scientific Cruises for

Environmental Characterization in the Santos Basin –

Methods and Sampling Design. DOI: https://doi.org/

5281/ZENODO.7702291

Mount, N. J. & Weaver, D. 2011. Self-organizing maps

and boundary effects: quantifying the benefits of

torus wrapping for mapping SOM trajectories. Pattern

Analysis and Applications, 14(2), 139–148. DOI:

https://doi.org/10.1007/s10044-011-0210-5

Muñoz, A. & Muruzábal, J. 1998. Self-organizing maps

for outlier detection. Neurocomputing, 18(1–3), 33–60.

DOI: https://doi.org/10.1016/s0925-2312(97)00068-4

Natita, W., Wiboonsak, and W. & Dusadee, S. 2016.

Appropriate Learning Rate and Neighborhood Function

of Self-organizing Map (SOM) for Specific Humidity

Pattern Classification over Southern Thailand.

International Journal of Modeling and Optimization, 6(1),

–65. DOI: https://doi.org/10.7763/ijmo.2016.v6.504

Newman, E. A. 2019. Disturbance Ecology in the

Anthropocene. Frontiers in Ecology and Evolution, 7.

DOI: https://doi.org/10.3389/fevo.2019.00147

Ng, S. & Chan, and M. 2019. Effect of Neighbourhood

Size Selection in SOM-Based Image Feature

Extraction. International Journal of Machine Learning

and Computing, 9(2), 195–200. DOI: https://doi.org/

18178/ijmlc.2019.9.2.786

Nichols, J. D. & Williams, B. K. 2006. Monitoring for

conservation. Trends in Ecology & Evolution, 21(12),

–673. DOI: https://doi.org/10.1016/j.tree.2006.08.007

Oshiro, T. M., Perez, P. S. & Baranauskas, J. A. 2012. How

Many Trees in a Random Forest? In: Perner, P. (ed.),

Machine Learning and Data Mining in Pattern Recognition

(Vol. 7376, pp. 154–168). New York: Springer. DOI:

https://doi.org/10.1007/978-3-642-31537-4_13

Park, Y.-S., Chung, Y.-J. & Moon, Y.-S. 2013. Hazard ratings

of pine forests to a pine wilt disease at two spatial scales

(individual trees and stands) using self-organizing map

and random forest. Ecological Informatics, 13, 40–46.

DOI: https://doi.org/10.1016/j.ecoinf.2012.10.008

Park, Y.-S., Song, M.-Y., Park, Y.-C., Oh, K.-H., Cho, E. &

Chon, T.-S. 2007. Community patterns of benthic

macroinvertebrates collected on the national scale in

Korea. Ecological Modelling, 203(1–2), 26–33. DOI:

https://doi.org/10.1016/j.ecolmodel.2006.04.032

Penczak, T., Kruk, A., Park, Y. S. & Lek, S. 2005. Modelling

Community Structure in Freshwater Ecosystems. In: Lek,

Sovan, Scardi, M., Verdonschot, P. F. M., Descy, J.-P., &

Park, Y.-S. (eds.), Modelling Community Structure in

Freshwater Ecosystems (pp. 100–113). Springer-Verlag.

DOI: https://doi.org/10.1007/3-540-26894-4_10

Perkel, J. 2019. Workflow systems turn raw data into

scientific knowledge. Nature, 573(7772), 149–150. DOI:

https://doi.org/10.1038/d41586-019-02619-z

Platias, C. & Petasis, G. 2020. A Comparison of Machine

Learning Methods for Data Imputation. In: 11th Hellenic

Machine learning workflows for ecosystem studies

Ocean and Coastal Research 2023, v71(suppl 3):e23021 24

Fonseca and Vieira

Conference on Artificial Intelligence (pp. 150–159). New

York: ACM. DOI: https://doi.org/10.1145/3411408.3411465

Pope, D. & McNeill, F. 2013. From Big Data to Meaningful

Information. Cary: SAS.

Poulos, J. & Valle, R. 2018. Missing Data Imputation for

Supervised Learning. Applied Artificial Intelligence,

(2), 186–196. DOI: https://doi.org/10.1080/08839514.

1448143

Probst, P., Bischl, B. & Boulesteix, A.-L. 2019. Tunability:

Importance of Hyperparameters of Machine Learning

Algorithms. The Journal of Machine Learning

Research, 20(1), 1934–1965. DOI: https://doi.org/

5555/3322706.3361994

Probst, P. & Boulesteix, A.-L. 2017. To tune or not to tune

the number of trees in random forest. The Journal of

Machine Learning Research, 18(1), 1934–1965. DOI:

https://doi.org/10.48550/ARXIV.1705.05654

Probst, P., Wright, M. & Boulesteix, A. 2019.

Hyperparameters and tuning strategies for random

forest. WIREs Data Mining and Knowledge Discovery,

(3), e1301. DOI: https://doi.org/10.1002/widm.1301

Rahmati, O., Falah, F., Naghibi, S., Biggs, T., Soltani, M.,

Deo, R., Cerdà, A., Mohammadi, F. & Tien Bui, D.

Land subsidence modelling using tree-based

machine learning algorithms. Science of The Total

Environment, 672, 239–252. DOI: https://doi.org/10.1016/j.

scitotenv.2019.03.496

Razi, M. & Athappilly, K. 2005. A comparative predictive

analysis of neural networks (NNs), nonlinear regression

and classification and regression tree (CART) models.

Expert Systems with Applications, 29(1), 65–74. DOI:

https://doi.org/10.1016/j.eswa.2005.01.006

Refaeilzadeh, P., Tang, L. & Liu, H. 2009. Encyclopedia

of Database Systems. In: Liu, L. & Özsu, M. T. (eds.),

Encyclopedia of Database Systems (pp. 532–538).

Boston: Springer US. DOI: https://doi.org/10.1007/978-0-

-39940-9_565

Rhodes, J. & Jonzén, N. 2011. Monitoring temporal trends

in spatially structured populations: how should sampling

effort be allocated between space and time? Ecography,

(6), 1040–1048. DOI: https://doi.org/10.1111/j.1600-

2011.06370.x

Robinson, S. 2008. Conceptual modelling for simulation

Part II: a framework for conceptual modelling. Journal

of the Operational Research Society, 59(3), 291–304.

DOI: https://doi.org/10.1057/palgrave.jors.2602369

Rollinson, C., Finley, A., Alexander, M., Banerjee, S., Dixon

Hamil, K.-A., Koenig, L., Locke, D., Demarche, M.,

Tingley, M., Wheeler, K., Youngflesh, C. & Zipkin, E.

Working across space and time: nonstationarity

in ecological research and application. Frontiers in

Ecology and the Environment, 19(1), 66–72. DOI:

https://doi.org/10.1002/fee.2298

Sarker, I. H. 2021. Machine Learning: Algorithms, RealWorld Applications and Research Directions. SN

Computer Science, 2(3), 160. DOI: https://doi.org/

1007/s42979-021-00592-x

Schaub, M. & Abadi, F. 2011. Integrated population models:

a novel analysis framework for deeper insights into

population dynamics. Journal of Ornithology, 152(S1),

–237. DOI: https://doi.org/10.1007/s10336-010-0632-7

Stefanovič, P. & Kurasova, O. 2011. Influence of Learning

Rates and Neighboring Functions on Self-Organizing

Maps. In: WSOM 2011: Advances in Self-Organizing

Maps (Vol. 6731, pp. 141–150). Berlin: Springer

Berlin Heidelberg. DOI: https://doi.org/10.1007/978-

-642-21566-7_14

Stoudt, S., Vásquez, V. & Martinez, C. 2021. Principles for

data analysis workflows. PLOS Computational Biology,

(3), e1008770. DOI: https://doi.org/10.1371/journal.

pcbi.1008770

Stupariu, M.-S., Cushman, S., Pleşoianu, A.-I., PătruStupariu, I. & Fürst, C. 2021. Machine learning in

landscape ecological analysis: a review of recent

approaches. Landscape Ecology, 37(5), 1227–1250.

DOI: https://doi.org/10.1007/s10980-021-01366-9

Tison, J. 2004. Use of unsupervised neural networks for

ecoregional zoning of hydrosystems through diatom

communities: case study of Adour-Garonne watershed

(France). Archiv Für Hydrobiologie, 159(3), 409–422.

DOI: https://doi.org/10.1127/0003-9136/2004/0159-0409

Tsai, C.-F. & Chen, M.-L. 2010. Credit rating by hybrid machine

learning techniques. Applied Soft Computing, 10(2),

–380. DOI: https://doi.org/10.1016/j.asoc.2009.08.003

Ultsch, A. 2003. U*-matrix: a tool to visualize clusters in

high dimensional data.

Van Hulle, M. 2012. Handbook of Natural Computing. In:

Rozenberg, G., Bäck, T., & Kok, J. N. (eds.), Handbook

of Natural Computing (pp. 585–622). Berlin: Springer

Berlin Heidelberg. DOI: https://doi.org/10.1007/978-3-

-92910-9_19

Vesanto, J. & Alhoniemi, E. 2000. Clustering of the selforganizing map. IEEE Transactions on Neural Networks,

(3), 586–600. DOI: https://doi.org/10.1109/72.846731

Vesanto, J., Himberg, J., Alhoniemi, E. & Parhankangas, J.

Self-organizing map in Matlab: the SOM Toolbox.

In: Proceedings of the Matlab DSP Conference

(pp. 35–40). Espoo.

Vesanto, J., Himberg, J., Alhoniemi, E. & Parhankangas, J.

SOM Toolbox for Matlab 5.

Vieira, D. & Fonseca, G. 2023. iMESc: An Interactive

Machine Learning App for Environmental Science. DOI:

https://doi.org/10.5281/zenodo.6484391

Virts, K., Shirey, A., Priftis, G., Ankur, K., Ramasubramanian,

M., Muhammad, H., Acharya, A. & Ramachandran, R.

A Quantitative Analysis on the Use of Supervised

Machine Learning in Earth Science. In: IGARSS 2020 -

IEEE International Geoscience and Remote

Sensing Symposium (pp. 2252–2255). Waikoloa: IEEE.

DOI: https://doi.org/10.1109/igarss39084.2020.9323770

Walker, G. 2006. The tipping point of the iceberg. Nature,

(7095), 802–805. DOI: https://doi.org/10.1038/441802a

Wang, F., Shi, Z., Biswas, A., Yang, S. & Ding, J. 2020.

Multi-algorithm comparison for predicting soil salinity.

Geoderma, 365, 114211. DOI: https://doi.org/10.1016/j.

geoderma.2020.114211

Webb, J., Arthington, A. & Olden, J. 2017. Models of

Ecological Responses to Flow Regime Change to

Inform Environmental Flows Assessments. Water for the

Environment: From Policy and Science to Implementation

and Management. Water for the Environment, 287–316.

DOI: https://doi.org/10.1016/B978-0-12-803907-6.00014-0

Machine learning workflows for ecosystem studies

Ocean and Coastal Research 2023, v71(suppl 3):e23021 25

Fonseca and Vieira

Wehrens, R. & Buydens, L. 2007. Self- and Superorganizing Maps in R: The Kohonen Package. Journal

of Statistical Software, 21(5), 1–19. DOI: https://doi.org/

18637/jss.v021.i05

Wehrens, R. & Kruisselbrink, J. 2018. Flexible Self-Organizing

Maps in kohonen 3.0. Journal of Statistical Software,

(7), 1–18. DOI: https://doi.org/10.18637/jss.v087.i07

Yang, P., Wang, D., Wei, Z., Du, X. & Li, T. 2019. An

Outlier Detection Approach Based on Improved SelfOrganizing Feature Map Clustering Algorithm. IEEE

Access, 7, 115914–115925. DOI: https://doi.org/10.1109/

access.2019.2922004

Yotova, G., Varbanov, M., Tcherkezova, E. & Tsakovski, S.

Water quality assessment of a river catchment by

the composite water quality index and self-organizing

maps. Ecological Indicators, 120, 106872. DOI:

https://doi.org/10.1016/j.ecolind.2020.106872

Zhang, J.-T., Dong, Y. & Xi, Y. 2008. A comparison of SOFM

ordination with DCA and PCA in gradient analysis of

plant communities in the midst of Taihang Mountains,

China. Ecological Informatics, 3(6), 367–374. DOI:

https://doi.org/10.1016/j.ecoinf.2008.09.004

Zhang, L., Scholz, M., Mustafa, A. & Harrington, R. 2008.

Assessment of the nutrient removal performance in

integrated constructed wetlands with the self-organizing

map. Water Research, 42(13), 3519–3527. DOI: https://

doi.org/10.1016/j.watres.2008.04.027

Zhong, S., Zhang, K., Bagheri, M., Burken, J., Gu, A., Li, B.,

Ma, X., Marrone, B., Ren, Z., Schrier, J., Shi, W., Tan, H.,

Wang, T., Wang, X., Wong, B., Xiao, X., Yu, X., Zhu, J.-J. &

Zhang, H. 2021. Machine Learning: New Ideas and Tools

in Environmental Science and Engineering. Environmental

Science & Technology, 55(19), 12741–12754. DOI:

https://doi.org/10.1021/acs.est.1c01339

Zipkin, E., Zylstra, E., Wright, A., Saunders, S., Finley, A.,

Dietze, M., Itter, M. & Tingley, M. 2021. Addressing

data integration challenges to link ecological processes

across scales. Frontiers in Ecology and the Environment,

(1), 30–38. DOI: https://doi.org/10.1002/fee.2290

Downloads

Published

10.04.2024

How to Cite

Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project. (2024). Ocean and Coastal Research, 71(Suppl. 3). https://doi.org/10.1590/