Simplification of machine learning models can reduce computational costs

Tool developed by USP researcher uses algorithms with real-time data flows to optimize large data processing

 Publicado: 06/11/2024 às 15:01

Text: Redação

Art: Beatriz Haddad*

Researchers and professionals without access to super powerful computers can benefit in the area of data analysis that are continuously generated by using
models that automatically adjust to changes – Photo: Freepik

Leia este conteúdo em PortuguêsThe use of artificial intelligence (AI) for the optimization of digital systems is becoming increasingly prevalent with the accelerated growth of data production. Algorithms used in the creation of this type of technology need to have the ability to process large volumes of data and not generate an excessive cost of computational resources.

Defended at São Carlos Institute of Mathematics and Computational (ICMC) of USP, the doctoral research “incremental and efficient algorithms for decision trees and rules and proximity-based algorithms” contributed to the creation of a tool that simplifies the implementation of online machine learning algorithms.

The research is linked to the Research, Innovation and Dissemination Center of the Center for Mathematical Sciences Applied to Industry, which promotes the transmission of technologies and scientific knowledge to the industrial environment.

Unlike traditional machine learning, an AI area where the model is trained from
an isolated database, online machine learning is an incremental method that
deals with real-time data streams. “There are scenarios where you need to be
as up-to-date as you need to update your model,” says Saulo Martiello Mastelini, computational scientist and author of the doctoral thesis.

Surveillance tools, medical diagnostics, financial transactions, and fraud detection are cases that can benefit from systems that adapt their algorithms as new data is inserted. However, this does not come without costs. Saulo explains that the processing of data-rich environments under constant change is slow and uses high memory capacity. “Maybe I can run on a normal computer, but on a small sensor that’s in the middle of a forest and running on battery, that might not be efficient,” he emphasizes.

Saulo Martiello Mastelini - Photo: Linkedin

The research sought solutions capable of optimizing these processes, reducing computational costs and, at the same time, maintaining good predictive performance, that is, the ability to predict future events based on available data. The study focused on regression models: algorithms that work with numerical values and differ from classification models, which operate on categorized values.

“In general, regression algorithms tend to be more challenging in data manipulation due to the nature of the problem.” When you are going to predict whether it’s a cat or a dog, you have two options. “Now, if you are going to predict, for example, a temperature, there are infinite possibilities,” says researcher explaining the greater complexity of the model studied.

The thesis also investigated, within the regression scenario, the use of so-called decision trees. They are an important type of algorithm used in machine learning, as they are versatile and visually intuitive. Formed by decision nodes and branches, the trees present a hierarchy flow when dealing with data: they start from a root node, the initial stage of processing, to reach a leaf node, which would be the final prediction generated as a response.

With an easy-to-visualize structure, decision trees can be used for both classification and regression algorithms, making an initial database generate a
final value by passing through nodes that establish decision rules – Illustration by Jornal da USP made with images from juicy_fish/Freepik , Freepik, and macrovector/Freepik

Simplification of models

In addition to the development of more efficient and less costly processing models, the work also contributed to the creation of the tool River, a library that simplifies the application of online machine learning algorithms. Modeled in the Python programming language, the software is a collaboration between various researchers and is made in open source, so that anyone has access to the source code for use, investigation, and modification with new features.

Already applied in both industry and academia, River reflects the intentions of the study done by Saulo. According to André Ponce de Leon Ferreira de Carvalho, director of the Institute and thesis advisor, its proposals democratize the use of AI, as they allow “research groups and companies with fewer resources to conduct research and develop products for applications where data is continuously generated.”

Due to its potential impact on society, the research defended last year was the winner of the 37th Thesis and Dissertation Contest of the Congress of the Brazilian Society of Computing (CSBC), one of the most important in the country in the field of Computing. It also won the first edition of the Maria Carolina Monard Award, an honor created by ICMC that recognizes doctoral theses related to artificial intelligence.

The thesis Incremental and efficient algorithms for decision trees and rules and proximity-based algorithms can be read here.

More information: email saulomastelini@gmail.com, with Saulo Martiello Mastelini.

*Intern under the supervision of  Moisés Dorado

English version: Nexus Traduções


Política de uso 
A reprodução de matérias e fotografias é livre mediante a citação do Jornal da USP e do autor. No caso dos arquivos de áudio, deverão constar dos créditos a Rádio USP e, em sendo explicitados, os autores. Para uso de arquivos de vídeo, esses créditos deverão mencionar a TV USP e, caso estejam explicitados, os autores. Fotos devem ser creditadas como USP Imagens e o nome do fotógrafo.