Selected Publications

You can find below a list of selected publications. To view all publications, please click on the following button, or download the full bibliography in BibTex style on the second button.

View all publications Download bibliography

103 entries « ‹ 1 of 3 › »

2025
	Dopke, Luan; Accorsi, Arthur; Aires, João; Guder, Larissa; Manssour, Isabel; Griebler, Dalvan SpeechVis: Simplifying Speech Emotion Visualization Inproceedings doi In: Proceedings of the 31st Brazilian Symposium on Multimedia and the Web, pp. 428-436, SBC Rio de Janeiro, Brazil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{DOPKE:WebMedia:25, title = {SpeechVis: Simplifying Speech Emotion Visualization }, author = {Luan Dopke and Arthur Accorsi and João Aires and Larissa Guder and Isabel Manssour and Dalvan Griebler}, url = {https://doi.org/10.5753/webmedia.2025.16115}, doi = {10.5753/webmedia.2025.16115}, year = {2025}, date = {2025-11-01}, booktitle = {Proceedings of the 31st Brazilian Symposium on Multimedia and the Web}, pages = {428-436}, address = {Rio de Janeiro, Brazil}, organization = {SBC}, abstract = {As the amount of online content increases, analyzing and following discussions becomes harder. Relevant information, such as the main discussion topics and the emotions expressed in audio, e.g., in a podcast, requires people to watch or listen to the entire content to understand the context. However, this can take a long time, and people’s interpretations of emotions can bias their understanding of them. A visual summarization of such information can help people quickly understand the audio context and analyze the content regarding speakers, their emotions, and the main topics covered. In this work, we introduce SpeechVis, a visual analytics tool that visually summarizes speech emotions from an audio source. SpeechVis extracts multiple information from the audio, such as the transcription, speakers, main topics, and emotions, to provide visualizations and statistics about the discussed topics and each speaker’s emotions. We used multiple off-the-shelf machine learning models to extract audio information and developed several visual representations that aim to facilitate audio analysis. To evaluate SpeechVis, we selected two use cases and performed an analysis to demonstrate how the SpeechVis visualizations can give valuable insights and facilitate audio interpretation.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close As the amount of online content increases, analyzing and following discussions becomes harder. Relevant information, such as the main discussion topics and the emotions expressed in audio, e.g., in a podcast, requires people to watch or listen to the entire content to understand the context. However, this can take a long time, and people’s interpretations of emotions can bias their understanding of them. A visual summarization of such information can help people quickly understand the audio context and analyze the content regarding speakers, their emotions, and the main topics covered. In this work, we introduce SpeechVis, a visual analytics tool that visually summarizes speech emotions from an audio source. SpeechVis extracts multiple information from the audio, such as the transcription, speakers, main topics, and emotions, to provide visualizations and statistics about the discussed topics and each speaker’s emotions. We used multiple off-the-shelf machine learning models to extract audio information and developed several visual representations that aim to facilitate audio analysis. To evaluate SpeechVis, we selected two use cases and performed an analysis to demonstrate how the SpeechVis visualizations can give valuable insights and facilitate audio interpretation. Close https://doi.org/10.5753/webmedia.2025.16115 doi:10.5753/webmedia.2025.16115 Close
	Guder, Larissa; Dopke, Luan; Kaiser, Marcos; Griebler, Dalvan; Meneguzzi, Felipe BAH: Beyond Acoustic Handcrafted features for speech emotion recognition in Portuguese Inproceedings doi In: Proceedings of the 31st Brazilian Symposium on Multimedia and the Web, pp. 86-93, SBC Rio de Janeiro, Brazil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GUDER:WebMedia:25, title = {BAH: Beyond Acoustic Handcrafted features for speech emotion recognition in Portuguese}, author = {Larissa Guder and Luan Dopke and Marcos Kaiser and Dalvan Griebler and Felipe Meneguzzi}, url = {https://doi.org/10.5753/webmedia.2025.16129}, doi = {10.5753/webmedia.2025.16129}, year = {2025}, date = {2025-11-01}, booktitle = {Proceedings of the 31st Brazilian Symposium on Multimedia and the Web}, pages = {86-93}, address = {Rio de Janeiro, Brazil}, organization = {SBC}, abstract = {It is through affective computing that we have the integration of human feelings and computing applications. One affective computing task is Speech Emotion Recognition (SER), which identifies emotions from spoken audio. Even though emotion is a universal aspect of human experience, each culture and language has different ways to express and understand emotions. So, when designing models for SER, it is common to focus on a single language. In this work, we explore VERBO, a Brazilian Portuguese dataset for categorical emotion recognition. Our main objective is to define the best way to extract acoustic features to train a classifier for SER.We compare 18 different methods to generate audio representations, grouped by handcrafted features and audio embeddings. The best representation for VERBO is TRILL embeddings, and with an SVM classifier, we achieved 92% accuracy in VERBO. As far as we know, this was the state of the art for this dataset.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close It is through affective computing that we have the integration of human feelings and computing applications. One affective computing task is Speech Emotion Recognition (SER), which identifies emotions from spoken audio. Even though emotion is a universal aspect of human experience, each culture and language has different ways to express and understand emotions. So, when designing models for SER, it is common to focus on a single language. In this work, we explore VERBO, a Brazilian Portuguese dataset for categorical emotion recognition. Our main objective is to define the best way to extract acoustic features to train a classifier for SER.We compare 18 different methods to generate audio representations, grouped by handcrafted features and audio embeddings. The best representation for VERBO is TRILL embeddings, and with an SVM classifier, we achieved 92% accuracy in VERBO. As far as we know, this was the state of the art for this dataset. Close https://doi.org/10.5753/webmedia.2025.16129 doi:10.5753/webmedia.2025.16129 Close
	Ahmad, Sunna Imtiaz; Olczyk, Jakub; Araújo, Adriel S.; de Moura Medeiros, João Pedro; Teixeira, Vinicius C.; Gomes, Carlos F. A.; Magnaguagno, Maurício Cecílio; Roederer, Quinn; Dutra, Vinicius; Conley, R. Scott; Griebler, Dalvan; Eckert, George; Pinho, Márcio Sarroglia; Turkkahraman, Hakan A Novel Multimodal Deep Image Analysis Model for Predicting Extraction/Non-Extraction Decision Journal Article doi In: Orthodontics & Craniofacial Research, vol. na, pp. na, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{AHMAD:OCR:25, title = {A Novel Multimodal Deep Image Analysis Model for Predicting Extraction/Non-Extraction Decision}, author = {Sunna Imtiaz Ahmad and Jakub Olczyk and Adriel S. Araújo and João Pedro de Moura Medeiros and Vinicius C. Teixeira and Carlos F. A. Gomes and Maurício Cecílio Magnaguagno and Quinn Roederer and Vinicius Dutra and R. Scott Conley and Dalvan Griebler and George Eckert and Márcio Sarroglia Pinho and Hakan Turkkahraman}, url = {https://doi.org/10.1111/ocr.70057}, doi = {10.1111/ocr.70057}, year = {2025}, date = {2025-10-01}, urldate = {2025-10-01}, journal = {Orthodontics & Craniofacial Research}, volume = {na}, pages = {na}, publisher = {Wiley}, abstract = {This study aimed to develop a deep learning model classifier capable of predicting the extraction/non-extraction binary decision using lateral cephalometric radiographs (LCRs) and intraoral scans (IOS) to serve as an additional decision-support tool for orthodontists. Materials and Methods The dataset was composed of LCRs and IOS from 617 patients (mean age: 18.2, 63.5% female) treated at the Indiana University School of Dentistry. Subjects were categorised into two groups: extraction (192) and non-extraction (425). Two sets of features were extracted from IOS: traditional arch measurements and novel tooth spatial features. For LCRs, features were derived using CephNet-based landmark detection (Land), a convolutional autoencoder (AE), and the dimensionality was reduced using Principal Component Analysis (PCA). Models were evaluated using accuracy, sensitivity, specificity, positive predictive value (PPV or precision), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR−), and F1 score. Results IOS + Land model achieved the highest overall accuracy (77%) and F1 score (0.62), with strong specificity (83%) and PPV (62%). In contrast, the Land model yielded the highest sensitivity (82%), but at the cost of lower specificity (57%). McNemar's test revealed that the AE model was significantly less accurate than IOS + AE (p = 0.048), IOS + Land (p = 0.006), and IOS + AE + Land (p = 0.005). Conclusion Deep learning models can predict the extraction/non-extraction decision using IOS and LCRs with high accuracy and diagnostic performance. Multimodal approaches, particularly those integrating IOS with cephalometric landmarks, demonstrate superior accuracy, sensitivity, and specificity compared to single-modality models.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close This study aimed to develop a deep learning model classifier capable of predicting the extraction/non-extraction binary decision using lateral cephalometric radiographs (LCRs) and intraoral scans (IOS) to serve as an additional decision-support tool for orthodontists. Materials and Methods The dataset was composed of LCRs and IOS from 617 patients (mean age: 18.2, 63.5% female) treated at the Indiana University School of Dentistry. Subjects were categorised into two groups: extraction (192) and non-extraction (425). Two sets of features were extracted from IOS: traditional arch measurements and novel tooth spatial features. For LCRs, features were derived using CephNet-based landmark detection (Land), a convolutional autoencoder (AE), and the dimensionality was reduced using Principal Component Analysis (PCA). Models were evaluated using accuracy, sensitivity, specificity, positive predictive value (PPV or precision), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR−), and F1 score. Results IOS + Land model achieved the highest overall accuracy (77%) and F1 score (0.62), with strong specificity (83%) and PPV (62%). In contrast, the Land model yielded the highest sensitivity (82%), but at the cost of lower specificity (57%). McNemar's test revealed that the AE model was significantly less accurate than IOS + AE (p = 0.048), IOS + Land (p = 0.006), and IOS + AE + Land (p = 0.005). Conclusion Deep learning models can predict the extraction/non-extraction decision using IOS and LCRs with high accuracy and diagnostic performance. Multimodal approaches, particularly those integrating IOS with cephalometric landmarks, demonstrate superior accuracy, sensitivity, and specificity compared to single-modality models. Close https://doi.org/10.1111/ocr.70057 doi:10.1111/ocr.70057 Close
	Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz Gustavo Performance, Portability, and Productivity of HIP on GPUs with NAS Parallel Benchmarks Inproceedings doi In: 2025 IEEE/SBC 37th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 204-214, IEEE, Bonito, Brazil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ARAUJO:SBAC-PAD:25, title = {Performance, Portability, and Productivity of HIP on GPUs with NAS Parallel Benchmarks}, author = {Gabriell Araujo and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/SBAC-PAD66369.2025.00027}, doi = {10.1109/SBAC-PAD66369.2025.00027}, year = {2025}, date = {2025-10-01}, booktitle = {2025 IEEE/SBC 37th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)}, pages = {204-214}, publisher = {IEEE}, address = {Bonito, Brazil}, series = {SBAC-PAD'25}, abstract = {Graphics Processing Units (GPUs) are powerful, massively parallel processors that have become ubiquitous in modern computing. In recent years, the GPU market has diversified, with vendors like AMD and Intel offering high-performance alternatives to NVIDIA. However, most applications are written using NVIDIA's CUDA API, which is incompatible with non-NVIDIA GPUs, creating significant challenges for developers who must port their code to different architectures. To address this issue, AMD developed the Heterogeneous-Compute Interface for Portability (HIP), an open-source API for cross-vendor GPU programming. However, HIP is relatively new, leaving gaps in the literature regarding its performance, portability, and productivity. In this paper, we evaluate HIP using the NAS Parallel Benchmarks (NPB), a CFD-based suite maintained by NASA. We present the first HIP-based implementation of NPB and conduct experiments on integrated and discrete GPUs from NVIDIA, AMD, and Intel. Our results provide novel insights into HIP’s performance and portability, particularly for integrated GPUs and Intel discrete GPUs, which have been underrepresented in prior studies. We also assess productivity using different metrics to quantify the programming effort of HIP-based implementations. This work addresses key gaps in the literature, offering valuable data and insights for developers targeting emerging GPU architectures.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Graphics Processing Units (GPUs) are powerful, massively parallel processors that have become ubiquitous in modern computing. In recent years, the GPU market has diversified, with vendors like AMD and Intel offering high-performance alternatives to NVIDIA. However, most applications are written using NVIDIA's CUDA API, which is incompatible with non-NVIDIA GPUs, creating significant challenges for developers who must port their code to different architectures. To address this issue, AMD developed the Heterogeneous-Compute Interface for Portability (HIP), an open-source API for cross-vendor GPU programming. However, HIP is relatively new, leaving gaps in the literature regarding its performance, portability, and productivity. In this paper, we evaluate HIP using the NAS Parallel Benchmarks (NPB), a CFD-based suite maintained by NASA. We present the first HIP-based implementation of NPB and conduct experiments on integrated and discrete GPUs from NVIDIA, AMD, and Intel. Our results provide novel insights into HIP’s performance and portability, particularly for integrated GPUs and Intel discrete GPUs, which have been underrepresented in prior studies. We also assess productivity using different metrics to quantify the programming effort of HIP-based implementations. This work addresses key gaps in the literature, offering valuable data and insights for developers targeting emerging GPU architectures. Close https://doi.org/10.1109/SBAC-PAD66369.2025.00027 doi:10.1109/SBAC-PAD66369.2025.00027 Close
	Martins, Eduardo; Hoffmann, Renato; Alf, Lucas; Griebler, Dalvan Interface para Programação de Pipelines Lineares Tolerantes a Falha para MPI Padrão C++ Inproceedings doi In: Anais do XXVI Simpósio em Sistemas Computacionais de Alto Desempenho, pp. 133-144, SBC, Bonito, Brazil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{MARTINS:SSCAD:25, title = {Interface para Programação de Pipelines Lineares Tolerantes a Falha para MPI Padrão C++}, author = {Eduardo Martins and Renato Hoffmann and Lucas Alf and Dalvan Griebler}, url = {https://doi.org/10.5753/sscad.2025.15867}, doi = {10.5753/sscad.2025.15867}, year = {2025}, date = {2025-10-01}, booktitle = {Anais do XXVI Simpósio em Sistemas Computacionais de Alto Desempenho}, pages = {133-144}, publisher = {SBC}, address = {Bonito, Brazil}, series = {SSCAD'25}, abstract = {Sistemas de processamento de stream são projetados para operar continuamente e devem ser capazes de se recuperar em caso de falhas. No entanto, programar aplicações de alto desempenho em ambientes distribuídos introduz uma alta complexidade de desenvolvimento. Este trabalho apresenta uma interface de programação que facilita a construção de pipelines lineares tolerantes a falhas para aplicações de processamento de stream em C++. A solução utiliza MPI (Message Passing Interface) para comunicação e o protocolo ABS (Asynchronous Barrier Snapshotting) juntamente com um agente monitor para a etapa de recuperação. Os resultados experimentais indicam uma redução significativa no tempo estimado de desenvolvimento para o programador, com impacto médio de -0.98% até 6.73% na vazão das aplicações. Além disso, o processo de recuperação mitiga o impacto das falhas na vazão do programa.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Sistemas de processamento de stream são projetados para operar continuamente e devem ser capazes de se recuperar em caso de falhas. No entanto, programar aplicações de alto desempenho em ambientes distribuídos introduz uma alta complexidade de desenvolvimento. Este trabalho apresenta uma interface de programação que facilita a construção de pipelines lineares tolerantes a falhas para aplicações de processamento de stream em C++. A solução utiliza MPI (Message Passing Interface) para comunicação e o protocolo ABS (Asynchronous Barrier Snapshotting) juntamente com um agente monitor para a etapa de recuperação. Os resultados experimentais indicam uma redução significativa no tempo estimado de desenvolvimento para o programador, com impacto médio de -0.98% até 6.73% na vazão das aplicações. Além disso, o processo de recuperação mitiga o impacto das falhas na vazão do programa. Close https://doi.org/10.5753/sscad.2025.15867 doi:10.5753/sscad.2025.15867 Close
	Faé, Leonardo; Griebler, Dalvan Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines Inproceedings doi In: Anais do XXIX Simpósio Brasileiro de Linguagens de Programação, pp. 75-83, SBC, Recife/PE, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{FAE:SBLP:25, title = {Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines}, author = {Leonardo Faé and Dalvan Griebler}, url = {https://sol.sbc.org.br/index.php/sblp/article/view/36951/36736}, doi = {10.5753/sblp.2025.13152}, year = {2025}, date = {2025-09-01}, booktitle = {Anais do XXIX Simpósio Brasileiro de Linguagens de Programação}, pages = {75-83}, publisher = {SBC}, address = {Recife/PE}, series = {SBLP'25}, abstract = {Programming Graphics Processing Units (GPUs) for general-purpose computation remains a daunting task, often requiring specialized knowledge of low-level APIs like CUDA or OpenCL. While Rust has emerged as a modern, safe, and performant systems programming language, its adoption in the GPU computing domain is still nascent. Existing approaches often involve intricate compiler modifications or complex static analysis to adapt CPU-centric Rust code for GPU execution. This paper presents a novel high-level abstraction in Rust, leveraging procedural macros to automatically generate GPU-executable code from constrained Rust functions. Our approach simplifies the code generation process by imposing specific limitations on how these functions can be written, thereby avoiding the need for complex static analysis. We demonstrate the feasibility and effectiveness of our abstraction through a case study involving linear pipeline parallel patterns, a common structure in data-parallel applications. By transforming Rust functions annotated as source, stage, or sink in a pipeline, we enable straightforward execution on the GPU. We evaluate our abstraction's performance and programmability using two benchmark applications: sobel (image filtering) and latbol (fluid simulation), comparing it against manual OpenCL implementations. Our results indicate that while incurring a small performance overhead in some cases, our approach significantly reduces development effort and, in certain scenarios, achieves comparable or even superior throughput compared to CPU-based parallelism.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Programming Graphics Processing Units (GPUs) for general-purpose computation remains a daunting task, often requiring specialized knowledge of low-level APIs like CUDA or OpenCL. While Rust has emerged as a modern, safe, and performant systems programming language, its adoption in the GPU computing domain is still nascent. Existing approaches often involve intricate compiler modifications or complex static analysis to adapt CPU-centric Rust code for GPU execution. This paper presents a novel high-level abstraction in Rust, leveraging procedural macros to automatically generate GPU-executable code from constrained Rust functions. Our approach simplifies the code generation process by imposing specific limitations on how these functions can be written, thereby avoiding the need for complex static analysis. We demonstrate the feasibility and effectiveness of our abstraction through a case study involving linear pipeline parallel patterns, a common structure in data-parallel applications. By transforming Rust functions annotated as source, stage, or sink in a pipeline, we enable straightforward execution on the GPU. We evaluate our abstraction's performance and programmability using two benchmark applications: sobel (image filtering) and latbol (fluid simulation), comparing it against manual OpenCL implementations. Our results indicate that while incurring a small performance overhead in some cases, our approach significantly reduces development effort and, in certain scenarios, achieves comparable or even superior throughput compared to CPU-based parallelism. Close https://sol.sbc.org.br/index.php/sblp/article/view/36951/36736 doi:10.5753/sblp.2025.13152 Close
	Ahmad, Sunna I.; Araújo, Adriel S.; Teixeira, Vinicius C.; Gomes, Carlos F. A.; Dutra, Vinicius; Roederer, Quinn; Conley, R. Scott; Griebler, Dalvan; Pinho, Márcio S.; Turkkahraman, Hakan A Novel AI-driven Automated Orthodontic Model Analysis to Improve Classification of Orthodontic Extraction Cases Inproceedings doi In: 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1853-1860, IEEE, Toronto, Canada, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{AHMAD:COMPSAC:25, title = {A Novel AI-driven Automated Orthodontic Model Analysis to Improve Classification of Orthodontic Extraction Cases}, author = {Sunna I. Ahmad and Adriel S. Araújo and Vinicius C. Teixeira and Carlos F. A. Gomes and Vinicius Dutra and Quinn Roederer and R. Scott Conley and Dalvan Griebler and Márcio S. Pinho and Hakan Turkkahraman}, url = {https://doi.org/10.1109/COMPSAC65507.2025.00254}, doi = {10.1109/COMPSAC65507.2025.00254}, year = {2025}, date = {2025-07-01}, booktitle = {2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC)}, pages = {1853-1860}, publisher = {IEEE}, address = {Toronto, Canada}, abstract = {Malocclusion, a prevalent dental condition worldwide, necessitates orthodontic intervention to correct tooth misalignment and improve oral health. Treatment can involve extraction of permanent teeth, depending on dental crowding, jaw relationships, and facial aesthetics. Today, clinical decision support systems have introduced machine learning (ML) to assist orthodontists in determining optimal treatment plans. This study explores the development of a novel, fully automated method for extracting dentoalveolar features from 3D intraoral scans (IOS), aiming to enhance orthodontic decision-making. Using deep learning-based IOS segmentation as basis, dental measurements were developed and utilized to train supervised ML classifiers, including support vector machines (SVM), logistic regression, decision trees, and random forests. An ensemble of SVM models demonstrated the highest accuracy (73%) in predicting extraction decisions, with these novel domain-specific features proving more informative than traditional dental arch measurements. While we can make further improvements not only in the automated segmentation but also by applying feature selection, the results highlight the potential of AI-driven analysis to streamline orthodontic workflows, reduce manual intervention and improve clinical efficiency.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Malocclusion, a prevalent dental condition worldwide, necessitates orthodontic intervention to correct tooth misalignment and improve oral health. Treatment can involve extraction of permanent teeth, depending on dental crowding, jaw relationships, and facial aesthetics. Today, clinical decision support systems have introduced machine learning (ML) to assist orthodontists in determining optimal treatment plans. This study explores the development of a novel, fully automated method for extracting dentoalveolar features from 3D intraoral scans (IOS), aiming to enhance orthodontic decision-making. Using deep learning-based IOS segmentation as basis, dental measurements were developed and utilized to train supervised ML classifiers, including support vector machines (SVM), logistic regression, decision trees, and random forests. An ensemble of SVM models demonstrated the highest accuracy (73%) in predicting extraction decisions, with these novel domain-specific features proving more informative than traditional dental arch measurements. While we can make further improvements not only in the automated segmentation but also by applying feature selection, the results highlight the potential of AI-driven analysis to streamline orthodontic workflows, reduce manual intervention and improve clinical efficiency. Close https://doi.org/10.1109/COMPSAC65507.2025.00254 doi:10.1109/COMPSAC65507.2025.00254 Close
	Guder, Larissa; Aires, João Paulo; Manssour, Isabel H; Griebler, Dalvan GoViz: A Visualization Tool for Empowering Transparency in Government Speech Inproceedings doi In: Annual International Conference on Digital Government Research, pp. 954, Digital Government Society, Porto Alegre, Brasil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GUDER:DGO:25, title = {GoViz: A Visualization Tool for Empowering Transparency in Government Speech}, author = {Larissa Guder and João Paulo Aires and Isabel H Manssour and Dalvan Griebler}, url = {https://doi.org/10.59490/dgo.2025.954}, doi = {10.59490/dgo.2025.954}, year = {2025}, date = {2025-05-01}, booktitle = {Annual International Conference on Digital Government Research}, volume = {26}, pages = {954}, publisher = {Digital Government Society}, address = {Porto Alegre, Brasil}, abstract = {Public speech from government figures often describes relevant actions that can impact the population's lives. However, most people do not have time and access to analyze and understand public speech. Such a scenario narrows the participation of the people in the main discussions, which leads to multiple misunderstandings. In this work, we propose GoViz, a tool that automatically produces visual representations to outline governmental speeches regarding the subject, its main actors, and how they connect to the discussion topics. GoViz processes natural language from speech transcriptions in a pipeline that identifies part-of-speech elements, named-entities, and the relation between persons, making speech content more accessible and insightful. Using publicly available data, we evaluate our tool in two different languages (Portuguese and English). The results demonstrate that the visualizations from both data facilitate understanding the speech content. Thus, our main contribution is to encourage the participation of citizens in parliamentary issues, allowing a simplified and visually engaging avenue to access long speeches and fostering improved communication between parliamentarians and the population.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Public speech from government figures often describes relevant actions that can impact the population's lives. However, most people do not have time and access to analyze and understand public speech. Such a scenario narrows the participation of the people in the main discussions, which leads to multiple misunderstandings. In this work, we propose GoViz, a tool that automatically produces visual representations to outline governmental speeches regarding the subject, its main actors, and how they connect to the discussion topics. GoViz processes natural language from speech transcriptions in a pipeline that identifies part-of-speech elements, named-entities, and the relation between persons, making speech content more accessible and insightful. Using publicly available data, we evaluate our tool in two different languages (Portuguese and English). The results demonstrate that the visualizations from both data facilitate understanding the speech content. Thus, our main contribution is to encourage the participation of citizens in parliamentary issues, allowing a simplified and visually engaging avenue to access long speeches and fostering improved communication between parliamentarians and the population. Close https://doi.org/10.59490/dgo.2025.954 doi:10.59490/dgo.2025.954 Close
	Czarnul, Paweł; Antal, Marcel; Baniata, Hamza; Griebler, Dalvan; Kertesz, Attila; Kessler, Christoph W.; Kouloumpris, Andreas; Kovačić, Salko; Markus, Andras; Michael, Maria K.; Nikolaou, Panagiota; Öz, Isil; Prodan, Radu; Rakić, Gordana Optimization of resource-aware parallel and distributed computing: a review Journal Article doi In: The Journal of Supercomputing, vol. 81, no. 7, pp. 848, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{CZARNUL:Supercomputing:25, title = {Optimization of resource-aware parallel and distributed computing: a review}, author = {Paweł Czarnul and Marcel Antal and Hamza Baniata and Dalvan Griebler and Attila Kertesz and Christoph W. Kessler and Andreas Kouloumpris and Salko Kovačić and Andras Markus and Maria K. Michael and Panagiota Nikolaou and Isil Öz and Radu Prodan and Gordana Rakić}, url = {https://doi.org/10.1007/s11227-025-07295-7}, doi = {10.1007/s11227-025-07295-7}, year = {2025}, date = {2025-05-01}, urldate = {2025-05-01}, journal = {The Journal of Supercomputing}, volume = {81}, number = {7}, pages = {848}, publisher = {Springer}, abstract = {This paper presents a review of state-of-the-art solutions concerning the optimization of computing in the field of parallel and distributed systems. Firstly, we contribute by identifying resources and quality metrics in this context including servers, network interconnects, storage systems, computational devices as well as execution time/performance, energy, security, and error vulnerability, respectively. We subsequently identify commonly used problem formulations and algorithms for integer linear programming, greedy algorithms, dynamic programming, genetic algorithms, particle swarm optimization, ant colony optimization, game theory, and reinforcement learning. Afterward, we characterize frequently considered optimization problems by stating these terms in domains such as data centers, cloud, fog, blockchain, high performance, and volunteer computing. Based on the extensive analysis, we identify how particular resources and corresponding quality metrics are considered in these domains and which problem formulations are used for which system types, either parallel or distributed environments. This allows us to formulate open research problems and challenges in this field and analyze research interest in problem formulations/domains in recent years.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close This paper presents a review of state-of-the-art solutions concerning the optimization of computing in the field of parallel and distributed systems. Firstly, we contribute by identifying resources and quality metrics in this context including servers, network interconnects, storage systems, computational devices as well as execution time/performance, energy, security, and error vulnerability, respectively. We subsequently identify commonly used problem formulations and algorithms for integer linear programming, greedy algorithms, dynamic programming, genetic algorithms, particle swarm optimization, ant colony optimization, game theory, and reinforcement learning. Afterward, we characterize frequently considered optimization problems by stating these terms in domains such as data centers, cloud, fog, blockchain, high performance, and volunteer computing. Based on the extensive analysis, we identify how particular resources and corresponding quality metrics are considered in these domains and which problem formulations are used for which system types, either parallel or distributed environments. This allows us to formulate open research problems and challenges in this field and analyze research interest in problem formulations/domains in recent years. Close https://doi.org/10.1007/s11227-025-07295-7 doi:10.1007/s11227-025-07295-7 Close
	Rockenbach, Dinei A.; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz Gustavo GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism Journal Article doi In: Computer Standards & Interfaces, vol. 92, pp. 103922, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{ROCKENBACH:GSParLib:CSI:25, title = {GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism}, author = {Dinei A. Rockenbach and Gabriell Araujo and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.csi.2024.103922}, doi = {10.1016/j.csi.2024.103922}, year = {2025}, date = {2025-03-01}, urldate = {2025-03-01}, journal = {Computer Standards & Interfaces}, volume = {92}, pages = {103922}, publisher = {Elsevier}, abstract = {The evolution of Graphics Processing Units (GPUs) has allowed the industry to overcome long-lasting problems and challenges. Many belong to the stream processing domain, whose central aspect is continuously receiving and processing data from streaming data producers such as cameras and sensors. Nonetheless, programming GPUs is challenging because it requires deep knowledge of many-core programming, mechanisms and optimizations for GPUs. Current GPU programming standards do not target stream processing and present programmability and code portability limitations. Among our main scientific contributions resides GSParLib, a C++ multi-level programming interface unifying CUDA and OpenCL for GPU processing on stream and data parallelism with negligible performance losses compared to manual implementations; GSParLib is organized in two layers: one for general-purpose computing and another for high-level structured programming based on parallel patterns; a methodology to provide unified and driver agnostic interfaces minimizing performance losses; a set of parallelism strategies and optimizations for GPU processing targeting stream and data parallelism; and new experiments covering GPU performance on applications exposing stream and data parallelism.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close The evolution of Graphics Processing Units (GPUs) has allowed the industry to overcome long-lasting problems and challenges. Many belong to the stream processing domain, whose central aspect is continuously receiving and processing data from streaming data producers such as cameras and sensors. Nonetheless, programming GPUs is challenging because it requires deep knowledge of many-core programming, mechanisms and optimizations for GPUs. Current GPU programming standards do not target stream processing and present programmability and code portability limitations. Among our main scientific contributions resides GSParLib, a C++ multi-level programming interface unifying CUDA and OpenCL for GPU processing on stream and data parallelism with negligible performance losses compared to manual implementations; GSParLib is organized in two layers: one for general-purpose computing and another for high-level structured programming based on parallel patterns; a methodology to provide unified and driver agnostic interfaces minimizing performance losses; a set of parallelism strategies and optimizations for GPU processing targeting stream and data parallelism; and new experiments covering GPU performance on applications exposing stream and data parallelism. Close https://doi.org/10.1016/j.csi.2024.103922 doi:10.1016/j.csi.2024.103922 Close
	Löff, Júnior; Hoffmann, Renato B.; Bianchessi, Arthur S.; Mallmann, Leonardo; Griebler, Dalvan; Binder, Walter NPB-PSTL: C++ STL Algorithms with Parallel Execution Policies in NAS Parallel Benchmarks Inproceedings doi In: 33rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 162-169, IEEE, Torino, Italy, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{LOFF:PDP:25, title = {NPB-PSTL: C++ STL Algorithms with Parallel Execution Policies in NAS Parallel Benchmarks}, author = {Júnior Löff and Renato B. Hoffmann and Arthur S. Bianchessi and Leonardo Mallmann and Dalvan Griebler and Walter Binder}, url = {https://doi.org/10.1109/PDP66500.2025.00030}, doi = {10.1109/PDP66500.2025.00030}, year = {2025}, date = {2025-03-01}, booktitle = {33rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {162-169}, publisher = {IEEE}, address = {Torino, Italy}, series = {PDP'25}, abstract = {The C++ language continually evolves through formal specifications established by its standards committee, proposing new features to maintain C++ as a relevant programming language while improving usability, performance, and portability across platforms. With the addition of parallel Standard Template Library (STL) algorithms in C++17, programmers can now leverage parallel processing capabilities via vendor-neutral parallel execution policies. This study presents an adaptation of the NAS Parallel Benchmarks (NPB)—a well-established suite of applications for evaluating parallel architectures-by porting its sequential C-style code to use C++ STL abstractions and performance-portable parallelism features. Our goals are to (1) assess the suitability of C++ STL for scientific applications like the ones in the NPB and (2) provide a comparative performance and portability of STL algorithms' parallel execution policies across different multicore architectures (x86 and AArch64). Results indicate that the performance of parallel STL algorithms is often close to that of optimized handwritten versions (OpenMP, Intel TBB, and FastFlow) on different architectures, with notable shortfalls. Across all NPB benchmarks, the STL algorithms' geometric mean shows sequential execution times that are between 3.76% and 6.9% higher, while parallel executions may reach a geometric mean of up to 21.21% higher execution time.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close The C++ language continually evolves through formal specifications established by its standards committee, proposing new features to maintain C++ as a relevant programming language while improving usability, performance, and portability across platforms. With the addition of parallel Standard Template Library (STL) algorithms in C++17, programmers can now leverage parallel processing capabilities via vendor-neutral parallel execution policies. This study presents an adaptation of the NAS Parallel Benchmarks (NPB)—a well-established suite of applications for evaluating parallel architectures-by porting its sequential C-style code to use C++ STL abstractions and performance-portable parallelism features. Our goals are to (1) assess the suitability of C++ STL for scientific applications like the ones in the NPB and (2) provide a comparative performance and portability of STL algorithms' parallel execution policies across different multicore architectures (x86 and AArch64). Results indicate that the performance of parallel STL algorithms is often close to that of optimized handwritten versions (OpenMP, Intel TBB, and FastFlow) on different architectures, with notable shortfalls. Across all NPB benchmarks, the STL algorithms' geometric mean shows sequential execution times that are between 3.76% and 6.9% higher, while parallel executions may reach a geometric mean of up to 21.21% higher execution time. Close https://doi.org/10.1109/PDP66500.2025.00030 doi:10.1109/PDP66500.2025.00030 Close
	Hoffmann, Renato B.; Faé, Leonardo G.; Griebler, Dalvan; Li, Xinliang David; Pereira, Fernando Magno Quintão Automatic Synthesis of Specialized Hash Functions Inproceedings doi In: Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, pp. 317-330, ACM, Las Vegas, NV, USA, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{HOFFMANN:sepe:cgo:25, title = {Automatic Synthesis of Specialized Hash Functions}, author = {Renato B. Hoffmann and Leonardo G. Faé and Dalvan Griebler and Xinliang David Li and Fernando Magno Quintão Pereira}, url = {https://doi.org/10.1145/3696443.3708940}, doi = {10.1145/3696443.3708940}, year = {2025}, date = {2025-03-01}, booktitle = {Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization}, pages = {317-330}, publisher = {ACM}, address = {Las Vegas, NV, USA}, series = {CGO '25}, abstract = {This paper introduces a technique for synthesizing hash functions specialized to particular byte formats. This code generation method leverages three prevalent patterns: (i) fixed-length keys, (ii) keys with common subsequences, and (iii) keys ranging on predetermined sequences of bytes. Code generation involves two algorithms: one identifies relevant regular expressions within key examples, and the other generates specialized hash functions based on these expressions. Comparative analysis demonstrates that the synthetic functions outperform the general-purpose hashes in the C++ Standard Template Library and the Google Abseil Library when keys are given in ascending, normal or uniform distribution. In applications where low-mixing hashes are acceptable, the synthetic functions achieve speedups ranging from 2% to 11% on full benchmarks, and speedups of almost 50x once only hashing speed is considered.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close This paper introduces a technique for synthesizing hash functions specialized to particular byte formats. This code generation method leverages three prevalent patterns: (i) fixed-length keys, (ii) keys with common subsequences, and (iii) keys ranging on predetermined sequences of bytes. Code generation involves two algorithms: one identifies relevant regular expressions within key examples, and the other generates specialized hash functions based on these expressions. Comparative analysis demonstrates that the synthetic functions outperform the general-purpose hashes in the C++ Standard Template Library and the Google Abseil Library when keys are given in ascending, normal or uniform distribution. In applications where low-mixing hashes are acceptable, the synthetic functions achieve speedups ranging from 2% to 11% on full benchmarks, and speedups of almost 50x once only hashing speed is considered. Close https://doi.org/10.1145/3696443.3708940 doi:10.1145/3696443.3708940 Close
	Mencagli, Gabriele; Rymarchuk, Yuriy; Griebler, Dalvan PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams Inproceedings doi In: Proceedings of the 19th ACM International Conference on Distributed and Event-Based Systems, pp. 51-61, ACM, Gothenburg, Sweden, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{MENCAGLI:DEBS:25, title = {PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams}, author = {Gabriele Mencagli and Yuriy Rymarchuk and Dalvan Griebler}, url = {https://doi.org/10.1145/3701717.3730542}, doi = {10.1145/3701717.3730542}, year = {2025}, date = {2025-01-01}, booktitle = {Proceedings of the 19th ACM International Conference on Distributed and Event-Based Systems}, pages = {51-61}, publisher = {ACM}, address = {Gothenburg, Sweden}, series = {DEBS'25}, abstract = {Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis. Close https://doi.org/10.1145/3701717.3730542 doi:10.1145/3701717.3730542 Close
	Araujo, Gabriell; Rockenbach, Dinei A.; Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz G. A C++ annotation-based domain-specific language for expressing stream and data parallelism supporting CPU and GPU Journal Article doi In: Journal of Computer Languages, vol. 85, pp. 101369, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{ARAUJO:COLA:25, title = {A C++ annotation-based domain-specific language for expressing stream and data parallelism supporting CPU and GPU}, author = {Gabriell Araujo and Dinei A. Rockenbach and Júnior Löff and Dalvan Griebler and Luiz G. Fernandes}, url = {https://doi.org/10.1016/j.cola.2025.101369}, doi = {10.1016/j.cola.2025.101369}, year = {2025}, date = {2025-01-01}, urldate = {2025-01-01}, journal = {Journal of Computer Languages}, volume = {85}, pages = {101369}, publisher = {Elsevier}, abstract = {Graphics processing units (GPUs) and central processing units (CPUs) provide massive parallel computing in our modern computer systems (e.g., servers, desktops, smartphones, and laptops), and efficiently utilizing their processing power requires expertise in parallel programming. Mainly, domain-specific languages (DSLs) address this challenge by improving productivity and abstractions. SPar is a high-level DSL that promotes parallel programming abstractions for stream and data parallelism using C++ attribute annotations for serial code. Unlike existing solutions, SPar eliminates the need to manually implement low-level mechanisms to leverage stream and data parallelism on heterogeneous systems. In this article, we design an extended version of the language and compiler algorithm for GPU code generation. We newly offer a single parallel programming model targeting CPUs and GPUs to exploit stream and data parallelism. The experiments indicated performance improvement compared with previous versions of SPar and achieved performance comparable to handwritten code using lower-level programming abstractions in specific scenarios.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Graphics processing units (GPUs) and central processing units (CPUs) provide massive parallel computing in our modern computer systems (e.g., servers, desktops, smartphones, and laptops), and efficiently utilizing their processing power requires expertise in parallel programming. Mainly, domain-specific languages (DSLs) address this challenge by improving productivity and abstractions. SPar is a high-level DSL that promotes parallel programming abstractions for stream and data parallelism using C++ attribute annotations for serial code. Unlike existing solutions, SPar eliminates the need to manually implement low-level mechanisms to leverage stream and data parallelism on heterogeneous systems. In this article, we design an extended version of the language and compiler algorithm for GPU code generation. We newly offer a single parallel programming model targeting CPUs and GPUs to exploit stream and data parallelism. The experiments indicated performance improvement compared with previous versions of SPar and achieved performance comparable to handwritten code using lower-level programming abstractions in specific scenarios. Close https://doi.org/10.1016/j.cola.2025.101369 doi:10.1016/j.cola.2025.101369 Close
	Leonarczyk, Ricardo; Mencagli, Gabriele; Griebler, Dalvan Self-Adaptive Micro-Batching for Low-Latency GPU-Accelerated Stream Processing Journal Article doi In: International Journal of Parallel Programming, vol. 53, no. 2, pp. 14, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{LEONARCZYK:IJPP:25, title = {Self-Adaptive Micro-Batching for Low-Latency GPU-Accelerated Stream Processing}, author = {Ricardo Leonarczyk and Gabriele Mencagli and Dalvan Griebler}, url = {https://doi.org/10.1007/s10766-025-00793-4}, doi = {10.1007/s10766-025-00793-4}, year = {2025}, date = {2025-01-01}, urldate = {2025-01-01}, journal = {International Journal of Parallel Programming}, volume = {53}, number = {2}, pages = {14}, publisher = {Springer}, abstract = {Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by GPUs. However, efficient GPU utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a GPU-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by GPUs. However, efficient GPU utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a GPU-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms. Close https://doi.org/10.1007/s10766-025-00793-4 doi:10.1007/s10766-025-00793-4 Close
2024
	Hoffmann, Renato B.; Griebler, Dalvan; Righi, Rodrigo Rosa; Fernandes, Luiz G. Benchmarking parallel programming for single-board computers Journal Article doi In: Future Generation Computer Systems, vol. 161, pp. 119-134, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{HOFFMANN:single-board-computers:FGCS:24, title = {Benchmarking parallel programming for single-board computers}, author = {Renato B. Hoffmann and Dalvan Griebler and Rodrigo Rosa Righi and Luiz G. Fernandes}, url = {https://doi.org/10.1016/j.future.2024.07.003}, doi = {10.1016/j.future.2024.07.003}, year = {2024}, date = {2024-12-01}, urldate = {2024-12-01}, journal = {Future Generation Computer Systems}, volume = {161}, pages = {119-134}, publisher = {Elsevier}, abstract = {Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role in enabling the full computational potential of SBCs. However, selecting the best-suited solution in this context is inherently complex due to the intricate interplay between PPI (parallel programming interface) strategies, SBC architectural characteristics, and application characteristics and constraints. To our knowledge, no solution presents a combined discussion of these three aspects. To tackle this problem, this article aims to provide a benchmark of the best-suited parallelism PPIs given a set of hardware and application characteristics and requirements. Compared to existing benchmarks, we introduce new metrics, additional applications, various parallelism interfaces, and extra hardware devices. Therefore, our contributions are the methodology to benchmark parallelism on SBCs and the characterization of the best-performing parallelism PPIs and strategies for given situations. We are confident that parallel computing will be mainstream to process edge and fog computing; thus, our solution provides the first insights regarding what kind of application and parallel programming interface is the most suited for a particular SBC hardware.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role in enabling the full computational potential of SBCs. However, selecting the best-suited solution in this context is inherently complex due to the intricate interplay between PPI (parallel programming interface) strategies, SBC architectural characteristics, and application characteristics and constraints. To our knowledge, no solution presents a combined discussion of these three aspects. To tackle this problem, this article aims to provide a benchmark of the best-suited parallelism PPIs given a set of hardware and application characteristics and requirements. Compared to existing benchmarks, we introduce new metrics, additional applications, various parallelism interfaces, and extra hardware devices. Therefore, our contributions are the methodology to benchmark parallelism on SBCs and the characterization of the best-performing parallelism PPIs and strategies for given situations. We are confident that parallel computing will be mainstream to process edge and fog computing; thus, our solution provides the first insights regarding what kind of application and parallel programming interface is the most suited for a particular SBC hardware. Close https://doi.org/10.1016/j.future.2024.07.003 doi:10.1016/j.future.2024.07.003 Close
	Vogel, Adriano; Danelutto, Marco; Torquati, Massimo; Griebler, Dalvan; Fernandes, Luiz Gustavo Enhancing self-adaptation for efficient decision-making at run-time in streaming applications on multicores Journal Article doi In: The Journal of Supercomputing, vol. 80, no. 15, pp. 22213-22244, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{VOGEL:Supercomputing:24, title = {Enhancing self-adaptation for efficient decision-making at run-time in streaming applications on multicores}, author = {Adriano Vogel and Marco Danelutto and Massimo Torquati and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-024-06191-w}, doi = {10.1007/s11227-024-06191-w}, year = {2024}, date = {2024-10-01}, urldate = {2024-10-01}, journal = {The Journal of Supercomputing}, volume = {80}, number = {15}, pages = {22213-22244}, publisher = {Springer}, abstract = {Parallel computing is very important to accelerate the performance of computing applications. Moreover, parallel applications are expected to continue executing in more dynamic environments and react to changing conditions. In this context, applying self-adaptation is a potential solution to achieve a higher level of autonomic abstractions and runtime responsiveness. In our research, we aim to explore and assess the possible abstractions attainable through the transparent management of parallel executions by self-adaptation. Our primary objectives are to expand the adaptation space to better reflect real-world applications and assess the potential for self-adaptation to enhance efficiency. We provide the following scientific contributions: (I) A conceptual framework to improve the designing of self-adaptation; (II) A new decision-making strategy for applications with multiple parallel stages; (III) A comprehensive evaluation of the proposed decision-making strategy compared to the state-of-the-art. The results demonstrate that the proposed conceptual framework can help design and implement self-adaptive strategies that are more modular and reusable. The proposed decision-making strategy provides significant gains in accuracy compared to the state-of-the-art, increasing the parallel applications' performance and efficiency.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Parallel computing is very important to accelerate the performance of computing applications. Moreover, parallel applications are expected to continue executing in more dynamic environments and react to changing conditions. In this context, applying self-adaptation is a potential solution to achieve a higher level of autonomic abstractions and runtime responsiveness. In our research, we aim to explore and assess the possible abstractions attainable through the transparent management of parallel executions by self-adaptation. Our primary objectives are to expand the adaptation space to better reflect real-world applications and assess the potential for self-adaptation to enhance efficiency. We provide the following scientific contributions: (I) A conceptual framework to improve the designing of self-adaptation; (II) A new decision-making strategy for applications with multiple parallel stages; (III) A comprehensive evaluation of the proposed decision-making strategy compared to the state-of-the-art. The results demonstrate that the proposed conceptual framework can help design and implement self-adaptive strategies that are more modular and reusable. The proposed decision-making strategy provides significant gains in accuracy compared to the state-of-the-art, increasing the parallel applications' performance and efficiency. Close https://doi.org/10.1007/s11227-024-06191-w doi:10.1007/s11227-024-06191-w Close
	Guder, Larissa; Aires, João Paulo; Griebler, Dalvan Dimensional Speech Emotion Recognition: a Bimodal Approach Inproceedings doi In: Anais Estendidos do XXX Simpósio Brasileiro de Sistemas Multimídia e Web, pp. 5-6, SBC, Juiz de Fora, Brasil, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GUDER:WEBMEDIA:24, title = {Dimensional Speech Emotion Recognition: a Bimodal Approach}, author = {Larissa Guder and João Paulo Aires and Dalvan Griebler}, url = {https://doi.org/10.5753/webmedia_estendido.2024.244402}, doi = {10.5753/webmedia_estendido.2024.244402}, year = {2024}, date = {2024-10-01}, booktitle = {Anais Estendidos do XXX Simpósio Brasileiro de Sistemas Multimídia e Web}, pages = {5-6}, publisher = {SBC}, address = {Juiz de Fora, Brasil}, abstract = {Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance, which can represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios, where processing the input in a short time is necessary. Considering these aspects, this work provides the first step towards creating a bimodal approach for Dimensional Speech Emotion Recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speech-emotion recognition. We evaluate different methods for creating audio and text representations, as well as automatic speech recognition techniques. Our best results achieve 0.5915 of CCC for arousal, 0.4165 for valence, and 0.5899 for dominance in the IEMOCAP dataset.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance, which can represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios, where processing the input in a short time is necessary. Considering these aspects, this work provides the first step towards creating a bimodal approach for Dimensional Speech Emotion Recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speech-emotion recognition. We evaluate different methods for creating audio and text representations, as well as automatic speech recognition techniques. Our best results achieve 0.5915 of CCC for arousal, 0.4165 for valence, and 0.5899 for dominance in the IEMOCAP dataset. Close https://doi.org/10.5753/webmedia_estendido.2024.244402 doi:10.5753/webmedia_estendido.2024.244402 Close
	Faé, Leonardo; Griebler, Dalvan An internal domain-specific language for expressing linear pipelines: a proof-of-concept with MPI in Rust Inproceedings doi In: Anais do XXVIII Simpósio Brasileiro de Linguagens de Programação, pp. 81-90, SBC, Curitiba/PR, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{FAE:SBLP:24, title = {An internal domain-specific language for expressing linear pipelines: a proof-of-concept with MPI in Rust}, author = {Leonardo Faé and Dalvan Griebler}, url = {https://doi.org/10.5753/sblp.2024.3691}, doi = {10.5753/sblp.2024.3691}, year = {2024}, date = {2024-09-01}, booktitle = {Anais do XXVIII Simpósio Brasileiro de Linguagens de Programação}, pages = {81-90}, publisher = {SBC}, address = {Curitiba/PR}, series = {SBLP'24}, abstract = {Parallel computation is necessary in order to process massive volumes of data in a timely manner. There are many parallel programming interfaces and environments, each with their own idiosyncrasies. This, alongside non-deterministic errors, make parallel programs notoriously challenging to write. Great effort has been put forth to make parallel programming for several environments easier. In this work, we propose a DSL for Rust, using the language’s source-to-source transformation facilities, that allows for automatic code generation for distributed environments that support the Message Passing Interface (MPI). Our DSL simplifies MPI’s quirks, allowing the programmer to focus almost exclusively on the computation at hand. Performance experiments show nearly or no runtime difference between our abstraction and manually written MPI code while resulting in less than half the lines of code. More elaborate code complexity metrics (Halstead) estimate from 4.5 to 14.7 times lower effort for expressing parallelism.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Parallel computation is necessary in order to process massive volumes of data in a timely manner. There are many parallel programming interfaces and environments, each with their own idiosyncrasies. This, alongside non-deterministic errors, make parallel programs notoriously challenging to write. Great effort has been put forth to make parallel programming for several environments easier. In this work, we propose a DSL for Rust, using the language’s source-to-source transformation facilities, that allows for automatic code generation for distributed environments that support the Message Passing Interface (MPI). Our DSL simplifies MPI’s quirks, allowing the programmer to focus almost exclusively on the computation at hand. Performance experiments show nearly or no runtime difference between our abstraction and manually written MPI code while resulting in less than half the lines of code. More elaborate code complexity metrics (Halstead) estimate from 4.5 to 14.7 times lower effort for expressing parallelism. Close https://doi.org/10.5753/sblp.2024.3691 doi:10.5753/sblp.2024.3691 Close
	Löff, J'unior; Griebler, Dalvan; Fernandes, Luiz Gustavo; Binder, Walter MPR: An MPI Framework for Distributed Self-adaptive Stream Processing Inproceedings doi In: Euro-Par 2024: Parallel Processing, pp. 400-414, Springer, Madrid, Spain, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{LOFF:Euro-Par:24, title = {MPR: An MPI Framework for Distributed Self-adaptive Stream Processing}, author = {J'unior Löff and Dalvan Griebler and Luiz Gustavo Fernandes and Walter Binder}, url = {https://doi.org/10.1007/978-3-031-69583-4_28}, doi = {10.1007/978-3-031-69583-4_28}, year = {2024}, date = {2024-08-01}, booktitle = {Euro-Par 2024: Parallel Processing}, pages = {400-414}, publisher = {Springer}, address = {Madrid, Spain}, series = {Euro-Par'24}, abstract = {Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresses this limitation by providing a new framework and runtime system to simplify implementing and assessing new self-adaptive algorithms and optimizations. We implement a prototype on top of MPI called MPR and show its functionality. We focus on horizontal scaling by supporting the addition and removal of processes during execution time. Experiments reveal that MPR can achieve performance similar to that of a handwritten static MPI application. We also assess MPR's adaptation capabilities, showing that it can readily re-configure itself, with the help of a self-adaptive algorithm, in response to workload variations.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresses this limitation by providing a new framework and runtime system to simplify implementing and assessing new self-adaptive algorithms and optimizations. We implement a prototype on top of MPI called MPR and show its functionality. We focus on horizontal scaling by supporting the addition and removal of processes during execution time. Experiments reveal that MPR can achieve performance similar to that of a handwritten static MPI application. We also assess MPR's adaptation capabilities, showing that it can readily re-configure itself, with the help of a self-adaptive algorithm, in response to workload variations. Close https://doi.org/10.1007/978-3-031-69583-4_28 doi:10.1007/978-3-031-69583-4_28 Close
	Gomes, Carlos Falcao Azevedo; Araujo, Adriel Silva; Ahmad, Sunna Imtiaz; Magnaguagno, Mauricio Cecilio; Teixeira, Vinicius Crisosthemos; Rajapuri, Anushri Singh; Roederer, Quinn; Griebler, Dalvan; Dutra, Vinicius; Turkkahraman, Hakan; Pinho, Marcio Sarroglia Multiview Machine Learning Classification of Tooth Extraction in Orthodontics Using Intraoral Scans Inproceedings doi In: 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1977-1982, IEEE, Osaka, Japan, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GOMES:COMPSAC:24, title = {Multiview Machine Learning Classification of Tooth Extraction in Orthodontics Using Intraoral Scans}, author = {Carlos Falcao Azevedo Gomes and Adriel Silva Araujo and Sunna Imtiaz Ahmad and Mauricio Cecilio Magnaguagno and Vinicius Crisosthemos Teixeira and Anushri Singh Rajapuri and Quinn Roederer and Dalvan Griebler and Vinicius Dutra and Hakan Turkkahraman and Marcio Sarroglia Pinho}, url = {https://doi.org/10.1109/COMPSAC61105.2024.00316}, doi = {10.1109/COMPSAC61105.2024.00316}, year = {2024}, date = {2024-07-01}, booktitle = {2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)}, pages = {1977-1982}, publisher = {IEEE}, address = {Osaka, Japan}, abstract = {Orthodontic treatment planning often involves de-ciding whether to extract teeth, a critical and irreversible decision. Integrating machine learning (ML) can enhance decision-making. This study proposes using Intraoral Scans (IOS) 3D models to predict extraction/non-extraction binary decisions with ML models. We leverage a multiview approach, using images taken from multiple points of view of the 3D model. The methodology involved a dataset composed of preprocessed IOS from 181 subjects and an experimental procedure that evaluated multiple ML models in their ability to classify subjects using either grayscale pixel intensities or radiomic features. The results indicated that a logistic model applied to the radiomic features from the back and frontal views of the 3D models was one of the best model candidates, achieving a test accuracy of 70 % and F1 score of. 73 and. 65 for non-extraction and extraction cases, respectively. Overall, these findings indicate that a multiview approach to IOS 3D models can be used to predict extraction/non-extraction decisions. In addition, the results suggest that radiomic features provide useful information in the analysis of IOS data.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Orthodontic treatment planning often involves de-ciding whether to extract teeth, a critical and irreversible decision. Integrating machine learning (ML) can enhance decision-making. This study proposes using Intraoral Scans (IOS) 3D models to predict extraction/non-extraction binary decisions with ML models. We leverage a multiview approach, using images taken from multiple points of view of the 3D model. The methodology involved a dataset composed of preprocessed IOS from 181 subjects and an experimental procedure that evaluated multiple ML models in their ability to classify subjects using either grayscale pixel intensities or radiomic features. The results indicated that a logistic model applied to the radiomic features from the back and frontal views of the 3D models was one of the best model candidates, achieving a test accuracy of 70 % and F1 score of. 73 and. 65 for non-extraction and extraction cases, respectively. Overall, these findings indicate that a multiview approach to IOS 3D models can be used to predict extraction/non-extraction decisions. In addition, the results suggest that radiomic features provide useful information in the analysis of IOS data. Close https://doi.org/10.1109/COMPSAC61105.2024.00316 doi:10.1109/COMPSAC61105.2024.00316 Close
	Guder, Larissa; Aires, João Paulo; Meneguzzi, Felipe; Griebler, Dalvan Dimensional Speech Emotion Recognition from Bimodal Features Inproceedings doi In: Anais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde, pp. 579-590, SBC, Goiânia, Brasil, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GUDER:SBCAS:24, title = {Dimensional Speech Emotion Recognition from Bimodal Features}, author = {Larissa Guder and João Paulo Aires and Felipe Meneguzzi and Dalvan Griebler}, url = {https://doi.org/10.5753/sbcas.2024.2779}, doi = {10.5753/sbcas.2024.2779}, year = {2024}, date = {2024-07-01}, booktitle = {Anais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde}, pages = {579-590}, publisher = {SBC}, address = {Goiânia, Brasil}, abstract = {Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance to represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios where processing the input quickly is necessary. Considering these aspects, we take the first step towards creating a bimodal approach for dimensional speech emotion recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speechemotion recognition. Our final architecture achieves a Concordance Correlation Coefficient of 0.5915 for arousal, 0.1431 for valence, and 0.5899 for dominance in the IEMOCAP dataset.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance to represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios where processing the input quickly is necessary. Considering these aspects, we take the first step towards creating a bimodal approach for dimensional speech emotion recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speechemotion recognition. Our final architecture achieves a Concordance Correlation Coefficient of 0.5915 for arousal, 0.1431 for valence, and 0.5899 for dominance in the IEMOCAP dataset. Close https://doi.org/10.5753/sbcas.2024.2779 doi:10.5753/sbcas.2024.2779 Close
	Leonarczyk, Ricardo; Griebler, Dalvan; Mencagli, Gabriele; Danelutto, Marco Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing Inproceedings doi In: Euro-Par 2023: Parallel Processing Workshops, pp. 81-92, Springer, Limassol, Cyprus, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{LEONARCZYK:Euro-ParW:23, title = {Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing}, author = {Ricardo Leonarczyk and Dalvan Griebler and Gabriele Mencagli and Marco Danelutto}, url = {https://doi.org/10.1007/978-3-031-50684-0_7}, doi = {10.1007/978-3-031-50684-0_7}, year = {2024}, date = {2024-04-01}, booktitle = {Euro-Par 2023: Parallel Processing Workshops}, pages = {81-92}, publisher = {Springer}, address = {Limassol, Cyprus}, series = {Euro-ParW'23}, abstract = {Stream processing plays a vital role in applications that require continuous, low-latency data processing. Thanks to their extensive parallel processing capabilities and relatively low cost, GPUs are well-suited to scenarios where such applications require substantial computational resources. However, micro-batching becomes essential for efficient GPU computation within stream processing systems. However, finding appropriate batch sizes to maintain an adequate level of service is often challenging, particularly in cases where applications experience fluctuations in input rate and workload. Addressing this challenge requires adjusting the optimal batch size at runtime. This study proposes a methodology for evaluating different self-adaptive micro-batching strategies in a real-world complex streaming application used as a benchmark.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Stream processing plays a vital role in applications that require continuous, low-latency data processing. Thanks to their extensive parallel processing capabilities and relatively low cost, GPUs are well-suited to scenarios where such applications require substantial computational resources. However, micro-batching becomes essential for efficient GPU computation within stream processing systems. However, finding appropriate batch sizes to maintain an adequate level of service is often challenging, particularly in cases where applications experience fluctuations in input rate and workload. Addressing this challenge requires adjusting the optimal batch size at runtime. This study proposes a methodology for evaluating different self-adaptive micro-batching strategies in a real-world complex streaming application used as a benchmark. Close https://doi.org/10.1007/978-3-031-50684-0_7 doi:10.1007/978-3-031-50684-0_7 Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; García, José Daniel; Muñoz, Javier Fernández; Fernandes, Luiz Gustavo Performance and programmability of GrPPI for parallel stream processing on multi-cores Journal Article doi In: The Journal of Supercomputing, vol. 80, no. 9, pp. 12966-13000, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{GARCIA:JS:24, title = {Performance and programmability of GrPPI for parallel stream processing on multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and José Daniel García and Javier Fernández Muñoz and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-024-05934-z}, doi = {10.1007/s11227-024-05934-z}, year = {2024}, date = {2024-02-01}, urldate = {2024-02-01}, journal = {The Journal of Supercomputing}, volume = {80}, number = {9}, pages = {12966-13000}, publisher = {Springer}, abstract = {GrPPI library aims to simplify the burdening task of parallel programming. It provides a unified, abstract, and generic layer while promising minimal overhead on performance. Although it supports stream parallelism, GrPPI lacks an evaluation regarding representative performance metrics for this domain, such as throughput and latency. This work evaluates GrPPI focused on parallel stream processing. We compare the throughput and latency performance, memory usage, and programmability of GrPPI against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks and benchmarks with handwritten parallel code using the same backends supported by GrPPI. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is often competitive with handwritten parallel code, the infeasibility of fine-tuning GrPPI is a crucial drawback for emerging applications. Despite this, programmability experiments estimate that GrPPI can potentially reduce the development time of parallel applications by about three times.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close GrPPI library aims to simplify the burdening task of parallel programming. It provides a unified, abstract, and generic layer while promising minimal overhead on performance. Although it supports stream parallelism, GrPPI lacks an evaluation regarding representative performance metrics for this domain, such as throughput and latency. This work evaluates GrPPI focused on parallel stream processing. We compare the throughput and latency performance, memory usage, and programmability of GrPPI against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks and benchmarks with handwritten parallel code using the same backends supported by GrPPI. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is often competitive with handwritten parallel code, the infeasibility of fine-tuning GrPPI is a crucial drawback for emerging applications. Despite this, programmability experiments estimate that GrPPI can potentially reduce the development time of parallel applications by about three times. Close https://doi.org/10.1007/s11227-024-05934-z doi:10.1007/s11227-024-05934-z Close
	Mencagli, Gabriele; Torquati, Massimo; Griebler, Dalvan; Fais, Alessandra; Danelutto, Marco General-purpose data stream processing on heterogeneous architectures with WindFlow Journal Article doi In: Journal of Parallel and Distributed Computing, vol. 184, pp. 104782, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{MENCAGLI:JPDC:24, title = {General-purpose data stream processing on heterogeneous architectures with WindFlow}, author = {Gabriele Mencagli and Massimo Torquati and Dalvan Griebler and Alessandra Fais and Marco Danelutto}, url = {https://doi.org/10.1016/j.jpdc.2023.104782}, doi = {10.1016/j.jpdc.2023.104782}, year = {2024}, date = {2024-02-01}, urldate = {2024-02-01}, journal = {Journal of Parallel and Distributed Computing}, volume = {184}, pages = {104782}, publisher = {Elsevier}, abstract = {Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink. Close https://doi.org/10.1016/j.jpdc.2023.104782 doi:10.1016/j.jpdc.2023.104782 Close
	Fischer, Gabriel Souto; Ramos, Gabriel Oliveira; Costa, Cristiano André; Alberti, Antonio Marcos; Griebler, Dalvan; Singh, Dhananjay; Righi, Rodrigo Rosa Multi-Hospital Management: Combining Vital Signs IoT Data and the Elasticity Technique to Support Healthcare 4.0 Journal Article doi In: IoT, vol. 5, no. 2, pp. 381-408, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{FISCHER:IoT:24, title = {Multi-Hospital Management: Combining Vital Signs IoT Data and the Elasticity Technique to Support Healthcare 4.0}, author = {Gabriel Souto Fischer and Gabriel Oliveira Ramos and Cristiano André Costa and Antonio Marcos Alberti and Dalvan Griebler and Dhananjay Singh and Rodrigo Rosa Righi}, url = {https://doi.org/10.3390/iot5020019}, doi = {10.3390/iot5020019}, year = {2024}, date = {2024-01-01}, urldate = {2024-01-01}, journal = {IoT}, volume = {5}, number = {2}, pages = {381-408}, publisher = {MDPI}, abstract = {Smart cities can improve the quality of life of citizens by optimizing the utilization of resources. In an IoT-connected environment, people's health can be constantly monitored, which can help identify medical problems before they become serious. However, overcrowded hospitals can lead to long waiting times for patients to receive treatment. The literature presents alternatives to address this problem by adjusting care capacity to demand. However, there is still a need for a solution that can adjust human resources in multiple healthcare settings, which is the reality of cities. This work introduces HealCity, a smart-city-focused model that can monitor patients’ use of healthcare settings and adapt the allocation of health professionals to meet their needs. HealCity uses vital signs (IoT) data in prediction techniques to anticipate when the demand for a given environment will exceed its capacity and suggests actions to allocate health professionals accordingly. Additionally, we introduce the concept of multilevel proactive human resources elasticity in smart cities, thus managing human resources at different levels of a smart city. An algorithm is also devised to automatically manage and identify the appropriate hospital for a possible future patient. Furthermore, some IoT deployment considerations are presented based on a hardware implementation for the proposed model. HealCity was evaluated with four hospital settings and obtained promising results: Compared to hospitals with rigid professional allocations, it reduced waiting time for care by up to 87.62%.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Smart cities can improve the quality of life of citizens by optimizing the utilization of resources. In an IoT-connected environment, people's health can be constantly monitored, which can help identify medical problems before they become serious. However, overcrowded hospitals can lead to long waiting times for patients to receive treatment. The literature presents alternatives to address this problem by adjusting care capacity to demand. However, there is still a need for a solution that can adjust human resources in multiple healthcare settings, which is the reality of cities. This work introduces HealCity, a smart-city-focused model that can monitor patients’ use of healthcare settings and adapt the allocation of health professionals to meet their needs. HealCity uses vital signs (IoT) data in prediction techniques to anticipate when the demand for a given environment will exceed its capacity and suggests actions to allocate health professionals accordingly. Additionally, we introduce the concept of multilevel proactive human resources elasticity in smart cities, thus managing human resources at different levels of a smart city. An algorithm is also devised to automatically manage and identify the appropriate hospital for a possible future patient. Furthermore, some IoT deployment considerations are presented based on a hardware implementation for the proposed model. HealCity was evaluated with four hospital settings and obtained promising results: Compared to hospitals with rigid professional allocations, it reduced waiting time for care by up to 87.62%. Close https://doi.org/10.3390/iot5020019 doi:10.3390/iot5020019 Close
2023
	Hoffmann, Renato Barreto; Faé, Leonardo; Manssour, Isabel; Griebler, Dalvan Analyzing C++ Stream Parallelism in Shared-Memory when Porting to Flink and Storm Inproceedings doi In: International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pp. 1-8, IEEE, Porto Alegre, Brazil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{HOFFMANN:SBAC-PADW:23, title = {Analyzing C++ Stream Parallelism in Shared-Memory when Porting to Flink and Storm}, author = {Renato Barreto Hoffmann and Leonardo Faé and Isabel Manssour and Dalvan Griebler}, url = {https://doi.org/10.1109/SBAC-PADW60351.2023.00017}, doi = {10.1109/SBAC-PADW60351.2023.00017}, year = {2023}, date = {2023-10-01}, booktitle = {International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)}, pages = {1-8}, publisher = {IEEE}, address = {Porto Alegre, Brazil}, series = {SBAC-PADW'23}, abstract = {Stream processing plays a crucial role in various information-oriented digital systems. Two popular frameworks for real-time data processing, Flink and Storm, provide solutions for effective parallel stream processing in Java. An option to leverage Java's mature ecosystem for distributed stream processing involves porting legacy C++ applications to Java. However, this raises considerations on the adequacy of the equivalent Java mechanisms and potential degradation in throughput. Therefore, our objective is to evaluate programmability and performance when converting stream processing applications from C++ to Java while also exploring the parallelization capabilities offered by Flink and Storm. Furthermore, we aim to assess the throughput of Flink and Storm on shared-memory manycore machines, a hardware architecture commonly found in cloud environments. To achieve this, we conduct experiments involving four different stream processing applications. We highlight challenges encountered when porting C++ to Java and working with Flink and Storm. Furthermore, we discuss throughput, latency, CPU, and memory usage results.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Stream processing plays a crucial role in various information-oriented digital systems. Two popular frameworks for real-time data processing, Flink and Storm, provide solutions for effective parallel stream processing in Java. An option to leverage Java's mature ecosystem for distributed stream processing involves porting legacy C++ applications to Java. However, this raises considerations on the adequacy of the equivalent Java mechanisms and potential degradation in throughput. Therefore, our objective is to evaluate programmability and performance when converting stream processing applications from C++ to Java while also exploring the parallelization capabilities offered by Flink and Storm. Furthermore, we aim to assess the throughput of Flink and Storm on shared-memory manycore machines, a hardware architecture commonly found in cloud environments. To achieve this, we conduct experiments involving four different stream processing applications. We highlight challenges encountered when porting C++ to Java and working with Flink and Storm. Furthermore, we discuss throughput, latency, CPU, and memory usage results. Close https://doi.org/10.1109/SBAC-PADW60351.2023.00017 doi:10.1109/SBAC-PADW60351.2023.00017 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo Extending the Planning Poker Method to Estimate the Development Effort of Parallel Applications Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 181-192, SBC, Porto Alegre, Brasil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ANDRADE:WSCAD:23, title = {Extending the Planning Poker Method to Estimate the Development Effort of Parallel Applications}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/wscad.2023.235925}, doi = {10.5753/wscad.2023.235925}, year = {2023}, date = {2023-10-01}, booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)}, pages = {181-192}, publisher = {SBC}, address = {Porto Alegre, Brasil}, abstract = {Since different Parallel Programming Interfaces (PPIs) are available to programmers, evaluating them to identify the most suitable PPI also became necessary. Recently, in addition to the performance of PPIs, developers’ productivity has also been evaluated by researchers in parallel processing. Some researchers conduct empirical studies involving people for productivity evaluation, which is time-consuming. Aiming to propose a less costly method for evaluating the development effort of parallel applications, we proposed modifying the Planning Poker method in this paper. We consider a representative set of parallel stream processing applications to evaluate the proposed modification. Our results showed that the proposed method required less effort for practical use than the controlled experiments with students.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Since different Parallel Programming Interfaces (PPIs) are available to programmers, evaluating them to identify the most suitable PPI also became necessary. Recently, in addition to the performance of PPIs, developers’ productivity has also been evaluated by researchers in parallel processing. Some researchers conduct empirical studies involving people for productivity evaluation, which is time-consuming. Aiming to propose a less costly method for evaluating the development effort of parallel applications, we proposed modifying the Planning Poker method in this paper. We consider a representative set of parallel stream processing applications to evaluate the proposed modification. Our results showed that the proposed method required less effort for practical use than the controlled experiments with students. Close https://doi.org/10.5753/wscad.2023.235925 doi:10.5753/wscad.2023.235925 Close
	Alf, Lucas; Hoffmann, Renato Barreto; Müller, Caetano; Griebler, Dalvan Análise da Execução de Algoritmos de Aprendizado de Máquina em Dispositivos Embarcados Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 61-72, SBC, Porto Alegre, Brasil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ALF:WSCAD:23, title = {Análise da Execução de Algoritmos de Aprendizado de Máquina em Dispositivos Embarcados}, author = {Lucas Alf and Renato Barreto Hoffmann and Caetano Müller and Dalvan Griebler}, url = {https://doi.org/10.5753/wscad.2023.235915}, doi = {10.5753/wscad.2023.235915}, year = {2023}, date = {2023-10-01}, booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)}, pages = {61-72}, publisher = {SBC}, address = {Porto Alegre, Brasil}, abstract = {Os avanços na área de IoT motivam a utilização de algoritmos de aprendizado de máquina em dispositivos embarcados. Entretanto, esses algoritmos exigem uma quantidade considerável de recursos computacionais. O objetivo deste trabalho consistiu em analisar algoritmos de aprendizado de máquina em dispositivos embarcados utilizando paralelismo em CPU e GPU com o intuito de compreender quais características de hardware e software desempenham melhor em relação ao consumo energético, inferências por segundo e acurácia. Foram avaliados três modelos de Convolutional Neural Network, bem como algoritmos tradicionais e redes neurais de classificação e regressão. Os experimentos demonstraram que o PyTorch obteve o melhor desempenho nos modelos de CNN e nas redes neurais de classificação e regressão usando GPU, enquanto o Keras obteve um melhor desempenho ao utilizar somente CPU.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Os avanços na área de IoT motivam a utilização de algoritmos de aprendizado de máquina em dispositivos embarcados. Entretanto, esses algoritmos exigem uma quantidade considerável de recursos computacionais. O objetivo deste trabalho consistiu em analisar algoritmos de aprendizado de máquina em dispositivos embarcados utilizando paralelismo em CPU e GPU com o intuito de compreender quais características de hardware e software desempenham melhor em relação ao consumo energético, inferências por segundo e acurácia. Foram avaliados três modelos de Convolutional Neural Network, bem como algoritmos tradicionais e redes neurais de classificação e regressão. Os experimentos demonstraram que o PyTorch obteve o melhor desempenho nos modelos de CNN e nas redes neurais de classificação e regressão usando GPU, enquanto o Keras obteve um melhor desempenho ao utilizar somente CPU. Close https://doi.org/10.5753/wscad.2023.235915 doi:10.5753/wscad.2023.235915 Close
	Bianchessi, Arthur S.; Mallmann, Leonardo; Hoffmann, Renato Barreto; Griebler, Dalvan Conversão do NAS Parallel Benchmarks para C++ Standard Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 313-324, SBC, Porto Alegre, Brasil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{BIANCHESSI:WSCAD:23, title = {Conversão do NAS Parallel Benchmarks para C++ Standard}, author = {Arthur S. Bianchessi and Leonardo Mallmann and Renato Barreto Hoffmann and Dalvan Griebler}, url = {https://doi.org/10.5753/wscad.2023.235913}, doi = {10.5753/wscad.2023.235913}, year = {2023}, date = {2023-10-01}, booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)}, pages = {313-324}, publisher = {SBC}, address = {Porto Alegre, Brasil}, abstract = {A linguagem C++ recebeu novas abstrações de paralelismo com a definição das políticas de execução dos algoritmos da biblioteca padrão. Entretanto, a adequabilidade e o desempenho dessa alternativa ainda necessita ser estudado em comparação com outras alternativas bem estabelecidas. Portanto, o objetivo deste trabalho foi explorar a vasta gama de opções de recursos da biblioteca padrão C++ para avaliar a aplicabilidade e desempenho a partir de cinco kernels do NPB. Através dos experimentos em um ambiente multithreaded, foi constatado que a incorporação de estruturas de dados da biblioteca padrão, assim como a abstração para acesso multidimensional criada, não apresentam impacto notável no tempo de execução. Já os algoritmos com políticas de execução paralela demonstraram uma perda de desempenho estatisticamente significativa.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close A linguagem C++ recebeu novas abstrações de paralelismo com a definição das políticas de execução dos algoritmos da biblioteca padrão. Entretanto, a adequabilidade e o desempenho dessa alternativa ainda necessita ser estudado em comparação com outras alternativas bem estabelecidas. Portanto, o objetivo deste trabalho foi explorar a vasta gama de opções de recursos da biblioteca padrão C++ para avaliar a aplicabilidade e desempenho a partir de cinco kernels do NPB. Através dos experimentos em um ambiente multithreaded, foi constatado que a incorporação de estruturas de dados da biblioteca padrão, assim como a abstração para acesso multidimensional criada, não apresentam impacto notável no tempo de execução. Já os algoritmos com políticas de execução paralela demonstraram uma perda de desempenho estatisticamente significativa. Close https://doi.org/10.5753/wscad.2023.235913 doi:10.5753/wscad.2023.235913 Close
	Faé, Leonardo; Hoffmann, Renato Barreto; Griebler, Dalvan Source-to-Source Code Transformation on Rust for High-Level Stream Parallelism Inproceedings doi In: XXVII Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Campo Grande, Brazil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{FAE:SBLP:23, title = {Source-to-Source Code Transformation on Rust for High-Level Stream Parallelism}, author = {Leonardo Faé and Renato Barreto Hoffmann and Dalvan Griebler}, url = {https://doi.org/10.1145/3624309.3624320}, doi = {10.1145/3624309.3624320}, year = {2023}, date = {2023-09-01}, booktitle = {XXVII Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-49}, publisher = {ACM}, address = {Campo Grande, Brazil}, series = {SBLP'23}, abstract = {Utilizing parallel systems to their full potential can be challenging for general-purpose developers. A solution to this problem is to create high-level abstractions using Domain-Specific Languages (DSL). We create a stream-processing DSL for Rust, a growing programming language focusing on performance and safety. To that end, we explore Rust’s macros as a high-level abstraction tool to support an existing DSL language named SPar and perform source-to-source code transformations in the abstract syntax tree. We aim to assess the Rust source-to-source code transformations toolset and its implications. We highlight that Rust macros are powerful tools for performing source-to-source code transformations for abstracting structured stream processing. In addition, execution time and programmability results are comparable to other solutions.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Utilizing parallel systems to their full potential can be challenging for general-purpose developers. A solution to this problem is to create high-level abstractions using Domain-Specific Languages (DSL). We create a stream-processing DSL for Rust, a growing programming language focusing on performance and safety. To that end, we explore Rust’s macros as a high-level abstraction tool to support an existing DSL language named SPar and perform source-to-source code transformations in the abstract syntax tree. We aim to assess the Rust source-to-source code transformations toolset and its implications. We highlight that Rust macros are powerful tools for performing source-to-source code transformations for abstracting structured stream processing. In addition, execution time and programmability results are comparable to other solutions. Close https://doi.org/10.1145/3624309.3624320 doi:10.1145/3624309.3624320 Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; García, José Daniel; Muñoz, Javier Fernández; Fernandes, Luiz Gustavo A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores Inproceedings doi In: 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 164-168, IEEE, Naples, Italy, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GARCIA:PDP:23, title = {A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and José Daniel García and Javier Fernández Muñoz and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP59025.2023.00033}, doi = {10.1109/PDP59025.2023.00033}, year = {2023}, date = {2023-03-01}, booktitle = {31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {164-168}, publisher = {IEEE}, address = {Naples, Italy}, series = {PDP'23}, abstract = {Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications. Close https://doi.org/10.1109/PDP59025.2023.00033 doi:10.1109/PDP59025.2023.00033 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo A parallel programming assessment for stream processing applications on multi-core systems Journal Article doi In: Computer Standards & Interfaces, vol. 84, pp. 103691, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @article{ANDRADE:CSI:2023, title = {A parallel programming assessment for stream processing applications on multi-core systems}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.csi.2022.103691}, doi = {10.1016/j.csi.2022.103691}, year = {2023}, date = {2023-03-01}, journal = {Computer Standards & Interfaces}, volume = {84}, pages = {103691}, publisher = {Elsevier}, abstract = {Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application Programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants’ opinions about their experience in this study to understand deeply the results achieved.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application Programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants’ opinions about their experience in this study to understand deeply the results achieved. Close https://doi.org/10.1016/j.csi.2022.103691 doi:10.1016/j.csi.2022.103691 Close
	Araujo, Gabriell; Griebler, Dalvan; Rockenbach, Dinei A.; Danelutto, Marco; Fernandes, Luiz Gustavo NAS Parallel Benchmarks with CUDA and Beyond Journal Article doi In: Software: Practice and Experience, vol. 53, no. 1, pp. 53-80, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @article{ARAUJO:SPE:23, title = {NAS Parallel Benchmarks with CUDA and Beyond}, author = {Gabriell Araujo and Dalvan Griebler and Dinei A. Rockenbach and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/spe.3056}, doi = {10.1002/spe.3056}, year = {2023}, date = {2023-01-01}, urldate = {2023-01-01}, journal = {Software: Practice and Experience}, volume = {53}, number = {1}, pages = {53-80}, publisher = {Wiley}, abstract = {NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior. Close https://doi.org/10.1002/spe.3056 doi:10.1002/spe.3056 Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Micro-batch and data frequency for stream processing on multi-cores Journal Article doi In: The Journal of Supercomputing, vol. 79, no. 8, pp. 9206-9244, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @article{GARCIA:JS:23, title = {Micro-batch and data frequency for stream processing on multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-022-05024-y}, doi = {10.1007/s11227-022-05024-y}, year = {2023}, date = {2023-01-01}, journal = {The Journal of Supercomputing}, volume = {79}, number = {8}, pages = {9206-9244}, publisher = {Springer}, abstract = {Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines. Close https://doi.org/10.1007/s11227-022-05024-y doi:10.1007/s11227-022-05024-y Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo SPBench: a framework for creating benchmarks of stream processing applications Journal Article doi In: Computing, vol. 105, no. 5, pp. 1077-1099, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @article{GARCIA:Computing:23, title = {SPBench: a framework for creating benchmarks of stream processing applications}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s00607-021-01025-6}, doi = {10.1007/s00607-021-01025-6}, year = {2023}, date = {2023-01-01}, urldate = {2023-01-01}, journal = {Computing}, volume = {105}, number = {5}, pages = {1077-1099}, publisher = {Springer}, abstract = {In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures. Close https://doi.org/10.1007/s00607-021-01025-6 doi:10.1007/s00607-021-01025-6 Close
2022
	Löff, Júnior; Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Combining stream with data parallelism abstractions for multi-cores Journal Article doi In: Journal of Computer Languages, vol. 73, pp. 101160, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{LOFF:COLA:22, title = {Combining stream with data parallelism abstractions for multi-cores}, author = {Júnior Löff and Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.cola.2022.101160}, doi = {10.1016/j.cola.2022.101160}, year = {2022}, date = {2022-12-01}, urldate = {2022-12-01}, journal = {Journal of Computer Languages}, volume = {73}, pages = {101160}, publisher = {Elsevier}, abstract = {Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity. Close https://doi.org/10.1016/j.cola.2022.101160 doi:10.1016/j.cola.2022.101160 Close
	Ernstsson, August; Griebler, Dalvan; Kessler, Christoph Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems Journal Article doi In: International Journal of Parallel Programming, vol. 51, no. 5, pp. 61-82, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{Ernstsson:IJPP:22, title = {Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems}, author = {August Ernstsson and Dalvan Griebler and Christoph Kessler}, url = {https://doi.org/10.1007/s10766-022-00746-1}, doi = {10.1007/s10766-022-00746-1}, year = {2022}, date = {2022-12-01}, urldate = {2022-12-01}, journal = {International Journal of Parallel Programming}, volume = {51}, number = {5}, pages = {61-82}, publisher = {Springer}, abstract = {We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS Parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code..}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS Parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code.. Close https://doi.org/10.1007/s10766-022-00746-1 doi:10.1007/s10766-022-00746-1 Close
	Rockenbach, Dinei A.; Löff, Júnior; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz G. High-Level Stream and Data Parallelism in C++ for GPUs Inproceedings doi In: XXVI Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Uberlândia, Brazil, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ROCKENBACH:SBLP:22, title = {High-Level Stream and Data Parallelism in C++ for GPUs}, author = {Dinei A. Rockenbach and Júnior Löff and Gabriell Araujo and Dalvan Griebler and Luiz G. Fernandes}, url = {https://doi.org/10.1145/3561320.3561327}, doi = {10.1145/3561320.3561327}, year = {2022}, date = {2022-10-01}, booktitle = {XXVI Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-49}, publisher = {ACM}, address = {Uberlândia, Brazil}, series = {SBLP'22}, abstract = {GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching. Close https://doi.org/10.1145/3561320.3561327 doi:10.1145/3561320.3561327 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo Opinião de Brasileiros Sobre a Produtividade no Desenvolvimento de Aplicações Paralelas Inproceedings doi In: Anais do XXII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 276-287, SBC, Florianópolis, Brasil, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ANDRADE:WSCAD:22, title = {Opinião de Brasileiros Sobre a Produtividade no Desenvolvimento de Aplicações Paralelas}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/wscad.2022.226392}, doi = {10.5753/wscad.2022.226392}, year = {2022}, date = {2022-10-01}, booktitle = {Anais do XXII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)}, pages = {276-287}, publisher = {SBC}, address = {Florianópolis, Brasil}, abstract = {A partir da popularização das arquiteturas paralelas, surgiram várias interfaces de programação a fim de facilitar a exploração de tais arquiteturas e aumentar a produtividade dos desenvolvedores. Entretanto, desenvolver aplicações paralelas ainda é uma tarefa complexa para desenvolvedores com pouca experiência. Neste trabalho, realizamos uma pesquisa para descobrir a opinião de desenvolvedores de aplicações paralelas sobre os fatores que impedem a produtividade. Nossos resultados mostraram que a experiência dos desenvolvedores é uma das principais razões para a baixa produtividade. Além disso, os resultados indicaram formas para contornar este problema, como melhorar e incentivar o ensino de programação paralela em cursos de graduação.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close A partir da popularização das arquiteturas paralelas, surgiram várias interfaces de programação a fim de facilitar a exploração de tais arquiteturas e aumentar a produtividade dos desenvolvedores. Entretanto, desenvolver aplicações paralelas ainda é uma tarefa complexa para desenvolvedores com pouca experiência. Neste trabalho, realizamos uma pesquisa para descobrir a opinião de desenvolvedores de aplicações paralelas sobre os fatores que impedem a produtividade. Nossos resultados mostraram que a experiência dos desenvolvedores é uma das principais razões para a baixa produtividade. Além disso, os resultados indicaram formas para contornar este problema, como melhorar e incentivar o ensino de programação paralela em cursos de graduação. Close https://doi.org/10.5753/wscad.2022.226392 doi:10.5753/wscad.2022.226392 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Kessler, Christoph; Ernstsson, August; Fernandes, Luiz Gustavo Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing Inproceedings doi In: 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022), pp. 229-232, IEEE, Gran Canaria, Spain, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ANDRADE:SEAA:22, title = {Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Christoph Kessler and August Ernstsson and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/SEAA56994.2022.00043}, doi = {10.1109/SEAA56994.2022.00043}, year = {2022}, date = {2022-09-01}, booktitle = {48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022)}, pages = {229-232}, publisher = {IEEE}, address = {Gran Canaria, Spain}, series = {SEAA'22}, abstract = {Over the years, several Parallel Programming Models (PPMs) have supported the abstraction of programming complexity for parallel computer systems. However, few studies aim to evaluate the productivity reached by such abstractions since this is a complex task that involves human beings. There are several studies to develop predictive methods to estimate the effort required to program applications in software engineering. In order to evaluate the reliability of such metrics, it is necessary to assess the accuracy in different programming domains. In this work, we used the data of an experiment conducted with beginners in parallel programming to determine the effort required for implementing stream parallelism using FastFlow, SPar, and TBB. Our results show that some traditional software effort estimation models, such as COCOMO II, fall short, while Putnam's model could be an alternative for high-level PPMs evaluation. To overcome the limitations of existing models, we plan to create a parallelism-aware model to evaluate applications in this domain in future work.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Over the years, several Parallel Programming Models (PPMs) have supported the abstraction of programming complexity for parallel computer systems. However, few studies aim to evaluate the productivity reached by such abstractions since this is a complex task that involves human beings. There are several studies to develop predictive methods to estimate the effort required to program applications in software engineering. In order to evaluate the reliability of such metrics, it is necessary to assess the accuracy in different programming domains. In this work, we used the data of an experiment conducted with beginners in parallel programming to determine the effort required for implementing stream parallelism using FastFlow, SPar, and TBB. Our results show that some traditional software effort estimation models, such as COCOMO II, fall short, while Putnam's model could be an alternative for high-level PPMs evaluation. To overcome the limitations of existing models, we plan to create a parallelism-aware model to evaluate applications in this domain in future work. Close https://doi.org/10.1109/SEAA56994.2022.00043 doi:10.1109/SEAA56994.2022.00043 Close
	Mencagli, Gabriele; Griebler, Dalvan; Danelutto, Marco Towards Parallel Data Stream Processing on System-on-Chip CPU+GPU Devices Inproceedings doi In: 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 34-38, IEEE, Valladolid, Spain, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{MENCAGLI:PDP:22, title = {Towards Parallel Data Stream Processing on System-on-Chip CPU+GPU Devices}, author = {Gabriele Mencagli and Dalvan Griebler and Marco Danelutto}, url = {https://doi.org/10.1109/PDP55904.2022.00014}, doi = {10.1109/PDP55904.2022.00014}, year = {2022}, date = {2022-04-01}, booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {34-38}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'22}, abstract = {Data Stream Processing is a pervasive computing paradigm with a wide spectrum of applications. Traditional streaming systems exploit the processing capabilities provided by homogeneous Clusters and Clouds. Due to the transition to streaming systems suitable for IoT/Edge environments, there has been the urgent need of new streaming frameworks and tools tailored for embedded platforms, often available as System-onChips composed of a small multicore CPU and an integrated onchip GPU. Exploiting this hybrid hardware requires special care in the runtime system design. In this paper, we discuss the support provided by the WindFlow library, showing its design principles and its effectiveness on the NVIDIA Jetson Nano board.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Data Stream Processing is a pervasive computing paradigm with a wide spectrum of applications. Traditional streaming systems exploit the processing capabilities provided by homogeneous Clusters and Clouds. Due to the transition to streaming systems suitable for IoT/Edge environments, there has been the urgent need of new streaming frameworks and tools tailored for embedded platforms, often available as System-onChips composed of a small multicore CPU and an integrated onchip GPU. Exploiting this hybrid hardware requires special care in the runtime system design. In this paper, we discuss the support provided by the WindFlow library, showing its design principles and its effectiveness on the NVIDIA Jetson Nano board. Close https://doi.org/10.1109/PDP55904.2022.00014 doi:10.1109/PDP55904.2022.00014 Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores Inproceedings doi In: 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 10-17, IEEE, Valladolid, Spain, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GARCIA:PDP:22, title = {Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP55904.2022.00011}, doi = {10.1109/PDP55904.2022.00011}, year = {2022}, date = {2022-04-01}, booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {10-17}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'22}, abstract = {In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations. Close https://doi.org/10.1109/PDP55904.2022.00011 doi:10.1109/PDP55904.2022.00011 Close
	Gomes, Márcio Miguel; Righi, Rodrigo Rosa; Costa, Cristiano André; Griebler, Dalvan Steam++: An Extensible End-to-end Framework for Developing IoT Data Processing Applications in the Fog Journal Article doi In: International Journal of Computer Science & Information Technology, vol. 14, no. 1, pp. 31-51, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{GOMES:IJCSIT:22, title = {Steam++: An Extensible End-to-end Framework for Developing IoT Data Processing Applications in the Fog}, author = {Márcio Miguel Gomes and Rodrigo Rosa Righi and Cristiano André Costa and Dalvan Griebler}, url = {http://dx.doi.org/10.5121/ijcsit.2022.14103}, doi = {10.5121/ijcsit.2022.14103}, year = {2022}, date = {2022-02-01}, urldate = {2022-02-01}, journal = {International Journal of Computer Science & Information Technology}, volume = {14}, number = {1}, pages = {31-51}, publisher = {AIRCC}, abstract = {IoT applications usually rely on cloud computing services to perform data analysis such as filtering, aggregation, classification, pattern detection, and prediction. When applied to specific domains, the IoT needs to deal with unique constraints. Besides the hostile environment such as vibration and electricmagnetic interference, resulting in malfunction, noise, and data loss, industrial plants often have Internet access restricted or unavailable, forcing us to design stand-alone fog and edge computing solutions. In this context, we present STEAM++, a lightweight and extensible framework for real-time data stream processing and decision-making in the network edge, targeting hardware-limited devices, besides proposing a micro-benchmark methodology for assessing embedded IoT applications. In real-case experiments in a semiconductor industry, we processed an entire data flow, from values sensing, processing and analysing data, detecting relevant events, and finally, publishing results to a dashboard. On average, the application consumed less than 500kb RAM and 1.0% of CPU usage, processing up to 239 data packets per second and reducing the output data size to 14% of the input raw data size when notifying events.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close IoT applications usually rely on cloud computing services to perform data analysis such as filtering, aggregation, classification, pattern detection, and prediction. When applied to specific domains, the IoT needs to deal with unique constraints. Besides the hostile environment such as vibration and electricmagnetic interference, resulting in malfunction, noise, and data loss, industrial plants often have Internet access restricted or unavailable, forcing us to design stand-alone fog and edge computing solutions. In this context, we present STEAM++, a lightweight and extensible framework for real-time data stream processing and decision-making in the network edge, targeting hardware-limited devices, besides proposing a micro-benchmark methodology for assessing embedded IoT applications. In real-case experiments in a semiconductor industry, we processed an entire data flow, from values sensing, processing and analysing data, detecting relevant events, and finally, publishing results to a dashboard. On average, the application consumed less than 500kb RAM and 1.0% of CPU usage, processing up to 239 data packets per second and reducing the output data size to 14% of the input raw data size when notifying events. Close http://dx.doi.org/10.5121/ijcsit.2022.14103 doi:10.5121/ijcsit.2022.14103 Close
	Löff, Júnior; Hoffmann, Renato Barreto; Pieper, Ricardo; Griebler, Dalvan; Fernandes, Luiz Gustavo DSParLib: A C++ Template Library for Distributed Stream Parallelism Journal Article doi In: International Journal of Parallel Programming, vol. 50, no. 5, pp. 454-485, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{LOFF:IJPP:22, title = {DSParLib: A C++ Template Library for Distributed Stream Parallelism}, author = {Júnior Löff and Renato Barreto Hoffmann and Ricardo Pieper and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s10766-022-00737-2}, doi = {10.1007/s10766-022-00737-2}, year = {2022}, date = {2022-01-01}, journal = {International Journal of Parallel Programming}, volume = {50}, number = {5}, pages = {454-485}, publisher = {Springer}, abstract = {Stream processing applications deal with millions of data items continuously generated over time. Often, they must be processed in real-time and scale performance, which requires the use of distributed parallel computing resources. In C/C++, the current state-of-the-art for distributed architectures and High-Performance Computing is Message Passing Interface (MPI). However, exploiting stream parallelism using MPI is complex and error-prone because it exposes many low-level details to the programmer. In this work, we introduce a new parallel programming abstraction for implementing distributed stream parallelism named DSParLib. Our abstraction of MPI simplifies parallel programming by providing a pattern-based and building block-oriented development to inter-connect, model, and parallelize data streams found in modern applications. Experiments conducted with five different stream processing applications and the representative PARSEC Ferret benchmark revealed that DSParLib is efficient and flexible. Also, DSParLib achieved similar or better performance, required less coding, and provided simpler abstractions to express parallelism with respect to handwritten MPI programs.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Stream processing applications deal with millions of data items continuously generated over time. Often, they must be processed in real-time and scale performance, which requires the use of distributed parallel computing resources. In C/C++, the current state-of-the-art for distributed architectures and High-Performance Computing is Message Passing Interface (MPI). However, exploiting stream parallelism using MPI is complex and error-prone because it exposes many low-level details to the programmer. In this work, we introduce a new parallel programming abstraction for implementing distributed stream parallelism named DSParLib. Our abstraction of MPI simplifies parallel programming by providing a pattern-based and building block-oriented development to inter-connect, model, and parallelize data streams found in modern applications. Experiments conducted with five different stream processing applications and the representative PARSEC Ferret benchmark revealed that DSParLib is efficient and flexible. Also, DSParLib achieved similar or better performance, required less coding, and provided simpler abstractions to express parallelism with respect to handwritten MPI programs. Close https://doi.org/10.1007/s10766-022-00737-2 doi:10.1007/s10766-022-00737-2 Close
	Hoffmann, Renato Barreto; Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz Gustavo OpenMP as runtime for providing high-level stream parallelism on multi-cores Journal Article doi In: The Journal of Supercomputing, vol. 78, no. 1, pp. 7655-7676, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{HOFFMANN:Jsuper:2022, title = {OpenMP as runtime for providing high-level stream parallelism on multi-cores}, author = {Renato Barreto Hoffmann and Júnior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-021-04182-9}, doi = {10.1007/s11227-021-04182-9}, year = {2022}, date = {2022-01-01}, journal = {The Journal of Supercomputing}, volume = {78}, number = {1}, pages = {7655-7676}, publisher = {Springer}, address = {New York, United States}, abstract = {OpenMP is an industry and academic standard for parallel programming. However, using it for developing parallel stream processing applications is complex and challenging. OpenMP lacks key programming mechanisms and abstractions for this particular domain. To tackle this problem, we used a high-level parallel programming framework (named SPar) for automatically generating parallel OpenMP code. We achieved this by leveraging SPar’s language and its domain-specific code annotations for simplifying the complexity and verbosity added by OpenMP in this application domain. Consequently, we implemented a new compiler algorithm in SPar for automatically generating parallel code targeting the OpenMP runtime using source-to-source code transformations. The experiments in four different stream processing applications demonstrated that the execution time of SPar was improved up to 25.42% when using the OpenMP runtime. Additionally, our abstraction over OpenMP introduced at most 1.72% execution time overhead when compared to handwritten parallel codes. Furthermore, SPar significantly reduces the total source lines of code required to express parallelism with respect to plain OpenMP parallel codes.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close OpenMP is an industry and academic standard for parallel programming. However, using it for developing parallel stream processing applications is complex and challenging. OpenMP lacks key programming mechanisms and abstractions for this particular domain. To tackle this problem, we used a high-level parallel programming framework (named SPar) for automatically generating parallel OpenMP code. We achieved this by leveraging SPar’s language and its domain-specific code annotations for simplifying the complexity and verbosity added by OpenMP in this application domain. Consequently, we implemented a new compiler algorithm in SPar for automatically generating parallel code targeting the OpenMP runtime using source-to-source code transformations. The experiments in four different stream processing applications demonstrated that the execution time of SPar was improved up to 25.42% when using the OpenMP runtime. Additionally, our abstraction over OpenMP introduced at most 1.72% execution time overhead when compared to handwritten parallel codes. Furthermore, SPar significantly reduces the total source lines of code required to express parallelism with respect to plain OpenMP parallel codes. Close https://doi.org/10.1007/s11227-021-04182-9 doi:10.1007/s11227-021-04182-9 Close
2021
	Löff, Júnior; Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz G. High-Level Stream and Data Parallelism in C++ for Multi-Cores Inproceedings doi In: XXV Brazilian Symposium on Programming Languages (SBLP), pp. 41-48, ACM, Joinville, Brazil, 2021. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{LOFF:SBLP:21, title = {High-Level Stream and Data Parallelism in C++ for Multi-Cores}, author = {Júnior Löff and Renato Barreto Hoffmann and Dalvan Griebler and Luiz G. Fernandes}, url = {https://doi.org/10.1145/3475061.3475078}, doi = {10.1145/3475061.3475078}, year = {2021}, date = {2021-10-01}, booktitle = {XXV Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-48}, publisher = {ACM}, address = {Joinville, Brazil}, series = {SBLP'21}, abstract = {Stream processing applications have seen an increasing demand with the increased availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. However, parallel programming is often difficult and error-prone, because programmers must deal with low-level system and architecture details. In this work, we introduce a new strategy for automatic data-parallel code generation in C++ targeting multi-core architectures. This strategy was integrated with an annotation-based parallel programming abstraction named SPar. We have increased SPar’s expressiveness for supporting stream and data parallelism, and their arbitrary composition. Therefore, we added two new attributes to its language and improved the compiler parallel code generation. We conducted a set of experiments on different stream and data-parallel applications to assess the efficiency of our solution. The results showed that the new SPar version obtained similar performance with respect to handwritten parallelizations. Moreover, the new SPar version is able to achieve up to 74.9x better performance with respect to the original ones due to this work.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Stream processing applications have seen an increasing demand with the increased availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. However, parallel programming is often difficult and error-prone, because programmers must deal with low-level system and architecture details. In this work, we introduce a new strategy for automatic data-parallel code generation in C++ targeting multi-core architectures. This strategy was integrated with an annotation-based parallel programming abstraction named SPar. We have increased SPar’s expressiveness for supporting stream and data parallelism, and their arbitrary composition. Therefore, we added two new attributes to its language and improved the compiler parallel code generation. We conducted a set of experiments on different stream and data-parallel applications to assess the efficiency of our solution. The results showed that the new SPar version obtained similar performance with respect to handwritten parallelizations. Moreover, the new SPar version is able to achieve up to 74.9x better performance with respect to the original ones due to this work. Close https://doi.org/10.1145/3475061.3475078 doi:10.1145/3475061.3475078 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Danelutto, Marco; Fernandes, Luiz Gustavo Assessing Coding Metrics for Parallel Programming of Stream Processing Programs on Multi-cores Inproceedings doi In: 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2021), pp. 291-295, IEEE, Pavia, Italy, 2021. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ANDRADE:SEAA:21, title = {Assessing Coding Metrics for Parallel Programming of Stream Processing Programs on Multi-cores}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/SEAA53835.2021.00044}, doi = {10.1109/SEAA53835.2021.00044}, year = {2021}, date = {2021-09-01}, booktitle = {47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2021)}, pages = {291-295}, publisher = {IEEE}, address = {Pavia, Italy}, series = {SEAA'21}, abstract = {From the popularization of multi-core architectures, several parallel APIs have emerged, helping to abstract the programming complexity and increasing productivity in application development. Unfortunately, only a few research efforts in this direction managed to show the usability pay-back of the programming abstraction created, because it is not easy and poses many challenges for conducting empirical software engineering. We believe that coding metrics commonly used in software engineering code measurements can give useful indicators on the programming effort of parallel applications and APIs. These metrics were designed for general purposes without considering the evaluation of applications from a specific domain. In this study, we aim to evaluate the feasibility of seven coding metrics to be used in the parallel programming domain. To do so, five stream processing applications implemented with different parallel APIs for multi-cores were considered. Our experiments have shown COCOMO II is a suitable model for evaluating the productivity of different parallel APIs targeting multi-cores on stream processing applications while other metrics are restricted to the code size.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close From the popularization of multi-core architectures, several parallel APIs have emerged, helping to abstract the programming complexity and increasing productivity in application development. Unfortunately, only a few research efforts in this direction managed to show the usability pay-back of the programming abstraction created, because it is not easy and poses many challenges for conducting empirical software engineering. We believe that coding metrics commonly used in software engineering code measurements can give useful indicators on the programming effort of parallel applications and APIs. These metrics were designed for general purposes without considering the evaluation of applications from a specific domain. In this study, we aim to evaluate the feasibility of seven coding metrics to be used in the parallel programming domain. To do so, five stream processing applications implemented with different parallel APIs for multi-cores were considered. Our experiments have shown COCOMO II is a suitable model for evaluating the productivity of different parallel APIs targeting multi-cores on stream processing applications while other metrics are restricted to the code size. Close https://doi.org/10.1109/SEAA53835.2021.00044 doi:10.1109/SEAA53835.2021.00044 Close
	Löff, Júnior; Griebler, Dalvan; Mencagli, Gabriele; Araujo, Gabriell; Torquati, Massimo; Danelutto, Marco; Fernandes, Luiz Gustavo The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures Journal Article doi In: Future Generation Computer Systems, vol. 125, pp. 743-757, 2021. (Abstract \| Links \| BibTeX \| Tags: ) @article{LOFF:FGCS:21, title = {The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures}, author = {Júnior Löff and Dalvan Griebler and Gabriele Mencagli and Gabriell Araujo and Massimo Torquati and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.future.2021.07.021}, doi = {10.1016/j.future.2021.07.021}, year = {2021}, date = {2021-07-01}, journal = {Future Generation Computer Systems}, volume = {125}, pages = {743-757}, publisher = {Elsevier}, abstract = {The NAS Parallel Benchmarks (NPB), originally implemented mostly in Fortran, is a consolidated suite containing several benchmarks extracted from Computational Fluid Dynamics (CFD) models. The benchmark suite has important characteristics such as intensive memory communications, complex data dependencies, different memory access patterns, and hardware components/sub-systems overload. Parallel programming APIs, libraries, and frameworks that are written in C++ as well as new optimizations and parallel processing techniques can benefit if NPB is made fully available in this programming language. In this paper we present NPB-CPP, a fully C++ translated version of NPB consisting of all the NPB kernels and pseudo-applications developed using OpenMP, Intel TBB, and FastFlow parallel frameworks for multicores. The design of NPB-CPP leverages the Structured Parallel Programming methodology (essentially based on parallel design patterns). We show the structure of each benchmark application in terms of composition of few patterns (notably Map and MapReduce constructs) provided by the selected C++ frameworks. The experimental evaluation shows the accuracy of NPB-CPP with respect to the original NPB source code. Furthermore, we carefully evaluate the parallel performance on three multi-core systems (Intel, IBM Power and AMD) with different C++ compilers (gcc, icc and clang) by discussing the performance differences in order to give to the researchers useful insights to choose the best parallel programming framework for a given type of problem.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close The NAS Parallel Benchmarks (NPB), originally implemented mostly in Fortran, is a consolidated suite containing several benchmarks extracted from Computational Fluid Dynamics (CFD) models. The benchmark suite has important characteristics such as intensive memory communications, complex data dependencies, different memory access patterns, and hardware components/sub-systems overload. Parallel programming APIs, libraries, and frameworks that are written in C++ as well as new optimizations and parallel processing techniques can benefit if NPB is made fully available in this programming language. In this paper we present NPB-CPP, a fully C++ translated version of NPB consisting of all the NPB kernels and pseudo-applications developed using OpenMP, Intel TBB, and FastFlow parallel frameworks for multicores. The design of NPB-CPP leverages the Structured Parallel Programming methodology (essentially based on parallel design patterns). We show the structure of each benchmark application in terms of composition of few patterns (notably Map and MapReduce constructs) provided by the selected C++ frameworks. The experimental evaluation shows the accuracy of NPB-CPP with respect to the original NPB source code. Furthermore, we carefully evaluate the parallel performance on three multi-core systems (Intel, IBM Power and AMD) with different C++ compilers (gcc, icc and clang) by discussing the performance differences in order to give to the researchers useful insights to choose the best parallel programming framework for a given type of problem. Close https://doi.org/10.1016/j.future.2021.07.021 doi:10.1016/j.future.2021.07.021 Close
	Pieper, Ricardo; Löff, Júnior; Hoffmann, Renato Berreto; Griebler, Dalvan; Fernandes, Luiz Gustavo High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores Journal Article doi In: Journal of Computer Languages, vol. 65, pp. 101054, 2021. (Abstract \| Links \| BibTeX \| Tags: ) @article{PIEPER:COLA:21, title = {High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores}, author = {Ricardo Pieper and Júnior Löff and Renato Berreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.cola.2021.101054}, doi = {10.1016/j.cola.2021.101054}, year = {2021}, date = {2021-07-01}, journal = {Journal of Computer Languages}, volume = {65}, pages = {101054}, publisher = {Elsevier}, abstract = {This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism..}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism.. Close https://doi.org/10.1016/j.cola.2021.101054 doi:10.1016/j.cola.2021.101054 Close

103 entries « ‹ 1 of 3 › »

2025
	Dopke, Luan; Accorsi, Arthur; Aires, João; Guder, Larissa; Manssour, Isabel; Griebler, Dalvan SpeechVis: Simplifying Speech Emotion Visualization Inproceedings doi In: Proceedings of the 31st Brazilian Symposium on Multimedia and the Web, pp. 428-436, SBC Rio de Janeiro, Brazil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{DOPKE:WebMedia:25, title = {SpeechVis: Simplifying Speech Emotion Visualization }, author = {Luan Dopke and Arthur Accorsi and João Aires and Larissa Guder and Isabel Manssour and Dalvan Griebler}, url = {https://doi.org/10.5753/webmedia.2025.16115}, doi = {10.5753/webmedia.2025.16115}, year = {2025}, date = {2025-11-01}, booktitle = {Proceedings of the 31st Brazilian Symposium on Multimedia and the Web}, pages = {428-436}, address = {Rio de Janeiro, Brazil}, organization = {SBC}, abstract = {As the amount of online content increases, analyzing and following discussions becomes harder. Relevant information, such as the main discussion topics and the emotions expressed in audio, e.g., in a podcast, requires people to watch or listen to the entire content to understand the context. However, this can take a long time, and people’s interpretations of emotions can bias their understanding of them. A visual summarization of such information can help people quickly understand the audio context and analyze the content regarding speakers, their emotions, and the main topics covered. In this work, we introduce SpeechVis, a visual analytics tool that visually summarizes speech emotions from an audio source. SpeechVis extracts multiple information from the audio, such as the transcription, speakers, main topics, and emotions, to provide visualizations and statistics about the discussed topics and each speaker’s emotions. We used multiple off-the-shelf machine learning models to extract audio information and developed several visual representations that aim to facilitate audio analysis. To evaluate SpeechVis, we selected two use cases and performed an analysis to demonstrate how the SpeechVis visualizations can give valuable insights and facilitate audio interpretation.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close As the amount of online content increases, analyzing and following discussions becomes harder. Relevant information, such as the main discussion topics and the emotions expressed in audio, e.g., in a podcast, requires people to watch or listen to the entire content to understand the context. However, this can take a long time, and people’s interpretations of emotions can bias their understanding of them. A visual summarization of such information can help people quickly understand the audio context and analyze the content regarding speakers, their emotions, and the main topics covered. In this work, we introduce SpeechVis, a visual analytics tool that visually summarizes speech emotions from an audio source. SpeechVis extracts multiple information from the audio, such as the transcription, speakers, main topics, and emotions, to provide visualizations and statistics about the discussed topics and each speaker’s emotions. We used multiple off-the-shelf machine learning models to extract audio information and developed several visual representations that aim to facilitate audio analysis. To evaluate SpeechVis, we selected two use cases and performed an analysis to demonstrate how the SpeechVis visualizations can give valuable insights and facilitate audio interpretation. Close https://doi.org/10.5753/webmedia.2025.16115 doi:10.5753/webmedia.2025.16115 Close
	Guder, Larissa; Dopke, Luan; Kaiser, Marcos; Griebler, Dalvan; Meneguzzi, Felipe BAH: Beyond Acoustic Handcrafted features for speech emotion recognition in Portuguese Inproceedings doi In: Proceedings of the 31st Brazilian Symposium on Multimedia and the Web, pp. 86-93, SBC Rio de Janeiro, Brazil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GUDER:WebMedia:25, title = {BAH: Beyond Acoustic Handcrafted features for speech emotion recognition in Portuguese}, author = {Larissa Guder and Luan Dopke and Marcos Kaiser and Dalvan Griebler and Felipe Meneguzzi}, url = {https://doi.org/10.5753/webmedia.2025.16129}, doi = {10.5753/webmedia.2025.16129}, year = {2025}, date = {2025-11-01}, booktitle = {Proceedings of the 31st Brazilian Symposium on Multimedia and the Web}, pages = {86-93}, address = {Rio de Janeiro, Brazil}, organization = {SBC}, abstract = {It is through affective computing that we have the integration of human feelings and computing applications. One affective computing task is Speech Emotion Recognition (SER), which identifies emotions from spoken audio. Even though emotion is a universal aspect of human experience, each culture and language has different ways to express and understand emotions. So, when designing models for SER, it is common to focus on a single language. In this work, we explore VERBO, a Brazilian Portuguese dataset for categorical emotion recognition. Our main objective is to define the best way to extract acoustic features to train a classifier for SER.We compare 18 different methods to generate audio representations, grouped by handcrafted features and audio embeddings. The best representation for VERBO is TRILL embeddings, and with an SVM classifier, we achieved 92% accuracy in VERBO. As far as we know, this was the state of the art for this dataset.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close It is through affective computing that we have the integration of human feelings and computing applications. One affective computing task is Speech Emotion Recognition (SER), which identifies emotions from spoken audio. Even though emotion is a universal aspect of human experience, each culture and language has different ways to express and understand emotions. So, when designing models for SER, it is common to focus on a single language. In this work, we explore VERBO, a Brazilian Portuguese dataset for categorical emotion recognition. Our main objective is to define the best way to extract acoustic features to train a classifier for SER.We compare 18 different methods to generate audio representations, grouped by handcrafted features and audio embeddings. The best representation for VERBO is TRILL embeddings, and with an SVM classifier, we achieved 92% accuracy in VERBO. As far as we know, this was the state of the art for this dataset. Close https://doi.org/10.5753/webmedia.2025.16129 doi:10.5753/webmedia.2025.16129 Close
	Ahmad, Sunna Imtiaz; Olczyk, Jakub; Araújo, Adriel S.; de Moura Medeiros, João Pedro; Teixeira, Vinicius C.; Gomes, Carlos F. A.; Magnaguagno, Maurício Cecílio; Roederer, Quinn; Dutra, Vinicius; Conley, R. Scott; Griebler, Dalvan; Eckert, George; Pinho, Márcio Sarroglia; Turkkahraman, Hakan A Novel Multimodal Deep Image Analysis Model for Predicting Extraction/Non-Extraction Decision Journal Article doi In: Orthodontics & Craniofacial Research, vol. na, pp. na, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{AHMAD:OCR:25, title = {A Novel Multimodal Deep Image Analysis Model for Predicting Extraction/Non-Extraction Decision}, author = {Sunna Imtiaz Ahmad and Jakub Olczyk and Adriel S. Araújo and João Pedro de Moura Medeiros and Vinicius C. Teixeira and Carlos F. A. Gomes and Maurício Cecílio Magnaguagno and Quinn Roederer and Vinicius Dutra and R. Scott Conley and Dalvan Griebler and George Eckert and Márcio Sarroglia Pinho and Hakan Turkkahraman}, url = {https://doi.org/10.1111/ocr.70057}, doi = {10.1111/ocr.70057}, year = {2025}, date = {2025-10-01}, urldate = {2025-10-01}, journal = {Orthodontics & Craniofacial Research}, volume = {na}, pages = {na}, publisher = {Wiley}, abstract = {This study aimed to develop a deep learning model classifier capable of predicting the extraction/non-extraction binary decision using lateral cephalometric radiographs (LCRs) and intraoral scans (IOS) to serve as an additional decision-support tool for orthodontists. Materials and Methods The dataset was composed of LCRs and IOS from 617 patients (mean age: 18.2, 63.5% female) treated at the Indiana University School of Dentistry. Subjects were categorised into two groups: extraction (192) and non-extraction (425). Two sets of features were extracted from IOS: traditional arch measurements and novel tooth spatial features. For LCRs, features were derived using CephNet-based landmark detection (Land), a convolutional autoencoder (AE), and the dimensionality was reduced using Principal Component Analysis (PCA). Models were evaluated using accuracy, sensitivity, specificity, positive predictive value (PPV or precision), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR−), and F1 score. Results IOS + Land model achieved the highest overall accuracy (77%) and F1 score (0.62), with strong specificity (83%) and PPV (62%). In contrast, the Land model yielded the highest sensitivity (82%), but at the cost of lower specificity (57%). McNemar's test revealed that the AE model was significantly less accurate than IOS + AE (p = 0.048), IOS + Land (p = 0.006), and IOS + AE + Land (p = 0.005). Conclusion Deep learning models can predict the extraction/non-extraction decision using IOS and LCRs with high accuracy and diagnostic performance. Multimodal approaches, particularly those integrating IOS with cephalometric landmarks, demonstrate superior accuracy, sensitivity, and specificity compared to single-modality models.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close This study aimed to develop a deep learning model classifier capable of predicting the extraction/non-extraction binary decision using lateral cephalometric radiographs (LCRs) and intraoral scans (IOS) to serve as an additional decision-support tool for orthodontists. Materials and Methods The dataset was composed of LCRs and IOS from 617 patients (mean age: 18.2, 63.5% female) treated at the Indiana University School of Dentistry. Subjects were categorised into two groups: extraction (192) and non-extraction (425). Two sets of features were extracted from IOS: traditional arch measurements and novel tooth spatial features. For LCRs, features were derived using CephNet-based landmark detection (Land), a convolutional autoencoder (AE), and the dimensionality was reduced using Principal Component Analysis (PCA). Models were evaluated using accuracy, sensitivity, specificity, positive predictive value (PPV or precision), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR−), and F1 score. Results IOS + Land model achieved the highest overall accuracy (77%) and F1 score (0.62), with strong specificity (83%) and PPV (62%). In contrast, the Land model yielded the highest sensitivity (82%), but at the cost of lower specificity (57%). McNemar's test revealed that the AE model was significantly less accurate than IOS + AE (p = 0.048), IOS + Land (p = 0.006), and IOS + AE + Land (p = 0.005). Conclusion Deep learning models can predict the extraction/non-extraction decision using IOS and LCRs with high accuracy and diagnostic performance. Multimodal approaches, particularly those integrating IOS with cephalometric landmarks, demonstrate superior accuracy, sensitivity, and specificity compared to single-modality models. Close https://doi.org/10.1111/ocr.70057 doi:10.1111/ocr.70057 Close
	Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz Gustavo Performance, Portability, and Productivity of HIP on GPUs with NAS Parallel Benchmarks Inproceedings doi In: 2025 IEEE/SBC 37th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 204-214, IEEE, Bonito, Brazil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ARAUJO:SBAC-PAD:25, title = {Performance, Portability, and Productivity of HIP on GPUs with NAS Parallel Benchmarks}, author = {Gabriell Araujo and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/SBAC-PAD66369.2025.00027}, doi = {10.1109/SBAC-PAD66369.2025.00027}, year = {2025}, date = {2025-10-01}, booktitle = {2025 IEEE/SBC 37th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)}, pages = {204-214}, publisher = {IEEE}, address = {Bonito, Brazil}, series = {SBAC-PAD'25}, abstract = {Graphics Processing Units (GPUs) are powerful, massively parallel processors that have become ubiquitous in modern computing. In recent years, the GPU market has diversified, with vendors like AMD and Intel offering high-performance alternatives to NVIDIA. However, most applications are written using NVIDIA's CUDA API, which is incompatible with non-NVIDIA GPUs, creating significant challenges for developers who must port their code to different architectures. To address this issue, AMD developed the Heterogeneous-Compute Interface for Portability (HIP), an open-source API for cross-vendor GPU programming. However, HIP is relatively new, leaving gaps in the literature regarding its performance, portability, and productivity. In this paper, we evaluate HIP using the NAS Parallel Benchmarks (NPB), a CFD-based suite maintained by NASA. We present the first HIP-based implementation of NPB and conduct experiments on integrated and discrete GPUs from NVIDIA, AMD, and Intel. Our results provide novel insights into HIP’s performance and portability, particularly for integrated GPUs and Intel discrete GPUs, which have been underrepresented in prior studies. We also assess productivity using different metrics to quantify the programming effort of HIP-based implementations. This work addresses key gaps in the literature, offering valuable data and insights for developers targeting emerging GPU architectures.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Graphics Processing Units (GPUs) are powerful, massively parallel processors that have become ubiquitous in modern computing. In recent years, the GPU market has diversified, with vendors like AMD and Intel offering high-performance alternatives to NVIDIA. However, most applications are written using NVIDIA's CUDA API, which is incompatible with non-NVIDIA GPUs, creating significant challenges for developers who must port their code to different architectures. To address this issue, AMD developed the Heterogeneous-Compute Interface for Portability (HIP), an open-source API for cross-vendor GPU programming. However, HIP is relatively new, leaving gaps in the literature regarding its performance, portability, and productivity. In this paper, we evaluate HIP using the NAS Parallel Benchmarks (NPB), a CFD-based suite maintained by NASA. We present the first HIP-based implementation of NPB and conduct experiments on integrated and discrete GPUs from NVIDIA, AMD, and Intel. Our results provide novel insights into HIP’s performance and portability, particularly for integrated GPUs and Intel discrete GPUs, which have been underrepresented in prior studies. We also assess productivity using different metrics to quantify the programming effort of HIP-based implementations. This work addresses key gaps in the literature, offering valuable data and insights for developers targeting emerging GPU architectures. Close https://doi.org/10.1109/SBAC-PAD66369.2025.00027 doi:10.1109/SBAC-PAD66369.2025.00027 Close
	Martins, Eduardo; Hoffmann, Renato; Alf, Lucas; Griebler, Dalvan Interface para Programação de Pipelines Lineares Tolerantes a Falha para MPI Padrão C++ Inproceedings doi In: Anais do XXVI Simpósio em Sistemas Computacionais de Alto Desempenho, pp. 133-144, SBC, Bonito, Brazil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{MARTINS:SSCAD:25, title = {Interface para Programação de Pipelines Lineares Tolerantes a Falha para MPI Padrão C++}, author = {Eduardo Martins and Renato Hoffmann and Lucas Alf and Dalvan Griebler}, url = {https://doi.org/10.5753/sscad.2025.15867}, doi = {10.5753/sscad.2025.15867}, year = {2025}, date = {2025-10-01}, booktitle = {Anais do XXVI Simpósio em Sistemas Computacionais de Alto Desempenho}, pages = {133-144}, publisher = {SBC}, address = {Bonito, Brazil}, series = {SSCAD'25}, abstract = {Sistemas de processamento de stream são projetados para operar continuamente e devem ser capazes de se recuperar em caso de falhas. No entanto, programar aplicações de alto desempenho em ambientes distribuídos introduz uma alta complexidade de desenvolvimento. Este trabalho apresenta uma interface de programação que facilita a construção de pipelines lineares tolerantes a falhas para aplicações de processamento de stream em C++. A solução utiliza MPI (Message Passing Interface) para comunicação e o protocolo ABS (Asynchronous Barrier Snapshotting) juntamente com um agente monitor para a etapa de recuperação. Os resultados experimentais indicam uma redução significativa no tempo estimado de desenvolvimento para o programador, com impacto médio de -0.98% até 6.73% na vazão das aplicações. Além disso, o processo de recuperação mitiga o impacto das falhas na vazão do programa.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Sistemas de processamento de stream são projetados para operar continuamente e devem ser capazes de se recuperar em caso de falhas. No entanto, programar aplicações de alto desempenho em ambientes distribuídos introduz uma alta complexidade de desenvolvimento. Este trabalho apresenta uma interface de programação que facilita a construção de pipelines lineares tolerantes a falhas para aplicações de processamento de stream em C++. A solução utiliza MPI (Message Passing Interface) para comunicação e o protocolo ABS (Asynchronous Barrier Snapshotting) juntamente com um agente monitor para a etapa de recuperação. Os resultados experimentais indicam uma redução significativa no tempo estimado de desenvolvimento para o programador, com impacto médio de -0.98% até 6.73% na vazão das aplicações. Além disso, o processo de recuperação mitiga o impacto das falhas na vazão do programa. Close https://doi.org/10.5753/sscad.2025.15867 doi:10.5753/sscad.2025.15867 Close
	Faé, Leonardo; Griebler, Dalvan Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines Inproceedings doi In: Anais do XXIX Simpósio Brasileiro de Linguagens de Programação, pp. 75-83, SBC, Recife/PE, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{FAE:SBLP:25, title = {Towards GPU Parallelism Abstractions in Rust: A Case Study with Linear Pipelines}, author = {Leonardo Faé and Dalvan Griebler}, url = {https://sol.sbc.org.br/index.php/sblp/article/view/36951/36736}, doi = {10.5753/sblp.2025.13152}, year = {2025}, date = {2025-09-01}, booktitle = {Anais do XXIX Simpósio Brasileiro de Linguagens de Programação}, pages = {75-83}, publisher = {SBC}, address = {Recife/PE}, series = {SBLP'25}, abstract = {Programming Graphics Processing Units (GPUs) for general-purpose computation remains a daunting task, often requiring specialized knowledge of low-level APIs like CUDA or OpenCL. While Rust has emerged as a modern, safe, and performant systems programming language, its adoption in the GPU computing domain is still nascent. Existing approaches often involve intricate compiler modifications or complex static analysis to adapt CPU-centric Rust code for GPU execution. This paper presents a novel high-level abstraction in Rust, leveraging procedural macros to automatically generate GPU-executable code from constrained Rust functions. Our approach simplifies the code generation process by imposing specific limitations on how these functions can be written, thereby avoiding the need for complex static analysis. We demonstrate the feasibility and effectiveness of our abstraction through a case study involving linear pipeline parallel patterns, a common structure in data-parallel applications. By transforming Rust functions annotated as source, stage, or sink in a pipeline, we enable straightforward execution on the GPU. We evaluate our abstraction's performance and programmability using two benchmark applications: sobel (image filtering) and latbol (fluid simulation), comparing it against manual OpenCL implementations. Our results indicate that while incurring a small performance overhead in some cases, our approach significantly reduces development effort and, in certain scenarios, achieves comparable or even superior throughput compared to CPU-based parallelism.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Programming Graphics Processing Units (GPUs) for general-purpose computation remains a daunting task, often requiring specialized knowledge of low-level APIs like CUDA or OpenCL. While Rust has emerged as a modern, safe, and performant systems programming language, its adoption in the GPU computing domain is still nascent. Existing approaches often involve intricate compiler modifications or complex static analysis to adapt CPU-centric Rust code for GPU execution. This paper presents a novel high-level abstraction in Rust, leveraging procedural macros to automatically generate GPU-executable code from constrained Rust functions. Our approach simplifies the code generation process by imposing specific limitations on how these functions can be written, thereby avoiding the need for complex static analysis. We demonstrate the feasibility and effectiveness of our abstraction through a case study involving linear pipeline parallel patterns, a common structure in data-parallel applications. By transforming Rust functions annotated as source, stage, or sink in a pipeline, we enable straightforward execution on the GPU. We evaluate our abstraction's performance and programmability using two benchmark applications: sobel (image filtering) and latbol (fluid simulation), comparing it against manual OpenCL implementations. Our results indicate that while incurring a small performance overhead in some cases, our approach significantly reduces development effort and, in certain scenarios, achieves comparable or even superior throughput compared to CPU-based parallelism. Close https://sol.sbc.org.br/index.php/sblp/article/view/36951/36736 doi:10.5753/sblp.2025.13152 Close
	Ahmad, Sunna I.; Araújo, Adriel S.; Teixeira, Vinicius C.; Gomes, Carlos F. A.; Dutra, Vinicius; Roederer, Quinn; Conley, R. Scott; Griebler, Dalvan; Pinho, Márcio S.; Turkkahraman, Hakan A Novel AI-driven Automated Orthodontic Model Analysis to Improve Classification of Orthodontic Extraction Cases Inproceedings doi In: 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1853-1860, IEEE, Toronto, Canada, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{AHMAD:COMPSAC:25, title = {A Novel AI-driven Automated Orthodontic Model Analysis to Improve Classification of Orthodontic Extraction Cases}, author = {Sunna I. Ahmad and Adriel S. Araújo and Vinicius C. Teixeira and Carlos F. A. Gomes and Vinicius Dutra and Quinn Roederer and R. Scott Conley and Dalvan Griebler and Márcio S. Pinho and Hakan Turkkahraman}, url = {https://doi.org/10.1109/COMPSAC65507.2025.00254}, doi = {10.1109/COMPSAC65507.2025.00254}, year = {2025}, date = {2025-07-01}, booktitle = {2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC)}, pages = {1853-1860}, publisher = {IEEE}, address = {Toronto, Canada}, abstract = {Malocclusion, a prevalent dental condition worldwide, necessitates orthodontic intervention to correct tooth misalignment and improve oral health. Treatment can involve extraction of permanent teeth, depending on dental crowding, jaw relationships, and facial aesthetics. Today, clinical decision support systems have introduced machine learning (ML) to assist orthodontists in determining optimal treatment plans. This study explores the development of a novel, fully automated method for extracting dentoalveolar features from 3D intraoral scans (IOS), aiming to enhance orthodontic decision-making. Using deep learning-based IOS segmentation as basis, dental measurements were developed and utilized to train supervised ML classifiers, including support vector machines (SVM), logistic regression, decision trees, and random forests. An ensemble of SVM models demonstrated the highest accuracy (73%) in predicting extraction decisions, with these novel domain-specific features proving more informative than traditional dental arch measurements. While we can make further improvements not only in the automated segmentation but also by applying feature selection, the results highlight the potential of AI-driven analysis to streamline orthodontic workflows, reduce manual intervention and improve clinical efficiency.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Malocclusion, a prevalent dental condition worldwide, necessitates orthodontic intervention to correct tooth misalignment and improve oral health. Treatment can involve extraction of permanent teeth, depending on dental crowding, jaw relationships, and facial aesthetics. Today, clinical decision support systems have introduced machine learning (ML) to assist orthodontists in determining optimal treatment plans. This study explores the development of a novel, fully automated method for extracting dentoalveolar features from 3D intraoral scans (IOS), aiming to enhance orthodontic decision-making. Using deep learning-based IOS segmentation as basis, dental measurements were developed and utilized to train supervised ML classifiers, including support vector machines (SVM), logistic regression, decision trees, and random forests. An ensemble of SVM models demonstrated the highest accuracy (73%) in predicting extraction decisions, with these novel domain-specific features proving more informative than traditional dental arch measurements. While we can make further improvements not only in the automated segmentation but also by applying feature selection, the results highlight the potential of AI-driven analysis to streamline orthodontic workflows, reduce manual intervention and improve clinical efficiency. Close https://doi.org/10.1109/COMPSAC65507.2025.00254 doi:10.1109/COMPSAC65507.2025.00254 Close
	Guder, Larissa; Aires, João Paulo; Manssour, Isabel H; Griebler, Dalvan GoViz: A Visualization Tool for Empowering Transparency in Government Speech Inproceedings doi In: Annual International Conference on Digital Government Research, pp. 954, Digital Government Society, Porto Alegre, Brasil, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GUDER:DGO:25, title = {GoViz: A Visualization Tool for Empowering Transparency in Government Speech}, author = {Larissa Guder and João Paulo Aires and Isabel H Manssour and Dalvan Griebler}, url = {https://doi.org/10.59490/dgo.2025.954}, doi = {10.59490/dgo.2025.954}, year = {2025}, date = {2025-05-01}, booktitle = {Annual International Conference on Digital Government Research}, volume = {26}, pages = {954}, publisher = {Digital Government Society}, address = {Porto Alegre, Brasil}, abstract = {Public speech from government figures often describes relevant actions that can impact the population's lives. However, most people do not have time and access to analyze and understand public speech. Such a scenario narrows the participation of the people in the main discussions, which leads to multiple misunderstandings. In this work, we propose GoViz, a tool that automatically produces visual representations to outline governmental speeches regarding the subject, its main actors, and how they connect to the discussion topics. GoViz processes natural language from speech transcriptions in a pipeline that identifies part-of-speech elements, named-entities, and the relation between persons, making speech content more accessible and insightful. Using publicly available data, we evaluate our tool in two different languages (Portuguese and English). The results demonstrate that the visualizations from both data facilitate understanding the speech content. Thus, our main contribution is to encourage the participation of citizens in parliamentary issues, allowing a simplified and visually engaging avenue to access long speeches and fostering improved communication between parliamentarians and the population.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Public speech from government figures often describes relevant actions that can impact the population's lives. However, most people do not have time and access to analyze and understand public speech. Such a scenario narrows the participation of the people in the main discussions, which leads to multiple misunderstandings. In this work, we propose GoViz, a tool that automatically produces visual representations to outline governmental speeches regarding the subject, its main actors, and how they connect to the discussion topics. GoViz processes natural language from speech transcriptions in a pipeline that identifies part-of-speech elements, named-entities, and the relation between persons, making speech content more accessible and insightful. Using publicly available data, we evaluate our tool in two different languages (Portuguese and English). The results demonstrate that the visualizations from both data facilitate understanding the speech content. Thus, our main contribution is to encourage the participation of citizens in parliamentary issues, allowing a simplified and visually engaging avenue to access long speeches and fostering improved communication between parliamentarians and the population. Close https://doi.org/10.59490/dgo.2025.954 doi:10.59490/dgo.2025.954 Close
	Czarnul, Paweł; Antal, Marcel; Baniata, Hamza; Griebler, Dalvan; Kertesz, Attila; Kessler, Christoph W.; Kouloumpris, Andreas; Kovačić, Salko; Markus, Andras; Michael, Maria K.; Nikolaou, Panagiota; Öz, Isil; Prodan, Radu; Rakić, Gordana Optimization of resource-aware parallel and distributed computing: a review Journal Article doi In: The Journal of Supercomputing, vol. 81, no. 7, pp. 848, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{CZARNUL:Supercomputing:25, title = {Optimization of resource-aware parallel and distributed computing: a review}, author = {Paweł Czarnul and Marcel Antal and Hamza Baniata and Dalvan Griebler and Attila Kertesz and Christoph W. Kessler and Andreas Kouloumpris and Salko Kovačić and Andras Markus and Maria K. Michael and Panagiota Nikolaou and Isil Öz and Radu Prodan and Gordana Rakić}, url = {https://doi.org/10.1007/s11227-025-07295-7}, doi = {10.1007/s11227-025-07295-7}, year = {2025}, date = {2025-05-01}, urldate = {2025-05-01}, journal = {The Journal of Supercomputing}, volume = {81}, number = {7}, pages = {848}, publisher = {Springer}, abstract = {This paper presents a review of state-of-the-art solutions concerning the optimization of computing in the field of parallel and distributed systems. Firstly, we contribute by identifying resources and quality metrics in this context including servers, network interconnects, storage systems, computational devices as well as execution time/performance, energy, security, and error vulnerability, respectively. We subsequently identify commonly used problem formulations and algorithms for integer linear programming, greedy algorithms, dynamic programming, genetic algorithms, particle swarm optimization, ant colony optimization, game theory, and reinforcement learning. Afterward, we characterize frequently considered optimization problems by stating these terms in domains such as data centers, cloud, fog, blockchain, high performance, and volunteer computing. Based on the extensive analysis, we identify how particular resources and corresponding quality metrics are considered in these domains and which problem formulations are used for which system types, either parallel or distributed environments. This allows us to formulate open research problems and challenges in this field and analyze research interest in problem formulations/domains in recent years.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close This paper presents a review of state-of-the-art solutions concerning the optimization of computing in the field of parallel and distributed systems. Firstly, we contribute by identifying resources and quality metrics in this context including servers, network interconnects, storage systems, computational devices as well as execution time/performance, energy, security, and error vulnerability, respectively. We subsequently identify commonly used problem formulations and algorithms for integer linear programming, greedy algorithms, dynamic programming, genetic algorithms, particle swarm optimization, ant colony optimization, game theory, and reinforcement learning. Afterward, we characterize frequently considered optimization problems by stating these terms in domains such as data centers, cloud, fog, blockchain, high performance, and volunteer computing. Based on the extensive analysis, we identify how particular resources and corresponding quality metrics are considered in these domains and which problem formulations are used for which system types, either parallel or distributed environments. This allows us to formulate open research problems and challenges in this field and analyze research interest in problem formulations/domains in recent years. Close https://doi.org/10.1007/s11227-025-07295-7 doi:10.1007/s11227-025-07295-7 Close
	Rockenbach, Dinei A.; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz Gustavo GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism Journal Article doi In: Computer Standards & Interfaces, vol. 92, pp. 103922, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{ROCKENBACH:GSParLib:CSI:25, title = {GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism}, author = {Dinei A. Rockenbach and Gabriell Araujo and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.csi.2024.103922}, doi = {10.1016/j.csi.2024.103922}, year = {2025}, date = {2025-03-01}, urldate = {2025-03-01}, journal = {Computer Standards & Interfaces}, volume = {92}, pages = {103922}, publisher = {Elsevier}, abstract = {The evolution of Graphics Processing Units (GPUs) has allowed the industry to overcome long-lasting problems and challenges. Many belong to the stream processing domain, whose central aspect is continuously receiving and processing data from streaming data producers such as cameras and sensors. Nonetheless, programming GPUs is challenging because it requires deep knowledge of many-core programming, mechanisms and optimizations for GPUs. Current GPU programming standards do not target stream processing and present programmability and code portability limitations. Among our main scientific contributions resides GSParLib, a C++ multi-level programming interface unifying CUDA and OpenCL for GPU processing on stream and data parallelism with negligible performance losses compared to manual implementations; GSParLib is organized in two layers: one for general-purpose computing and another for high-level structured programming based on parallel patterns; a methodology to provide unified and driver agnostic interfaces minimizing performance losses; a set of parallelism strategies and optimizations for GPU processing targeting stream and data parallelism; and new experiments covering GPU performance on applications exposing stream and data parallelism.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close The evolution of Graphics Processing Units (GPUs) has allowed the industry to overcome long-lasting problems and challenges. Many belong to the stream processing domain, whose central aspect is continuously receiving and processing data from streaming data producers such as cameras and sensors. Nonetheless, programming GPUs is challenging because it requires deep knowledge of many-core programming, mechanisms and optimizations for GPUs. Current GPU programming standards do not target stream processing and present programmability and code portability limitations. Among our main scientific contributions resides GSParLib, a C++ multi-level programming interface unifying CUDA and OpenCL for GPU processing on stream and data parallelism with negligible performance losses compared to manual implementations; GSParLib is organized in two layers: one for general-purpose computing and another for high-level structured programming based on parallel patterns; a methodology to provide unified and driver agnostic interfaces minimizing performance losses; a set of parallelism strategies and optimizations for GPU processing targeting stream and data parallelism; and new experiments covering GPU performance on applications exposing stream and data parallelism. Close https://doi.org/10.1016/j.csi.2024.103922 doi:10.1016/j.csi.2024.103922 Close
	Löff, Júnior; Hoffmann, Renato B.; Bianchessi, Arthur S.; Mallmann, Leonardo; Griebler, Dalvan; Binder, Walter NPB-PSTL: C++ STL Algorithms with Parallel Execution Policies in NAS Parallel Benchmarks Inproceedings doi In: 33rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 162-169, IEEE, Torino, Italy, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{LOFF:PDP:25, title = {NPB-PSTL: C++ STL Algorithms with Parallel Execution Policies in NAS Parallel Benchmarks}, author = {Júnior Löff and Renato B. Hoffmann and Arthur S. Bianchessi and Leonardo Mallmann and Dalvan Griebler and Walter Binder}, url = {https://doi.org/10.1109/PDP66500.2025.00030}, doi = {10.1109/PDP66500.2025.00030}, year = {2025}, date = {2025-03-01}, booktitle = {33rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {162-169}, publisher = {IEEE}, address = {Torino, Italy}, series = {PDP'25}, abstract = {The C++ language continually evolves through formal specifications established by its standards committee, proposing new features to maintain C++ as a relevant programming language while improving usability, performance, and portability across platforms. With the addition of parallel Standard Template Library (STL) algorithms in C++17, programmers can now leverage parallel processing capabilities via vendor-neutral parallel execution policies. This study presents an adaptation of the NAS Parallel Benchmarks (NPB)—a well-established suite of applications for evaluating parallel architectures-by porting its sequential C-style code to use C++ STL abstractions and performance-portable parallelism features. Our goals are to (1) assess the suitability of C++ STL for scientific applications like the ones in the NPB and (2) provide a comparative performance and portability of STL algorithms' parallel execution policies across different multicore architectures (x86 and AArch64). Results indicate that the performance of parallel STL algorithms is often close to that of optimized handwritten versions (OpenMP, Intel TBB, and FastFlow) on different architectures, with notable shortfalls. Across all NPB benchmarks, the STL algorithms' geometric mean shows sequential execution times that are between 3.76% and 6.9% higher, while parallel executions may reach a geometric mean of up to 21.21% higher execution time.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close The C++ language continually evolves through formal specifications established by its standards committee, proposing new features to maintain C++ as a relevant programming language while improving usability, performance, and portability across platforms. With the addition of parallel Standard Template Library (STL) algorithms in C++17, programmers can now leverage parallel processing capabilities via vendor-neutral parallel execution policies. This study presents an adaptation of the NAS Parallel Benchmarks (NPB)—a well-established suite of applications for evaluating parallel architectures-by porting its sequential C-style code to use C++ STL abstractions and performance-portable parallelism features. Our goals are to (1) assess the suitability of C++ STL for scientific applications like the ones in the NPB and (2) provide a comparative performance and portability of STL algorithms' parallel execution policies across different multicore architectures (x86 and AArch64). Results indicate that the performance of parallel STL algorithms is often close to that of optimized handwritten versions (OpenMP, Intel TBB, and FastFlow) on different architectures, with notable shortfalls. Across all NPB benchmarks, the STL algorithms' geometric mean shows sequential execution times that are between 3.76% and 6.9% higher, while parallel executions may reach a geometric mean of up to 21.21% higher execution time. Close https://doi.org/10.1109/PDP66500.2025.00030 doi:10.1109/PDP66500.2025.00030 Close
	Hoffmann, Renato B.; Faé, Leonardo G.; Griebler, Dalvan; Li, Xinliang David; Pereira, Fernando Magno Quintão Automatic Synthesis of Specialized Hash Functions Inproceedings doi In: Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, pp. 317-330, ACM, Las Vegas, NV, USA, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{HOFFMANN:sepe:cgo:25, title = {Automatic Synthesis of Specialized Hash Functions}, author = {Renato B. Hoffmann and Leonardo G. Faé and Dalvan Griebler and Xinliang David Li and Fernando Magno Quintão Pereira}, url = {https://doi.org/10.1145/3696443.3708940}, doi = {10.1145/3696443.3708940}, year = {2025}, date = {2025-03-01}, booktitle = {Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization}, pages = {317-330}, publisher = {ACM}, address = {Las Vegas, NV, USA}, series = {CGO '25}, abstract = {This paper introduces a technique for synthesizing hash functions specialized to particular byte formats. This code generation method leverages three prevalent patterns: (i) fixed-length keys, (ii) keys with common subsequences, and (iii) keys ranging on predetermined sequences of bytes. Code generation involves two algorithms: one identifies relevant regular expressions within key examples, and the other generates specialized hash functions based on these expressions. Comparative analysis demonstrates that the synthetic functions outperform the general-purpose hashes in the C++ Standard Template Library and the Google Abseil Library when keys are given in ascending, normal or uniform distribution. In applications where low-mixing hashes are acceptable, the synthetic functions achieve speedups ranging from 2% to 11% on full benchmarks, and speedups of almost 50x once only hashing speed is considered.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close This paper introduces a technique for synthesizing hash functions specialized to particular byte formats. This code generation method leverages three prevalent patterns: (i) fixed-length keys, (ii) keys with common subsequences, and (iii) keys ranging on predetermined sequences of bytes. Code generation involves two algorithms: one identifies relevant regular expressions within key examples, and the other generates specialized hash functions based on these expressions. Comparative analysis demonstrates that the synthetic functions outperform the general-purpose hashes in the C++ Standard Template Library and the Google Abseil Library when keys are given in ascending, normal or uniform distribution. In applications where low-mixing hashes are acceptable, the synthetic functions achieve speedups ranging from 2% to 11% on full benchmarks, and speedups of almost 50x once only hashing speed is considered. Close https://doi.org/10.1145/3696443.3708940 doi:10.1145/3696443.3708940 Close
	Mencagli, Gabriele; Rymarchuk, Yuriy; Griebler, Dalvan PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams Inproceedings doi In: Proceedings of the 19th ACM International Conference on Distributed and Event-Based Systems, pp. 51-61, ACM, Gothenburg, Sweden, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{MENCAGLI:DEBS:25, title = {PPOIJ: Shared-Nothing Parallel Patterns for Efficient Online Interval Joins over Data Streams}, author = {Gabriele Mencagli and Yuriy Rymarchuk and Dalvan Griebler}, url = {https://doi.org/10.1145/3701717.3730542}, doi = {10.1145/3701717.3730542}, year = {2025}, date = {2025-01-01}, booktitle = {Proceedings of the 19th ACM International Conference on Distributed and Event-Based Systems}, pages = {51-61}, publisher = {ACM}, address = {Gothenburg, Sweden}, series = {DEBS'25}, abstract = {Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Joining data streams is a fundamental stateful operator in stream processing. It involves evaluating join pairs of tuples from two streams that meet specific user-defined criteria. This operator is typically time-consuming and often represents the major bottleneck in several real-world continuous queries. This paper focuses on a specific class of join operator, named online interval join, where we seek join pairs of tuples that occur within a certain time frame of each other. Our contribution is to propose different parallel patterns for implementing this join operator efficiently in the presence of watermarked data streams and skewed key distributions. The proposed patterns comply with the shared-nothing parallelization paradigm, a popular paradigm adopted by most of the existing Stream Processing Engines. Among the proposed patterns, we introduce one based on hybrid parallelism, which is particularly effective in handling various scenarios in terms of key distribution, number of keys, batching, and parallelism as demonstrated in our experimental analysis. Close https://doi.org/10.1145/3701717.3730542 doi:10.1145/3701717.3730542 Close
	Araujo, Gabriell; Rockenbach, Dinei A.; Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz G. A C++ annotation-based domain-specific language for expressing stream and data parallelism supporting CPU and GPU Journal Article doi In: Journal of Computer Languages, vol. 85, pp. 101369, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{ARAUJO:COLA:25, title = {A C++ annotation-based domain-specific language for expressing stream and data parallelism supporting CPU and GPU}, author = {Gabriell Araujo and Dinei A. Rockenbach and Júnior Löff and Dalvan Griebler and Luiz G. Fernandes}, url = {https://doi.org/10.1016/j.cola.2025.101369}, doi = {10.1016/j.cola.2025.101369}, year = {2025}, date = {2025-01-01}, urldate = {2025-01-01}, journal = {Journal of Computer Languages}, volume = {85}, pages = {101369}, publisher = {Elsevier}, abstract = {Graphics processing units (GPUs) and central processing units (CPUs) provide massive parallel computing in our modern computer systems (e.g., servers, desktops, smartphones, and laptops), and efficiently utilizing their processing power requires expertise in parallel programming. Mainly, domain-specific languages (DSLs) address this challenge by improving productivity and abstractions. SPar is a high-level DSL that promotes parallel programming abstractions for stream and data parallelism using C++ attribute annotations for serial code. Unlike existing solutions, SPar eliminates the need to manually implement low-level mechanisms to leverage stream and data parallelism on heterogeneous systems. In this article, we design an extended version of the language and compiler algorithm for GPU code generation. We newly offer a single parallel programming model targeting CPUs and GPUs to exploit stream and data parallelism. The experiments indicated performance improvement compared with previous versions of SPar and achieved performance comparable to handwritten code using lower-level programming abstractions in specific scenarios.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Graphics processing units (GPUs) and central processing units (CPUs) provide massive parallel computing in our modern computer systems (e.g., servers, desktops, smartphones, and laptops), and efficiently utilizing their processing power requires expertise in parallel programming. Mainly, domain-specific languages (DSLs) address this challenge by improving productivity and abstractions. SPar is a high-level DSL that promotes parallel programming abstractions for stream and data parallelism using C++ attribute annotations for serial code. Unlike existing solutions, SPar eliminates the need to manually implement low-level mechanisms to leverage stream and data parallelism on heterogeneous systems. In this article, we design an extended version of the language and compiler algorithm for GPU code generation. We newly offer a single parallel programming model targeting CPUs and GPUs to exploit stream and data parallelism. The experiments indicated performance improvement compared with previous versions of SPar and achieved performance comparable to handwritten code using lower-level programming abstractions in specific scenarios. Close https://doi.org/10.1016/j.cola.2025.101369 doi:10.1016/j.cola.2025.101369 Close
	Leonarczyk, Ricardo; Mencagli, Gabriele; Griebler, Dalvan Self-Adaptive Micro-Batching for Low-Latency GPU-Accelerated Stream Processing Journal Article doi In: International Journal of Parallel Programming, vol. 53, no. 2, pp. 14, 2025. (Abstract \| Links \| BibTeX \| Tags: ) @article{LEONARCZYK:IJPP:25, title = {Self-Adaptive Micro-Batching for Low-Latency GPU-Accelerated Stream Processing}, author = {Ricardo Leonarczyk and Gabriele Mencagli and Dalvan Griebler}, url = {https://doi.org/10.1007/s10766-025-00793-4}, doi = {10.1007/s10766-025-00793-4}, year = {2025}, date = {2025-01-01}, urldate = {2025-01-01}, journal = {International Journal of Parallel Programming}, volume = {53}, number = {2}, pages = {14}, publisher = {Springer}, abstract = {Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by GPUs. However, efficient GPU utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a GPU-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Stream processing is a computing paradigm enabling the continuous processing of unbounded data streams. Some classes of stream processing applications can greatly benefit from the parallel processing power and affordability offered by GPUs. However, efficient GPU utilization with stream processing applications often requires micro-batching techniques, i.e., the continuous processing of data batches to expose data parallelism opportunities and amortize host-device data transfer overheads. Micro-batching further introduces the challenge of finding suitable micro-batch sizes to maintain low-latency processing under highly dynamic workloads. The research field of self-adaptive software provides different techniques to address such a challenge. Our goal is to assess the performance of six self-adaptive algorithms in meeting latency requirements through micro-batch size adaptation. The algorithms are applied to a GPU-accelerated stream processing benchmark with a highly dynamic workload. Four of the six algorithms have already been evaluated using a smaller workload with the same application. We propose two new algorithms to address the shortcomings detected in the former four. The results demonstrate that a highly dynamic workload is challenging for the evaluated algorithms, as they could not meet the most strict latency requirements for more than 38.5% of the stream data items. Overall, all algorithms performed similarly in meeting the latency requirements. However, one of our proposed algorithms met the requirements for 4% more data items than the best of the previously studied algorithms, demonstrating more effectiveness in highly variable workloads. This effectiveness is particularly evident in segments of the workload with abrupt transitions between low- and high-latency regions, where our proposed algorithms met the requirements for 79% of the data items in those segments, compared to 33% for the best of the earlier algorithms. Close https://doi.org/10.1007/s10766-025-00793-4 doi:10.1007/s10766-025-00793-4 Close
2024
	Hoffmann, Renato B.; Griebler, Dalvan; Righi, Rodrigo Rosa; Fernandes, Luiz G. Benchmarking parallel programming for single-board computers Journal Article doi In: Future Generation Computer Systems, vol. 161, pp. 119-134, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{HOFFMANN:single-board-computers:FGCS:24, title = {Benchmarking parallel programming for single-board computers}, author = {Renato B. Hoffmann and Dalvan Griebler and Rodrigo Rosa Righi and Luiz G. Fernandes}, url = {https://doi.org/10.1016/j.future.2024.07.003}, doi = {10.1016/j.future.2024.07.003}, year = {2024}, date = {2024-12-01}, urldate = {2024-12-01}, journal = {Future Generation Computer Systems}, volume = {161}, pages = {119-134}, publisher = {Elsevier}, abstract = {Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role in enabling the full computational potential of SBCs. However, selecting the best-suited solution in this context is inherently complex due to the intricate interplay between PPI (parallel programming interface) strategies, SBC architectural characteristics, and application characteristics and constraints. To our knowledge, no solution presents a combined discussion of these three aspects. To tackle this problem, this article aims to provide a benchmark of the best-suited parallelism PPIs given a set of hardware and application characteristics and requirements. Compared to existing benchmarks, we introduce new metrics, additional applications, various parallelism interfaces, and extra hardware devices. Therefore, our contributions are the methodology to benchmark parallelism on SBCs and the characterization of the best-performing parallelism PPIs and strategies for given situations. We are confident that parallel computing will be mainstream to process edge and fog computing; thus, our solution provides the first insights regarding what kind of application and parallel programming interface is the most suited for a particular SBC hardware.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role in enabling the full computational potential of SBCs. However, selecting the best-suited solution in this context is inherently complex due to the intricate interplay between PPI (parallel programming interface) strategies, SBC architectural characteristics, and application characteristics and constraints. To our knowledge, no solution presents a combined discussion of these three aspects. To tackle this problem, this article aims to provide a benchmark of the best-suited parallelism PPIs given a set of hardware and application characteristics and requirements. Compared to existing benchmarks, we introduce new metrics, additional applications, various parallelism interfaces, and extra hardware devices. Therefore, our contributions are the methodology to benchmark parallelism on SBCs and the characterization of the best-performing parallelism PPIs and strategies for given situations. We are confident that parallel computing will be mainstream to process edge and fog computing; thus, our solution provides the first insights regarding what kind of application and parallel programming interface is the most suited for a particular SBC hardware. Close https://doi.org/10.1016/j.future.2024.07.003 doi:10.1016/j.future.2024.07.003 Close
	Vogel, Adriano; Danelutto, Marco; Torquati, Massimo; Griebler, Dalvan; Fernandes, Luiz Gustavo Enhancing self-adaptation for efficient decision-making at run-time in streaming applications on multicores Journal Article doi In: The Journal of Supercomputing, vol. 80, no. 15, pp. 22213-22244, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{VOGEL:Supercomputing:24, title = {Enhancing self-adaptation for efficient decision-making at run-time in streaming applications on multicores}, author = {Adriano Vogel and Marco Danelutto and Massimo Torquati and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-024-06191-w}, doi = {10.1007/s11227-024-06191-w}, year = {2024}, date = {2024-10-01}, urldate = {2024-10-01}, journal = {The Journal of Supercomputing}, volume = {80}, number = {15}, pages = {22213-22244}, publisher = {Springer}, abstract = {Parallel computing is very important to accelerate the performance of computing applications. Moreover, parallel applications are expected to continue executing in more dynamic environments and react to changing conditions. In this context, applying self-adaptation is a potential solution to achieve a higher level of autonomic abstractions and runtime responsiveness. In our research, we aim to explore and assess the possible abstractions attainable through the transparent management of parallel executions by self-adaptation. Our primary objectives are to expand the adaptation space to better reflect real-world applications and assess the potential for self-adaptation to enhance efficiency. We provide the following scientific contributions: (I) A conceptual framework to improve the designing of self-adaptation; (II) A new decision-making strategy for applications with multiple parallel stages; (III) A comprehensive evaluation of the proposed decision-making strategy compared to the state-of-the-art. The results demonstrate that the proposed conceptual framework can help design and implement self-adaptive strategies that are more modular and reusable. The proposed decision-making strategy provides significant gains in accuracy compared to the state-of-the-art, increasing the parallel applications' performance and efficiency.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Parallel computing is very important to accelerate the performance of computing applications. Moreover, parallel applications are expected to continue executing in more dynamic environments and react to changing conditions. In this context, applying self-adaptation is a potential solution to achieve a higher level of autonomic abstractions and runtime responsiveness. In our research, we aim to explore and assess the possible abstractions attainable through the transparent management of parallel executions by self-adaptation. Our primary objectives are to expand the adaptation space to better reflect real-world applications and assess the potential for self-adaptation to enhance efficiency. We provide the following scientific contributions: (I) A conceptual framework to improve the designing of self-adaptation; (II) A new decision-making strategy for applications with multiple parallel stages; (III) A comprehensive evaluation of the proposed decision-making strategy compared to the state-of-the-art. The results demonstrate that the proposed conceptual framework can help design and implement self-adaptive strategies that are more modular and reusable. The proposed decision-making strategy provides significant gains in accuracy compared to the state-of-the-art, increasing the parallel applications' performance and efficiency. Close https://doi.org/10.1007/s11227-024-06191-w doi:10.1007/s11227-024-06191-w Close
	Guder, Larissa; Aires, João Paulo; Griebler, Dalvan Dimensional Speech Emotion Recognition: a Bimodal Approach Inproceedings doi In: Anais Estendidos do XXX Simpósio Brasileiro de Sistemas Multimídia e Web, pp. 5-6, SBC, Juiz de Fora, Brasil, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GUDER:WEBMEDIA:24, title = {Dimensional Speech Emotion Recognition: a Bimodal Approach}, author = {Larissa Guder and João Paulo Aires and Dalvan Griebler}, url = {https://doi.org/10.5753/webmedia_estendido.2024.244402}, doi = {10.5753/webmedia_estendido.2024.244402}, year = {2024}, date = {2024-10-01}, booktitle = {Anais Estendidos do XXX Simpósio Brasileiro de Sistemas Multimídia e Web}, pages = {5-6}, publisher = {SBC}, address = {Juiz de Fora, Brasil}, abstract = {Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance, which can represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios, where processing the input in a short time is necessary. Considering these aspects, this work provides the first step towards creating a bimodal approach for Dimensional Speech Emotion Recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speech-emotion recognition. We evaluate different methods for creating audio and text representations, as well as automatic speech recognition techniques. Our best results achieve 0.5915 of CCC for arousal, 0.4165 for valence, and 0.5899 for dominance in the IEMOCAP dataset.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance, which can represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios, where processing the input in a short time is necessary. Considering these aspects, this work provides the first step towards creating a bimodal approach for Dimensional Speech Emotion Recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speech-emotion recognition. We evaluate different methods for creating audio and text representations, as well as automatic speech recognition techniques. Our best results achieve 0.5915 of CCC for arousal, 0.4165 for valence, and 0.5899 for dominance in the IEMOCAP dataset. Close https://doi.org/10.5753/webmedia_estendido.2024.244402 doi:10.5753/webmedia_estendido.2024.244402 Close
	Faé, Leonardo; Griebler, Dalvan An internal domain-specific language for expressing linear pipelines: a proof-of-concept with MPI in Rust Inproceedings doi In: Anais do XXVIII Simpósio Brasileiro de Linguagens de Programação, pp. 81-90, SBC, Curitiba/PR, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{FAE:SBLP:24, title = {An internal domain-specific language for expressing linear pipelines: a proof-of-concept with MPI in Rust}, author = {Leonardo Faé and Dalvan Griebler}, url = {https://doi.org/10.5753/sblp.2024.3691}, doi = {10.5753/sblp.2024.3691}, year = {2024}, date = {2024-09-01}, booktitle = {Anais do XXVIII Simpósio Brasileiro de Linguagens de Programação}, pages = {81-90}, publisher = {SBC}, address = {Curitiba/PR}, series = {SBLP'24}, abstract = {Parallel computation is necessary in order to process massive volumes of data in a timely manner. There are many parallel programming interfaces and environments, each with their own idiosyncrasies. This, alongside non-deterministic errors, make parallel programs notoriously challenging to write. Great effort has been put forth to make parallel programming for several environments easier. In this work, we propose a DSL for Rust, using the language’s source-to-source transformation facilities, that allows for automatic code generation for distributed environments that support the Message Passing Interface (MPI). Our DSL simplifies MPI’s quirks, allowing the programmer to focus almost exclusively on the computation at hand. Performance experiments show nearly or no runtime difference between our abstraction and manually written MPI code while resulting in less than half the lines of code. More elaborate code complexity metrics (Halstead) estimate from 4.5 to 14.7 times lower effort for expressing parallelism.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Parallel computation is necessary in order to process massive volumes of data in a timely manner. There are many parallel programming interfaces and environments, each with their own idiosyncrasies. This, alongside non-deterministic errors, make parallel programs notoriously challenging to write. Great effort has been put forth to make parallel programming for several environments easier. In this work, we propose a DSL for Rust, using the language’s source-to-source transformation facilities, that allows for automatic code generation for distributed environments that support the Message Passing Interface (MPI). Our DSL simplifies MPI’s quirks, allowing the programmer to focus almost exclusively on the computation at hand. Performance experiments show nearly or no runtime difference between our abstraction and manually written MPI code while resulting in less than half the lines of code. More elaborate code complexity metrics (Halstead) estimate from 4.5 to 14.7 times lower effort for expressing parallelism. Close https://doi.org/10.5753/sblp.2024.3691 doi:10.5753/sblp.2024.3691 Close
	Löff, J'unior; Griebler, Dalvan; Fernandes, Luiz Gustavo; Binder, Walter MPR: An MPI Framework for Distributed Self-adaptive Stream Processing Inproceedings doi In: Euro-Par 2024: Parallel Processing, pp. 400-414, Springer, Madrid, Spain, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{LOFF:Euro-Par:24, title = {MPR: An MPI Framework for Distributed Self-adaptive Stream Processing}, author = {J'unior Löff and Dalvan Griebler and Luiz Gustavo Fernandes and Walter Binder}, url = {https://doi.org/10.1007/978-3-031-69583-4_28}, doi = {10.1007/978-3-031-69583-4_28}, year = {2024}, date = {2024-08-01}, booktitle = {Euro-Par 2024: Parallel Processing}, pages = {400-414}, publisher = {Springer}, address = {Madrid, Spain}, series = {Euro-Par'24}, abstract = {Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresses this limitation by providing a new framework and runtime system to simplify implementing and assessing new self-adaptive algorithms and optimizations. We implement a prototype on top of MPI called MPR and show its functionality. We focus on horizontal scaling by supporting the addition and removal of processes during execution time. Experiments reveal that MPR can achieve performance similar to that of a handwritten static MPI application. We also assess MPR's adaptation capabilities, showing that it can readily re-configure itself, with the help of a self-adaptive algorithm, in response to workload variations.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresses this limitation by providing a new framework and runtime system to simplify implementing and assessing new self-adaptive algorithms and optimizations. We implement a prototype on top of MPI called MPR and show its functionality. We focus on horizontal scaling by supporting the addition and removal of processes during execution time. Experiments reveal that MPR can achieve performance similar to that of a handwritten static MPI application. We also assess MPR's adaptation capabilities, showing that it can readily re-configure itself, with the help of a self-adaptive algorithm, in response to workload variations. Close https://doi.org/10.1007/978-3-031-69583-4_28 doi:10.1007/978-3-031-69583-4_28 Close
	Gomes, Carlos Falcao Azevedo; Araujo, Adriel Silva; Ahmad, Sunna Imtiaz; Magnaguagno, Mauricio Cecilio; Teixeira, Vinicius Crisosthemos; Rajapuri, Anushri Singh; Roederer, Quinn; Griebler, Dalvan; Dutra, Vinicius; Turkkahraman, Hakan; Pinho, Marcio Sarroglia Multiview Machine Learning Classification of Tooth Extraction in Orthodontics Using Intraoral Scans Inproceedings doi In: 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), pp. 1977-1982, IEEE, Osaka, Japan, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GOMES:COMPSAC:24, title = {Multiview Machine Learning Classification of Tooth Extraction in Orthodontics Using Intraoral Scans}, author = {Carlos Falcao Azevedo Gomes and Adriel Silva Araujo and Sunna Imtiaz Ahmad and Mauricio Cecilio Magnaguagno and Vinicius Crisosthemos Teixeira and Anushri Singh Rajapuri and Quinn Roederer and Dalvan Griebler and Vinicius Dutra and Hakan Turkkahraman and Marcio Sarroglia Pinho}, url = {https://doi.org/10.1109/COMPSAC61105.2024.00316}, doi = {10.1109/COMPSAC61105.2024.00316}, year = {2024}, date = {2024-07-01}, booktitle = {2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)}, pages = {1977-1982}, publisher = {IEEE}, address = {Osaka, Japan}, abstract = {Orthodontic treatment planning often involves de-ciding whether to extract teeth, a critical and irreversible decision. Integrating machine learning (ML) can enhance decision-making. This study proposes using Intraoral Scans (IOS) 3D models to predict extraction/non-extraction binary decisions with ML models. We leverage a multiview approach, using images taken from multiple points of view of the 3D model. The methodology involved a dataset composed of preprocessed IOS from 181 subjects and an experimental procedure that evaluated multiple ML models in their ability to classify subjects using either grayscale pixel intensities or radiomic features. The results indicated that a logistic model applied to the radiomic features from the back and frontal views of the 3D models was one of the best model candidates, achieving a test accuracy of 70 % and F1 score of. 73 and. 65 for non-extraction and extraction cases, respectively. Overall, these findings indicate that a multiview approach to IOS 3D models can be used to predict extraction/non-extraction decisions. In addition, the results suggest that radiomic features provide useful information in the analysis of IOS data.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Orthodontic treatment planning often involves de-ciding whether to extract teeth, a critical and irreversible decision. Integrating machine learning (ML) can enhance decision-making. This study proposes using Intraoral Scans (IOS) 3D models to predict extraction/non-extraction binary decisions with ML models. We leverage a multiview approach, using images taken from multiple points of view of the 3D model. The methodology involved a dataset composed of preprocessed IOS from 181 subjects and an experimental procedure that evaluated multiple ML models in their ability to classify subjects using either grayscale pixel intensities or radiomic features. The results indicated that a logistic model applied to the radiomic features from the back and frontal views of the 3D models was one of the best model candidates, achieving a test accuracy of 70 % and F1 score of. 73 and. 65 for non-extraction and extraction cases, respectively. Overall, these findings indicate that a multiview approach to IOS 3D models can be used to predict extraction/non-extraction decisions. In addition, the results suggest that radiomic features provide useful information in the analysis of IOS data. Close https://doi.org/10.1109/COMPSAC61105.2024.00316 doi:10.1109/COMPSAC61105.2024.00316 Close
	Guder, Larissa; Aires, João Paulo; Meneguzzi, Felipe; Griebler, Dalvan Dimensional Speech Emotion Recognition from Bimodal Features Inproceedings doi In: Anais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde, pp. 579-590, SBC, Goiânia, Brasil, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GUDER:SBCAS:24, title = {Dimensional Speech Emotion Recognition from Bimodal Features}, author = {Larissa Guder and João Paulo Aires and Felipe Meneguzzi and Dalvan Griebler}, url = {https://doi.org/10.5753/sbcas.2024.2779}, doi = {10.5753/sbcas.2024.2779}, year = {2024}, date = {2024-07-01}, booktitle = {Anais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde}, pages = {579-590}, publisher = {SBC}, address = {Goiânia, Brasil}, abstract = {Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance to represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios where processing the input quickly is necessary. Considering these aspects, we take the first step towards creating a bimodal approach for dimensional speech emotion recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speechemotion recognition. Our final architecture achieves a Concordance Correlation Coefficient of 0.5915 for arousal, 0.1431 for valence, and 0.5899 for dominance in the IEMOCAP dataset.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance to represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios where processing the input quickly is necessary. Considering these aspects, we take the first step towards creating a bimodal approach for dimensional speech emotion recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speechemotion recognition. Our final architecture achieves a Concordance Correlation Coefficient of 0.5915 for arousal, 0.1431 for valence, and 0.5899 for dominance in the IEMOCAP dataset. Close https://doi.org/10.5753/sbcas.2024.2779 doi:10.5753/sbcas.2024.2779 Close
	Leonarczyk, Ricardo; Griebler, Dalvan; Mencagli, Gabriele; Danelutto, Marco Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing Inproceedings doi In: Euro-Par 2023: Parallel Processing Workshops, pp. 81-92, Springer, Limassol, Cyprus, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{LEONARCZYK:Euro-ParW:23, title = {Evaluation of Adaptive Micro-batching Techniques for GPU-accelerated Stream Processing}, author = {Ricardo Leonarczyk and Dalvan Griebler and Gabriele Mencagli and Marco Danelutto}, url = {https://doi.org/10.1007/978-3-031-50684-0_7}, doi = {10.1007/978-3-031-50684-0_7}, year = {2024}, date = {2024-04-01}, booktitle = {Euro-Par 2023: Parallel Processing Workshops}, pages = {81-92}, publisher = {Springer}, address = {Limassol, Cyprus}, series = {Euro-ParW'23}, abstract = {Stream processing plays a vital role in applications that require continuous, low-latency data processing. Thanks to their extensive parallel processing capabilities and relatively low cost, GPUs are well-suited to scenarios where such applications require substantial computational resources. However, micro-batching becomes essential for efficient GPU computation within stream processing systems. However, finding appropriate batch sizes to maintain an adequate level of service is often challenging, particularly in cases where applications experience fluctuations in input rate and workload. Addressing this challenge requires adjusting the optimal batch size at runtime. This study proposes a methodology for evaluating different self-adaptive micro-batching strategies in a real-world complex streaming application used as a benchmark.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Stream processing plays a vital role in applications that require continuous, low-latency data processing. Thanks to their extensive parallel processing capabilities and relatively low cost, GPUs are well-suited to scenarios where such applications require substantial computational resources. However, micro-batching becomes essential for efficient GPU computation within stream processing systems. However, finding appropriate batch sizes to maintain an adequate level of service is often challenging, particularly in cases where applications experience fluctuations in input rate and workload. Addressing this challenge requires adjusting the optimal batch size at runtime. This study proposes a methodology for evaluating different self-adaptive micro-batching strategies in a real-world complex streaming application used as a benchmark. Close https://doi.org/10.1007/978-3-031-50684-0_7 doi:10.1007/978-3-031-50684-0_7 Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; García, José Daniel; Muñoz, Javier Fernández; Fernandes, Luiz Gustavo Performance and programmability of GrPPI for parallel stream processing on multi-cores Journal Article doi In: The Journal of Supercomputing, vol. 80, no. 9, pp. 12966-13000, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{GARCIA:JS:24, title = {Performance and programmability of GrPPI for parallel stream processing on multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and José Daniel García and Javier Fernández Muñoz and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-024-05934-z}, doi = {10.1007/s11227-024-05934-z}, year = {2024}, date = {2024-02-01}, urldate = {2024-02-01}, journal = {The Journal of Supercomputing}, volume = {80}, number = {9}, pages = {12966-13000}, publisher = {Springer}, abstract = {GrPPI library aims to simplify the burdening task of parallel programming. It provides a unified, abstract, and generic layer while promising minimal overhead on performance. Although it supports stream parallelism, GrPPI lacks an evaluation regarding representative performance metrics for this domain, such as throughput and latency. This work evaluates GrPPI focused on parallel stream processing. We compare the throughput and latency performance, memory usage, and programmability of GrPPI against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks and benchmarks with handwritten parallel code using the same backends supported by GrPPI. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is often competitive with handwritten parallel code, the infeasibility of fine-tuning GrPPI is a crucial drawback for emerging applications. Despite this, programmability experiments estimate that GrPPI can potentially reduce the development time of parallel applications by about three times.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close GrPPI library aims to simplify the burdening task of parallel programming. It provides a unified, abstract, and generic layer while promising minimal overhead on performance. Although it supports stream parallelism, GrPPI lacks an evaluation regarding representative performance metrics for this domain, such as throughput and latency. This work evaluates GrPPI focused on parallel stream processing. We compare the throughput and latency performance, memory usage, and programmability of GrPPI against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks and benchmarks with handwritten parallel code using the same backends supported by GrPPI. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is often competitive with handwritten parallel code, the infeasibility of fine-tuning GrPPI is a crucial drawback for emerging applications. Despite this, programmability experiments estimate that GrPPI can potentially reduce the development time of parallel applications by about three times. Close https://doi.org/10.1007/s11227-024-05934-z doi:10.1007/s11227-024-05934-z Close
	Mencagli, Gabriele; Torquati, Massimo; Griebler, Dalvan; Fais, Alessandra; Danelutto, Marco General-purpose data stream processing on heterogeneous architectures with WindFlow Journal Article doi In: Journal of Parallel and Distributed Computing, vol. 184, pp. 104782, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{MENCAGLI:JPDC:24, title = {General-purpose data stream processing on heterogeneous architectures with WindFlow}, author = {Gabriele Mencagli and Massimo Torquati and Dalvan Griebler and Alessandra Fais and Marco Danelutto}, url = {https://doi.org/10.1016/j.jpdc.2023.104782}, doi = {10.1016/j.jpdc.2023.104782}, year = {2024}, date = {2024-02-01}, urldate = {2024-02-01}, journal = {Journal of Parallel and Distributed Computing}, volume = {184}, pages = {104782}, publisher = {Elsevier}, abstract = {Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Many emerging applications analyze data streams by running graphs of communicating tasks called operators. To develop and deploy such applications, Stream Processing Systems (SPSs) like Apache Storm and Flink have been made available to researchers and practitioners. They exhibit imperative or declarative programming interfaces to develop operators running arbitrary algorithms working on structured or unstructured data streams. In this context, the interest in leveraging hardware acceleration with GPUs has become more pronounced in high-throughput use cases. Unfortunately, GPU acceleration has been studied for relational operators working on structured streams only, while non-relational operators have often been overlooked. This paper presents WindFlow, a library supporting the seamless GPU offloading of general partitioned-stateful operators, extending the range of operators that benefit from hardware acceleration. Its design provides high throughput still exposing a high-level API to users compared with the raw utilization of GPUs in Apache Flink. Close https://doi.org/10.1016/j.jpdc.2023.104782 doi:10.1016/j.jpdc.2023.104782 Close
	Fischer, Gabriel Souto; Ramos, Gabriel Oliveira; Costa, Cristiano André; Alberti, Antonio Marcos; Griebler, Dalvan; Singh, Dhananjay; Righi, Rodrigo Rosa Multi-Hospital Management: Combining Vital Signs IoT Data and the Elasticity Technique to Support Healthcare 4.0 Journal Article doi In: IoT, vol. 5, no. 2, pp. 381-408, 2024. (Abstract \| Links \| BibTeX \| Tags: ) @article{FISCHER:IoT:24, title = {Multi-Hospital Management: Combining Vital Signs IoT Data and the Elasticity Technique to Support Healthcare 4.0}, author = {Gabriel Souto Fischer and Gabriel Oliveira Ramos and Cristiano André Costa and Antonio Marcos Alberti and Dalvan Griebler and Dhananjay Singh and Rodrigo Rosa Righi}, url = {https://doi.org/10.3390/iot5020019}, doi = {10.3390/iot5020019}, year = {2024}, date = {2024-01-01}, urldate = {2024-01-01}, journal = {IoT}, volume = {5}, number = {2}, pages = {381-408}, publisher = {MDPI}, abstract = {Smart cities can improve the quality of life of citizens by optimizing the utilization of resources. In an IoT-connected environment, people's health can be constantly monitored, which can help identify medical problems before they become serious. However, overcrowded hospitals can lead to long waiting times for patients to receive treatment. The literature presents alternatives to address this problem by adjusting care capacity to demand. However, there is still a need for a solution that can adjust human resources in multiple healthcare settings, which is the reality of cities. This work introduces HealCity, a smart-city-focused model that can monitor patients’ use of healthcare settings and adapt the allocation of health professionals to meet their needs. HealCity uses vital signs (IoT) data in prediction techniques to anticipate when the demand for a given environment will exceed its capacity and suggests actions to allocate health professionals accordingly. Additionally, we introduce the concept of multilevel proactive human resources elasticity in smart cities, thus managing human resources at different levels of a smart city. An algorithm is also devised to automatically manage and identify the appropriate hospital for a possible future patient. Furthermore, some IoT deployment considerations are presented based on a hardware implementation for the proposed model. HealCity was evaluated with four hospital settings and obtained promising results: Compared to hospitals with rigid professional allocations, it reduced waiting time for care by up to 87.62%.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Smart cities can improve the quality of life of citizens by optimizing the utilization of resources. In an IoT-connected environment, people's health can be constantly monitored, which can help identify medical problems before they become serious. However, overcrowded hospitals can lead to long waiting times for patients to receive treatment. The literature presents alternatives to address this problem by adjusting care capacity to demand. However, there is still a need for a solution that can adjust human resources in multiple healthcare settings, which is the reality of cities. This work introduces HealCity, a smart-city-focused model that can monitor patients’ use of healthcare settings and adapt the allocation of health professionals to meet their needs. HealCity uses vital signs (IoT) data in prediction techniques to anticipate when the demand for a given environment will exceed its capacity and suggests actions to allocate health professionals accordingly. Additionally, we introduce the concept of multilevel proactive human resources elasticity in smart cities, thus managing human resources at different levels of a smart city. An algorithm is also devised to automatically manage and identify the appropriate hospital for a possible future patient. Furthermore, some IoT deployment considerations are presented based on a hardware implementation for the proposed model. HealCity was evaluated with four hospital settings and obtained promising results: Compared to hospitals with rigid professional allocations, it reduced waiting time for care by up to 87.62%. Close https://doi.org/10.3390/iot5020019 doi:10.3390/iot5020019 Close
2023
	Hoffmann, Renato Barreto; Faé, Leonardo; Manssour, Isabel; Griebler, Dalvan Analyzing C++ Stream Parallelism in Shared-Memory when Porting to Flink and Storm Inproceedings doi In: International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pp. 1-8, IEEE, Porto Alegre, Brazil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{HOFFMANN:SBAC-PADW:23, title = {Analyzing C++ Stream Parallelism in Shared-Memory when Porting to Flink and Storm}, author = {Renato Barreto Hoffmann and Leonardo Faé and Isabel Manssour and Dalvan Griebler}, url = {https://doi.org/10.1109/SBAC-PADW60351.2023.00017}, doi = {10.1109/SBAC-PADW60351.2023.00017}, year = {2023}, date = {2023-10-01}, booktitle = {International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)}, pages = {1-8}, publisher = {IEEE}, address = {Porto Alegre, Brazil}, series = {SBAC-PADW'23}, abstract = {Stream processing plays a crucial role in various information-oriented digital systems. Two popular frameworks for real-time data processing, Flink and Storm, provide solutions for effective parallel stream processing in Java. An option to leverage Java's mature ecosystem for distributed stream processing involves porting legacy C++ applications to Java. However, this raises considerations on the adequacy of the equivalent Java mechanisms and potential degradation in throughput. Therefore, our objective is to evaluate programmability and performance when converting stream processing applications from C++ to Java while also exploring the parallelization capabilities offered by Flink and Storm. Furthermore, we aim to assess the throughput of Flink and Storm on shared-memory manycore machines, a hardware architecture commonly found in cloud environments. To achieve this, we conduct experiments involving four different stream processing applications. We highlight challenges encountered when porting C++ to Java and working with Flink and Storm. Furthermore, we discuss throughput, latency, CPU, and memory usage results.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Stream processing plays a crucial role in various information-oriented digital systems. Two popular frameworks for real-time data processing, Flink and Storm, provide solutions for effective parallel stream processing in Java. An option to leverage Java's mature ecosystem for distributed stream processing involves porting legacy C++ applications to Java. However, this raises considerations on the adequacy of the equivalent Java mechanisms and potential degradation in throughput. Therefore, our objective is to evaluate programmability and performance when converting stream processing applications from C++ to Java while also exploring the parallelization capabilities offered by Flink and Storm. Furthermore, we aim to assess the throughput of Flink and Storm on shared-memory manycore machines, a hardware architecture commonly found in cloud environments. To achieve this, we conduct experiments involving four different stream processing applications. We highlight challenges encountered when porting C++ to Java and working with Flink and Storm. Furthermore, we discuss throughput, latency, CPU, and memory usage results. Close https://doi.org/10.1109/SBAC-PADW60351.2023.00017 doi:10.1109/SBAC-PADW60351.2023.00017 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo Extending the Planning Poker Method to Estimate the Development Effort of Parallel Applications Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 181-192, SBC, Porto Alegre, Brasil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ANDRADE:WSCAD:23, title = {Extending the Planning Poker Method to Estimate the Development Effort of Parallel Applications}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/wscad.2023.235925}, doi = {10.5753/wscad.2023.235925}, year = {2023}, date = {2023-10-01}, booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)}, pages = {181-192}, publisher = {SBC}, address = {Porto Alegre, Brasil}, abstract = {Since different Parallel Programming Interfaces (PPIs) are available to programmers, evaluating them to identify the most suitable PPI also became necessary. Recently, in addition to the performance of PPIs, developers’ productivity has also been evaluated by researchers in parallel processing. Some researchers conduct empirical studies involving people for productivity evaluation, which is time-consuming. Aiming to propose a less costly method for evaluating the development effort of parallel applications, we proposed modifying the Planning Poker method in this paper. We consider a representative set of parallel stream processing applications to evaluate the proposed modification. Our results showed that the proposed method required less effort for practical use than the controlled experiments with students.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Since different Parallel Programming Interfaces (PPIs) are available to programmers, evaluating them to identify the most suitable PPI also became necessary. Recently, in addition to the performance of PPIs, developers’ productivity has also been evaluated by researchers in parallel processing. Some researchers conduct empirical studies involving people for productivity evaluation, which is time-consuming. Aiming to propose a less costly method for evaluating the development effort of parallel applications, we proposed modifying the Planning Poker method in this paper. We consider a representative set of parallel stream processing applications to evaluate the proposed modification. Our results showed that the proposed method required less effort for practical use than the controlled experiments with students. Close https://doi.org/10.5753/wscad.2023.235925 doi:10.5753/wscad.2023.235925 Close
	Alf, Lucas; Hoffmann, Renato Barreto; Müller, Caetano; Griebler, Dalvan Análise da Execução de Algoritmos de Aprendizado de Máquina em Dispositivos Embarcados Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 61-72, SBC, Porto Alegre, Brasil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ALF:WSCAD:23, title = {Análise da Execução de Algoritmos de Aprendizado de Máquina em Dispositivos Embarcados}, author = {Lucas Alf and Renato Barreto Hoffmann and Caetano Müller and Dalvan Griebler}, url = {https://doi.org/10.5753/wscad.2023.235915}, doi = {10.5753/wscad.2023.235915}, year = {2023}, date = {2023-10-01}, booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)}, pages = {61-72}, publisher = {SBC}, address = {Porto Alegre, Brasil}, abstract = {Os avanços na área de IoT motivam a utilização de algoritmos de aprendizado de máquina em dispositivos embarcados. Entretanto, esses algoritmos exigem uma quantidade considerável de recursos computacionais. O objetivo deste trabalho consistiu em analisar algoritmos de aprendizado de máquina em dispositivos embarcados utilizando paralelismo em CPU e GPU com o intuito de compreender quais características de hardware e software desempenham melhor em relação ao consumo energético, inferências por segundo e acurácia. Foram avaliados três modelos de Convolutional Neural Network, bem como algoritmos tradicionais e redes neurais de classificação e regressão. Os experimentos demonstraram que o PyTorch obteve o melhor desempenho nos modelos de CNN e nas redes neurais de classificação e regressão usando GPU, enquanto o Keras obteve um melhor desempenho ao utilizar somente CPU.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Os avanços na área de IoT motivam a utilização de algoritmos de aprendizado de máquina em dispositivos embarcados. Entretanto, esses algoritmos exigem uma quantidade considerável de recursos computacionais. O objetivo deste trabalho consistiu em analisar algoritmos de aprendizado de máquina em dispositivos embarcados utilizando paralelismo em CPU e GPU com o intuito de compreender quais características de hardware e software desempenham melhor em relação ao consumo energético, inferências por segundo e acurácia. Foram avaliados três modelos de Convolutional Neural Network, bem como algoritmos tradicionais e redes neurais de classificação e regressão. Os experimentos demonstraram que o PyTorch obteve o melhor desempenho nos modelos de CNN e nas redes neurais de classificação e regressão usando GPU, enquanto o Keras obteve um melhor desempenho ao utilizar somente CPU. Close https://doi.org/10.5753/wscad.2023.235915 doi:10.5753/wscad.2023.235915 Close
	Bianchessi, Arthur S.; Mallmann, Leonardo; Hoffmann, Renato Barreto; Griebler, Dalvan Conversão do NAS Parallel Benchmarks para C++ Standard Inproceedings doi In: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 313-324, SBC, Porto Alegre, Brasil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{BIANCHESSI:WSCAD:23, title = {Conversão do NAS Parallel Benchmarks para C++ Standard}, author = {Arthur S. Bianchessi and Leonardo Mallmann and Renato Barreto Hoffmann and Dalvan Griebler}, url = {https://doi.org/10.5753/wscad.2023.235913}, doi = {10.5753/wscad.2023.235913}, year = {2023}, date = {2023-10-01}, booktitle = {Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)}, pages = {313-324}, publisher = {SBC}, address = {Porto Alegre, Brasil}, abstract = {A linguagem C++ recebeu novas abstrações de paralelismo com a definição das políticas de execução dos algoritmos da biblioteca padrão. Entretanto, a adequabilidade e o desempenho dessa alternativa ainda necessita ser estudado em comparação com outras alternativas bem estabelecidas. Portanto, o objetivo deste trabalho foi explorar a vasta gama de opções de recursos da biblioteca padrão C++ para avaliar a aplicabilidade e desempenho a partir de cinco kernels do NPB. Através dos experimentos em um ambiente multithreaded, foi constatado que a incorporação de estruturas de dados da biblioteca padrão, assim como a abstração para acesso multidimensional criada, não apresentam impacto notável no tempo de execução. Já os algoritmos com políticas de execução paralela demonstraram uma perda de desempenho estatisticamente significativa.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close A linguagem C++ recebeu novas abstrações de paralelismo com a definição das políticas de execução dos algoritmos da biblioteca padrão. Entretanto, a adequabilidade e o desempenho dessa alternativa ainda necessita ser estudado em comparação com outras alternativas bem estabelecidas. Portanto, o objetivo deste trabalho foi explorar a vasta gama de opções de recursos da biblioteca padrão C++ para avaliar a aplicabilidade e desempenho a partir de cinco kernels do NPB. Através dos experimentos em um ambiente multithreaded, foi constatado que a incorporação de estruturas de dados da biblioteca padrão, assim como a abstração para acesso multidimensional criada, não apresentam impacto notável no tempo de execução. Já os algoritmos com políticas de execução paralela demonstraram uma perda de desempenho estatisticamente significativa. Close https://doi.org/10.5753/wscad.2023.235913 doi:10.5753/wscad.2023.235913 Close
	Faé, Leonardo; Hoffmann, Renato Barreto; Griebler, Dalvan Source-to-Source Code Transformation on Rust for High-Level Stream Parallelism Inproceedings doi In: XXVII Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Campo Grande, Brazil, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{FAE:SBLP:23, title = {Source-to-Source Code Transformation on Rust for High-Level Stream Parallelism}, author = {Leonardo Faé and Renato Barreto Hoffmann and Dalvan Griebler}, url = {https://doi.org/10.1145/3624309.3624320}, doi = {10.1145/3624309.3624320}, year = {2023}, date = {2023-09-01}, booktitle = {XXVII Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-49}, publisher = {ACM}, address = {Campo Grande, Brazil}, series = {SBLP'23}, abstract = {Utilizing parallel systems to their full potential can be challenging for general-purpose developers. A solution to this problem is to create high-level abstractions using Domain-Specific Languages (DSL). We create a stream-processing DSL for Rust, a growing programming language focusing on performance and safety. To that end, we explore Rust’s macros as a high-level abstraction tool to support an existing DSL language named SPar and perform source-to-source code transformations in the abstract syntax tree. We aim to assess the Rust source-to-source code transformations toolset and its implications. We highlight that Rust macros are powerful tools for performing source-to-source code transformations for abstracting structured stream processing. In addition, execution time and programmability results are comparable to other solutions.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Utilizing parallel systems to their full potential can be challenging for general-purpose developers. A solution to this problem is to create high-level abstractions using Domain-Specific Languages (DSL). We create a stream-processing DSL for Rust, a growing programming language focusing on performance and safety. To that end, we explore Rust’s macros as a high-level abstraction tool to support an existing DSL language named SPar and perform source-to-source code transformations in the abstract syntax tree. We aim to assess the Rust source-to-source code transformations toolset and its implications. We highlight that Rust macros are powerful tools for performing source-to-source code transformations for abstracting structured stream processing. In addition, execution time and programmability results are comparable to other solutions. Close https://doi.org/10.1145/3624309.3624320 doi:10.1145/3624309.3624320 Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; García, José Daniel; Muñoz, Javier Fernández; Fernandes, Luiz Gustavo A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores Inproceedings doi In: 31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 164-168, IEEE, Naples, Italy, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GARCIA:PDP:23, title = {A Latency, Throughput, and Programmability Perspective of GrPPI for Streaming on Multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and José Daniel García and Javier Fernández Muñoz and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP59025.2023.00033}, doi = {10.1109/PDP59025.2023.00033}, year = {2023}, date = {2023-03-01}, booktitle = {31st Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {164-168}, publisher = {IEEE}, address = {Naples, Italy}, series = {PDP'23}, abstract = {Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Several solutions aim to simplify the burdening task of parallel programming. The GrPPI library is one of them. It allows users to implement parallel code for multiple backends through a unified, abstract, and generic layer while promising minimal overhead on performance. An outspread evaluation of GrPPI regarding stream parallelism with representative metrics for this domain, such as throughput and latency, was not yet done. In this work, we evaluate GrPPI focused on stream processing. We evaluate performance, memory usage, and programming effort and compare them against handwritten parallel code. For this, we use the benchmarking framework SPBench to build custom GrPPI benchmarks. The basis of the benchmarks is real applications, such as Lane Detection, Bzip2, Face Recognizer, and Ferret. Experiments show that while performance is competitive with handwritten code in some cases, in other cases, the infeasibility of fine-tuning GrPPI is a crucial drawback. Despite this, programmability experiments estimate that GrPPI has the potential to reduce by about three times the development time of parallel applications. Close https://doi.org/10.1109/PDP59025.2023.00033 doi:10.1109/PDP59025.2023.00033 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo A parallel programming assessment for stream processing applications on multi-core systems Journal Article doi In: Computer Standards & Interfaces, vol. 84, pp. 103691, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @article{ANDRADE:CSI:2023, title = {A parallel programming assessment for stream processing applications on multi-core systems}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.csi.2022.103691}, doi = {10.1016/j.csi.2022.103691}, year = {2023}, date = {2023-03-01}, journal = {Computer Standards & Interfaces}, volume = {84}, pages = {103691}, publisher = {Elsevier}, abstract = {Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application Programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants’ opinions about their experience in this study to understand deeply the results achieved.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application Programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants’ opinions about their experience in this study to understand deeply the results achieved. Close https://doi.org/10.1016/j.csi.2022.103691 doi:10.1016/j.csi.2022.103691 Close
	Araujo, Gabriell; Griebler, Dalvan; Rockenbach, Dinei A.; Danelutto, Marco; Fernandes, Luiz Gustavo NAS Parallel Benchmarks with CUDA and Beyond Journal Article doi In: Software: Practice and Experience, vol. 53, no. 1, pp. 53-80, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @article{ARAUJO:SPE:23, title = {NAS Parallel Benchmarks with CUDA and Beyond}, author = {Gabriell Araujo and Dalvan Griebler and Dinei A. Rockenbach and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1002/spe.3056}, doi = {10.1002/spe.3056}, year = {2023}, date = {2023-01-01}, urldate = {2023-01-01}, journal = {Software: Practice and Experience}, volume = {53}, number = {1}, pages = {53-80}, publisher = {Wiley}, abstract = {NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior. Close https://doi.org/10.1002/spe.3056 doi:10.1002/spe.3056 Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Micro-batch and data frequency for stream processing on multi-cores Journal Article doi In: The Journal of Supercomputing, vol. 79, no. 8, pp. 9206-9244, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @article{GARCIA:JS:23, title = {Micro-batch and data frequency for stream processing on multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-022-05024-y}, doi = {10.1007/s11227-022-05024-y}, year = {2023}, date = {2023-01-01}, journal = {The Journal of Supercomputing}, volume = {79}, number = {8}, pages = {9206-9244}, publisher = {Springer}, abstract = {Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Latency or throughput is often critical performance metrics in stream processing. Applications’ performance can fluctuate depending on the input stream. This unpredictability is due to the variety in data arrival frequency and size, complexity, and other factors. Researchers are constantly investigating new ways to mitigate the impact of these variations on performance with self-adaptive techniques involving elasticity or micro-batching. However, there is a lack of benchmarks capable of creating test scenarios to further evaluate these techniques. This work extends and improves the SPBench benchmarking framework to support dynamic micro-batching and data stream frequency management. We also propose a set of algorithms that generates the most commonly used frequency patterns for benchmarking stream processing in related work. It allows the creation of a wide variety of test scenarios. To validate our solution, we use SPBench to create custom benchmarks and evaluate the impact of micro-batching and data stream frequency on the performance of Intel TBB and FastFlow. These are two libraries that leverage stream parallelism for multi-core architectures. Our results demonstrated that our test cases did not benefit from micro-batches on multi-cores. For different data stream frequency configurations, TBB ensured the lowest latency, while FastFlow assured higher throughput in shorter pipelines. Close https://doi.org/10.1007/s11227-022-05024-y doi:10.1007/s11227-022-05024-y Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo SPBench: a framework for creating benchmarks of stream processing applications Journal Article doi In: Computing, vol. 105, no. 5, pp. 1077-1099, 2023. (Abstract \| Links \| BibTeX \| Tags: ) @article{GARCIA:Computing:23, title = {SPBench: a framework for creating benchmarks of stream processing applications}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s00607-021-01025-6}, doi = {10.1007/s00607-021-01025-6}, year = {2023}, date = {2023-01-01}, urldate = {2023-01-01}, journal = {Computing}, volume = {105}, number = {5}, pages = {1077-1099}, publisher = {Springer}, abstract = {In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close In a fast-changing data-driven world, real-time data processing systems are becoming ubiquitous in everyday applications. The increasing data we produce, such as audio, video, image, and, text are demanding quickly and efficiently computation. Stream Parallelism allows accelerating this computation for real-time processing. But it is still a challenging task and most reserved for experts. In this paper, we present SPBench, a framework for benchmarking stream processing applications. It aims to support users with a set of real-world stream processing applications, which are made accessible through an Application Programming Interface (API) and executable via Command Line Interface (CLI) to create custom benchmarks. We tested SPBench by implementing parallel benchmarks with Intel Threading Building Blocks (TBB), FastFlow, and SPar. This evaluation provided useful insights and revealed the feasibility of the proposed framework in terms of usage, customization, and performance analysis. SPBench demonstrated to be a high-level, reusable, extensible, and easy of use abstraction to build parallel stream processing benchmarks on multi-core architectures. Close https://doi.org/10.1007/s00607-021-01025-6 doi:10.1007/s00607-021-01025-6 Close
2022
	Löff, Júnior; Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz Gustavo Combining stream with data parallelism abstractions for multi-cores Journal Article doi In: Journal of Computer Languages, vol. 73, pp. 101160, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{LOFF:COLA:22, title = {Combining stream with data parallelism abstractions for multi-cores}, author = {Júnior Löff and Renato Barreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.cola.2022.101160}, doi = {10.1016/j.cola.2022.101160}, year = {2022}, date = {2022-12-01}, urldate = {2022-12-01}, journal = {Journal of Computer Languages}, volume = {73}, pages = {101160}, publisher = {Elsevier}, abstract = {Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity. Close https://doi.org/10.1016/j.cola.2022.101160 doi:10.1016/j.cola.2022.101160 Close
	Ernstsson, August; Griebler, Dalvan; Kessler, Christoph Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems Journal Article doi In: International Journal of Parallel Programming, vol. 51, no. 5, pp. 61-82, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{Ernstsson:IJPP:22, title = {Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems}, author = {August Ernstsson and Dalvan Griebler and Christoph Kessler}, url = {https://doi.org/10.1007/s10766-022-00746-1}, doi = {10.1007/s10766-022-00746-1}, year = {2022}, date = {2022-12-01}, urldate = {2022-12-01}, journal = {International Journal of Parallel Programming}, volume = {51}, number = {5}, pages = {61-82}, publisher = {Springer}, abstract = {We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS Parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code..}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS Parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code.. Close https://doi.org/10.1007/s10766-022-00746-1 doi:10.1007/s10766-022-00746-1 Close
	Rockenbach, Dinei A.; Löff, Júnior; Araujo, Gabriell; Griebler, Dalvan; Fernandes, Luiz G. High-Level Stream and Data Parallelism in C++ for GPUs Inproceedings doi In: XXVI Brazilian Symposium on Programming Languages (SBLP), pp. 41-49, ACM, Uberlândia, Brazil, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ROCKENBACH:SBLP:22, title = {High-Level Stream and Data Parallelism in C++ for GPUs}, author = {Dinei A. Rockenbach and Júnior Löff and Gabriell Araujo and Dalvan Griebler and Luiz G. Fernandes}, url = {https://doi.org/10.1145/3561320.3561327}, doi = {10.1145/3561320.3561327}, year = {2022}, date = {2022-10-01}, booktitle = {XXVI Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-49}, publisher = {ACM}, address = {Uberlândia, Brazil}, series = {SBLP'22}, abstract = {GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close GPUs are massively parallel processors that allow solving problems that are not viable to traditional processors like CPUs. However, implementing applications for GPUs is challenging to programmers as it requires parallel programming to efficiently exploit the GPU resources. In this sense, parallel programming abstractions, notably domain-specific languages, are fundamental for improving programmability. SPar is a high-level Domain-Specific Language (DSL) that allows expressing stream and data parallelism in the serial code through annotations using C++ attributes. This work elaborates on a methodology and tool for GPU code generation by introducing new attributes to SPar language and transformation rules to SPar compiler. These new contributions, besides the gains in simplicity and code reduction compared to CUDA and OpenCL, enabled SPar achieve of higher throughput when exploring combined CPU and GPU parallelism, and when using batching. Close https://doi.org/10.1145/3561320.3561327 doi:10.1145/3561320.3561327 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Fernandes, Luiz Gustavo Opinião de Brasileiros Sobre a Produtividade no Desenvolvimento de Aplicações Paralelas Inproceedings doi In: Anais do XXII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD), pp. 276-287, SBC, Florianópolis, Brasil, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ANDRADE:WSCAD:22, title = {Opinião de Brasileiros Sobre a Produtividade no Desenvolvimento de Aplicações Paralelas}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Luiz Gustavo Fernandes}, url = {https://doi.org/10.5753/wscad.2022.226392}, doi = {10.5753/wscad.2022.226392}, year = {2022}, date = {2022-10-01}, booktitle = {Anais do XXII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)}, pages = {276-287}, publisher = {SBC}, address = {Florianópolis, Brasil}, abstract = {A partir da popularização das arquiteturas paralelas, surgiram várias interfaces de programação a fim de facilitar a exploração de tais arquiteturas e aumentar a produtividade dos desenvolvedores. Entretanto, desenvolver aplicações paralelas ainda é uma tarefa complexa para desenvolvedores com pouca experiência. Neste trabalho, realizamos uma pesquisa para descobrir a opinião de desenvolvedores de aplicações paralelas sobre os fatores que impedem a produtividade. Nossos resultados mostraram que a experiência dos desenvolvedores é uma das principais razões para a baixa produtividade. Além disso, os resultados indicaram formas para contornar este problema, como melhorar e incentivar o ensino de programação paralela em cursos de graduação.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close A partir da popularização das arquiteturas paralelas, surgiram várias interfaces de programação a fim de facilitar a exploração de tais arquiteturas e aumentar a produtividade dos desenvolvedores. Entretanto, desenvolver aplicações paralelas ainda é uma tarefa complexa para desenvolvedores com pouca experiência. Neste trabalho, realizamos uma pesquisa para descobrir a opinião de desenvolvedores de aplicações paralelas sobre os fatores que impedem a produtividade. Nossos resultados mostraram que a experiência dos desenvolvedores é uma das principais razões para a baixa produtividade. Além disso, os resultados indicaram formas para contornar este problema, como melhorar e incentivar o ensino de programação paralela em cursos de graduação. Close https://doi.org/10.5753/wscad.2022.226392 doi:10.5753/wscad.2022.226392 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Kessler, Christoph; Ernstsson, August; Fernandes, Luiz Gustavo Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing Inproceedings doi In: 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022), pp. 229-232, IEEE, Gran Canaria, Spain, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ANDRADE:SEAA:22, title = {Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Christoph Kessler and August Ernstsson and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/SEAA56994.2022.00043}, doi = {10.1109/SEAA56994.2022.00043}, year = {2022}, date = {2022-09-01}, booktitle = {48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2022)}, pages = {229-232}, publisher = {IEEE}, address = {Gran Canaria, Spain}, series = {SEAA'22}, abstract = {Over the years, several Parallel Programming Models (PPMs) have supported the abstraction of programming complexity for parallel computer systems. However, few studies aim to evaluate the productivity reached by such abstractions since this is a complex task that involves human beings. There are several studies to develop predictive methods to estimate the effort required to program applications in software engineering. In order to evaluate the reliability of such metrics, it is necessary to assess the accuracy in different programming domains. In this work, we used the data of an experiment conducted with beginners in parallel programming to determine the effort required for implementing stream parallelism using FastFlow, SPar, and TBB. Our results show that some traditional software effort estimation models, such as COCOMO II, fall short, while Putnam's model could be an alternative for high-level PPMs evaluation. To overcome the limitations of existing models, we plan to create a parallelism-aware model to evaluate applications in this domain in future work.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Over the years, several Parallel Programming Models (PPMs) have supported the abstraction of programming complexity for parallel computer systems. However, few studies aim to evaluate the productivity reached by such abstractions since this is a complex task that involves human beings. There are several studies to develop predictive methods to estimate the effort required to program applications in software engineering. In order to evaluate the reliability of such metrics, it is necessary to assess the accuracy in different programming domains. In this work, we used the data of an experiment conducted with beginners in parallel programming to determine the effort required for implementing stream parallelism using FastFlow, SPar, and TBB. Our results show that some traditional software effort estimation models, such as COCOMO II, fall short, while Putnam's model could be an alternative for high-level PPMs evaluation. To overcome the limitations of existing models, we plan to create a parallelism-aware model to evaluate applications in this domain in future work. Close https://doi.org/10.1109/SEAA56994.2022.00043 doi:10.1109/SEAA56994.2022.00043 Close
	Mencagli, Gabriele; Griebler, Dalvan; Danelutto, Marco Towards Parallel Data Stream Processing on System-on-Chip CPU+GPU Devices Inproceedings doi In: 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 34-38, IEEE, Valladolid, Spain, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{MENCAGLI:PDP:22, title = {Towards Parallel Data Stream Processing on System-on-Chip CPU+GPU Devices}, author = {Gabriele Mencagli and Dalvan Griebler and Marco Danelutto}, url = {https://doi.org/10.1109/PDP55904.2022.00014}, doi = {10.1109/PDP55904.2022.00014}, year = {2022}, date = {2022-04-01}, booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {34-38}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'22}, abstract = {Data Stream Processing is a pervasive computing paradigm with a wide spectrum of applications. Traditional streaming systems exploit the processing capabilities provided by homogeneous Clusters and Clouds. Due to the transition to streaming systems suitable for IoT/Edge environments, there has been the urgent need of new streaming frameworks and tools tailored for embedded platforms, often available as System-onChips composed of a small multicore CPU and an integrated onchip GPU. Exploiting this hybrid hardware requires special care in the runtime system design. In this paper, we discuss the support provided by the WindFlow library, showing its design principles and its effectiveness on the NVIDIA Jetson Nano board.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Data Stream Processing is a pervasive computing paradigm with a wide spectrum of applications. Traditional streaming systems exploit the processing capabilities provided by homogeneous Clusters and Clouds. Due to the transition to streaming systems suitable for IoT/Edge environments, there has been the urgent need of new streaming frameworks and tools tailored for embedded platforms, often available as System-onChips composed of a small multicore CPU and an integrated onchip GPU. Exploiting this hybrid hardware requires special care in the runtime system design. In this paper, we discuss the support provided by the WindFlow library, showing its design principles and its effectiveness on the NVIDIA Jetson Nano board. Close https://doi.org/10.1109/PDP55904.2022.00014 doi:10.1109/PDP55904.2022.00014 Close
	Garcia, Adriano Marques; Griebler, Dalvan; Schepke, Claudio; Fernandes, Luiz Gustavo Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores Inproceedings doi In: 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 10-17, IEEE, Valladolid, Spain, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{GARCIA:PDP:22, title = {Evaluating Micro-batch and Data Frequency for Stream Processing Applications on Multi-cores}, author = {Adriano Marques Garcia and Dalvan Griebler and Claudio Schepke and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/PDP55904.2022.00011}, doi = {10.1109/PDP55904.2022.00011}, year = {2022}, date = {2022-04-01}, booktitle = {30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages = {10-17}, publisher = {IEEE}, address = {Valladolid, Spain}, series = {PDP'22}, abstract = {In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. These fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. Therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations. Close https://doi.org/10.1109/PDP55904.2022.00011 doi:10.1109/PDP55904.2022.00011 Close
	Gomes, Márcio Miguel; Righi, Rodrigo Rosa; Costa, Cristiano André; Griebler, Dalvan Steam++: An Extensible End-to-end Framework for Developing IoT Data Processing Applications in the Fog Journal Article doi In: International Journal of Computer Science & Information Technology, vol. 14, no. 1, pp. 31-51, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{GOMES:IJCSIT:22, title = {Steam++: An Extensible End-to-end Framework for Developing IoT Data Processing Applications in the Fog}, author = {Márcio Miguel Gomes and Rodrigo Rosa Righi and Cristiano André Costa and Dalvan Griebler}, url = {http://dx.doi.org/10.5121/ijcsit.2022.14103}, doi = {10.5121/ijcsit.2022.14103}, year = {2022}, date = {2022-02-01}, urldate = {2022-02-01}, journal = {International Journal of Computer Science & Information Technology}, volume = {14}, number = {1}, pages = {31-51}, publisher = {AIRCC}, abstract = {IoT applications usually rely on cloud computing services to perform data analysis such as filtering, aggregation, classification, pattern detection, and prediction. When applied to specific domains, the IoT needs to deal with unique constraints. Besides the hostile environment such as vibration and electricmagnetic interference, resulting in malfunction, noise, and data loss, industrial plants often have Internet access restricted or unavailable, forcing us to design stand-alone fog and edge computing solutions. In this context, we present STEAM++, a lightweight and extensible framework for real-time data stream processing and decision-making in the network edge, targeting hardware-limited devices, besides proposing a micro-benchmark methodology for assessing embedded IoT applications. In real-case experiments in a semiconductor industry, we processed an entire data flow, from values sensing, processing and analysing data, detecting relevant events, and finally, publishing results to a dashboard. On average, the application consumed less than 500kb RAM and 1.0% of CPU usage, processing up to 239 data packets per second and reducing the output data size to 14% of the input raw data size when notifying events.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close IoT applications usually rely on cloud computing services to perform data analysis such as filtering, aggregation, classification, pattern detection, and prediction. When applied to specific domains, the IoT needs to deal with unique constraints. Besides the hostile environment such as vibration and electricmagnetic interference, resulting in malfunction, noise, and data loss, industrial plants often have Internet access restricted or unavailable, forcing us to design stand-alone fog and edge computing solutions. In this context, we present STEAM++, a lightweight and extensible framework for real-time data stream processing and decision-making in the network edge, targeting hardware-limited devices, besides proposing a micro-benchmark methodology for assessing embedded IoT applications. In real-case experiments in a semiconductor industry, we processed an entire data flow, from values sensing, processing and analysing data, detecting relevant events, and finally, publishing results to a dashboard. On average, the application consumed less than 500kb RAM and 1.0% of CPU usage, processing up to 239 data packets per second and reducing the output data size to 14% of the input raw data size when notifying events. Close http://dx.doi.org/10.5121/ijcsit.2022.14103 doi:10.5121/ijcsit.2022.14103 Close
	Löff, Júnior; Hoffmann, Renato Barreto; Pieper, Ricardo; Griebler, Dalvan; Fernandes, Luiz Gustavo DSParLib: A C++ Template Library for Distributed Stream Parallelism Journal Article doi In: International Journal of Parallel Programming, vol. 50, no. 5, pp. 454-485, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{LOFF:IJPP:22, title = {DSParLib: A C++ Template Library for Distributed Stream Parallelism}, author = {Júnior Löff and Renato Barreto Hoffmann and Ricardo Pieper and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s10766-022-00737-2}, doi = {10.1007/s10766-022-00737-2}, year = {2022}, date = {2022-01-01}, journal = {International Journal of Parallel Programming}, volume = {50}, number = {5}, pages = {454-485}, publisher = {Springer}, abstract = {Stream processing applications deal with millions of data items continuously generated over time. Often, they must be processed in real-time and scale performance, which requires the use of distributed parallel computing resources. In C/C++, the current state-of-the-art for distributed architectures and High-Performance Computing is Message Passing Interface (MPI). However, exploiting stream parallelism using MPI is complex and error-prone because it exposes many low-level details to the programmer. In this work, we introduce a new parallel programming abstraction for implementing distributed stream parallelism named DSParLib. Our abstraction of MPI simplifies parallel programming by providing a pattern-based and building block-oriented development to inter-connect, model, and parallelize data streams found in modern applications. Experiments conducted with five different stream processing applications and the representative PARSEC Ferret benchmark revealed that DSParLib is efficient and flexible. Also, DSParLib achieved similar or better performance, required less coding, and provided simpler abstractions to express parallelism with respect to handwritten MPI programs.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Stream processing applications deal with millions of data items continuously generated over time. Often, they must be processed in real-time and scale performance, which requires the use of distributed parallel computing resources. In C/C++, the current state-of-the-art for distributed architectures and High-Performance Computing is Message Passing Interface (MPI). However, exploiting stream parallelism using MPI is complex and error-prone because it exposes many low-level details to the programmer. In this work, we introduce a new parallel programming abstraction for implementing distributed stream parallelism named DSParLib. Our abstraction of MPI simplifies parallel programming by providing a pattern-based and building block-oriented development to inter-connect, model, and parallelize data streams found in modern applications. Experiments conducted with five different stream processing applications and the representative PARSEC Ferret benchmark revealed that DSParLib is efficient and flexible. Also, DSParLib achieved similar or better performance, required less coding, and provided simpler abstractions to express parallelism with respect to handwritten MPI programs. Close https://doi.org/10.1007/s10766-022-00737-2 doi:10.1007/s10766-022-00737-2 Close
	Hoffmann, Renato Barreto; Löff, Júnior; Griebler, Dalvan; Fernandes, Luiz Gustavo OpenMP as runtime for providing high-level stream parallelism on multi-cores Journal Article doi In: The Journal of Supercomputing, vol. 78, no. 1, pp. 7655-7676, 2022. (Abstract \| Links \| BibTeX \| Tags: ) @article{HOFFMANN:Jsuper:2022, title = {OpenMP as runtime for providing high-level stream parallelism on multi-cores}, author = {Renato Barreto Hoffmann and Júnior Löff and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1007/s11227-021-04182-9}, doi = {10.1007/s11227-021-04182-9}, year = {2022}, date = {2022-01-01}, journal = {The Journal of Supercomputing}, volume = {78}, number = {1}, pages = {7655-7676}, publisher = {Springer}, address = {New York, United States}, abstract = {OpenMP is an industry and academic standard for parallel programming. However, using it for developing parallel stream processing applications is complex and challenging. OpenMP lacks key programming mechanisms and abstractions for this particular domain. To tackle this problem, we used a high-level parallel programming framework (named SPar) for automatically generating parallel OpenMP code. We achieved this by leveraging SPar’s language and its domain-specific code annotations for simplifying the complexity and verbosity added by OpenMP in this application domain. Consequently, we implemented a new compiler algorithm in SPar for automatically generating parallel code targeting the OpenMP runtime using source-to-source code transformations. The experiments in four different stream processing applications demonstrated that the execution time of SPar was improved up to 25.42% when using the OpenMP runtime. Additionally, our abstraction over OpenMP introduced at most 1.72% execution time overhead when compared to handwritten parallel codes. Furthermore, SPar significantly reduces the total source lines of code required to express parallelism with respect to plain OpenMP parallel codes.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close OpenMP is an industry and academic standard for parallel programming. However, using it for developing parallel stream processing applications is complex and challenging. OpenMP lacks key programming mechanisms and abstractions for this particular domain. To tackle this problem, we used a high-level parallel programming framework (named SPar) for automatically generating parallel OpenMP code. We achieved this by leveraging SPar’s language and its domain-specific code annotations for simplifying the complexity and verbosity added by OpenMP in this application domain. Consequently, we implemented a new compiler algorithm in SPar for automatically generating parallel code targeting the OpenMP runtime using source-to-source code transformations. The experiments in four different stream processing applications demonstrated that the execution time of SPar was improved up to 25.42% when using the OpenMP runtime. Additionally, our abstraction over OpenMP introduced at most 1.72% execution time overhead when compared to handwritten parallel codes. Furthermore, SPar significantly reduces the total source lines of code required to express parallelism with respect to plain OpenMP parallel codes. Close https://doi.org/10.1007/s11227-021-04182-9 doi:10.1007/s11227-021-04182-9 Close
2021
	Löff, Júnior; Hoffmann, Renato Barreto; Griebler, Dalvan; Fernandes, Luiz G. High-Level Stream and Data Parallelism in C++ for Multi-Cores Inproceedings doi In: XXV Brazilian Symposium on Programming Languages (SBLP), pp. 41-48, ACM, Joinville, Brazil, 2021. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{LOFF:SBLP:21, title = {High-Level Stream and Data Parallelism in C++ for Multi-Cores}, author = {Júnior Löff and Renato Barreto Hoffmann and Dalvan Griebler and Luiz G. Fernandes}, url = {https://doi.org/10.1145/3475061.3475078}, doi = {10.1145/3475061.3475078}, year = {2021}, date = {2021-10-01}, booktitle = {XXV Brazilian Symposium on Programming Languages (SBLP)}, pages = {41-48}, publisher = {ACM}, address = {Joinville, Brazil}, series = {SBLP'21}, abstract = {Stream processing applications have seen an increasing demand with the increased availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. However, parallel programming is often difficult and error-prone, because programmers must deal with low-level system and architecture details. In this work, we introduce a new strategy for automatic data-parallel code generation in C++ targeting multi-core architectures. This strategy was integrated with an annotation-based parallel programming abstraction named SPar. We have increased SPar’s expressiveness for supporting stream and data parallelism, and their arbitrary composition. Therefore, we added two new attributes to its language and improved the compiler parallel code generation. We conducted a set of experiments on different stream and data-parallel applications to assess the efficiency of our solution. The results showed that the new SPar version obtained similar performance with respect to handwritten parallelizations. Moreover, the new SPar version is able to achieve up to 74.9x better performance with respect to the original ones due to this work.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Stream processing applications have seen an increasing demand with the increased availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. However, parallel programming is often difficult and error-prone, because programmers must deal with low-level system and architecture details. In this work, we introduce a new strategy for automatic data-parallel code generation in C++ targeting multi-core architectures. This strategy was integrated with an annotation-based parallel programming abstraction named SPar. We have increased SPar’s expressiveness for supporting stream and data parallelism, and their arbitrary composition. Therefore, we added two new attributes to its language and improved the compiler parallel code generation. We conducted a set of experiments on different stream and data-parallel applications to assess the efficiency of our solution. The results showed that the new SPar version obtained similar performance with respect to handwritten parallelizations. Moreover, the new SPar version is able to achieve up to 74.9x better performance with respect to the original ones due to this work. Close https://doi.org/10.1145/3475061.3475078 doi:10.1145/3475061.3475078 Close
	Andrade, Gabriella; Griebler, Dalvan; Santos, Rodrigo; Danelutto, Marco; Fernandes, Luiz Gustavo Assessing Coding Metrics for Parallel Programming of Stream Processing Programs on Multi-cores Inproceedings doi In: 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2021), pp. 291-295, IEEE, Pavia, Italy, 2021. (Abstract \| Links \| BibTeX \| Tags: ) @inproceedings{ANDRADE:SEAA:21, title = {Assessing Coding Metrics for Parallel Programming of Stream Processing Programs on Multi-cores}, author = {Gabriella Andrade and Dalvan Griebler and Rodrigo Santos and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1109/SEAA53835.2021.00044}, doi = {10.1109/SEAA53835.2021.00044}, year = {2021}, date = {2021-09-01}, booktitle = {47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA 2021)}, pages = {291-295}, publisher = {IEEE}, address = {Pavia, Italy}, series = {SEAA'21}, abstract = {From the popularization of multi-core architectures, several parallel APIs have emerged, helping to abstract the programming complexity and increasing productivity in application development. Unfortunately, only a few research efforts in this direction managed to show the usability pay-back of the programming abstraction created, because it is not easy and poses many challenges for conducting empirical software engineering. We believe that coding metrics commonly used in software engineering code measurements can give useful indicators on the programming effort of parallel applications and APIs. These metrics were designed for general purposes without considering the evaluation of applications from a specific domain. In this study, we aim to evaluate the feasibility of seven coding metrics to be used in the parallel programming domain. To do so, five stream processing applications implemented with different parallel APIs for multi-cores were considered. Our experiments have shown COCOMO II is a suitable model for evaluating the productivity of different parallel APIs targeting multi-cores on stream processing applications while other metrics are restricted to the code size.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close From the popularization of multi-core architectures, several parallel APIs have emerged, helping to abstract the programming complexity and increasing productivity in application development. Unfortunately, only a few research efforts in this direction managed to show the usability pay-back of the programming abstraction created, because it is not easy and poses many challenges for conducting empirical software engineering. We believe that coding metrics commonly used in software engineering code measurements can give useful indicators on the programming effort of parallel applications and APIs. These metrics were designed for general purposes without considering the evaluation of applications from a specific domain. In this study, we aim to evaluate the feasibility of seven coding metrics to be used in the parallel programming domain. To do so, five stream processing applications implemented with different parallel APIs for multi-cores were considered. Our experiments have shown COCOMO II is a suitable model for evaluating the productivity of different parallel APIs targeting multi-cores on stream processing applications while other metrics are restricted to the code size. Close https://doi.org/10.1109/SEAA53835.2021.00044 doi:10.1109/SEAA53835.2021.00044 Close
	Löff, Júnior; Griebler, Dalvan; Mencagli, Gabriele; Araujo, Gabriell; Torquati, Massimo; Danelutto, Marco; Fernandes, Luiz Gustavo The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures Journal Article doi In: Future Generation Computer Systems, vol. 125, pp. 743-757, 2021. (Abstract \| Links \| BibTeX \| Tags: ) @article{LOFF:FGCS:21, title = {The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures}, author = {Júnior Löff and Dalvan Griebler and Gabriele Mencagli and Gabriell Araujo and Massimo Torquati and Marco Danelutto and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.future.2021.07.021}, doi = {10.1016/j.future.2021.07.021}, year = {2021}, date = {2021-07-01}, journal = {Future Generation Computer Systems}, volume = {125}, pages = {743-757}, publisher = {Elsevier}, abstract = {The NAS Parallel Benchmarks (NPB), originally implemented mostly in Fortran, is a consolidated suite containing several benchmarks extracted from Computational Fluid Dynamics (CFD) models. The benchmark suite has important characteristics such as intensive memory communications, complex data dependencies, different memory access patterns, and hardware components/sub-systems overload. Parallel programming APIs, libraries, and frameworks that are written in C++ as well as new optimizations and parallel processing techniques can benefit if NPB is made fully available in this programming language. In this paper we present NPB-CPP, a fully C++ translated version of NPB consisting of all the NPB kernels and pseudo-applications developed using OpenMP, Intel TBB, and FastFlow parallel frameworks for multicores. The design of NPB-CPP leverages the Structured Parallel Programming methodology (essentially based on parallel design patterns). We show the structure of each benchmark application in terms of composition of few patterns (notably Map and MapReduce constructs) provided by the selected C++ frameworks. The experimental evaluation shows the accuracy of NPB-CPP with respect to the original NPB source code. Furthermore, we carefully evaluate the parallel performance on three multi-core systems (Intel, IBM Power and AMD) with different C++ compilers (gcc, icc and clang) by discussing the performance differences in order to give to the researchers useful insights to choose the best parallel programming framework for a given type of problem.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close The NAS Parallel Benchmarks (NPB), originally implemented mostly in Fortran, is a consolidated suite containing several benchmarks extracted from Computational Fluid Dynamics (CFD) models. The benchmark suite has important characteristics such as intensive memory communications, complex data dependencies, different memory access patterns, and hardware components/sub-systems overload. Parallel programming APIs, libraries, and frameworks that are written in C++ as well as new optimizations and parallel processing techniques can benefit if NPB is made fully available in this programming language. In this paper we present NPB-CPP, a fully C++ translated version of NPB consisting of all the NPB kernels and pseudo-applications developed using OpenMP, Intel TBB, and FastFlow parallel frameworks for multicores. The design of NPB-CPP leverages the Structured Parallel Programming methodology (essentially based on parallel design patterns). We show the structure of each benchmark application in terms of composition of few patterns (notably Map and MapReduce constructs) provided by the selected C++ frameworks. The experimental evaluation shows the accuracy of NPB-CPP with respect to the original NPB source code. Furthermore, we carefully evaluate the parallel performance on three multi-core systems (Intel, IBM Power and AMD) with different C++ compilers (gcc, icc and clang) by discussing the performance differences in order to give to the researchers useful insights to choose the best parallel programming framework for a given type of problem. Close https://doi.org/10.1016/j.future.2021.07.021 doi:10.1016/j.future.2021.07.021 Close
	Pieper, Ricardo; Löff, Júnior; Hoffmann, Renato Berreto; Griebler, Dalvan; Fernandes, Luiz Gustavo High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores Journal Article doi In: Journal of Computer Languages, vol. 65, pp. 101054, 2021. (Abstract \| Links \| BibTeX \| Tags: ) @article{PIEPER:COLA:21, title = {High-level and Efficient Structured Stream Parallelism for Rust on Multi-cores}, author = {Ricardo Pieper and Júnior Löff and Renato Berreto Hoffmann and Dalvan Griebler and Luiz Gustavo Fernandes}, url = {https://doi.org/10.1016/j.cola.2021.101054}, doi = {10.1016/j.cola.2021.101054}, year = {2021}, date = {2021-07-01}, journal = {Journal of Computer Languages}, volume = {65}, pages = {101054}, publisher = {Elsevier}, abstract = {This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism..}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism.. Close https://doi.org/10.1016/j.cola.2021.101054 doi:10.1016/j.cola.2021.101054 Close

2025

2024

2023

2022

2021