Energy-efficient Data Engineering Practices for Big Data Workloads in Cloud Infrastructure
Downloads
The exponential growth of data-driven applications in modern enterprises has significantly increased the demand for big data processing within cloud infrastructures. While the scalability and flexibility of cloud services offer clear advantages, the energy consumption associated with running large-scale data engineering workloads presents critical sustainability and operational challenges. This paper explores and evaluates a set of energy-efficient practices tailored for engineering big data pipelines in cloud environments, with the goal of minimizing carbon footprints without sacrificing performance or reliability.
We adopt a mixed-method approach, combining literature analysis, cloud benchmarking experiments (on platforms such as AWS and Azure), and comparative evaluations of common tools including Apache Spark, AWS Glue, and Apache NiFi. Energy metrics such as power usage, execution time, and CPU efficiency were collected using cloud-native and third-party monitoring solutions. Several optimization strategies were examined, including serverless compute provisioning, storage tiering, format compression (e.g., Parquet with Snappy), and directed acyclic graph (DAG) orchestration enhancements.
Results from our experiments demonstrate that these optimized data engineering techniques can lead to energy savings of up to 37% across different pipeline stages, with minimal impact on execution time. Furthermore, the study highlights how cloud-native services, when configured with sustainability in mind, can align enterprise data operations with global green computing goals. This work contributes practical insights and actionable frameworks for engineers and organizations seeking to enhance the energy efficiency of their big data workloads in cloud infrastructure.
Downloads
Copyright (c) 2023 Kishore Arul (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.