I have a Spark application running with an Apache Zeppelin interface. Initially, tasks execute quickly (in seconds), but after a couple of days of continuous operation, the same tasks start taking several minutes to complete. Restarting the Spark session temporarily resolves the issue, but the slowdown eventually reoccurs.
Here are some details about my setup:
- Spark Version: 3.2.3
- Zeppelin Version: 0.11.1
I’ve observed the following:
- There are no obvious errors or failures in the logs.
- Resource usage (CPU, memory, disk) doesn’t appear to be maxed out.
Questions:
- What are the potential causes of this gradual slowdown in Spark performance over time?
- Are there specific configurations or optimizations I should check to prevent this issue?
- Could this be related to memory leaks, accumulation of metadata, or other long-running session issues?
- Are there best practices for managing long-running Spark sessions with Zeppelin to avoid performance degradation?
Any insights, debugging tips, or recommendations would be greatly appreciated!