Why is ReLU a good activation function?



I understand that ReLU helps avoid dead neurons during backpropagation.

However, I am not able to understand why is ReLU used as an activation function if its output is linear?

Doesn’t the whole point of being the activation function get defeated if it won’t introduce non-linearity? Can someone explain mathematically why this is a good choice?