BLIP Encoder-based Image Captioning

I’m wanting to finetune huggingface’s BLIP model for image captioning. I specifically want the caption to be generated with an encoder not decoder. How can I do this with huggingface?