Open
Description
Hi there,
I am currently working on Bidi Streaming using websockets. I am trying to have a image streaming feature too.
When I send images and audio(or text) together I got response. But I couldn't find a way to get response to only image data. Is there a way to do that?
In main.py, I handle image data like this:
elif mime_type.startswith("image"):
# Send image data (video frames)
decoded_data = base64.b64decode(data)
live_request_queue.send_realtime(Blob(data=decoded_data, mime_type=mime_type))
print(f"[CLIENT TO AGENT]: {mime_type}: {len(decoded_data)} bytes")
on app.js I added:
const startCamButton = document.getElementById('startCamButton')
startCamButton.addEventListener('click', async () => {
try {
startCamButton.disabled = true
startAudioButton.disabled = true
startAudio()
is_audio = true
connectWebsocket()
// Start video capture at fps FPS
await startVideoCapture(videoFrameHandler, fps)
} catch (error) {
console.error('Failed to start camera:', error)
startCamButton.disabled = false
alert(
'Failed to access camera. Please make sure you have granted camera permissions.'
)
}
})
// Video frame handler
function videoFrameHandler(frameData) {
// Send the frame data as base64
sendMessage({
mime_type: 'image/jpeg',
data: arrayBufferToBase64(frameData),
})
console.log('[CLIENT TO AGENT] sent video frame: bytes')
}
To app.js.
With these additions, I don't get response to only image data. For example when I say tell me the number of fingers you see continuously, I only got response when I talk. I'd like to send webcam frames and get responses like "You are showing three fingers" without having to speak or type anything.