Ah, I see what you want to do after seeing the video. But clearly, the methods uses blob detection, and sets some variables, with several sine (or other) generators, driven by the detected blobs. But these generators are created only once, or every n frames, no on each frame.
Actually, perhaps the camera image is scanned only once per second (more or less), and the sounds only generated after this scan.
Actually, perhaps the camera image is scanned only once per second (more or less), and the sounds only generated after this scan.