First, I have assembled the system as shown below. During the initial setup, the system built an image database of known persons. During the registration process, an affine transformation is applied after detecting the facial landmarks of the person, to get the frontal view. These images are saved and later compared, to identify the person.
In order to avoid repeated triggers, the message gets published only when the same person is not found in the last 'n' frames. This is implemented with a double-ended queue to store identified information. If the person is identified, then the greeting message is pushed via the eSpeak text-to-speech synthesizer. Prior to this, the voice configuration setup was done in Pi.