Figure 1: Sample attack scenarios: (a) an indoor lounge scenario where the attacker watches a video while recording the victim typing; (b) a long-range outdoor scenario where the attacker hides a smartphone with a budget telephoto lens (<$60) inside a building (behind a window) to record the victim in the courtyard (∼12m away) typing; the victim can type on a physical keyboard, like the iPad keyboard used in (a) and (b) or even (c) uses an "invisible" keyboard and types directly on the table.
Our work exposed a new, general, video-based keystroke inference attack that operates on common public settings. The attacker only needs a single commodity RGB camera, which records the typing fingers of the target via a frontal view (see Figure 1a). By analyzing the video, the attacker recovers the typed content. Our work differs significantly from prior work in that we do not rely on side-channel data or other assumptions beyond having a frontal view of the target's typing hands — i.e., the attack assumes no pretraining, no keyboard knowledge, no training data from the target, no local sensors and no side-channels. Our user study results show such attacks can succeed in realistic scenarios (including long-range attacks (Figure 1b), typing on iPads or on invisible keyboards). Our work raises the immediate need for users working in public settings to protect their typing privacy, e.g., setting up a physical screen that blocks frontal views of their hands.
Figure 2: Our self-supervised approach to keystroke inference. We first run unsupervised inference on fingertip data extracted from each video frame, from which we identify keystrokes with high confidence labels (this process is marked by thin arrows). We use these as training data and build DNN models that detect and recognize keystrokes directly from the video (thick arrow).
A self-supervised approach to keystroke inference. We proposed a new approach to keystroke inference with no additional input other than video captured from a distance via commodity phone cameras. The key insight is to use a two-layer self-supervised system, where noisy results of hand tracking on the target video are used to run keystroke detection/clustering, followed by a language-based Hidden Markov Model (HMM) to recognize keystrokes. These initial labels are filtered using multiple consistency checks to produce high confidence labels on video frames, which are then used to train two 3D-CNN models that detect and recognize keystrokes from the video. This two-layer process is illustrated by Figure 2.
We evaluate our video based attacks using real-world user studies under a diverse set of conditions: (1) Different scenarios, including environments (indoor/outdoor, varying attack distances and blockages), keyboard devices (visible/invisible keyboards, varying size/layout, placed on desk vs. on lap); (2) 16 different users, who have different typing styles (e.g. use different set of fingers to type) and abilities (e.g. high-speed typing). The attack is highly effective in nearly all scenarios, and performs well across our user study participants, despite significantly different typing behaviors.
Towards a General Video-based Keystroke Inference Attack.
Zhuolin Yang, Yuxin Chen, Zain Sarwar, Hadleigh Schwartz, Ben Y. Zhao and Haitao Zheng.
To Appear In Proceedings of USENIX Security Symposium 2023.