Seeing your VR avatar = knowing your typed content:

Keystroke Inference Attack in Shared VR Environments

Virtual Reality (VR) has gained popularity by providing immersive and interactive experiences without geographical limitations. It also provides a sense of personal privacy through physical separation. In this paper, we show that despite assumptions of enhanced privacy, VR is unable to shield its users from side-channel attacks that steal private information. Ironically, this vulnerability arises from VR's greatest strength, its immersive and interactive nature. We demonstrate this by designing and implementing a new set of keystroke inference attacks in shared virtual environments, where an adversary (VR user) can recover the content typed by another VR user by observing their avatar.

teaser

Figure 1: An illustrative example of keystroke inference attacks in the shared VR space. In this scenario, U is writing a document while enjoying the immersive experience of a virtual beachside cafe. U types comfortably on a physical keyboard that is wirelessly connected to the VR headset. Adversary A is another VR user in the same virtual beachside cafe. From U's perspective, they see the typed content on a virtual screen and a rendered view of their own hands. From A's view, they see U's avatar with typing hand motions. Using this information, A seeks to recover the content that U is typing. Credit: The VR scenes are screenshots taken when running the Horizon Workroom application from Meta.

VR enables immersive interaction among people in shared virtual environments where each VR user is represented by their own avatar rather than their physical self. When an individual performs physical actions such as typing on a keyboard, their avatar demonstrates an approximate, digital version of those hand movements, which are visible to other users in the same virtual space (as illustrated in Figure 1).


Different Level of Information Access → Three Attack Scenarios

We assume that the adversary (A) can only gather data related to the user's (U's) avatar. To render U's avatar to be seen by A, the headset of A will receive a continuous stream of U's telemetry data, i.e., U's handposes. We investigate three attack scenarios by varying A's level of information access (see Figure 2).

data_flow

Figure 2: Our work considers three keystroke inference attacks which operate on the original telemetry, the observed telemetry (used to render the target's avatar), or the video of the target's avatar displayed on the adversary's VR screen.

Attack I: Original telemetry attack
This represents the strongest adversary, one who has access to either the target's headset or the VR rendering server. Here the attacker is able to obtain the 3D telemetry data on U's hands collected by U's headset.

Attack II: Observed telemetry attack
Here the adversary obtains U's avatar data by setting up a virtual camera (e.g., A's avatar in the same VR environment). To render the virtual hands of U's avatar to be seen by A, the VR system needs to transform the original telemetry data to screen coordinates of A's virtual camera.

Attack III: Rendered handpose attack
The attack operates on a 2D projection of the observed (transformed) telemetry used by Attack II such that the depth-to-camera data is removed. This represents a limited adversary who has no access to any telemetry data, but instead records U's avatar shown on A's VR screen and applies a hand tracking tool to extract the 2D handpose per video frame.


Attack Evaluation

We evaluated our attacks across 15 users who have different typing behaviors. On average, the original telemetry attack accurately recognizes over 90% of the typed keys, and the recovered content retains more than 70% of the meaning of the typed content. We also evaluated the performance of observed telemetry and rendered handpose attacks. In Table 1, we present the attack performance of 1 participant. For more comprehensive results, please refer to our paper.

result

Table 1: Performance of original telemetry, observed telemetry and rendered handpose attacks on a study participant.

Defense

Our study shows that by either intercepting the telemetry data used to render a target's avatar or simply observing rendered hand movements of their avatar, an attacker in the same virtual environment can successfully recover the text physically typed by the VR user. Untreated, these attacks can cause significant damage to users. Therefore, we also discuss several defense options:

Defense I: Limiting access to telemetry (hand tracking) data
We seek to minimize the chance of leaking sensitive handpose data to attackers. According to our experiments, by reducing the sampling rate from the default 60fps to 15fps, this defense drops the percentage of fully recovered words from 85.8% to 62%. When further reduced to 6fps, the attack becomes ineffective.

Defense II: Adding noise to telemetry data
We perturb the handpose data to prevent the attacker from extracting useful keystroke information. We design the noise to be hard to remove and with the injected noise, the attack becomes much less effective (>50% of the words are wrongly recovered). Please refer to our paper for a detailed design of the defense.

Publication

Can Virtual Reality Protect Users from Keystroke Inference Attacks?
Zhuolin Yang, Zain Sarwar, Iris Hwang, Ronik Bhaskar, Ben Y. Zhao and Haitao Zheng.
To Appear In Proceedings of USENIX Security Symposium 2024. ( arXiv )

Press

New ScientistAI can steal passwords in virtual reality from avatar hand motions