A Survey of Human Pose Recognition Based on WiFi Sensing and Neural Network ()
1. Introduction
Currently, behavior recognition is a hot topic [1] [2]. Three-dimensional human pose estimation, a core task in computer vision and artificial intelligence, plays a critical role in numerous fields, such as human-computer interaction, virtual reality, medical monitoring, and security surveillance [3]. However, traditional methods primarily rely on RGB cameras or depth sensors, whose performance is significantly limited in privacy-sensitive scenarios (e.g. homes, hospitals) and environments with insufficient lighting or occlusions. For example, camera-based systems not only require line-of-sight transmission, but also are highly susceptible to environmental factors, making large-scale applications challenging.
In recent years, research on human perception using wireless signals, particularly WiFi, has become a hotspot. This approach offers significant advantages such as non-invasiveness, strong penetrability, no line-of-sight requirement, and privacy protection [4], attracting numerous researchers. Among various wireless signal features, Channel State Information (CSI) has become a key tool for human activity recognition and pose estimation due to its ability to capture fine-grained characteristics of signal propagation (e.g. amplitude and phase of multiple subcarriers) [5]. The development of deep learning has brought significant breakthroughs to CSI-based pose estimation. Neural networks such as CNNs and RNNs are widely used to learn complex patterns from CSI data and map them to human poses while effectively capturing spatiotemporal dependencies in the data [6] [7].
Additionally, fusing CSI with multimodal data further enhances the robustness of pose estimation. Some studies use visual data to generate training labels or supplement critical information in complex environments [8]. In terms of application prospects, this technology demonstrates great potential in medical monitoring (e.g. fall detection for the elderly), security, game (sensorless motion capture), and other fields [9].
This comprehensive review systematically investigates the current state of research concerning pose estimation utilizing CSI and neural network architectures. The study provides a detailed synthesis of existing methodologies and compares the merits and limitations of typical approaches across various applications, including data preprocessing techniques, network models, and feature extraction strategies. Furthermore, this paper examines the challenges associated with environmental variability and the practical difficulties. These challenges, stemming from factors such as multipath fading, dynamic environments, and subject-specific variations, impact the generalizability and reliability of CSI-based pose estimation models. Finally, the review identifies and discusses promising avenues for future research, emphasizing the development of novel algorithms and techniques designed to enhance model generalization capabilities and robustness. This work contributes a valuable resource for researchers seeking to navigate the rapidly evolving landscape of wireless sensing and human activity recognition.
2. State of the Art
2.1. Research Status of 3D Human Pose Recognition Based on
Computer Vision
In the field of computer vision, researchers focus on estimating 3D human poses and shapes from video images. Many scholars have proposed 3D human pose recognition models based on computer vision. For example, Lin et al. [10] proposed a model for reconstructing 3D human poses and mesh vertices from single images. It uses an attention encoder to jointly model vertex-vertex and vertex-joint interactions while outputting 3D joint coordinates and mesh vertices. Li et al. [11] proposed “Hybrik”, a hybrid inverse kinematics method connecting 3D pose estimation and body mesh regression, which converts 3D keypoint positions into 3D human meshes, enabling end-to-end training and forming a closed loop between 3D skeletons and parametric models. This solves the alignment problem of model-based methods and the unrealistic human structure issue in keypoint estimation methods. Jin [12] used the ResNeXt-CBAM encoding network and the parametric human model SMPL to reconstruct 3D humans from single-frame RGB images, effectively improving 3D human reconstruction performance and reducing errors. Although these methods have high recognition accuracy, they are greatly affected by factors such as light and occlusion, may involve privacy issues, and cannot penetrate obstacles like walls.
2.2. Research Status of 3D Human Pose Recognition Based on
Wireless Millimeter-Wave Radar
Millimeter-wave radar identifies human 3D poses by transmitting millimeter-wave signals and analyzing phase, amplitude, and other information in the received signals. Pioneering work such as RF-capture [13] demonstrated the possibility of using Radio Frequency (RF) signals for human recognition, sparking significant interest in human pose and shape estimation. Subsequent studies like RF-pose [14] and RF-avatar [15] used finer-grained RF signals to construct human poses and meshes. These works show that RF signals contain sufficient information to estimate human poses and overcome many limitations of traditional vision-based methods, such as poor lighting, clothing interference, and privacy issues. Li et al. [16] proposed a precise human pose estimation system based on 77GHz millimeter-wave radar, which generates heatmaps from two sets of radar data and uses CNN to convert 2D heatmaps into human poses. Kwon et al. [17] developed a hands-free human activity recognition system using millimeter-wave sensors, whose network protects user privacy and can reconstruct the skeleton of the active human body. Xue et al. [18] introduced the first real-time 3D human mesh estimation system using commercial portable millimeter-wave devices, innovatively addressing point cloud sparsity with a deep learning framework, representing complex human meshes with few parameters, generating more realistic meshes using prior knowledge, and using Recurrent Neural Networks (RNNs) to handle missing body part points. Gu et al. [19] proposed “mmSense”, a multi-person detection and recognition framework that uses the unique sensing characteristics of millimeter waves and an LSTM-based classification model to detect and locate multiple persons simultaneously. While millimeter-wave radar-based 3D human pose recognition is unaffected by lighting conditions, it still cannot penetrate walls and requires expensive specialized hardware.
2.3. Research Status of 3D Human Pose Recognition Based on
WiFi Devices and Neural Networks
With the rapid development of information technology, home WiFi devices have become widely used. By collecting and processing radio frequency signals from WiFi, we can generate 3D human poses. Compared with data from video, audio, or optical signals, WiFi signals offer unique advantages: they work in low-light environments, can penetrate walls, effectively protect privacy, and require no specialized hardware (e.g. USRP) or radio signals (e.g. FMCW), making them low-cost. Therefore, this technology has enormous application potential.
WiFi technology uses the transmission characteristics of wireless signals to recognize 3D human poses by analyzing CSI information in signals received by receivers [20]. When humans move within a WiFi coverage area, pose changes affect signal propagation characteristics, allowing pose inference. Common CSI collection tools include the Linux 802.11n CSI Tool [21], ESP32 CSI Tool [22], and Atheros CSI Tool [23]. CSI samples from WiFi systems are often affected by noise and interference, including random phase drift and inversion. Some solutions only use CSI amplitude information while ignoring phase [24], compromising information integrity. The PhaseFi system created by Wang et al. [25] used unwrapping and linear transformation to denoise phase signals, effectively preserving integrity.
Jiang et al. [26] extracted 2D Angle of Arrival (AoA) spectra from WiFi signals to locate different body parts and used deep learning models to establish complex relationships between 2D AoA spectra and 3D skeletons for pose tracking. Ren et al. [27] used a CNN and LSTM deep learning model to abstract 3D human poses from 2D AoA, where CNN extracts spatial dynamics (e.g. limb and torso positions), and LSTM models temporal dynamics (e.g. limb and torso trajectories). Han et al. [28] proposed a cross-modal meta-learning method based on Model-Agnostic Meta-Learning (MAML) for few-shot human activity recognition using WiFi, addressing data dependency and scalability issues. Wang et al. [29]-[31] demonstrated that commercial WiFi can construct 3D human meshes and estimate 2D AoA of signal reflections using multiple transmit and receive antennas on WiFi devices. Ren et al. [32] proposed GoPose, a 3D skeleton-based human pose estimation system using reusable WiFi devices in home environments, suitable for predefined activities at fixed locations but capable of through-wall estimation. Most of these methods are limited to single-person pose recognition, with separate data collection and pose judgment, and can only recognize a few predefined poses, failing to achieve real-time arbitrary pose recognition for multiple persons. Ren [33] et al. proposed Winect, a skeleton-based human pose tracking system that does not rely on predefined activities and can simultaneously track free-form movements of multiple limbs in real time. However, it only generates 3D human skeletons without judging body shape, limiting its value in gaming modeling and other fields.
In summary, while there have been research achievements in 3D human pose generation using WiFi devices and neural networks, most current results are limited to non-real-time single-person 3D skeleton generation, with no real-time multi-person 3D contour generation. Therefore, WiFi and neural network-based 3D human contour generation systems have significant development space and application potential.
3. Fundamental Architecture of the Pose Recognition System
This section introduces the fundamental architecture of a pose recognition system. We will then provide a detailed analysis of the function and composition of each component, aiming to clarify the overall system structure and functionalities.
3.1. Data Collection
3.1.1. WiFi Signal Collection
A device (e.g. a laptop) equipped with an Intel 5300 network card, three antennas, and running Ubuntu serves as the core receiver. Professional tools are used to analyze the site’s electromagnetic environment, select network card frequency bands and channels, and deploy routers at key positions to build a WiFi network. During experiments, volunteers perform diverse activities in simulated scenarios (e.g. offices, classrooms, corridors), including standing, walking, sitting, and waving. The network card collects CSI data in real time at high frequency and stores it.
When humans move within the coverage area of WiFi transmitters and receivers, their bodies affect wireless signal propagation, causing changes in CSI measurements that contain key information about human poses, providing a basis for subsequent pose and contour analysis based on CSI data.
3.1.2. Camera Data Co-Collection
A Kinect 2.0 camera is introduced, with installation parameters determined based on optical imaging principles and scene geometry to achieve wide coverage of the experimental scene. High-frame-rate, high-resolution videos are recorded to accurately capture human appearance, limb dynamics, and other information. Video data is encoded, labeled, and stored in millisecond-level synchronization with CSI data using high-precision timestamps to ensure accurate correspondence. Stored data is filtered and optimized via image quality assessment algorithms to remove low-quality frames (e.g. blurry or noisy frames), improving data usability.
3.2. Data Preprocessing
3.2.1. WiFi Signal Processing
For CSI phase correction and noise processing, filtering algorithms are studied to remove noise, and phase compensation techniques are used for correction. The applicability of phase compensation algorithms in complex indoor environments is analyzed, and a filter is designed based on the formula:
(where
is the received signal,
is the channel frequency response,
is the transmitted signal, and
is noise) to remove noise and improve signal quality.
3.2.2. Data Fusion
Features from WiFi signal processing are fused with visual features from camera video analysis. Specific algorithms are used to correlate and integrate human position information at the feature level (e.g. a deep learning-based feature fusion network takes CSI [8] features from WiFi signals and visual features from videos as input, learns the correlation between them via the network, and fuses features from different modalities to construct a comprehensive dataset reflecting human states [30]). During fusion, dimensionality reduction or expansion is used to match feature dimensions, and attention mechanisms are applied to weight features based on their importance in describing human states, improving fused data quality.
Some studies have explored combining CSI with visual data, where visual data can generate accurate training labels for CSI-based models or provide supplementary information in hybrid systems fusing multi-sensor data. The fusion method enhances pose estimation accuracy and provides important references for achieving more precise real-time 3D pose generation for multiple persons [29].
3.3. Algorithm Design and Selection
After studying geometric models and deep learning algorithms, a hybrid approach combining CNN, LSTM, and Transformer is chosen for pose recognition, as geometric models require precise prior knowledge and are limited in complex scenarios, while deep learning algorithms can automatically learn feature representations. Specifically, CNN first extracts spatial features from processed WiFi signal pose data to obtain preliminary representations; Transformer then uses self-attention mechanisms to further mine long-range dependencies between features and enhance feature expressiveness; finally, LSTM learns temporal variation features of poses to achieve accurate real-time pose recognition for multiple persons.
CNN’s powerful spatial feature extraction capability allows it to capture spatial information such as limb and torso positions from processed WiFi signal pose data via sliding convolution kernels, obtaining preliminary human pose representations. Figure 1 shows CNN achieving preliminary human pose representation.
Transformer’s self-attention mechanism effectively captures dependency relationships between features at different positions, deeply processing features extracted by CNN to mine potential correlations and enrich feature connotations.
LSTM excels at processing sequential data and learning temporal variation features of poses. Human movements are continuous processes with obvious time-series characteristics. LSTM uses gating mechanisms to effectively memorize past information and update memory states based on current inputs, accurately modeling temporal changes in human poses.
Figure 1. Preliminary human pose representation by CNN.
3.4. Model Training and Optimization
3.4.1. Multimodal Data-Driven Model Training Process
Using preprocessed WiFi signals and fused multimodal data as input, a deep learning model is constructed based on selected algorithms. To enable the model to accurately learn human pose and position patterns, researchers collected a large amount of diverse labeled data covering different indoor scenarios (living rooms, offices) [30], various human actions (walking, jumping, sitting, standing), and different lighting conditions, personnel numbers, and interactions.
During training, CNN extracts spatial features (e.g. limb positions, movement amplitudes) via layer-wise convolution with sliding kernels, providing a basis for subsequent analysis. Transformer processes CNN-extracted features using self-attention to strengthen inter-feature connections. LSTM takes features processed by CNN and Transformer as input, using gating mechanisms to handle sequential features and learn temporal variation patterns of human poses, enabling recognition of complex action sequences.
Through repeated training on large labeled datasets, the model continuously optimizes parameters, gradually learning human pose and position patterns, and improving performance in multi-person real-time 3D pose recognition tasks, laying a foundation for intelligent and accurate 3D human pose recognition.
3.4.2. Model Optimization Strategies for Complex Scenarios
During model training, optimization is performed to enhance positioning and pose recognition performance in complex scenarios.
Network parameter optimization is critical. For CNN, convolution kernel size, quantity, and pooling methods matter: small kernels capture details, and large kernels extract global features. For pooling, max pooling retains salient features, while average pooling smooths extraction; alternating use is recommended. For Transformer, adjusting parameters like the number of multi-head attention heads and hidden layer dimensions optimizes long-range dependency capture. For LSTM, adjusting layer numbers, neuron connections, and gate weights enhances learning capabilities.
Optimization algorithms and hyperparameters are also important. Adam converges fast and stably, while SGD is simple but learning rate-dependent; choices should be based on data and training conditions. Learning rate decay strategies and regularization parameters (L1/L2) prevent overfitting.
Introducing attention mechanisms enhances model performance. In multi-person WiFi data processing, attention calculates feature weights to focus on key features and suppress interference. Adding attention modules to the model dynamically adjusts weights, enabling more accurate pose and position recognition and improving accuracy and efficiency. The attention mechanism principle is shown in Figure 2.
Figure 2. Schematic of the attention mechanism.
The section systematically describes the methodology for human behavior recognition, including data collection, preprocessing, algorithm design, and model training. By integrating signal processing techniques (e.g. phase unwrapping, Wiener filtering) and a hybrid deep learning architecture (CNN-Transformer-LSTM), the proposed framework addresses critical challenges in existing WiFi-based pose estimation, such as noise-sensitive phase signals and limited spatiotemporal feature learning.
However, despite these advancements, the outlined methods face inherent limitations. Existing approaches often rely on simplified assumptions about signal propagation or human motion, which may not fully generalize to highly dynamic environments with strong multipath interference. Additionally, while the hybrid model enhances feature representation, the computational complexity poses challenges for real-time deployment on resource-constrained devices.
Signal processing and model training are invariably integral to pose recognition. Given that the suitability of specific methods is largely determined by the application, selecting and adapting the processing approach to address specific requirements is crucial for the development of real-world systems.
4. Typical Applications
4.1. Gesture Recognition
Gesture recognition, a core technology in human-computer interaction, holds significant value in smart devices, virtual reality, and other fields. Traditional methods relying on specialized sensors or cameras suffer from high deployment costs and line-of-sight limitations. In recent years, contactless gesture recognition based on WiFi signals has become a research hotspot due to its lack of need for extra hardware and wide coverage. Below, two representative studies are discussed, focusing on position-agnostic sensing and fine-grained finger gesture recognition.
Gao et al. [34] proposed a position-agnostic sensing technology to address the issue of signal features in traditional WiFi gesture recognition being affected by user position and orientation. They innovatively shifted the observation perspective from “transceiver view” to “hand view” and introduced Motion Navigation Primitive (MNP) as a core feature. MNP extracts position-agnostic invariant features by analyzing patterns of hand movement direction changes in gestures. Experiments show the system achieves an average recognition accuracy of over 92% for 10 gestures across different positions, orientations, and environments, significantly outperforming traditional methods and providing key support for universal interaction in complex scenarios.
Tan et al. [35] designed the WiFinger system, focusing on single-finger micro-action recognition using CSI from commercial WiFi devices to achieve fine-grained gesture sensing. To address environmental noise and individual differences, the system employs multipath mitigation and wavelet denoising to filter dynamic environmental interference and retain subtle signal changes caused by gestures. It uses Principal Component Identification to extract inherent gesture features and selects subcarriers sensitive to finger movements to handle inter-user differences effectively. In home and office environments, WiFinger achieves an average recognition accuracy of over 93% for 8 finger gestures (e.g. zoom, flip, slide) and maintains high robustness in Non-Line-of-Sight (NLOS) scenarios. This study breaks through traditional methods’ reliance on high sampling rates and specialized hardware, proving the feasibility of commercial WiFi devices in fine-grained gesture recognition.
In summary, WiFi-based gesture recognition has made significant progress in position agnosticism and fine-grained recognition through signal processing and feature extraction innovations, laying a foundation for practical deployment of interaction systems. Future research could further integrate multimodal fusion and deep learning to improve recognition accuracy and generalization in complex scenarios.
4.2. Vital Sign Monitoring
Monitoring vital signs during sleep is crucial for health assessment and disease diagnosis. Contactless sensing based on WiFi signals provides low-cost, non-invasive solutions for monitoring key indicators like respiration and heart rate by mining fine-grained CSI features. Below, two key studies are systematically discussed, from single-user fine monitoring to multi-user collaborative sensing, outlining technological advancements and application breakthroughs in this field, such as [36].
Liu et al. [36] proposed the first sleep monitoring system based on commercial WiFi devices, using CSI sensitivity to human micro-movements to synchronously monitor respiratory rate, heart rate, and sleep posture. The system uses Hampel filters and moving average filtering to remove noise, selects sensitive channels via subcarrier variance, and extracts sinusoidal features of respiratory cycles and high-frequency components of heart rates via time-frequency analysis. Experiments show that in single-user scenarios, respiratory rate monitoring error is less than 0.5 breaths/minute, heart rate error is within ±3 beats/minute, and four sleep postures (supine, side-lying, etc.) are recognized with over 90% accuracy. Its core advantage lies in requiring no extra hardware, relying solely on existing WiFi devices, making it feasible for long-term home health management, especially for elderly users with low wearable device compliance. This study demonstrates the effectiveness of commercial WiFi devices in contactless heart rate monitoring, providing technical support for early screening of diseases like sleep apnea and heart rate abnormalities.
For monitoring in multi-person shared spaces (e.g. wards, dormitories), Zeng et al. [37] developed the MultiSense system, achieving multi-user respiration signal separation from a single pair of WiFi devices via Blind Source Separation (BSS) and Independent Component Analysis (ICA) for the first time. The system uses the linear mixing characteristics of multi-antenna received signals to build signal models, remove background noise and phase offsets, and distinguish different individuals’ respiratory rates via K-means clustering. Experimental results show 92.7% respiratory rate recognition accuracy in 3-person monitoring scenarios, with robustness to personnel position changes (e.g. within 50 cm movement). Compared to traditional spectral analysis methods, MultiSense breaks through the “blind spot” limitation, effectively monitoring even in weak signal reflection areas, and offering efficient solutions for group health monitoring in nursing homes, post-disaster rescue, and other scenarios. Technical comparisons and evolution are shown in Table 1.
Table 1. Technical comparison and evolution of vital sign monitoring.
|
Single-user Monitoring [36] |
Multi-user Sensing [37] |
Core Method |
Time-freq analysis, subcarrier screening,
peak detection |
Blind Source Separation (BSS), Independent Component Analysis (ICA) |
Monitoring Target |
Respiratory rate, heart rate, sleep posture |
Multi-person resp sync monitoring, up to 4 people |
Signal
Processing |
Band-pass filter to separate resp &
heart rate signals |
Multi-antenna linear mixing modeling, remove phase
offset & background noise |
Environmental
Adaptability |
Supports NLOS, 10m distance,
multiple sleep postures |
Resist furniture movement, AC vibration noise,
strong positioning robustness |
Accuracy |
Respiratory rate error < 0.5 times/minute, heart rate ± 3 times/minute |
3-person: 92.7% accuracy, avg absolute error
0.73/min |
Hardware
Dependence |
Single-pair commercial WiFi
(router + laptop) |
Same, multi-antenna for better signal separation |
4.3. Personnel Identification
In IoT and smart environments, personnel identification, as a core technology for security authentication and personalized services, faces challenges such as user compliance, privacy protection, and environmental adaptability. Traditional vision-based or biometric-based methods (e.g. facial recognition, fingerprint scanning) require active user cooperation and pose privacy risks, while wireless gait-based recognition, though passive, requires users to walk several meters along fixed paths, leading to high time costs and limited application scenarios. Addressing these issues, Wang et al.’s [38] WiPIN system was a new paradigm of passive, operation-free personnel identification. Its core lies in using unique signal distortions caused by individual physiological characteristics (e.g. body shape, body fat rate, muscle distribution) when WiFi signals penetrate the human body for identity differentiation. Users only need to stand still for ~200 ms for identification, significantly improving user-friendliness and system practicality.
Experimental validation shows WiPIN achieves 92% recognition accuracy in 30-person tests, maintaining stable performance as user scale increases (2 - 30 people). Its robustness is demonstrated by maintaining >90% accuracy over 15 days via periodic training data updates, achieving 94% intra-category and minimum of 77% cross-category recognition accuracy. This study not only expands WiFi signals’ application boundaries in passive perception, but also proves the feasibility and superiority of wireless signal-human feature integration for recognition technology in real-world deployment.
4.4. Summary
A summary of the above application scenarios is shown in Table 2. WiFi-based contactless sensing technology has demonstrated groundbreaking application value across multiple fields by innovating signal processing and feature extraction, and building low-cost, high-robustness intelligent sensing systems. This technology system, centered on “contactless sensing-multi-scenario adaptation-efficient
Table 2. Summary of typical application scenarios.
|
Core Objective |
Key Technology |
Performance Index |
Typical Scenarios |
Gesture
Recognition |
Recognize digital gestures (0 - 9) & finger motions (zoom, slide) |
MNP, multipath signal proc., sub-carrier
screening |
Digital gesture acc. >92%, finger motion acc. >93% |
Smart device int., VR, smart home |
Vital Sign
Monitoring |
Single-user: sleep resp/HR/posture;
multi-user: resp sync mon. |
Time-freq analysis,
BSS, ICA |
Single-user: resp err. <0.5/min, HR ±3/min, posture acc. >90%;
multi-user (3): resp acc. 92.7% |
Home health mgt,
nursing homes,
hospital wards |
Passive Personnel Identification |
No-op identity auth (physiological feature
signal distortion) |
Signal filtering, feature
extraction,
SVM classification |
30-person acc. 92%,
single-time recog.
<300 ms |
Security auth,
smart space services |
recognition”, lays a technical foundation for smart interaction, medical monitoring, intelligent security, and other fields. Future research could integrate multimodal fusion and deep learning to further enhance generalization in complex environments and promote large-scale application.
5. Existing Challenges
5.1. Complex Environment Effect
In complex indoor environments, WiFi signals are susceptible to multipath effects, electromagnetic interference, and other factors, causing increased CSI data fluctuations and noise. Despite using various filtering and phase correction algorithms, signal processing remains suboptimal in scenarios with strong interference sources or signal blockages. Achieving precise phase correction and deep noise removal remains challenging, affecting human feature extraction accuracy and limiting 3D pose recognition precision.
5.2. Challenges in Multimodal Data Fusion
Fusing WiFi signal features with camera video features is complicated by differences in data characteristics, dimensions, and sampling frequencies, making correlation and integration difficult. Millisecond-level errors in data synchronization and inconsistent feature representations can lead to redundant fused data or loss of critical information, hindering comprehensive and accurate reflection of human states and reducing model learning efficiency and recognition performance.
5.3. Optimization Bottlenecks in Model Training
During model training, hyperparameter tuning for deep learning architectures is complex and time-consuming. Adjusting parameters like CNN convolution kernels, LSTM network structures, and attention mechanism scopes requires extensive experimental exploration. Optimization algorithms like Adam and SGD, affected by data distribution and gradient vanishing, may get trapped in local optima for specific datasets/tasks, slowing model convergence, weakening generalization, and impeding accuracy improvements for multi-person complex scenarios.
5.4. Limitations of WiFi-Based Sensing
WiFi sensing, despite privacy and cost benefits, faces limitations. First, signal reliability is challenged by multipath interference in complex environments, impacting feature extraction accuracy, especially for fine details and rapid movements. Second, limited spatial/temporal resolution due to lower frequencies hinders capturing high-fidelity human contours and subtle motion. Signal superposition in multi-person scenarios further complicates individual separation. Third, environmental adaptability is bounded by penetration limits and signal attenuation; changes necessitate recalibration. Finally, compared to vision, WiFi lacks high-fidelity visual features, and precision compared to specialized hardware. Multimodal fusion is needed, leveraging WiFi’s privacy benefits while integrating visual/radar data to compensate for limitations and balance practicality and performance.
6. Future Development Suggestions
6.1. Innovations in Signal Processing Technology
Future developments should focus on signal processing technologies.
Develop new adaptive algorithms: Deep learning-driven signal processing models intelligently capture signal patterns based on environmental changes, strengthening filtering/denoising during multipath interference and precise phase offset correction to stabilize human feature extraction. These algorithms enhance anti-interference capabilities, ensuring feature extraction accuracy, and improving overall human action recognition performance in complex environments.
6.2. Efficient Multimodal Application Strategies
Develop intelligent fusion algorithms to automatically learn feature mapping and weight allocation via deep learning, enabling seamless fusion of WiFi and video data. Utilize Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) to mine latent data correlations, align spatiotemporal features, unify feature representation formats, improve fusion efficiency and data quality, provide accurate human state information for model input, and enhance recognition performance.
6.3. New Approaches to Model Training Optimization
Introduce Neural Architecture Search (NAS) technology to automatically design optimal network architectures, allocating computing resources, determining layer connections, and configuring convolution kernels based on task requirements. Combine adaptive learning rate adjustment strategies, gradient clipping, and novel regularization methods to mitigate gradient issues and accelerate convergence to global optima. Apply few-shot learning and transfer learning to leverage pre-trained model knowledge, reduce data dependency, and rapidly optimize models for improved action recognition accuracy and efficiency in complex scenarios.
7. Conclusion
This paper presents a real-time multi-person pose recognition system based on WiFi signals and deep learning neural networks using ubiquitous WiFi devices. The system accurately recognizes human pose in complex environments, enabling precise analysis of human movements. During experiments, large volumes of WiFi signal data were collected, preprocessed, and feature-extracted with deep learning neural networks. Extensive experimental validation and testing show that the constructed deep learning neural network performs well in recognizing human pose, effectively improving the accuracy and efficiency of human action recognition in complex environments. This technology creates more convenient and intelligent experiences for fields, such as intelligent environmental perception, virtual reality, and human-computer interaction.
Acknowledgements
The work is funded by the foundation of the Innovation and Entrepreneurship Training Program for College Students (202410424048).