Scalable Swarm Robotics in GPS-Denied Environments
Mihir Parekh, Advisor: Kshitij Jerath
Overview
Swarm algorithms have seen significant development in simulation environments; however real-world validation is essential for verifying their robustness, scalability, and performance under uncertainty. Simulation often abstracts away critical real-world constraints, while experimental testbeds bridge the gap, providing feedback on swarm theory and design. We propose a testbed using Anki Vector’s, commercially available differential drive mobile robots equipped with an HD Camera and their own processor enabling onboard compute. Our objective is to deliver a scalable swarm testbed that can operate in GPS-denied environments.
Approach
Emergent behavior refers to the collective behaviors that arise in multi-agent systems, such as bird flocking or fish schooling. What makes these patterns remarkable is that they emerge from simple agents operating with minimal capabilities.
The Anki Vector, out of the box, is an AI-powered companion robot that can serve as a single agent within a swarm. To adapt it for our swarm application, we engineer the following local abilities:
- Online Robot Detection
- Inter-Robot Distance Estimation
- Target Following
- Real Time Pose Estimation
Online Robot Detection
The first objective was to teach a Vector robot to recognize another Vector. To achieve this, we designed and implemented a computer vision pipeline capable of detecting and tracking nearby Vectors. This process began with manually collecting 300 images from the robot’s onboard camera, augmenting the dataset with noise, and labeling each image. We then trained a lightweight YOLO model using transfer learning, leveraging pre-trained object classification to accelerate development and improve accuracy.




The model generalizes excellently with near-perfect precision (0.999), recall (1.0), and mAP@0.5 (0.995). Validation box loss falls to 0.277, showing effective learning and no overfitting, which demonstrates robustness for our lightweight application.
View Project HereInter-Robot Distance Estimation
After detecting a Vector in the frame using our computer vision model, the next step is to estimate its distance. YOLO returns a bounding box around the detected Vector. Using the pinhole camera model, we calculate the camera’s focal length, which indicates how much the image is scaled. With this focal length, the bounding box height, and the known real-world height of the Vector, we can estimate the distance. We use height rather than width because the width varies significantly with the Vector’s orientation.
- himg = height of image
- hobj = height of object
- hbb = height of bounding box

Data was collected by moving a stationary Vector robot away from the observing Vector’s camera in 10 mm increments. Distance estimates by the model were compared against tape measurements for front and side orientations.
Overall, the model’s distance estimates were reasonably accurate but tended to overestimate distances at close range, likely due to the lack of camera calibration. Applying intrinsic camera calibration is expected to improve close-range accuracy.
Target Following
Within an agent's local neighborhood, behavior is governed by three primary objectives: maintaining personal space from nearby agents, aligning with others at an optimal distance, and seeking companionship when isolated. In the case of isolation, target following allows the agent to pursue a detected target to rejoin the swarm.
Initially, we implemented the PID controller, tuning its gains through trial and error to minimize oscillations and achieve smooth convergence toward the target. In our setup, the target is a detected Vector robot within the camera frame, and the PID controller iteratively adjusts the trajectory to reach it. Alternatively, we designed a Kinematic State Feedback Controller based on the standard motion model of a differential drive robot, applying a control law to directly regulate the robot’s position and orientation.
Comparative Evaluation of Follow Methods
To determine the best approach for Vector target following, we tested two scenarios, following a stationary target and following a moving target, as demonstrated in the videos above. Each scenario was evaluated by running five iterations for both the PID and Kinematic controllers to ensure comprehensive coverage and reliable results.
Stationary Target Test


Moving Target Test


The figures compare two target-following approaches implemented on the Vector. The PID controller reduces oscillations over time, while the kinematic state feedback controller produces sharper angular adjustments but reaches the target faster. These results highlight the trade-off between stability and convergence speed.
Beyond these differences, the kinematic controller’s use of the robot’s motion model allows for faster, more reliable convergence without the need for manual tuning. In contrast, the PID controller requires careful gain adjustment and can behave unpredictably in dynamic or complex environments. This makes the kinematic approach particularly suitable for swarm robotics, where consistent performance and scalability across many agents are crucial.
View Project HereReal-Time Pose Estimation
Knowing a neighbor's orientation is crucial for developing a swarm, as an agent will align itself with a neighbor if that neighbor lies within an optimal distance threshold. Individual pose estimation is essential for enabling coordinated collective behavior. We implemented this using OpenCV’s ArUco markers, homogeneous frame transformations, and an Extended Kalman Filter for smooth and accurate real-time pose estimation. An initial step was to develop a centralized perception system that would maitain all active poses of deployed Vectors.
Aruco Marker Setup
Our starting test environment consists of a 400 x 400 milimeter square space, with eight markers along the edges. A picture can be seen below:

Fixed Landmark Localization
After calibrating the camera, we leveraged these markers as fixed reference points to estimate the robot’s pose within a global frame. We define the global frame as a 2D coordinate system (x,y). When the Vector’s camera captures a raw image, it searches for visible markers. If a marker is detected, we retrieve its pose relative to the camera frame. Referring to the transformation chain below, our end objective is to get the camera in global frame.
$$ {}^{\text{Global}}\mathbf{T}_{\text{Camera}} = {}^{\text{Global}}\mathbf{T}_{\text{Marker}} \cdot \left({}^{\text{Camera}}\mathbf{T}_{\text{Marker}}\right)^{-1} $$
$$ \Downarrow $$
$$ {}^{\text{Global}}\mathbf{T}_{\text{Camera}} = {}^{\text{Global}}\mathbf{T}_{\text{Marker}} \cdot {}^{\text{Marker}}\mathbf{T}_{\text{Camera}} $$
This transformation chain multiplies works because the Marker's cancel out in the multiplication. Here each T represents a 4 by 4 homogenous transformation matrix, comprised of a rotation matrix and trasnlation vector. Once performed we can obtain a relatively accurate estimate within our small test area. However, these marker-based “GPS-like” readings can be noisy and unreliable at times. Pose ambiguity can lead to mirrored or incorrect readings, extreme viewing angles cause poor detections, and lighting conditions further affect accuracy. And the biggest challenge is, markers are not always within the camera’s field of view, limiting continuous pose tracking.
Sensor Fusion
As discussed previously, while our marker-based localization provides valuable pose estimates, it is inherently noisy and unreliable when markers are not in view. To mitigate these limitations, we implemented a simple odometry-based motion model that performs dead reckoning using wheel speed data.
However, odometry alone is susceptible to errors such as wheel slippage and accumulates drift over time, making it unsuitable for long-term state estimation. In contrast, camera-based localization is drift-free but can suffer from intermittent inaccuracies and occlusions. To leverage the complementary strengths of both systems, we integrated an Extended Kalman Filter (EKF). The EKF is particularly well-suited for estimating the state of systems with nonlinear dynamics in real time. In our approach, the prediction step of the EKF uses odometry to estimate the next state based on the previous one, while the update step incorporates camera-based observations to correct and refine the prediction. This fusion allows us to maintain a more accurate and robust estimate of the robot’s pose over time.
Results
The plots below highlight the sensor fusion performed by the EKF in the video demo. Showcasing how the EKF selectively trusts either the camera-based observations (in orange) or the odometry data (in blue) at different times, illustrating how it performs iterative filtering in real time.


Limitations & Challenges
Sequential development of individual tools has helped us progressively transform the off-the-shelf Vector robot into a more intelligent and capable agent. However, we are still far from achieving our goal of building a fully functional swarm testbed. A key limitation is the lack of a reliable method for determining a neighbor’s orientation. Our current approach uses a centralized perception system, where each robot's pose is published in real time to a central server. While this works well within a small 400 x 400 mm operating space, it does not scale effectively to larger environments. Expanding to areas such as a 2 x 2 meter workspace, several challenges emerge. Marker visibility decreases significantly, leading to increased pose ambiguity at greater distances, along with more general sensor noise. In regions where markers become unreliable or completely occluded, the system must fall back on dead reckoning. Resulting in increased drift and a higher likelihood of inter-agent collisions.
Future Work
To address the challenges in real-time pose estimation, we are currently implementing visual odometry as a supplementary input. This additional data stream will be fused within our Extended Kalman Filter (EKF) to enhance overall robustness and accuracy. In parallel, we are developing open-source software to support scalable Vector swarm deployments, enabling broader accessibility and collaboration.