What is claimed is:
1 . A method of resilient human-on-the-loop range-only cooperative positioning of a plurality of unmanned aerial vehicles (UAVs), comprising:
determining an initial distribution, an initial policy, and a pre-defined tolerance value by a processer; computing an initial exploitability using the initial distribution and the initial policy; for an i-th iteration, for each time step, performing a forward updating of a distribution of a portion of the plurality of UAVs by computing a distribution of the portion of the plurality of UAVs at a k-th time step using a distribution of the portion of the plurality of UAVs at a (k−1)-th time step under a policy at the (k−1)-th time step; and performing a backward updating of a Q function of each UAV of the plurality of UAVs by computing a Q function value of each UAV of the plurality of UAVs at the k-th time step using a Q function value of each UAV of the plurality of UAVs at a (k+1)-th time step; for each time step, calculating a dual variable at an (i+1)-th iteration using a dual variable at the i-th iteration and the computed Q function value of each UAV of the plurality of UAVs at the i-th iteration; and calculating a policy at the (i+1)-th iteration using the calculated dual variable at the (i+1)-th iteration; computing an exploitability at the (i+1)-th iteration and a ratio of the exploitability at the (i+1)-th iteration over the initial exploitability; and when the ratio is less than or equal to the pre-defined tolerance value, controlling the UAV movement by maintaining a policy at the i-th iteration for each UAV of the plurality of UAVs by maintaining a policy at the i-th iteration; and when the ratio is greater than the pre-defined tolerance value, using the policy at the (i+1)-th iteration for controlling the movement of each UAV of the plurality of UAVs using the policy at the (i+1)-th iteration.
2 . The method according to claim 1 , wherein the dual variable at the (i+1)-th iteration is calculated by:
y
k
i
+
1
(
s
_
,
μ
)
=
y
k
i
(
s
_
,
μ
)
+
α
Q
π
k
i
(
s
_
,
μ
)
wherein γ denotes the dual variable, s denotes a UAV failure probability, μ denotes a recovery rate, α denotes a step size, Q denotes a Q-function, and π denotes a policy.
3 . The method according to claim 2 , wherein the policy at the (i+1)-th iteration is calculated by:
π
k
i
+
1
(
·
❘
s
_
)
=
Γ
(
y
k
i
+
1
(
s
_
,
μ
)
)
wherein Γ denotes a function that maps the dual variable at the (i+1)-th iteration to the policy at the (i+1)-th iteration.
4 . The method according to claim 1 , further including:
inputting a plurality of discretized states and a plurality of discretized actions into a mean field game (MFG) model.
5 . The method according to claim 4 , wherein:
a ratio of an exploitability at the (i+1)-th iteration to the initial exploitability is configured to determine early stopping of the mean field game model.
6 . A system, comprising:
a memory, configured to store program instructions for performing a method of resilient human-on-the-loop range-only cooperative positioning of a plurality of unmanned aerial vehicles (UAVs); and a processor, coupled with the memory and, when executing the program instructions, configured for:
determining an initial distribution, an initial policy, and a pre-defined tolerance value by a processer;
computing an initial exploitability using the initial distribution and the initial policy;
for an i-th iteration, for each time step, performing a forward updating of a distribution of a portion of the plurality of UAVs by computing a distribution of the portion of the plurality of UAVs at a k-th time step using a distribution of the portion of the plurality of UAVs at a (k−1)-th time step under a policy at the (k−1)-th time step; and performing a backward updating of a Q function of each UAV of the plurality of UAVs by computing a Q function value of each UAV of the plurality of UAVs at the k-th time step using a Q function value of each UAV of the plurality of UAVs at a (k+1)-th time step;
for each time step, calculating a dual variable at an (i+1)-th iteration using a dual variable at the i-th iteration and the computed Q function value of each UAV of the plurality of UAVs at the i-th iteration; and calculating a policy at the (i+1)-th iteration using the calculated dual variable at the (i+1)-th iteration;
computing an exploitability at the (i+1)-th iteration and a ratio of the exploitability at the (i+1)-th iteration over the initial exploitability; and
when the ratio is less than or equal to the pre-defined tolerance value, controlling the UAV movement by maintaining a policy at the i-th iteration for each UAV of the plurality of UAVs by maintaining a policy at the i-th iteration; and when the ratio is greater than the pre-defined tolerance value, using the policy at the (i+1)-th iteration for controlling the movement of each UAV of the plurality of UAVs using the policy at the (i+1)-th iteration.
7 . The system according to claim 6 , wherein the dual variable at the (i+1)-th iteration is calculated by:
y
k