Spatio-temporal trajectory privacy-protection algorithm based on amending prefix tree

doi:10.15406/iratj.2017.02.00018

eISSN: 2574-8092

International Robotics & Automation Journal

Research Article Volume 2 Issue 3

Spatio-temporal trajectory privacy-protection algorithm based on amending prefix tree

HE Ming,^1,2,4 LIU Fangxin,^1,4

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

ZHOU Huan,³ CHEN Qiuli l,^1,2,4 MIAO Zhuang,¹ ZHOU BO¹

¹College of Command Information System, PLA Science and Technology University, China
²The 61th Research Institute of PLA, China
³Institute of Vocational Education, Tongji University, China
⁴Nanjing University of Information Science and Technology, China

Correspondence: LIU Fangxin, College of Command Information System, PLA Science and Technology University, China

Received: March 27, 2017 | Published: April 28, 2017

Citation: Ming HE, Fangxin LIU, Huan ZHOU, Qiuli CHEN, Zhuang MIAO, et al. (2017) Spatio-Temporal Trajectory Privacy-Protection Algorithm Based on Amending Prefix Tree. Int Rob Auto J 2(3): 00018. DOI: 10.15406/iratj.2017.02.00018

Download PDF

Abstract

The purpose of this paper is to prevent privacy leakage resulting from improper release of spatio-temporal trajectory data and achieve a balance between privacy protection and practicability. Therefore，this article builds a model of the self-evolution of attack patterns and proposes APT-PP algorithm based on the spatio-temporal correlation within trajectory data. The core thoughts of the APT-PP algorithm are as following: firstly, classify trajectory data into hold points and moving points, build trajectory chart and calculate the sensitivity of each point, which reduces the size of data set and improves operating efficiency; then, transform the above-mentioned chart into prefix tree to evaluate to what extent trajectories are kept secret, and prefix tree also raises the algorithm’s retrieval efficiency; finally, amend the prefix tree with grafting operation, protecting privacy while keeping data utility. At the end of this article, a comparative experiment is introduced to assess the performance of APT-PP algorithm with real data set. The results of experiment proved that this algorithm can protect users’ privacy and on the same time provide high-quality spatio-temporal trajectory data and positive user experience.

Keywords: Trajectory; Privacy protection; Similarity; Sensitivity

Abbreviations

TS: Trajectory Sequences; TISS: Temporal Information Similarity Set; TD: Temporal Data

Introduction

With the popularity of location services and the equipment with location function, a large number of location information and the path of moving objects are collected; analysis and digging of the internal relations and rules for these data are conducive to the needs of multiple applications,^1-2 for example: analysis and digging on the location information recorded by GPS of the city car can help the government to plan the urban road traffic, analysis and digging on the location information of the consumer group in business circle can provide a decision support for the site selection, advertisement putting and so on. Although more and more decision making related to location services benefit from the analysis and digging of spatio-temporal data, it is difficult to avoid the privacy threats caused by spatio-temporal data release: a lot of information about the privacy of the design user hides in the spatio-temporal data contained in the moving trajectories of the moving objects, and such hidden information, such as work habits, the work nature, work/residence address, financial status, etc., will be revealed in the digging of trajectory mode. In order to prevent the attacker from re-recognizing them in the released spatio-temporal data set based on the user's background information as well as part of data grasped, the privacy-protection algorithm is proposed.

Trajectory privacy protection has become the focus field of researchers in recent years. The attacker digs the moving mode of moving object in the in the released data set based on the user's background information as well as part of data stolen, so as to infer the corresponding user information of each trajectory. The protection of user privacy is not only to optimize the privacy level of the trajectory, but also to consider the availability of the optimized and released spatio-temporal data set, and the finally released data set will still be used for each location dependent application.³ At present, some progress has been made in the research of trajectory privacy protection technology. Abul et al⁴ consider that the error exists in the location service provided by some devices, and propose the concept of $(k, δ)$ anonymity as well as the NWA algorithm based on such concept by the trajectory clustering; Yang et al⁵ preserves the privacy of the trajectory by using the method of fuzzy region based on the idea of graph theory, the trajectory of spatio-temporal data set is transformed into the form of graph, the trajectory privacy protection is transformed into the problem of $k$ anonymity sub-graph partition, so as to achieve the purpose of optimizing the privacy; the GC-DM algorithm proposed by Wang et al⁶ comprehensively considers the factors such as time span, position information and trajectory shape, carries out the trajectory clustering to the trajectory based on similarity measure, then carries out the trajectory reconstruction in different clusters so that each cluster contains $k$ trajectories to achieve the purpose of protection the privacy; in Xu and Cai,⁷ the trajectory of the moving object is recorded by setting up a moving footprint table which records the historical trajectory, and the historical footprint is used to replace the trajectory which is not satisfied with the requirements of privacy by querying the moving footprint in the subsequent release process of trajectory data, so as to achieve the purpose of privacy protection; from the point of view of attackers, Zhao et al ⁸ proposes two kinds of trajectory privacy protection schemes based on the possible attack model that might be conducted by the attacker, the first of which generates a number of false trajectories by turning and deforming the threatened trajectory, and adds them into the released spatio-temporal data set to achieve the $k$ anonymity of trajectory; the second scheme is the optimization of the first scheme, to restrain the local threatened trajectory fragments, not allowing it to be added to the finally-released data set to optimize privacy and data availability, Gidófalvi et al⁹ presents a trajectory privacy protection approach of data collection under the server-client framework, in which the motion estimation of different users is separated; these trajectories are used in the server end to carry out the exchange and anonymous operation of trajectory points/fragment between different users before the demand service. However, although the above algorithms have achieved the $k$ anonymity of the trajectory, which meets the needs of the user's privacy, there are still the following problems:

Consider too simple about the attacker's attack, while the actual effect of privacy protection is not sufficient. The model has not considered that attackers would enrich their knowledge in accordance with the result of the first attack. Therefore, when they attack again, the owner of the trajectory may have more risk of privacy leakage.
The retention of data availability is low.
The algorithm does not consider the offset error acceptable to part of application.
Algorithm execution efficiency is not sufficient.

In view of the above problems, this paper enriches the attacker's attack model, based on which a privacy protection algorithm for amending prefix tree is proposed; this algorithm not only achieves the purpose of optimizing the trajectory data privacy, but also largely improves the data availability and the algorithm execution efficiency.

Problem Description and Model Building

Problem description

In order to meet the needs of location-based services, a large number of spatio-temporal data are collected and distributed by all kinds of devices with positioning or sign-in function, which causes that the identity of the user is easily found by the attacker, further leading to the user's privacy reveal. Therefore, the research on trajectory privacy protection is carried out, which should not only ensure the privacy level of user’s trajectory, but also guarantee the high quality experience of data provided for each kind of service, so the relationship between privacy level and data availability should be balanced.¹⁰

Attack model

The released spatio-temporal data set is shown in Table 1 (a); given a attacker set of $A = (a_{1}, a_{2}, \dots, a_{n})$ , for any attacker $a_{i}$ , the part of the data information grasped by him is shown in following Table 1 (b); according to the part of the information grasped by him, the attacker $a_{i}$ can identify from the spatio-temporal data set that the trajectory of the Object_1 belongs to the user X_1, and the trajectory of the Object_3 belongs to the user X_2, while the trajectory of the Object_1 and Object_5 may belong to user X_3; if it is based on the condition that each trajectory has only one corresponding user, Object_1 can be excluded, so it is speculated that Object_5 may belong to user X_3. Then, the trajectory privacy protection is to modify the data in the spatio-temporal data set, so that probability of attacker re-identifying the corresponding trajectory of users from the released and new spatio-temporal data sets, namely the $P_{a t t a c k} \leq \frac{1}{k}$ ; however, for the existing methods, the process of re-attacking still adopts the original information grasped by the attacker, which does not consider that the attacker will update their grasped information in the first time of stealing user privacy information, and the attacker will learn based on the stolen information, so the information grasped by attacker $a_{i}$ in actual process of the second time of attack is not as shown in Table 2 (a), but the Table 2 (b); if deleting the trajectory therein or conduct transformation in a very abrupt manner and not considering the relevance and sensitivity of trajectory point in time and space, it is very likely that the attacker will regard such point as an abnormal point or a disturbing point, and not to consider such point, so the actual effect of privacy protection is not sufficient; for example, in order to preserve the trajectory of Object_5 not being found, the $l o c_{4}$ is spatially transformed into $l o c_{8}$ , which does not exist in the information grasped by the attacker, regarding it as the disturbing point and choosing to ignore, so it can be speculated that Object_5 trajectory after being spatially transformed belongs to user X_3.

Moving object	Moving trajectory
Object_1	$l o c_{1} \to l o c_{2} \to l o c_{3} \to l o c_{4}$
Object_2	$l o c_{1} \to l o c_{3} \to l o c_{5} \to l o c_{6}$
Object_3	$l o c_{2} \to l o c_{7} \to l o c_{6} \to l o c_{2}$
Object_4	$l o c_{2} \to l o c_{1} \to l o c_{5}$
Object_5	$l o c_{2} \to l o c_{3} \to l o c_{4}$

Table 1 the data included in the spatio-temporal data set

Moving object	Moving trajectory
X_1	$\dots \to l o c_{1} \to \dots \to l o c_{4} \to \dots$
X_2	$\dots \to l o c_{7} \to \dots \to l o c_{6} \to \dots$
X_3	$\dots \to l o c_{2} \to \dots \to l o c_{4} \to \dots$

(a) The information obtained by the attacker in one attack.

Moving object	Moving trajectory
X_1	${\begin{cases} \dots \to l o c_{1} \to \dots \to l o c_{4} \to \dots \\ l o c_{1} \to l o c_{2} \to l o c_{3} \to l o c_{4} \end{cases}$
X_2	${\begin{cases} \dots \to l o c_{7} \to \dots \to l o c_{6} \to \dots \\ l o c_{2} \to l o c_{7} \to l o c_{6} \to l o c_{2} \end{cases}$
X_3	${\begin{cases} \dots \to l o c_{2} \to \dots \to l o c_{4} \to \dots \\ l o c_{2} \to l o c_{3} \to l o c_{4} \end{cases}$

Table 2 The information obtained by the attacker in one and another attack, respectively

(b) The information obtained by the attacker in another attack.

First, the attacker $a_{i}$ to steal spatio-temporal trajectory data without any privacy protection, then, the attacker $a_{i}$ according to the part of the incomplete information and the background information he has mastered, the attacker $a_{i}$ can infer the corresponding relationship between the trajectory and the moving object, and then analyze and find out more privacy information of the moving object, which threatens the privacy of the moving object. The attacker $a_{i}$ will update the information already mastered based on the information obtained from the first attack, when the attacker $a_{i}$ with more comprehensive information to attack again, the moving object will be subject to greater threat to privacy, more likely to lead to the disclosure of the privacy of mobile objects. The effect of privacy protection after optimizing the privacy protection algorithm is valued by the leakage of the targets’ privacy, when they are attacked again by those attackers. At this time, the attackers’ learning ability is not taken into consideration. (As shown in Figure 1)

Figure 1 Sketch of attack model.

Algorithm design

Relevant definition

Definition 1: Spatio-temporal data set $O$ represents the set of moving objects, recorded as $O = {o_{1}, o_{2}, o_{3}, \dots, o_{n}}$ . $t r a_{o_{i}}$ Represents the record of $n$ trajectories for moving object $o_{i}$ .

$t r a_{o_{i}} = {\begin{cases} 〈 \begin{array}{l} (l o n_{o_{i - 1}}^{1}, l a t_{o_{i - 1}}^{1}, t_{o_{i - 1}}^{1}), (l o n_{o_{i - 1}}^{2}, l a t_{o_{i - 1}}^{2}, t_{o_{i - 1}}^{2}), (l o n_{o_{i - 1}}^{3}, l a t_{o_{i - 1}}^{3}, t_{o_{i - 1}}^{3}), \dots, \\ (l o n_{o_{i - 1}}^{m 1}, l a t_{o_{i - 1}}^{m 1}, t_{o_{i - 1}}^{m 1}) \end{array} 〉 \\ 〈 \begin{array}{l} (l o n_{o_{i - 2}}^{1}, l a t_{o_{i - 2}}^{1}, t_{o_{i - 2}}^{1}), (l o n_{o_{i - 2}}^{2}, l a t_{o_{i - 2}}^{2}, t_{o_{i - 2}}^{2}), (l o n_{o_{i - 2}}^{3}, l a t_{o_{i - 2}}^{3}, t_{o_{i - 2}}^{3}), \dots, \\ (l o n_{o_{i - 2}}^{m 2}, l a t_{o_{i - 2}}^{m 2}, t_{o_{i - 2}}^{m 2}) \end{array} 〉 \\ ⋮ \\ 〈 \begin{array}{l} (l o n_{o_{i -n}}^{1}, l a t_{o_{i -n}}^{1}, t_{o_{i -n}}^{1}), (l o n_{o_{i -n}}^{2}, l a t_{o_{i -n}}^{2}, t_{o_{i -n}}^{2}), (l o n_{o_{i -n}}^{3}, l a t_{o_{i -n}}^{3}, t_{o_{i -n}}^{3}), \dots, \\ (l o n_{o_{i -n}}^{m 3}, l a t_{o_{i -n}}^{m 3}, t_{o_{i -n}}^{m 3}) \end{array} 〉 \end{cases}$

Where, a trajectory’s length of moving object $o_{i}$ is denoted as $| t r a_{o_{i}} | = m$ , triad $(l o n_{o_{i}}^{k}, l a t_{o_{i}}^{k}, t_{o_{i}}^{k}), k \leq m$ is denoted as $p o i n t_{k}$ , referring to the location of moving object $o_{i}$ at time $t^{k}$ , and each element in the triad respectively represents longitude, latitude and time.

Definition 2: As for the two trajectories of sub trajectory data, namely $t r a_{o_{i}} = 〈 p o i n t_{o_{i}}^{1}, p o i n t_{o_{i}}^{2}, \dots, p o i n t_{o_{i}}^{l} 〉$ and $t r a_{o_{j}} = 〈 p o i n t_{o_{j}}^{1}, p o i n t_{o_{j}}^{2}, \dots, p o i n t_{o_{j}}^{r} 〉$ , if there is a set of integers $1 \leq k_{1}, k_{2}, \dots, k_{q} \leq l$ in the trajectory $t r a_{o_{i}}$ enable $p o i n t_{o_{i}}^{k_{1}} = p o i n t_{o_{j}}^{1}, p o i n t_{o_{i}}^{k_{2}} = p o i n t_{o_{j}}^{2}, \dots p o i n t_{o_{i}}^{k_{q}} = p o i n t_{o_{j}}^{r}$ , the article call the trajectory $t r a_{o_{i}}$ is the sub trajectory of $t r a_{o_{j}}$ , recorded as $t r a_{o_{j}} \subseteq t r a_{o_{i}}$ .

Definition 3: Sensitivity, different moving objects have different definitions of privacy, so this paper uses sensitivity to represent the privacy of different trajectories. Sensitivity is the number of visits for each trajectory point $t r a_p = (l o n, l a t)$ in a trajectory data set $T D$ ; the larger the number of visits is the less sensitive the trajectory is and the higher the privacy level is.

Definition 4: Trajectory information gain; the data gain in this paper refers to the sensitivity carried by each point of each trajectory in the trajectory data set $T D$ .

Definition 5: Temporal information similarity set, those in the trajectory data set $T D$ meet the following two conditions belong to the same temporal information similarity set $T I S S$ :

Approximate time span of the whole trajectory.
Have the same stagnation point

Definition 6: Temporal information relevance set; those in the trajectory data set $T D$ meet the following two conditions belong to the same temporal information relevance set $T I R S$ :

Approximate sensitivity of each trajectory point in the whole trajectory.
Approximate time span of the whole trajectory.

Definition 7: Sensitivity location, it is the set composed of the trajectory points $t r a_p = (l o n, l a t)$ with the tiny sensitivity in the given trajectory data set $T D$ . It is worth noting that the sensitivity location set is variable, and it is about the sensitivity with the trajectory changed according to the privacy protection algorithm; the number of elements in original sensitive location set gradually turns to 0.

Definition 8: Safe trajectory sequence; given the trajectory data set $T D$ and the sensitivity location set $S L S$ , when and only when each trajectory point of certain trajectory $t r a_{o_{i - k}}$ in the trajectory data set $T D$ does not exist in the sensitivity location set $S L S$ , namely $\forall t r a_p_{i} \in t r a_{o_{i - k}}, t r a_p_{i} \notin S L S$ , the trajectory is safe then.

Definition 9: $k$ anonymity spatio-temporal data; given the trajectory data set $T D$ , probability for each of these trajectories being successfully attacked by the attacker resulting in the privacy leakage $P_{d i v u l g e} \leq \frac{1}{k}$ , the spatio-temporal data is called as the $k$ anonymity spatio-temporal data then.

Definition 10: Trajectory chart; given the spatio-temporal data set $T D$ and the trajectory chart $T G = (V, E, S)$ , where $V$ represents the node set of $T G$ , which is the set composed of stagnation point, namely $\forall v_{i} \in V, v_{i} \in H P S$ ; $E$ represents the side set of $T G$ , which is the set composed of moving point, namely $\forall e_{i} \in E, e_{i} \in M P S$ ; $S$ represents the sensitivity of each node and side; the average value of the corresponding sensitivity of the stagnation point and the moving point is obtained.

Definition 11: Tolerance error; in the positioning or sign-in service, some services are allowed to have errors; such as the sign-in service, in the releasing of personal position, it is allowed to have error within certain range; for example, the signing-in function for clock in by “ding talk” allows fine tuning within 500 meters. The $t e r r o r$ here represents the tolerance error.

The core steps of the algorithm in this paper are as follows (as shown in Figure 2):

Figure 2 Privacy preserving model of MPT-PP algorithm.

Pre-process the spatio-temporal data sets $T D$ , transform $T D$ into trajectory chart for storage, and calculate the sensitivity of each trajectory point $t r a_p$ ;
According to the definition 5, each trajectory is divided into different temporal information relevance sets $T I R S$ according to the sensitivity of each trajectory point in each trajectory and the time span of the trajectory.
Transform into prefix tree for storage according to the graph

APT-PP algorithm

APT-PP algorithm proposed in this paper is as follows:
Step 1 Firstly; carry out the preprocessing of trajectory.

Step 1.1 According to the given spatio-temporal data $T D$ to generate the trajectory sequence, the trajectory generated is composed of the stagnation point and moving point.

Step 1.2 The sensitivity of each trajectory point is calculated according to the generated trajectory sequence.

Step 1.3 According to definition 5, the generated trajectory sequence is divided into each temporal information similarity set.

Step 1.4 Trajectory chart is generated based on the temporal information similarity set obtained by the above preprocessing.

Step 2 Trajectory chart obtained by preprocessing is transformed into prefix tree form for storage.

Step 3 Carry out "grafting" operation to prefix tree, with the specific operations are as follows.

Step 3.1 Count the number of leaf nodes of the prefix tree; if the number of leaf nodes of the prefix tree $| l e a f | \geq k$ , meaning that the stagnation point satisfies the $k$ anonymity privacy protection operation. Skip to the Step3.2 only for the moving point, or both the stagnation point and the moving point skip to the Step3.2 for processing.

Step 3.2 Extract the points with the sensitivity for the stagnation point and the moving point less than $k$ , put into two sets respectively, namely $k - L H P$ and $k - L M P$ ; extract the points with the sensitivity larger than $k$ and put them into the two sets of $k - M H P$ and $k - M M P$ , and then count the number of trajectory points contained in the two sets, namely $| k - L H P |$ and $| k - L M P |$ ; if $| k - L H P | \geq k$ , selects $⌊ \frac{| k - L H P |}{k} ⌋$ nos. of trajectory points from the set $k - L H P$ to be the replacement point of set, if $| k - L H P | < k$ , select the point that is the closest to itself from the set $k - M H P$ as the replacement point, namely ${r e p l a c e_p | d_{r e p l a c e_t r a} \leq t e r r o r}$ , where $r e p l a c e_p = (l o n_{0}, l a t_{0})$ and $t r a_p = (l o n, l a t)$ respectively respects the replacement point in the $k - M H P$ set and the point that is to be replaced in the $k - M H P$ set currently; $d_{r e p l a c e_t r a}$ respects the distance between the two points; if the $r e p l a c e_p$ does not exist, randomly select one point from the set $k - M H P$ as the replacement point; the operation for $| k - L M P | < k$ is as same as the above description.

Step 4 Traverse the entire prefix tree from the root node, to generate the safe trajectory sequence set $S L S$ that is to be released Table 3.

1. for $t r a {}_{i}$ in $T D$
2. {
3. $n u m = 0$ ；
4. for $p o i n t_{j}$ in $t r a_{i}$
5. {
6. if( $p o i n t_{j}$ is $b e g i n_p o i n t$ or $e n d_p o i n t$ )
7. Put $p o i n t_{j}$ in the set of hold point $H S$ ；
8. else if( $p o i n t_{j} = = p o i n t_{j + 1}$ )
9. {
10. if( $p o i n t_{j}$ not in $H S$ )
11. Put $p o i n t_{j}$ in the set of hold point $H S$ ；
12. else
13. continue；
14. }
15. else
16. Put $p o i n t_{j}$ in the set of moving point $H S$ ；
17. $p o i n t_{j} . s e n s i t i v i t y += 1$ ；
18. $t r a_{i}$ restructure by the point that in the $H S$ or $M S$ , then put $t r a_{i}$ in the sequence of trajectory $T S$ ；
19. }
20. }
21. while( $T S \neq \emptyset$ )
22. {
23. $t r a_{i}$ and $t r a_{j}$ in $T S$ and $i \neq j$ ；
24. if( $t r a_{i} . s p a n = = t r a_{j} . s p a n$ and $t r a_{i} . h o l d P o i n t = = t r a_{j} . h o l d P o i n t$ )
25. {
26. put $t r a_{i}$ and $t r a_{j}$ in the same $T I S S$ ；
27. remove $t r a_{i}$ and $t r a_{j}$ from $T S$ ；
28. }
29. }
30. for each $T I S S_{i}$ in $T I S S$
31. {
32. Initialize the prefix tree $P T = \emptyset$ ；
33. for $t r a_{i}$ in $T I S S_{i}$
34. {
35. if( $P T = \emptyset$ )
36. $p a t h_{i} = t r a_{i}$ ；
37. else
38. $p a t h_{i} =$ maximum prefix of $(t r a_{i}, p a t h_{j} i n P T)$ ；
39. Add $p a t h_{i}$ into $P T$ ；
40. for $p o i n t_{m}$ in $p a t h_{i} \cap t r a_{i}$
41. $p a t h_{i} . p o i n t_{m} . s e n s i t i v i t y + = t r a_{i} . p o i n t_{m} . s e n s i t i v i t y$ ；
42. for $p o i n t_{m}$ not in $p a t h_{i}$ but in $t r a_{i}$
43. $p a t h_{i} . p o i n t_{m} . s e n s i t i v i t y = t r a_{i} . p o i n t_{m} . s e n s i t i v i t y$ ；
44. }
45. for each $p o i n t$ in $p a t h_{i}$
46. {
47. if( $p o i n t . s e n s i t i v i t y < k$ and $p o i n t$ in $M S$ )
48. Put $p o i n t$ in the set $k - L M P$ ；
49 else if( $p o i n t . s e n s i t i v i t y \geq k$ and $p o i n t$ in $M S$ )
50. Put $p o i n t$ in the set $k - M M P$ ；
51. if( $| L e a f N o d e | \geq k$ )
52. {
53. if( $p o i n t . s e n s i t i v i t y < k$ and $p o i n t$ in $H S$ )
54. Put $p o i n t$ in the set $k - L H P$ ；
55. else( $p o i n t . s e n s i t i v i t y \geq k$ and $p o i n t$ in $H S$ )
56. Put $p o i n t$ in the set $k - M H P$ ；
57. }
/*divide $M S$ and $H S$ into the point set $k - L M P$ and $k - L H P$ with sensitivity of point less than $k$ and the point set $k - M M P$ and $k - M H P$ with sensitivity of point more than $k$ ；*/
58. }
59. if( $k - L M P \neq \emptyset && | k - L M P | \geq k$ )
60. random select $⌊ \frac{| k - L M P |}{k} ⌋$ points as the replacement point；
61. else if( $k - L M P \neq \emptyset && | k - L M P | < k$ )
62. {
63. for each $r e p l a c e_p o i n t$ in $k - M M P$ and $p o i n t$ in $k - L M P$
64. if( ${r e p l a c e_p o i n t | d_{r e p l a c e_p o i n t} \leq t e r r o r} == \emptyset$ )
65. random select a point in $k - M M P$ as the replacement point；
66. else
67. select the point in ${r e p l a c e_p o i n t | d_{r e p l a c e_p o i n t} \leq t e r r o r}$ as the replacement point；
68. }
69. if( $k - L H P \neq \emptyset$ )
70. do like the process of $k - L M P$ mentioned above；
71. }

Table 3 Pseudo Code for the First Part of the APT-PP Algorithm

The 1-29 lines of the above pseudo code are the part about the preprocessing for the spatio-temporal data set; the operation of this part is to compress the spatio-temporal data set into the trajectory sequence (forming the above nodes) to reduce the space. Among them, the 1 to 20 lines are the trajectory sequences set $T S$ generated according to the input spatio-temporal data set $T D$ , and the sensitivity of each trajectory point is calculated, while the 21 to 29 lines are the trajectory sequences set $T S$ generated from the previous preprocessing; the trajectory sequence therein is divided into the temporal information similarity set $T I S S$ that conforms the definition 5, namely the approximate time span of the trajectory sequence; trajectory sequences with the same stagnation point are divided into the same temporal information similarity set $T I S S$ . 32 to 44 lines are the process of generating the corresponding prefix tree for each temporal information similarity set; the trajectory sequence is stored in the form of prefix tree to improve the retrieval efficiency, so as to reduce the computation time. Finding the maximum common sequence, namely the maximum prefix, starting from the starting point of the trajectory sequence $t r a_{i}$ from the generated $T P$ , add this trajectory sequence $t r a_{i}$ to the path $p a t h_{i}$ of $T P$ , and update the sensitivity of each node (namely the stagnation point and the moving point) in $T P$ ; if there is a maximum prefix, add the sensitivity of the intersection of the two paths, otherwise the sensitivity of the node on the path $p a t h_{i}$ is equal to the sensitivity of the trajectory point in the trajectory sequence $t r a_{i}$ . 45 to 70 lines are the "grafting" process of privacy-protection optimization for prefix trees, which process is to replace the node (namely the trajectory point with $s e n s i t i v i t y < k$ ) that does not satisfy the privacy protection to other place by the algorithm, so as to achieve the purpose of protection the availability of data to a certain extent and strengthening the data privacy level at the same time. In the process of "grafting", the stagnation points and moving points on each path $p a t h_{i}$ are divided into four sets $(k - L H P, k - M H P, k - L M P, k - M M P)$ with the $k$ as the threshold; determine the set after division, and if the set $k - L H P$ or the $k - L M P$ is larger than the threshold $k$ , randomly select the point as the replacement point of set $k - L H P$ or $k - L M P$ , otherwise calculate the distance to be replaced from set $k - M H P$ or $k - M M P$ ; if there is the replacement point set ${r e p l a c e_p o i n t | d_{r e p l a c e_p o i n t} \leq t e r r o r}$ that is not empty, select a point from it for replacement, otherwise randomly select the point from $k - M H P$ or $k - M M P$ for replacement. After the end of the "grafting" operation, start visiting from the root node, each path from the root node to the leaf node is the required safe trajectory $S L S$ sequence at last The algorithm of this paper is divided into two parts, because the number of trajectory contained in the temporal information similarity set $T I S S$ generated in the process of preprocessing $| T I S S | \in [1, | T D |]$ , considering the relationship between the complexity and the loss of data of the algorithm, if the number of trajectory $| T I S S | < k / 2$ , there will be large consumption of complexity to use the algorithm 2; algorithm 2 is proposed based on this consideration for the trajectory sequence that has not been processed with the replacement of stagnation point in the algorithm 1, with the specific procedures as follows:

Step1. Preprocessing of spatio-temporal data is similar to algorithm 1, and the only difference lies in the Step1.3; the temporal information relevance set $T I R S$ in this algorithm is generated according to definition 6.

Step2. Generate the prefix tree $T P$ based on temporal information relevance set $T I R S$ , find the points of the stagnation point with the sensitivity less than $k$ from the prefix tree $T P$ to put into set $S - L H P$ , while the points with the sensitivity equal to or larger than $k$ are put into the set $S - M H P$ .

Step3. If the $S - M H P$ is not 0, namely $S - M H P \neq \emptyset$ , find the point in the set $S - M H P$ closest (namely within the tolerate error range, within $t e r r o r$ meters) to the point to be replaced in the set $S - L H P$ for replacement, otherwise find a point from the set $S - L H P$ that is the closest to itself for replacement, to get the amended prefix tree $T P {}_{n e w}$ .

Step4. Carry out the “grafting” operation as algorithm 1 to the moving point in the amended $T P_{n e w}$ . Divide the moving point in the amended $T P_{n e w}$ into two sets of $k - L M P$ and $k - M M P$ according to $k$ division, if $| k - L M P | \geq k$ , selects $⌊ \frac{| k - L M P |}{k} ⌋$ nos. of trajectory points from the set $k - L M P$ as the replacement point, otherwise randomly select one point within the $t e r r o r$ -meter tolerate error of the point to be replaced from the set $k - M M P$ for replacement, namely ${p o i n t (l o n, l a t) | d_{r e p l a c e_p o i n t} \leq t e r r o r}$ .

1-11 lines for the pseudo code of algorithm 2 Table 4 are about the preprocessing to the trajectory sequence that is not treated in algorithm 1; the trajectory sequence is divided into the corresponding temporal information relevance set $T I R S$ and the corresponding prefix tree is generated. The 12 to 34 lines of the pseudo code are the amendment of the prefix tree, and the processing of the amended prefix tree to the stagnation point changes from the original $O (n^{2}) \to O (n)$ , which, to some extent, accelerates the efficiency of the algorithm.

Input：the remaining sequence of trajectory $R T S$ that’re not processed by algorithm 1
Output：The modified prefix tree $P T_{n e w}$

1. while( $R T S \neq \emptyset$ )
2. {
3. $t r a_{i}$ and $t r a_{j}$ in $R T S$ and $i \neq j$ ；
4. if( $t r a_{i} . s p a n = = t r a_{j} . s p a n$ )
5. {
6. put $t r a_{i}$ and $t r a_{j}$ in the same $T I R S$ ；
7. remove $t r a_{i}$ and $t r a_{j}$ from $R T S$ ；
8. }
9. }
10. establish prefix tree $P T$ ；
11. $(S - L H P, S - M H P) = d i v i d e (H S)$ /* $H S$ represent the hold point in the path of the prefix tree $P T$ */
/*according to the sensitivity $k$ divide $H S$ into the point set $S - M H P$ and $S - L H P$ */
12. if( $S - M H P \neq \emptyset$ )
13. {
14. for $H P_{i}$ in $S - L H P$
15. {
16. for $H P_{j}$ in $S - M H P$
17. {
18. $r e p l a c e_f l a g = f a l s e$ ；
19. if( $H P_{j}$ in $r a n g e (H P_{i}, t e r r o r)$ ) /* $H P_{j}$ in the $t e r r o r$ meters of $H P_{i}$ */
20. {
21. $H P_{i} = H P_{j}$ ；
22. $r e p l a c e_f l a g = t r u e$ ；
23. }
24. }
25. if( $r e p l a c e_f l a g == f a l s e$ )
26. $H P_{i} = r a n d o m (S - L H P)$ ；
27. }
28. }
29. while( $S - L H P \neq \emptyset$ )
30. {
31. $M i n k P o i n t = M I N (S - L H P)$ ； /* the point within the minimum sensitivity in $S - L H P$ */
32. $M a x k P o i n t = M A X (S - L H P)$ ； /* the point within the maximum sensitivity in $S - L H P$ */
33. $M i n k P o i n t = M a x k P o i n t$ ；
34. Update the set $S - L H P$ and $S - M H P$ ； /*if the point’s sensitivity is more than $k$ , remove the point from $S - L H P$ and put it in the $S - M H P$ */
35. }
36. return $P T_{n e w}$ ；

Table 4 Amending Prefix Tree Algorithm Pseudo Code for the Second Part of the APT-PP Algorithm

Analysis for Data Availability and Privacy Level

Date availability and privacy level are discussed in this section, and the article will show that the spatio-temporal data republished by APT-PP algorithm of the paper has validly improved the privacy level in this section, and at the same time, the availability of data is also not greatly reduced, and the loss of quantity of information is in a relatively stable and acceptable range. This section will define the relevant measurement parameters in this section and discuss how to measure the availability of data and the protection level of privacy.

Protection level of privacy:

^11-13Measurement is carried out by probability recognized intensively from spatio-temporal data republished by attacker, and after optimizing the privacy according to the algorithm proposed in the paper, the $k -$ anonymity demand of the trajectory is met, proving as follows:

The original given spatio-temporal data set $T D$ obtains the safety trajectory sequence set by APT-PP algorithm proposed in the paper. Set $S L S$ is the prefix tree which is able to achieve the privacy protection after correction, composed by path (all other node paths except for the root node) traversed from root node to leaf node. Because that in the “grafting” process of prefix tree, the trajectory point of which the sensitivity does not meet the requirements is transferred or replaced to another place for the purpose of making the sensitivity reach threshold $k$ above, finally the sensitivity of stagnation point and moving point included in amended prefix tree is greater than or equal to $k$ , therefore, the final path sequence $S L S$ acquired by depth traverse meets the $k -$ anonymity demand of the trajectory.

Loss level of information and tolerance error of information

Loss level of information and tolerance error of information are used for measuring the information distortion level of spatio-temporal data caused by the modification of trajectory point at a certain extent in the process of algorithm processing and privacy optimization, respectively represented by $M I L$ and $M I T E L$ . However, the former is the measure index aimed at accurate and strict service as demand, and the latter is the measure index aimed at service allowed error such as location and check in.

${\begin{cases} M I L =1 - \frac{\sum_{1}^{| P T S |} | P T_{n e w} \cap P T |}{| T D . p o i n t |} \\ M I T E L = 1 - \frac{\sum_{1}^{| P T S |} | P T_{t e r r o r} \cap P T |}{| T D . p o int |} \end{cases}$

$P T S$ in the above formula means the prefix tree forest, $| P T S |$ refers to the quantity of prefix tree contained in the forest, $P T_{n e w}$ and $P T_{t e r r o r}$ respectively refer to the prefix tree after finishing the algorithm under the case of tolerance error for $t e r r o r = 0$ and $t e r r o r \neq 0$ , $P T$ represents the prefix tree before carrying out the algorithm, and $T D . p o i n t$ refers to the number of all trajectory points included in given original spatio-temporal data set $T D$ .

Algorithm complexity

Experimental Verification

Description for Experimental data and environment

To verify the superiority of APT-PP algorithm proposed in the paper, the algorithm is assessed in terms of detail by truthful data T-drive provided by Microsoft Asia Research Institute.^14-15 The data set T-drive collected the moving trajectory of 10357 taxies in Beijing area from February 2 to February 8, 2008, with the interval of 5 seconds for information collection, of which 15000000 trajectory points are included in total, and the total distance of moving trajectory reaches 9000000 km with the average sampling time for 177 seconds and average interval distance of sampling for 632m. The specific attribute of T-drive is shown in Table 5, and the format of moving trajectory recorded in data set is shown in Figure 3.

Collection Site	Collection Object	Quantity of Moving Object	Quantity of Moving Trajectory
Beijing	Taxi	10357	72499

Quantity of Trajectory Point	Moving Distance	Time Span of Sampling	Interval of Data Collection	Average Time Interval of Sampling	Average Interval Distance of Sampling
15 million	9 million	7 days	5 seconds	177 seconds	623 m

Table 5 Description for Date Set

As shown in Figure 3, each line of data represents a spatio-temporal data collected by sampling point, separated by commas, respectively representing taxi ID, sampling data /time, longitude as well as latitude. As the time span involves the multi-day data, the data preprocessing is carried out before testing data. The data set contains partial repeating data, that is, multiple records in the same time, so the data needs to be cleaned before processing, and at the same time the position information data of moving object in the day composes a trajectory, that is, each moving object has seven trajectories (namely, a 7-day moving path is recorded). The APT-PP algorithm is written by Python, and is tested in hardware condition of Intel(R) Core(TM) i7-4610M CPU @ 3.00GHz, 8.00GB memory and Microsoft Windows 10, and the results are shown by MATLAB.

Figure 3 Data Format.

Experimental results and analysis

The algorithm in literature⁴ (called NWA algorithm) and that in literature ⁶ (called GC-DM algorithm) are used as comparison algorithm in the test, and the contrast experiment is carried out respectively from the loss of information, tolerated loss of information as well as execution efficiency of algorithm to verify the superiority of algorithm. The reason for choosing the literature⁴ is that such algorithm is a classic algorithm of privacy protection, and presently a lot of methods of privacy protection are derived from such algorithm and inspired by it, but the algorithm in literature⁶ is an improved method of comparatively novel algorithm in terms of privacy protection of trajectory.

Experiment 1: Contrast test between $k$ size of privacy protection level and loss level of information. The test for contrast algorithm (NWA algorithm and GC-DM algorithm) and APT-PP algorithm of the paper in T-drive data set, and the comparison of information loss conditions $M I L$ caused by different algorithms on data set with $k$ value (anonymity level) are described in Figure 4. As can be seen from the graph that, with the increase of $k$ value (anonymity level), the issued data set generated after processing of three algorithms on original data set and the loss conditions compared with original data set are all increased, however the algorithm of the paper represented by a blue straight line has a outstanding advantage compared with contrast algorithm, of which GC-DM algorithm has a few difference with the algorithm of the paper in the process of implementation when $k$ value (anonymity level) is not large, but with the increase of $k$ value (anonymity level), the advantage of algorithm in the paper is more and more obvious, and with the change of $k$ value (anonymity level), there will be no sudden increase or decrease conditions on implementation of the algorithm of the paper with a smoothing curve.

Experiment 2: Contrast test between $k$ size of privacy protection level and tolerated loss level of information. The test for contrast algorithm (NWA algorithm and GC-DM algorithm) and APT-PP algorithm of the paper in T-drive data set with $t e r r o r$ for 500m, and the comparison of information tolerance error loss conditions $M I T E L$ caused by different algorithms on data set with value (anonymity level) are described in Figure 5. Comparing with Figure 4,5 it can be found that, the loss conditions of the algorithm (NWA algorithm and GC-DM algorithm) have almost no difference in the same data set after two experiments, however the algorithm APT-PP has significant optimization in the experiment of loss level of tolerance error of information, and the APT-PP algorithm optimize the presence of tolerable offset $t e r r o r$ in the processing because the data issued in the experiment is served for the check in application of tolerance error. In the case of privacy protection, the geography location and time shall be fully considered in the offset change of trajectory point, so the quality of check in service is protected to the maximum extent, and the loss level of tolerance error of information has an outstanding advantage compared with other algorithms, there will be a good effectiveness when applying in check in demand.

Figure 4 Change of loss level of spatio-temporal data with K value.

Figure 5 Conditions for loss level of tolerance error of information changing with $k$ value.

Experiment 3: Test for $k$ size of privacy protection level, tolerance error $t e r r o r$ and tolerated loss level of information. The test of APT-PP algorithm in T-drive data set, and the conditions for tolerated loss level of information changing with $k$ (anonymity level) and tolerance error $t e r r o r$ are shown in Figure 6. As can be seen from figure, there is a positive correlation between APT-PP algorithm and $k$ (anonymity level), but with a negative correlation for tolerance error $t e r r o r$ , with the increase of tolerance error $t e r r o r$ , the acceptable error offset is more greater, the correction of a lot of trajectory point in the process of prefix tree of corrected point is in the tolerance range, and the change of tolerated loss level of information MITEL is not influenced by the offset. The application of such positioning requirements is not influenced by the final formative trajectory safety sequence as an issued data set, and the location check also can be carried out for it, therefore as shown in the Figure 6, the APT-PP algorithm will be influenced by the change of tolerance error $t e r r o r$ , but the NWA algorithm and GC-DM algorithm will not be influenced.

Figure 6 Change of tolerated loss level of information of APT-PP algorithm with $t e r r o r$ and $k$ value.

Experiment 4: Contrast test between $k$ size of privacy protection level and execution efficiency of algorithm. The test for contrast algorithm (NWA algorithm and GC-DM algorithm) and APT-PP algorithm of the paper in T-drive data set, and the conditions of execution time of algorithm changing with $k$ are described in Figure 7. There is a downtrend on execution time of three algorithms with $k$ (anonymity level), because with the increase of $k$ value (anonymity level), there is an increasing trend on privacy requirements. The trajectory point that the partial data cannot meet will be processed simply (namely directly remove and other operations), therefore the execution time of algorithm is also reduced, but comparing with Test 1 and Test 2, it is known that although the time is accelerating, the loss of information of spatio-temporal data set caused by sample processing of trajectory point is increasing. The APT-PP algorithm has a outstanding advantage compared with contrast algorithm in the execution efficiency of algorithm, however with the change of $k$ (anonymity level), there is a few change on the execution time of algorithm, because in the process of implementation, the larger the $k$ value (anonymity level) of algorithm of the paper is, the more the divided time information association set or the time information similarity set is, namely the size of single set is more and more small, but still each set will be processed. If the quantity of multi-set is 1, it will be uniformly reprocessed, so the descent range of execution time is not large, and is only slightly superior to GC-DM algorithm when $k$ value (anonymity level) reaches 40 or 48.

Figure 7 Change of execution efficiency of algorithm with $k$ value.

Conclusion

To solve the privacy disclosure of spatio-temporal, the APT-PP algorithm is proposed in the paper, which not only considers the information of location but also considers the relation of time span in the process of optimizing tracing privacy, and defines the concept of time information similarity set and time information association set, so that the spatio-temporal data set is classified and processed in different levels; on the basis of it, the data storage in the form of prefix tree has validly improved the efficiency of data search, and the “grafting” operation is carried out on the trajectory stored in prefix tree to optimize the privacy level of issued spatio-temporal data and the loss level of data. Finally, the contrast verification is carried out for algorithm proposed in the paper by true data set, and the results show that APT-PP algorithm has a comparatively outstanding advantage in the execution efficiency and the quality of issued data set. APT-PP algorithm in the application of location based service has important significance. We will do further study on the privacy protection of different levels according to people's needs.

Acknowledgments

This work was supported by the Natural Science Foundation of Jiangsu Province under Grant No. BK20150721, BK20161469; China Postdoctoral Science Foundation under Grant No. 2015M582786, 2016T91017; Engineering Research Center of Jiangsu Province under Grant No. BM2014391. Primary Research & Development Plan of Jiangsu Province under Grant BE2015728, BE2016904. National Key Research and Development Program 2016YFC0800606.