Archive

Differentially Private Grids for Geospatial Data

Summarize - Uniform grid approach for balancing the noise error and the non-uniformity error - Proposed a method for choosing the grid size. - Traditional differentially private methods for 2D datasets : quadtrees, kd...

2024. 1. 17.6 min read

Summarize

  • Uniform grid approach for balancing the noise error and the non-uniformity error
  • Proposed a method for choosing the grid size.
  • Traditional differentially private methods for 2D datasets : quadtrees, kd-trees (Cormode et al., 2012; Xiao et al., 2010)
  • Uniform grid method : apply grid on domain and execute count queries
  • Key problem is the best grid size
  • Proposed m=Nϵcm=\sqrt{\dfrac{N\epsilon}{c}} where NN is the number of data and cc is some small const.

Problem formulation

  • DD : 2-dimensional geospatial dataset
  • Goal : publish a synopsis of the dataset to accurately answer count queries
  • Count query : specifies a grid and asks for the number of data points in that grid
  • Query rectangle : A set of grid asked by a query

Two sources of error

Noise error

  • To satisfy differential privacy, one adds an independent noise to each cell.
  • The answer for count query then has sum of the noisy counts.
  • If the noise has variance σ2\sigma^{2} and we count qq cells then the standard deviation of answer becomes qσ\sqrt{q}\sigma.
  • That is, if we make the resolution higher, more grids are contained for same answer, which results larger variance.

Non-uniformity error

  • Caused by cells that intersect with the query rectangle but not contained in it.
  • If the data points are not distributed uniformly, the non-uniformity error occurs.
  • It depends on the number of data points in the intersected cell
  • If we make the resolution higher, the non-uniformity error gets lower.

Previous approaches

Recursive partitioning

  • (Xiao et al., 2010) proposed spatial indexing method, KD-trees
  • KD tree recursively split along some dimension. Split points were chose to have an uniformity among subregions.
  • (Cormode et al., 2012) proposed some method based on KD-trees, using the median of the partiton dimension as the split points.
  • or based on quadtree, which recursively divides the nodes into four equal regions.

Hierarchical Transformations

  • Recursive partitioning essentially builds hierarchy over data points.
  • (Xiao et al., 2011) proposed the Privlet method using wavelet transformation.
  • Applies a Haar wavelet transformation to the frequency matrix of the dataset.

Methodology

Uniform Grid method

  • Partitions the data domain into m×mm\times m grid cells of equal size.

  • The grid size(resolution) mm is recommended to be

    m=Nϵc(1) m=\sqrt{\dfrac{N\epsilon}{c}} \tag{1}

where NN is the number of data points. c=10c=10 worked well.

Proof

  • Sensitivity Δ\Delta of count query is 11.

  • Thus the noise follows the Laplace distribution Lap(1ϵ)\text{Lap}\left(\dfrac{1}{\epsilon}\right), with standard deviation of 2ϵ\dfrac{\sqrt{2}}{\epsilon}.

  • For m×mm\times m grid, suppose a query selects rr proportion of the domain.

  • Then, the query has standard deviation of 2rmϵ\dfrac{\sqrt{2r}m}{\epsilon}.

  • For rr proportion query, there are rm\sqrt{r}m cells along the border of the query.

  • That is, it includes on the order of rm×Nm2=rNm\sqrt{r}m\times \dfrac{N}{m^{2}} = \dfrac{\sqrt{r}N}{m} cells, which results the non-uniformity error rNc0m\dfrac{\sqrt{r}N}{c_{0}m} for some constant c0c_0.

Thus the error becomes

error=2rmϵ+rNmc0 \text{error}= \frac{\sqrt{2r}m}{\epsilon} + \frac{\sqrt{r}N}{mc_{0}}

and to minimize it, we should pick mm as equation (1)(1) where c=2c0c=\sqrt{2}c_{0}.

Adaptive Grid method

  • The main disadvantage of UG is that we treat each region without concerning their sparsity.
  • Instead of making equipartition, we first lay a coarse m1×m1m_{1}\times m_{1} grid over the data.
  • Then we issue a count query for each cell using a privacy budget αϵ\alpha\epsilon, where 0<α<10<\alpha<1.
  • AG then again partitions each cell based on NN' which is the noisy count of the cell.

Constrained Inference

Let vv be the noisy count of a specific first-level cell at AG method. Then, vv count is partitioned into u1,1,,um2,m2u_{1,1},\ldots,u_{m_{2},m_{2}} counts (m2×m2m_{2}\times m_{2} second-level grid). Then, one obtains more accurate count vv' by

v=α2m22(1α)2+α2m22+(1α)2(1α)2+α2m22ui,j v' = \dfrac{\alpha^{2}m_{2}^{2}}{(1-\alpha)^{2}+\alpha^{2}m_{2}^{2}}+\dfrac{(1-\alpha)^{2}}{(1-\alpha)^{2}+\alpha^{2}m_{2}^{2}}\sum u_{i,j}

This value goes to the following update:

ui,j=ui,j+(vui,j) u_{i,j}' = u_{i,j}+\left(v'-\sum u_{i,j}\right)

For AG, the second-level grid resolution m2m_{2} becomes

N(1α)ϵc/2 \left\lceil\sqrt{\dfrac{N'(1-\alpha)\epsilon}{c/2}}\right\rceil

where (1α)ϵ(1-\alpha)\epsilon indicates the remaining privacy budget,

Choosing m1m_{1} is not that important than m2m_{2} since the density of first-level grid affects on the second-level partition. But m1m_{1} should not be too small, thus it was set to

m1=max(10,14,Nϵc) m_{1}=\max \left(10, \frac{1}{4}, \left\lceil \sqrt{\frac{N\epsilon}{c}}\right\rceil\right)

Also α\alpha was set to 0.50.5.

Error

For query rr, A(r)A(r) denotes the correct answer and QM(r)Q_\mathcal{M}(r) denotes the answered value using the histogram using method M\mathcal{M}. Then, the relative error is defined as

REM(r)=QM(r)A(r)max{A(r),ρ} \text{RE}_\mathcal{M}(r) = \dfrac{\vert Q_\mathcal{M}(r)-A(r)\vert}{\max\lbrace A(r),\rho\rbrace }

where ρ=0.001D\rho=0.001\cdot \vert D\vert were set to avoid division by zero.

Experiment (GPT summarized)

In the experimental results section, the authors conducted extensive experiments using four real datasets to compare the accuracy of different methods and validate their analysis of parameter choices. The datasets include road intersections GPS coordinates, check-ins from a location-based social networking website, landmarks in the United States, and US storage facility locations.

The relative error is primarily considered, defined as the absolute error divided by the maximum of the correct answer or a threshold value. The experiments involve various query sizes, and the errors are computed for 200200 randomly generated queries per size. Two values of epsilon (ϵ\epsilon) are used (0.10.1 and 1.01.0) for privacy considerations.

The authors compare KD-tree-based methods (KD-standard and KD-hybrid) with Uniform Grid (UG), Adaptive Grid (AG), and Privlet methods. They also explore the effect of adding hierarchies to UG and evaluate adaptive grids' performance with varying parameters. The evaluation metrics include line graphs for the arithmetic mean of relative errors and candlesticks for a clearer comparison among different algorithms.

Results show that choosing an appropriate grid size is crucial, and the suggested sizes by the authors generally align well with experimentally observed optimal sizes. The authors compare KD-hybrid with UG and find that UG tends to perform better in most cases. Additionally, the authors investigate the impact of adding hierarchies and find that while it can improve accuracy, the benefit is relatively small.

The AG method consistently outperforms other methods, including KD-hybrid, Quadopt, UG with the lowest observed relative error size, and Privlet on the same grid size. The AG method with suggested grid sizes performs competitively with the experimentally observed best grid sizes. The authors also conduct a final comparison using absolute error, confirming that AG consistently and significantly outperforms other methods across various datasets and privacy settings.

References

  • G. Cormode, C. Procopiuc, D. Srivastava, E. Shen, and T. Yu, “Differentially Private Spatial Decompositions,” in 2012 IEEE 28th International Conference on Data Engineering, Arlington, VA, USA: IEEE, Apr. 2012, pp. 20–31. doi: 10.1109/ICDE.2012.16.

  • Y. Xiao, L. Xiong, and C. Yuan, Differentially Private Data Release through Multidimensional Partitioning. 2010, p. 168. doi: 10.1007/978-3-642-15546-8_11.

  • W. Qardaji, Weining Yang, and Ninghui Li, “Differentially private grids for geospatial data,” in 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, QLD: IEEE, Apr. 2013, pp. 757–768. doi: 10.1109/ICDE.2013.6544872.