Horizontal Federated-Local Differential Privacy Inference Result Protection
Privacy Protection Background
Evaluating the federated unsupervised model training can be judged by the
The effective integration of privacy protection technology into the product services will, on the one hand, help enhance the trust of users and the industry in the products and technology, and on the other hand, help to better carry out the federated tasks under the current privacy compliance requirements and create a full lifecycle (training-inference-evaluation) of privacy protection.
Algorithm Analysis
and Paradigm
The
The inference result is generally a
and Sensitivity
Local differential privacy introduces uncertainty on the data to be uploaded, and the sensitivity describes an upper bound on the uncertainty. Gaussian noise with
The
In this scenario,
Laplace Distribution
The Laplace distribution is continuous, and the probability density function of the Laplace with mean value 0 is:
Laplace Mechanism
where
In this scenario,
Proving that the Laplace Mechanism is Satisfied with the
Any two different clients, after being processed by the Laplace mechanism, both output the same result to achieve the confusion indistinguishable and the purpose probability ratio of outputting the same result has upper exact bound. Substituting
The Determination of with the Corresponding Probability Density Plot
The privacy budget with high availability is calculated by combining the data characteristics, such as the requirement to output noise of the order of
There is the
i.e.
When the privacy budget takes this value, the Laplace probability density function is as follows:
Impact Analysis of Clustering Evaluation Indicators
Using the Calinski-Harabasz Index assessment method as an example, the evaluation indicator is calculated in two steps:
Each class calculates the sum of the squares of the distances from all
points
in the class to thecenter of the class
;Calculate the sum of squares of distances from each
class
to thecenter of the class
;
Source code implementation and impact analysis after noise addition:
# 1.The cloud-side clustering algorithm gets the class ordinal number to which it belongs, with impact
n_labels = argmax(X)
extra_disp, intra_disp = 0.0, 0.0
# 2.Calculate the class center of all points, without impact
mean = np.mean(X, axis=0)
for k in range(n_labels):
# 3.Get all points in class k, based on the effect of 1
cluster_k = X[labels == k]
# 4.Get the class center, based on the impact of 1
mean_k = np.mean(cluster_k, axis=0)
# 5.The distance between the class and the center of all classes, based on the impact of 1
extra_disp += len(cluster_k) * np.sum((mean_k - mean) ** 2)
# 6.The distance from the point to the center of the class, with impact
intra_disp += np.sum((cluster_k - mean_k) ** 2)
return (
1.0
if intra_disp == 0.0
else extra_disp * (n_samples - n_labels) / (intra_disp * (n_labels - 1.0))
)
In a comprehensive analysis, the main impact is on the clustering algorithm after noise addition, and the error on the distance calculation. When calculating the class center, the error introduced is small because the noise sum is expected to be
Taking SILHOUETTE SCORE as an example, the process of calculating this evaluation indicator is divided into two steps:
Calculate the average distance of a sample point
from all other sample points in the same cluster, which is denoted as . The smaller the value is, the more the sample should be assigned to this cluster.Calculate the average distance
of sample to all samples of some other cluster , which is called the dissimilarity of sample to cluster . The inter-cluster dissimilarity of sample is defined as: . The larger the value is, the less the sample should belong to this cluster.
The smaller
Pseudocode implementation and impact analysis after noise addition:
// Calculate distance matrix, space for time, upper triangle storage, which has an impact on noise addition
euclidean_distance_matrix(&distance_matrix, group_ids);
// Perform the same calculation for each point, and finally calculate the mean value
for (size_t i = 0; i < n_samples; ++i) {
std::unordered_map<size_t, std::vector<float>> b_i_map;
for (size_t j = 0; j < n_samples; ++j) {
size_t label_j = labels[j];
float distance = distance_matrix[i][j];
// Same cluster calculates ai
if (label_j == label_i) {
a_distances.push_back(distance);
} else {
// Different clusters calculate bi
b_i_map[label_j].push_back(distance);
}
}
if (a_distances.size() > 0) {
// Calculate the average distance of the point from other points in the same cluster
a_i = std::accumulate(a_distances.begin(), a_distances.end(), 0.0) / a_distances.size();
}
for (auto &item : b_i_map) {
auto &b_i_distances = item.second;
float b_i_distance = std::accumulate(b_i_distances.begin(), b_i_distances.end(), 0.0) / b_i_distances.size();
b_i = std::min(b_i, b_i_distance);
}
if (a_i == 0) {
s_i[i] = 0;
} else {
s_i[i] = (b_i - a_i) / std::max(a_i, b_i);
}
}
return std::accumulate(s_i.begin(), s_i.end(), 0.0) / n_samples;
As above, the main impact is the main impact is on the clustering algorithm after noise addition, and the error on the distance calculation.
End-side Java Implementation
There is no function in the Java basic library to generate Laplace distributed random numbers. The following combination strategy of random numbers is used to generate.
The source code is as follows:
float genLaplaceNoise(SecureRandom secureRandom, float beta) {
float u1 = secureRandom.nextFloat();
float u2 = secureRandom.nextFloat();
if (u1 <= 0.5f) {
return (float) (-beta * log(1. - u2));
} else {
return (float) (beta * log(u2));
}
}
After obtaining a new round of model on the end-side, the inference calculation is executed immediately. After the training, the inference results after privacy protection are uploaded to the cloud side together with the new model, and the cloud side finally performs operations such as clustering and score calculation. The flow is shown in the following figure, where the red part is the output result of privacy protection processing:
Quick Start
Preparation
To use this feature, one first needs to successfully complete the training aggregation process for either end-cloud federated scenario. Implementing an Image Classification Application of Cross-device Federated Learning (x86) details the preparation of datasets and network models, as well as simulates the process of initiating multi-client participation in federated learning.
Configuration Items
The cloud-side yaml configuration file gives the complete configuration items for opening the end-cloud federated, and the program involves the following additional configuration file items:
encrypt:
privacy_eval_type: LAPLACE
laplace_eval:
laplace_eval_eps: 230260
where privacy_eval_type
currently supports only NOT_ENCRYPT
and LAPLACE
, indicating that the inference results are processed without privacy protection methods and with the LAPLACE
mechanism, respectively.
laplace_eval_eps
indicates how much of the privacy budget is used if LAPLACE
processing is used.
Experimental Results
The basic configuration associated with the inference result evaluation function is used as follows:
unsupervised:
cluster_client_num: 1000
eval_type: SILHOUETTE_SCORE
We can see that the relationship between LAPLACE
mechanism by using NOT_ENCRYPT
and using laplace_eval_eps=230260
is shown in the figure:
The red dashed line shows the SILHOUETTE scores after the Laplace mechanism is used to protect the inference results. Since the model contains