Quick start
This quick start guide helps you get up and running with DPSQL+.
Tutorial
We recommend following the tutorial in the examples directory to better understand how to use DPSQL+.
How to Use
To begin, install the project by following Installation.
To use DPSQL+, prepare the following:
A standard SQL query where the final
SELECTclause consists of aggregation expressionsPrivacy budget parameters (epsilon, delta)
A Spark session object with Python bindings (SQLite3 and DuckDB backends are also supported)
The database and table names used in the query
The names of the columns representing user IDs (e.g.,
privacy_unit) for each table, where applicable
Initialize components:
Import and initialize the necessary components. For detailed information about each component, refer to the Overview.
from dpsql.backend import SparkSQLBackend from dpsql.engine import Engine from dpsql.validator import Validator from dpsql.accountant import RenyiAccountant sql_backend = SparkSQLBackend(spark) validator = Validator() accountant = RenyiAccountant(epsilon, delta) engine = Engine(accountant, sql_backend, validator)
Register databases:
Register the databases and tables you want to query with the engine, and specify privacy unit columns:
engine.register_database("TITANIC", privacy_unit_columns={"passengers_features": "id", "passengers_survived": "id"}) engine.register_database("SEX", privacy_unit_columns={})
Set query parameters:
Define the parameters for the query:
from dpsql.dp_params import DPParams contribution_bound = 1 min_frequency = 10 epsilon_per_query = 0.5 delta_per_query = 5e-5 clipping = 1 dpparams = DPParams( contribution_bound=contribution_bound, min_frequency=min_frequency, epsilon=epsilon_per_query, delta=delta_per_query, clipping_thresholds=[None, [(0, clipping)]] )
Run the query:
Execute the query with the specified parameters:
sql = """ WITH combined_data_tmp AS ( SELECT e.id, e.who, s.survived FROM TITANIC.passengers_features AS e JOIN TITANIC.passengers_survived AS s ON e.id = s.id ), combined_data AS ( SELECT c.id, c.who, c.survived, w.adult_male FROM combined_data_tmp AS c JOIN SEX.is_adult_male AS w ON c.who = w.who ) SELECT adult_male, COUNT(adult_male), SUM(survived) FROM combined_data GROUP BY adult_male """ result = engine.execute_query(sql, dpparams) result.show()
Additional Usage
Specify sigma values directly:
You can specify sigma values directly instead of epsilon and delta:
from dpsql.dp_params import DPParams contribution_bound = 1 min_frequency = 10 tau = 100 sigma = 20 sigma_for_thresholding = sigma dpparams = DPParams(contribution_bound=contribution_bound, min_frequency=min_frequency, tau=tau, sigma_for_thresholding=sigma_for_thresholding, sigmas=[sigma], clipping_thresholds=[None])