Quick start

This quick start guide helps you get up and running with DPSQL+.

Tutorial

We recommend following the tutorial in the examples directory to better understand how to use DPSQL+.

How to Use

To begin, install the project by following Installation.

To use DPSQL+, prepare the following:

A standard SQL query where the final SELECT clause consists of aggregation expressions
Privacy budget parameters (epsilon, delta)
A Spark session object with Python bindings (SQLite3 and DuckDB backends are also supported)
The database and table names used in the query
The names of the columns representing user IDs (e.g., privacy_unit) for each table, where applicable

Initialize components:

Import and initialize the necessary components. For detailed information about each component, refer to the Overview.

from dpsql.backend import SparkSQLBackend
from dpsql.engine import Engine
from dpsql.validator import Validator
from dpsql.accountant import RenyiAccountant

sql_backend = SparkSQLBackend(spark)
validator = Validator()
accountant = RenyiAccountant(epsilon, delta)
engine = Engine(accountant, sql_backend, validator)

Register databases:

Register the databases and tables you want to query with the engine, and specify privacy unit columns:
engine.register_database("TITANIC", privacy_unit_columns={"passengers_features": "id", "passengers_survived": "id"})
engine.register_database("SEX", privacy_unit_columns={})

Set query parameters:

Define the parameters for the query:

from dpsql.dp_params import DPParams

contribution_bound = 1
min_frequency = 10
epsilon_per_query = 0.5
delta_per_query = 5e-5
clipping = 1

dpparams = DPParams(
    contribution_bound=contribution_bound,
    min_frequency=min_frequency,
    epsilon=epsilon_per_query,
    delta=delta_per_query,
    clipping_thresholds=[None, [(0, clipping)]]
)

Run the query:

Execute the query with the specified parameters:

sql = """
WITH combined_data_tmp AS (
    SELECT
        e.id,
        e.who,
        s.survived
    FROM
        TITANIC.passengers_features AS e
    JOIN
        TITANIC.passengers_survived AS s ON e.id = s.id
), combined_data AS (
    SELECT
        c.id,
        c.who,
        c.survived,
        w.adult_male
    FROM
        combined_data_tmp AS c
    JOIN
        SEX.is_adult_male AS w ON c.who = w.who
)
SELECT
    adult_male, COUNT(adult_male), SUM(survived)
FROM
    combined_data
GROUP BY
    adult_male
"""

result = engine.execute_query(sql, dpparams)
result.show()

Additional Usage

Specify sigma values directly:

You can specify sigma values directly instead of epsilon and delta:

from dpsql.dp_params import DPParams

contribution_bound = 1
min_frequency = 10
tau = 100
sigma = 20
sigma_for_thresholding = sigma

dpparams = DPParams(contribution_bound=contribution_bound, min_frequency=min_frequency, tau=tau, sigma_for_thresholding=sigma_for_thresholding, sigmas=[sigma], clipping_thresholds=[None])