Quick start

This quick start guide helps you get up and running with DPSQL+.

Tutorial

We recommend following the tutorial in the examples directory to better understand how to use DPSQL+.

How to Use

To begin, install the project by following Installation.

To use DPSQL+, prepare the following:

  • A standard SQL query where the final SELECT clause consists of aggregation expressions

  • Privacy budget parameters (epsilon, delta)

  • A Spark session object with Python bindings (SQLite3 and DuckDB backends are also supported)

  • The database and table names used in the query

  • The names of the columns representing user IDs (e.g., privacy_unit) for each table, where applicable

  1. Initialize components:

    Import and initialize the necessary components. For detailed information about each component, refer to the Overview.

    from dpsql.backend import SparkSQLBackend
    from dpsql.engine import Engine
    from dpsql.validator import Validator
    from dpsql.accountant import RenyiAccountant
    
    sql_backend = SparkSQLBackend(spark)
    validator = Validator()
    accountant = RenyiAccountant(epsilon, delta)
    engine = Engine(accountant, sql_backend, validator)
    
  2. Register databases:

    Register the databases and tables you want to query with the engine, and specify privacy unit columns:

    engine.register_database("TITANIC", privacy_unit_columns={"passengers_features": "id", "passengers_survived": "id"})
    engine.register_database("SEX", privacy_unit_columns={})
    
  3. Set query parameters:

    Define the parameters for the query:

    from dpsql.dp_params import DPParams
    
    contribution_bound = 1
    min_frequency = 10
    epsilon_per_query = 0.5
    delta_per_query = 5e-5
    clipping = 1
    
    dpparams = DPParams(
        contribution_bound=contribution_bound,
        min_frequency=min_frequency,
        epsilon=epsilon_per_query,
        delta=delta_per_query,
        clipping_thresholds=[None, [(0, clipping)]]
    )
    
  4. Run the query:

    Execute the query with the specified parameters:

    sql = """
    WITH combined_data_tmp AS (
        SELECT
            e.id,
            e.who,
            s.survived
        FROM
            TITANIC.passengers_features AS e
        JOIN
            TITANIC.passengers_survived AS s ON e.id = s.id
    ), combined_data AS (
        SELECT
            c.id,
            c.who,
            c.survived,
            w.adult_male
        FROM
            combined_data_tmp AS c
        JOIN
            SEX.is_adult_male AS w ON c.who = w.who
    )
    SELECT
        adult_male, COUNT(adult_male), SUM(survived)
    FROM
        combined_data
    GROUP BY
        adult_male
    """
    
    result = engine.execute_query(sql, dpparams)
    result.show()
    

Additional Usage

  • Specify sigma values directly:

    You can specify sigma values directly instead of epsilon and delta:

    from dpsql.dp_params import DPParams
    
    contribution_bound = 1
    min_frequency = 10
    tau = 100
    sigma = 20
    sigma_for_thresholding = sigma
    
    dpparams = DPParams(contribution_bound=contribution_bound, min_frequency=min_frequency, tau=tau, sigma_for_thresholding=sigma_for_thresholding, sigmas=[sigma], clipping_thresholds=[None])