Skip to content

Aggregating

This module provides functionalities to make representative curves from data and find statistics for metadata.

Functions
  • _generate_filter_permutations(info_table, group_by) - Generates filter permutations for grouping data.
  • make_representative_data(ds, info_path, data_dir, repres_col, group_by_keys, interp_by, interp_res, interp_range, group_info_cols) - Creates representative curves from a dataset and saves them to a directory.
  • make_representative_info(ds, group_by_keys, group_info_cols) - Creates a table of representative information for each group in a DataSet.

make_representative_data(ds, info_path, data_dir, repres_col, group_by_keys, interp_by, interp_res=200, interp_range='outer', group_info_cols=None)

Make representative curves of the DataSet and save them to a directory.

This function takes a DataSet, groups it by specific keys, and creates representative curves. The curves are then saved to a specified directory. It is useful for generating aggregated data curves that represent groups of similar tests.

Parameters:

Name Type Description Default
ds DataSet

The DataSet to make representative curves from.

required
info_path str

The path to the info file where the representative information will be saved.

required
data_dir str

The directory to save the representative curves to.

required
group_by_keys List[str]

The info columns to group the tests by.

required
repres_col str

The data column to aggregate for the y-axis of the representative curves.

required
interp_by str

The data column to interpolate for the x-axis of the representative curves.

required
interp_res int

The resolution of the interpolation.

200
interp_range Union[str, Tuple[float, float]]

Can be either "outer", "inner", or a tuple of floats, defining the domain on the x-axis for

'outer'
interpolation
  • If "outer", the domain is defined by the smallest minimum and the largest maximum values of the interpolation column in the representative subset.
  • If "inner", the domain is defined by the largest minimum and the smallest maximum values of the interpolation column in the representative subset.
  • If a tuple, the domain is directly defined by the values within the tuple.
required
group_info_cols Optional[List[str]]

The info categories to include in the aggregated info_table.

None

Returns:

Type Description

None

Examples:

Imagine you have performed a series of stress tests on different materials at various temperatures. You have collected all the data in a DataSet and want to create representative stress-strain curves for each combination of material and temperature. Here's how you can use this function:

>>> import paramaterial as pam
>>> ds = pam.DataSet('info/test_info.csv','data/tests')  # Load your dataset
>>> pam.make_representative_data(ds, 'info/representative_info.xlsx', 'data/representative_curves',
>>>                              repres_col='Stress_MPa', group_by_keys=['material', 'temperature'],
interp_by='Strain')

This will create representative curves for each material and temperature group, saving them to the specified directory and information to an Excel file.

Source code in paramaterial\aggregating.py
def make_representative_data(ds: DataSet, info_path: str, data_dir: str, repres_col: str, group_by_keys: List[str],
                             interp_by: str, interp_res: int = 200,
                             interp_range: Union[str, Tuple[float, float]] = 'outer',
                             group_info_cols: Optional[List[str]] = None):
    """Make representative curves of the DataSet and save them to a directory.

     This function takes a DataSet, groups it by specific keys, and creates representative curves. The curves are
     then saved to a specified directory. It is useful for generating aggregated data curves that represent groups of
     similar tests.

     Args:
         ds: The DataSet to make representative curves from.
         info_path: The path to the info file where the representative information will be saved.
         data_dir: The directory to save the representative curves to.
         group_by_keys: The info columns to group the tests by.
         repres_col: The data column to aggregate for the y-axis of the representative curves.
         interp_by: The data column to interpolate for the x-axis of the representative curves.
         interp_res: The resolution of the interpolation.
         interp_range: Can be either "outer", "inner", or a tuple of floats, defining the domain on the x-axis for
         interpolation:

            - If "outer", the domain is defined by the smallest minimum and the largest maximum values of the
            interpolation column in the representative subset.
            - If "inner", the domain is defined by the largest minimum and the smallest maximum values of the
            interpolation column in the representative subset.
            - If a tuple, the domain is directly defined by the values within the tuple.
         group_info_cols: The info categories to include in the aggregated info_table.

     Returns:
         None

     Examples:
        Imagine you have performed a series of stress tests on different materials at various temperatures. You have
        collected all the data in a DataSet and want to create representative stress-strain curves for each
        combination of material and temperature. Here's how you can use this function:

        >>> import paramaterial as pam
        >>> ds = pam.DataSet('info/test_info.csv','data/tests')  # Load your dataset
        >>> pam.make_representative_data(ds, 'info/representative_info.xlsx', 'data/representative_curves',
        >>>                              repres_col='Stress_MPa', group_by_keys=['material', 'temperature'],
        interp_by='Strain')

        This will create representative curves for each material and temperature group, saving them to the specified
        directory and information to an Excel file.
    """

    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    value_lists = [ds.info_table[col].unique() for col in group_by_keys]

    # make a dataset filter for each representative curve
    subset_filters = []
    for i in range(len(value_lists[0])):  # i
        subset_filters.append({group_by_keys[0]: value_lists[0][i]})
    for i in range(1, len(group_by_keys)):  # i
        new_filters = []
        for fltr in subset_filters:  # j
            for value in value_lists[i]:  # k
                new_filter = fltr.copy()
                new_filter[group_by_keys[i]] = value
                new_filters.append(new_filter)
        subset_filters = new_filters

    # make list of repres_ids and initialise info table for the representative data
    repres_ids = [f'repres_id_{i + 1:0>4}' for i in range(len(subset_filters))]
    repr_info_table = pd.DataFrame(columns=['repres_id'] + group_by_keys)

    # make representative curves and take means of info table columns
    for repres_id, subset_filter in zip(repres_ids, subset_filters):
        # get representative subset
        repres_subset = ds.subset(subset_filter)
        if repres_subset.info_table.empty:
            continue
        # add row to repr_info_table
        repr_info_table = pd.concat([repr_info_table, pd.DataFrame(
            {'repres_id': [repres_id], **subset_filter, 'nr averaged': [len(repres_subset)]})])

        # add means of group info columns to repr_info_table
        if group_info_cols is not None:
            for col in group_info_cols:
                df_col = repres_subset.info_table[col]
                repr_info_table.loc[repr_info_table['repres_id'] == repres_id, '' + col] = df_col.mean()
                repr_info_table.loc[repr_info_table['repres_id'] == repres_id, 'std_' + col] = df_col.std()
                repr_info_table.loc[
                    repr_info_table['repres_id'] == repres_id, 'upstd_' + col] = df_col.mean() + df_col.std()
                repr_info_table.loc[
                    repr_info_table['repres_id'] == repres_id, 'downstd_' + col] = df_col.mean() - df_col.std()
                repr_info_table.loc[repr_info_table['repres_id'] == repres_id, 'max_' + col] = df_col.max()
                repr_info_table.loc[repr_info_table['repres_id'] == repres_id, 'min_' + col] = df_col.min()

        # find minimum of maximum interp_by vals in subset
        if interp_range == 'outer':
            min_interp_val = min([min(dataitem.data[interp_by]) for dataitem in repres_subset])
            max_interp_val = max([max(dataitem.data[interp_by]) for dataitem in repres_subset])
        elif interp_range == 'inner':
            min_interp_val = max([min(dataitem.data[interp_by]) for dataitem in repres_subset])
            max_interp_val = min([max(dataitem.data[interp_by]) for dataitem in repres_subset])
        elif type(interp_range) == tuple:
            min_interp_val = interp_range[0]
            max_interp_val = interp_range[1]
        else:
            raise ValueError(f'interp_range must be "outer", "inner" or a tuple, not {interp_range}')

        # make monotonically increasing vector to interpolate by
        interp_vec = np.linspace(min_interp_val, max_interp_val, interp_res)

        # make interpolated data for averaging, staring at origin
        interp_data = pd.DataFrame(data={interp_by: interp_vec})

        for n, dataitem in enumerate(repres_subset):
            # drop columns and rows outside interp range
            data = dataitem.data[[interp_by, repres_col]].reset_index(drop=True)
            data = data[(data[interp_by] <= max_interp_val) & (data[interp_by] >= min_interp_val)]
            # interpolate the repr_by column and add to interp_data
            # add 0 to start of data to ensure interpolation starts at origin
            interp_data[f'interp_{repres_col}_{n}'] = np.interp(interp_vec, data[interp_by].tolist(),
                                                                data[repres_col].tolist())

        # make representative data from stats of interpolated data
        interp_data = interp_data.drop(columns=[interp_by])
        repr_data = pd.DataFrame({f'{interp_by}': interp_vec})
        repr_data[f'{repres_col}'] = interp_data.mean(axis=1)
        repr_data[f'std_{repres_col}'] = interp_data.std(axis=1)
        repr_data[f'up_std_{repres_col}'] = repr_data[f'{repres_col}'] + repr_data[f'std_{repres_col}']
        repr_data[f'down_std_{repres_col}'] = repr_data[f'{repres_col}'] - repr_data[f'std_{repres_col}']
        repr_data[f'up_2std_{repres_col}'] = repr_data[f'{repres_col}'] + 2 * repr_data[f'std_{repres_col}']
        repr_data[f'down_2std_{repres_col}'] = repr_data[f'{repres_col}'] - 2 * repr_data[f'std_{repres_col}']
        repr_data[f'up_3std_{repres_col}'] = repr_data[f'{repres_col}'] + 3 * repr_data[f'std_{repres_col}']
        repr_data[f'down_3std_{repres_col}'] = repr_data[f'{repres_col}'] - 3 * repr_data[f'std_{repres_col}']
        repr_data[f'min_{repres_col}'] = interp_data.min(axis=1)
        repr_data[f'max_{repres_col}'] = interp_data.max(axis=1)
        repr_data[f'q1_{repres_col}'] = interp_data.quantile(0.25, axis=1)
        repr_data[f'q3_{repres_col}'] = interp_data.quantile(0.75, axis=1)

        # write the representative data and info
        repr_data.to_csv(os.path.join(data_dir, f'{repres_id}.csv'), index=False)
        repr_info_table.to_excel(info_path, index=False)

make_representative_info(ds, group_by_keys, group_info_cols=None)

Make a table of representative info for each group in a DataSet.

Parameters:

Name Type Description Default
ds DataSet

DataSet to make representative info from.

required
group_by_keys List[str]

Columns to group by and make representative info for.

required
group_info_cols List[str]

Columns to include in representative info table.

None

Returns:

Type Description
pd.DataFrame

A pandas DataFrame containing the representative information table.

Examples:

To create a summary table that includes specific mechanical properties like Elastic Modulus (E), Proof Stress (PS), Ultimate Tensile Strength (UTS), for each temperature and material type:

>>> import paramaterial as pam
>>> table = pam.make_representative_info(ds, group_by_keys=['temperature', 'material'], group_info_cols=['E', 'PS', 'UTS'])
>>> print(table.head())

The result will be a DataFrame containing representative information for each group, including the mean, standard deviation, maximum, minimum, and 1st and 3rd quartiles of the specified columns.

Source code in paramaterial\aggregating.py
def make_representative_info(ds: DataSet, group_by_keys: List[str], group_info_cols: List[str] = None) -> pd.DataFrame:
    """Make a table of representative info for each group in a DataSet.

    Args:
        ds: DataSet to make representative info from.
        group_by_keys: Columns to group by and make representative info for.
        group_info_cols: Columns to include in representative info table.

    Returns:
        A pandas DataFrame containing the representative information table.

    Examples:
        To create a summary table that includes specific mechanical properties like Elastic Modulus (E), Proof Stress
        (PS), Ultimate Tensile Strength (UTS), for each temperature and material type:

        >>> import paramaterial as pam
        >>> table = pam.make_representative_info(ds, group_by_keys=['temperature', 'material'], group_info_cols=['E', 'PS', 'UTS'])
        >>> print(table.head())

        The result will be a DataFrame containing representative information for each group, including the mean, standard
        deviation, maximum, minimum, and 1st and 3rd quartiles of the specified columns.
    """
    subset_filters = []
    value_lists = [ds.info_table[col].unique() for col in group_by_keys]
    for i in range(len(value_lists[0])):
        subset_filters.append({group_by_keys[0]: [value_lists[0][i]]})
    for i in range(1, len(group_by_keys)):
        new_filters = []
        for fltr in subset_filters:
            for value in value_lists[i]:
                new_filter = fltr.copy()
                new_filter[group_by_keys[i]] = [value]
                new_filters.append(new_filter)
        subset_filters = new_filters

    # make list of repres_ids and initialise info table for the representative data
    repres_ids = [f'repres_id_{i + 1:0>4}' for i in range(len(subset_filters))]
    repr_info_table = pd.DataFrame(columns=['repres_id'] + group_by_keys)

    for fltr, repres_id in zip(subset_filters, repres_ids):
        # get representative subset
        repr_subset = ds.subset(fltr)
        if repr_subset.info_table.empty:
            continue
        # add row to repr_info_table
        repr_info_table = pd.concat(
            [repr_info_table, pd.DataFrame({'repres_id': [repres_id], **fltr, 'nr averaged': [len(repr_subset)]})])

        # add means of group info columns to repr_info_table
        if group_info_cols is not None:
            for col in group_info_cols:
                df_col = repr_subset.info_table[col]
                repr_info_table.loc[repr_info_table['repres_id'] == repres_id, '' + col] = df_col.mean()
                repr_info_table.loc[repr_info_table['repres_id'] == repres_id, 'std_' + col] = df_col.std()
                repr_info_table.loc[
                    repr_info_table['repres_id'] == repres_id, 'upstd_' + col] = df_col.mean() + df_col.std()
                repr_info_table.loc[
                    repr_info_table['repres_id'] == repres_id, 'downstd_' + col] = df_col.mean() - df_col.std()
                repr_info_table.loc[repr_info_table['repres_id'] == repres_id, 'max_' + col] = df_col.max()
                repr_info_table.loc[repr_info_table['repres_id'] == repres_id, 'min_' + col] = df_col.min()

    return repr_info_table