ScmDataFrame

ScmDataFrame

ScmDataFrame provides a high level analysis tool for simple climate model relevant data. It provides a simple interface for reading/writing, subsetting and visualising model data. ScmDataFrames are able to hold multiple model runs which aids in analysis of ensembles of model runs.

class openscm.scmdataframe.ScmDataFrame(data, index=None, columns=None, **kwargs)

Bases: openscm.scmdataframe.base.ScmDataFrameBase

OpenSCM’s custom DataFrame implementation.

The ScmDataFrame implements a subset of the functionality provided by pyam’s IamDataFrame, but is focused on providing a performant way of storing time series data and the metadata associated with those time series.

For users who wish to take advantage of all of Pyam’s functionality, please cast your ScmDataFrame to an IamDataFrame first with to_iamdataframe(). Note: this operation can be computationally expensive for large data sets because IamDataFrames stored data in long/tidy form internally rather than ScmDataFrames’ more compact internal format.

__init__(data, index=None, columns=None, **kwargs)

Initialize.

Parameters
  • data (Union[ScmDataFrameBase, None, DataFrame, Series, ndarray, str]) – A pd.DataFrame or data file with IAMC-format data columns, or a numpy array of timeseries data if columns is specified. If a string is passed, data will be attempted to be read from file.

  • index (Optional[Any]) – Only used if columns is not None. If index is not None, too, then this value sets the time index of the ScmDataFrameBase instance. If index is None and columns is not None, the index is taken from data.

  • columns (Optional[Dict[str, list]]) –

    If None, ScmDataFrameBase will attempt to infer the values from the source. Otherwise, use this dict to write the metadata for each timeseries in data. For each metadata key (e.g. “model”, “scenario”), an array of values (one per time series) is expected. Alternatively, providing a list of length 1 applies the same value to all timeseries in data. For example, if you had three timeseries from ‘rcp26’ for 3 different models ‘model’, ‘model2’ and ‘model3’, the column dict would look like either ‘col_1’ or ‘col_2’:

    >>> col_1 = {
        "scenario": ["rcp26"],
        "model": ["model1", "model2", "model3"],
        "region": ["unspecified"],
        "variable": ["unspecified"],
        "unit": ["unspecified"]
    }
    >>> col_2 = {
        "scenario": ["rcp26", "rcp26", "rcp26"],
        "model": ["model1", "model2", "model3"],
        "region": ["unspecified"],
        "variable": ["unspecified"],
        "unit": ["unspecified"]
    }
    >>> assert pd.testing.assert_frame_equal(
        ScmDataFrameBase(d, columns=col_1).meta,
        ScmDataFrameBase(d, columns=col_2).meta
    )
    

  • **kwargs – Additional parameters passed to pyam.core._read_file() to read files

Raises
  • ValueError – If metadata for [‘model’, ‘scenario’, ‘region’, ‘variable’, ‘unit’] is not found. A ValueError is also raised if you try to load from multiple files at once. If you wish to do this, please use df_append() instead.

  • TypeError – Timeseries cannot be read from data

_apply_filters(filters, has_nan=True)

Determine rows to keep in data for given set of filters.

Parameters
  • filters (Dict[~KT, ~VT]) – Dictionary of filters ({col: values}}); uses a pseudo-regexp syntax by default but if filters["regexp"] is True, regexp is used directly.

  • has_nan (bool) – If True`, convert all nan values in meta_col to empty string before applying filters. This means that “” and “*” will match rows with np.nan. If False, the conversion is not applied and so a search in a string column which contains np.nan will result in a TypeError.

Returns

Two boolean np.ndarray’s. The first contains the columns to keep (i.e. which time points to keep). The second contains the rows to keep (i.e. which metadata matched the filters).

Return type

np.ndarray of bool, np.ndarray of bool

Raises

ValueError – Filtering cannot be performed on requested column

_day_match(values)
_sort_meta_cols()
append(other, inplace=False, duplicate_msg='warn', **kwargs)

Append additional data to the current dataframe.

For details, see df_append().

Parameters
  • other (Union[ScmDataFrameBase, None, DataFrame, Series, ndarray, str]) – Data (in format which can be cast to ScmDataFrameBase) to append

  • inplace (bool) – If True, append data in place and return None. Otherwise, return a new ScmDataFrameBase instance with the appended data.

  • duplicate_msg (Union[str, bool]) – If “warn”, raise a warning if duplicate data is detected. If “return”, return the joint dataframe (including duplicate timeseries) so the user can inspect further. If False, take the average and do not raise a warning.

  • **kwargs – Keywords to pass to ScmDataFrameBase.__init__() when reading other

Returns

If not inplace, return a new ScmDataFrameBase instance containing the result of the append.

Return type

ScmDataFrameBase

convert_unit(unit, context=None, inplace=False, **kwargs)

Convert the units of a selection of timeseries.

Uses openscm.units.UnitConverter to perform the conversion.

Parameters
  • unit (str) – Unit to convert to. This must be recognised by UnitConverter.

  • context (Optional[str]) – Context to use for the conversion i.e. which metric to apply when performing CO2-equivalent calculations. If None, no metric will be applied and CO2-equivalent calculations will raise DimensionalityError.

  • inplace (bool) – If True, the operation is performed inplace, updating the underlying data. Otherwise a new ScmDataFrameBase instance is returned.

  • **kwargs – Extra arguments which are passed to filter() to limit the timeseries which are attempted to be converted. Defaults to selecting the entire ScmDataFrame, which will likely fail.

Returns

If inplace is not False, a new ScmDataFrameBase instance with the converted units.

Return type

ScmDataFrameBase

copy()

Return a copy.deepcopy() of self.

Returns

copy.deepcopy() of self

Return type

ScmDataFrameBase

data_hierarchy_separator = '|'
filter(keep=True, inplace=False, has_nan=True, **kwargs)

Return a filtered ScmDataFrame (i.e., a subset of the data).

Parameters
  • keep (bool) – If True, keep all timeseries satisfying the filters, otherwise drop all the timeseries satisfying the filters

  • inplace (bool) – If True, do operation inplace and return None

  • has_nan (bool) – If True, convert all nan values in meta_col to empty string before applying filters. This means that “” and “*” will match rows with np.nan. If False, the conversion is not applied and so a search in a string column which contains ;class:np.nan will result in a TypeError.

  • **kwargs

    Argument names are keys with which to filter, values are used to do the filtering. Filtering can be done on:

    • all metadata columns with strings, “*” can be used as a wildcard in search strings

    • ’level’: the maximum “depth” of IAM variables (number of hierarchy levels, excluding the strings given in the ‘variable’ argument)

    • ’time’: takes a datetime.datetime or list of datetime.datetime’s TODO: default to np.datetime64

    • ’year’, ‘month’, ‘day’, hour’: takes an int or list of int’s (‘month’ and ‘day’ also accept str or list of str)

    If regexp=True is included in kwargs then the pseudo-regexp syntax in pattern_match is disabled.

Returns

If not inplace, return a new instance with the filtered data.

Return type

ScmDataFrameBase

Raises

AssertionError – Data and meta become unaligned

head(*args, **kwargs)

Return head of self.timeseries().

Parameters
  • *args – Passed to self.timeseries().head()

  • **kwargs – Passed to self.timeseries().head()

Returns

Tail of self.timeseries()

Return type

pd.DataFrame

interpolate(target_times, interpolation_type=<InterpolationType.LINEAR: 1>, extrapolation_type=<ExtrapolationType.CONSTANT: 0>)

Interpolate the dataframe onto a new time frame.

Uses openscm.timeseries_converter.TimeseriesConverter internally. For each time series a ParameterType is guessed from the variable name. To override the guessed parameter type, specify a “parameter_type” meta column before calling interpolate. The guessed parameter types are returned in meta.

Parameters
Returns

A new ScmDataFrameBase containing the data interpolated onto the target_times grid

Return type

ScmDataFrameBase

line_plot(x='time', y='value', **kwargs)

Plot a line chart.

See pyam.IamDataFrame.line_plot() for more information.

Return type

None

property meta

Metadata

Return type

DataFrame

pivot_table(index, columns, **kwargs)

Pivot the underlying data series.

See pyam.core.IamDataFrame.pivot_table() for details.

Return type

DataFrame

process_over(cols, operation, **kwargs)

Process the data over the input columns.

Parameters
  • cols (Union[str, List[str]]) – Columns to perform the operation on. The timeseries will be grouped by all other columns in meta.

  • operation (['median', 'mean', 'quantile']) – The operation to perform. This uses the equivalent pandas function. Note that quantile means the value of the data at a given point in the cumulative distribution of values at each point in the timeseries, for each timeseries once the groupby is applied. As a result, using q=0.5 is is the same as taking the median and not the same as taking the mean/average.

  • **kwargs – Keyword arguments to pass to the pandas operation

Returns

The quantiles of the timeseries, grouped by all columns in meta other than cols

Return type

pd.DataFrame

Raises

ValueError – If the operation is not one of [‘median’, ‘mean’, ‘quantile’]

region_plot(**kwargs)

Plot regional data for a single model, scenario, variable, and year.

See pyam.plotting.region_plot for details.

Return type

None

relative_to_ref_period_mean(append_str=None, **kwargs)

Return the timeseries relative to a given reference period mean.

The reference period mean is subtracted from all values in the input timeseries.

Parameters
  • append_str (Optional[str]) – String to append to the name of all the variables in the resulting DataFrame to indicate that they are relevant to a given reference period. E.g. ‘rel. to 1961-1990’. If None, this will be autofilled with the keys and ranges of kwargs.

  • **kwargs – Arguments to pass to filter() to determine the data to be included in the reference time period. See the docs of filter() for valid options.

Returns

DataFrame containing the timeseries, adjusted to the reference period mean

Return type

pd.DataFrame

rename(mapping, inplace=False)

Rename and aggregate column entries using groupby.sum() on values. When renaming models or scenarios, the uniqueness of the index must be maintained, and the function will raise an error otherwise.

Parameters
  • mapping (Dict[str, Dict[str, str]]) –

    For each column where entries should be renamed, provide current name and target name

    {<column name>: {<current_name_1>: <target_name_1>,
                     <current_name_2>: <target_name_2>}}
    

  • inplace (bool) – If True, do operation inplace and return None

Returns

If inplace is True, return a new ScmDataFrameBase instance

Return type

ScmDataFrameBase

Raises

ValueError – Column is not in meta or renaming will cause non-unique metadata

resample(rule='AS', **kwargs)

Resample the time index of the timeseries data onto a custom grid.

This helper function allows for values to be easily interpolated onto annual or monthly timesteps using the rules=’AS’ or ‘MS’ respectively. Internally, the interpolate function performs the regridding.

Parameters
  • rule (str) – See the pandas user guide for a list of options. Note that Business-related offsets such as “BusinessDay” are not supported.

  • **kwargs – Other arguments to pass through to interpolate()

Returns

New ScmDataFrameBase instance on a new time index

Return type

ScmDataFrameBase

Examples

Resample a dataframe to annual values

>>> scm_df = ScmDataFrame(
...     pd.Series([1, 2, 10], index=(2000, 2001, 2009)),
...     columns={
...         "model": ["a_iam"],
...         "scenario": ["a_scenario"],
...         "region": ["World"],
...         "variable": ["Primary Energy"],
...         "unit": ["EJ/y"],
...     }
... )
>>> scm_df.timeseries().T
model             a_iam
scenario     a_scenario
region            World
variable Primary Energy
unit               EJ/y
year
2000                  1
2010                 10

An annual timeseries can be the created by interpolating to the start of years using the rule ‘AS’.

>>> res = scm_df.resample('AS')
>>> res.timeseries().T
model                        a_iam
scenario                a_scenario
region                       World
variable            Primary Energy
unit                          EJ/y
time
2000-01-01 00:00:00       1.000000
2001-01-01 00:00:00       2.001825
2002-01-01 00:00:00       3.000912
2003-01-01 00:00:00       4.000000
2004-01-01 00:00:00       4.999088
2005-01-01 00:00:00       6.000912
2006-01-01 00:00:00       7.000000
2007-01-01 00:00:00       7.999088
2008-01-01 00:00:00       8.998175
2009-01-01 00:00:00      10.00000
>>> m_df = scm_df.resample('MS')
>>> m_df.timeseries().T
model                        a_iam
scenario                a_scenario
region                       World
variable            Primary Energy
unit                          EJ/y
time
2000-01-01 00:00:00       1.000000
2000-02-01 00:00:00       1.084854
2000-03-01 00:00:00       1.164234
2000-04-01 00:00:00       1.249088
2000-05-01 00:00:00       1.331204
2000-06-01 00:00:00       1.416058
2000-07-01 00:00:00       1.498175
2000-08-01 00:00:00       1.583029
2000-09-01 00:00:00       1.667883
                            ...
2008-05-01 00:00:00       9.329380
2008-06-01 00:00:00       9.414234
2008-07-01 00:00:00       9.496350
2008-08-01 00:00:00       9.581204
2008-09-01 00:00:00       9.666058
2008-10-01 00:00:00       9.748175
2008-11-01 00:00:00       9.833029
2008-12-01 00:00:00       9.915146
2009-01-01 00:00:00      10.000000
[109 rows x 1 columns]

Note that the values do not fall exactly on integer values as not all years are exactly the same length.

References

See the pandas documentation for resample <http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html> for more information about possible arguments.

scatter(x, y, **kwargs)

Plot a scatter chart using metadata columns.

See pyam.plotting.scatter() for details.

Return type

None

set_meta(meta, name=None, index=None)

Set metadata information.

TODO: re-write this to make it more sane and add type annotations

Parameters
  • meta (Union[Series, list, int, float, str]) – Column to be added to metadata

  • name (Optional[str]) – Meta column name (defaults to meta.name)

  • index (Union[DataFrame, Series, Index, MultiIndex, None]) – The index to which the metadata is to be applied

Raises

ValueError – No name can be determined from inputs or index cannot be coerced to pd.MultiIndex

Return type

None

tail(*args, **kwargs)

Return tail of self.timeseries().

Parameters
  • *args – Passed to self.timeseries().tail()

  • **kwargs – Passed to self.timeseries().tail()

Returns

Tail of self.timeseries()

Return type

pd.DataFrame

property time_points

Time points of the data

Return type

ndarray

timeseries(meta=None)

Return the data in wide format (same as the timeseries method of pyam.IamDataFrame).

Parameters

meta (Optional[List[str]]) – The list of meta columns that will be included in the output’s MultiIndex. If None (default), then all metadata will be used.

Returns

DataFrame with datetimes as columns and timeseries as rows. Metadata is in the index.

Return type

pd.DataFrame

Raises

ValueError – If the metadata are not unique between timeseries

to_csv(path, **kwargs)

Write timeseries data to a csv file

Parameters

path (str) – Path to write the file into

Return type

None

to_iamdataframe()

Convert to a LongDatetimeIamDataFrame instance.

LongDatetimeIamDataFrame is a subclass of pyam.IamDataFrame. We use LongDatetimeIamDataFrame to ensure all times can be handled, see docstring of LongDatetimeIamDataFrame for details.

Returns

LongDatetimeIamDataFrame instance containing the same data.

Return type

LongDatetimeIamDataFrame

Raises

ImportError

If pyam is not installed

to_parameterset(parameterset=None)

Add parameters in this ScmDataFrameBase to a ParameterSet.

It can only be transformed if all timeseries have the same metadata. This is typically the case if all data comes from a single scenario/model input dataset. If that is not the case, further filtering is needed to reduce to a dataframe with identical metadata.

Parameters

parameterset (Optional[ParameterSet]) – ParameterSet to add this ScmDataFrameBase’s parameters to. A new ParameterSet is created if this is None.

Returns

ParameterSet containing the data in self (equals parameterset if not None)

Return type

ParameterSet

Raises

ValueError – Not all timeseries have the same metadata or climate_model is given and does not equal “unspecified”

property values

Timeseries values without metadata

Calls timeseries()

Return type

ndarray

openscm.scmdataframe.convert_openscm_to_scmdataframe(parameterset, time_points, model='unspecified', scenario='unspecified', climate_model='unspecified')

Get an ScmDataFrame from a ParameterSet.

An ScmDataFrame is a view with a common time index for all time series. All metadata in the ParameterSet must be represented as Generic parameters with in the World region.

TODO: overhaul this function and move to an appropriate location

Parameters
  • parameterset (ParameterSet) – ParameterSet containing time series and optional metadata.

  • time_points (ndarray) – Time points onto which all timeseries will be interpolated.

  • model (str) – Default value for the model metadata value. This value is only used if the model parameter is not found.

  • scenario (str) – Default value for the scenario metadata value. This value is only used if the scenario parameter is not found.

  • climate_model (str) – Default value for the climate_model metadata value. This value is only used if the climate_model parameter is not found.

Raises

ValueError – If a generic parameter cannot be mapped to an ScmDataFrame meta table. This happens if the parameter has a region which is not ('World',).

Returns

ScmDataFrame containing the data from parameterset

Return type

ScmDataFrame

Base

Base and utilities for OpenSCM’s custom DataFrame implementation.

openscm.scmdataframe.base.REQUIRED_COLS = ['model', 'scenario', 'region', 'variable', 'unit']

Minimum metadata columns required by an ScmDataFrame

class openscm.scmdataframe.base.ScmDataFrameBase(data, index=None, columns=None, **kwargs)

Bases: object

Base of OpenSCM’s custom DataFrame implementation.

This base is the class other libraries can subclass. Having such a subclass avoids a potential circularity where e.g. OpenSCM imports ScmDataFrame as well as Pymagicc, but Pymagicc wants to import ScmDataFrame too. Hence, importing ScmDataFrame requires importing ScmDataFrame, causing a circularity.

__init__(data, index=None, columns=None, **kwargs)

Initialize.

Parameters
  • data (Union[ScmDataFrameBase, None, DataFrame, Series, ndarray, str]) – A pd.DataFrame or data file with IAMC-format data columns, or a numpy array of timeseries data if columns is specified. If a string is passed, data will be attempted to be read from file.

  • index (Optional[Any]) – Only used if columns is not None. If index is not None, too, then this value sets the time index of the ScmDataFrameBase instance. If index is None and columns is not None, the index is taken from data.

  • columns (Optional[Dict[str, list]]) –

    If None, ScmDataFrameBase will attempt to infer the values from the source. Otherwise, use this dict to write the metadata for each timeseries in data. For each metadata key (e.g. “model”, “scenario”), an array of values (one per time series) is expected. Alternatively, providing a list of length 1 applies the same value to all timeseries in data. For example, if you had three timeseries from ‘rcp26’ for 3 different models ‘model’, ‘model2’ and ‘model3’, the column dict would look like either ‘col_1’ or ‘col_2’:

    >>> col_1 = {
        "scenario": ["rcp26"],
        "model": ["model1", "model2", "model3"],
        "region": ["unspecified"],
        "variable": ["unspecified"],
        "unit": ["unspecified"]
    }
    >>> col_2 = {
        "scenario": ["rcp26", "rcp26", "rcp26"],
        "model": ["model1", "model2", "model3"],
        "region": ["unspecified"],
        "variable": ["unspecified"],
        "unit": ["unspecified"]
    }
    >>> assert pd.testing.assert_frame_equal(
        ScmDataFrameBase(d, columns=col_1).meta,
        ScmDataFrameBase(d, columns=col_2).meta
    )
    

  • **kwargs – Additional parameters passed to pyam.core._read_file() to read files

Raises
  • ValueError – If metadata for [‘model’, ‘scenario’, ‘region’, ‘variable’, ‘unit’] is not found. A ValueError is also raised if you try to load from multiple files at once. If you wish to do this, please use df_append() instead.

  • TypeError – Timeseries cannot be read from data

_apply_filters(filters, has_nan=True)

Determine rows to keep in data for given set of filters.

Parameters
  • filters (Dict[~KT, ~VT]) – Dictionary of filters ({col: values}}); uses a pseudo-regexp syntax by default but if filters["regexp"] is True, regexp is used directly.

  • has_nan (bool) – If True`, convert all nan values in meta_col to empty string before applying filters. This means that “” and “*” will match rows with np.nan. If False, the conversion is not applied and so a search in a string column which contains np.nan will result in a TypeError.

Returns

Two boolean np.ndarray’s. The first contains the columns to keep (i.e. which time points to keep). The second contains the rows to keep (i.e. which metadata matched the filters).

Return type

np.ndarray of bool, np.ndarray of bool

Raises

ValueError – Filtering cannot be performed on requested column

_data = None

Timeseries data

_day_match(values)
_meta = None

Meta data

_sort_meta_cols()
_time_points = None

Time points

append(other, inplace=False, duplicate_msg='warn', **kwargs)

Append additional data to the current dataframe.

For details, see df_append().

Parameters
  • other (Union[ScmDataFrameBase, None, DataFrame, Series, ndarray, str]) – Data (in format which can be cast to ScmDataFrameBase) to append

  • inplace (bool) – If True, append data in place and return None. Otherwise, return a new ScmDataFrameBase instance with the appended data.

  • duplicate_msg (Union[str, bool]) – If “warn”, raise a warning if duplicate data is detected. If “return”, return the joint dataframe (including duplicate timeseries) so the user can inspect further. If False, take the average and do not raise a warning.

  • **kwargs – Keywords to pass to ScmDataFrameBase.__init__() when reading other

Returns

If not inplace, return a new ScmDataFrameBase instance containing the result of the append.

Return type

ScmDataFrameBase

convert_unit(unit, context=None, inplace=False, **kwargs)

Convert the units of a selection of timeseries.

Uses openscm.units.UnitConverter to perform the conversion.

Parameters
  • unit (str) – Unit to convert to. This must be recognised by UnitConverter.

  • context (Optional[str]) – Context to use for the conversion i.e. which metric to apply when performing CO2-equivalent calculations. If None, no metric will be applied and CO2-equivalent calculations will raise DimensionalityError.

  • inplace (bool) – If True, the operation is performed inplace, updating the underlying data. Otherwise a new ScmDataFrameBase instance is returned.

  • **kwargs – Extra arguments which are passed to filter() to limit the timeseries which are attempted to be converted. Defaults to selecting the entire ScmDataFrame, which will likely fail.

Returns

If inplace is not False, a new ScmDataFrameBase instance with the converted units.

Return type

ScmDataFrameBase

copy()

Return a copy.deepcopy() of self.

Returns

copy.deepcopy() of self

Return type

ScmDataFrameBase

data_hierarchy_separator = '|'

String used to define different levels in our data hierarchies.

By default we follow pyam and use “|”. In such a case, emissions of CO2 for energy from coal would be “Emissions|CO2|Energy|Coal”.

Type

str

filter(keep=True, inplace=False, has_nan=True, **kwargs)

Return a filtered ScmDataFrame (i.e., a subset of the data).

Parameters
  • keep (bool) – If True, keep all timeseries satisfying the filters, otherwise drop all the timeseries satisfying the filters

  • inplace (bool) – If True, do operation inplace and return None

  • has_nan (bool) – If True, convert all nan values in meta_col to empty string before applying filters. This means that “” and “*” will match rows with np.nan. If False, the conversion is not applied and so a search in a string column which contains ;class:np.nan will result in a TypeError.

  • **kwargs

    Argument names are keys with which to filter, values are used to do the filtering. Filtering can be done on:

    • all metadata columns with strings, “*” can be used as a wildcard in search strings

    • ’level’: the maximum “depth” of IAM variables (number of hierarchy levels, excluding the strings given in the ‘variable’ argument)

    • ’time’: takes a datetime.datetime or list of datetime.datetime’s TODO: default to np.datetime64

    • ’year’, ‘month’, ‘day’, hour’: takes an int or list of int’s (‘month’ and ‘day’ also accept str or list of str)

    If regexp=True is included in kwargs then the pseudo-regexp syntax in pattern_match is disabled.

Returns

If not inplace, return a new instance with the filtered data.

Return type

ScmDataFrameBase

Raises

AssertionError – Data and meta become unaligned

head(*args, **kwargs)

Return head of self.timeseries().

Parameters
  • *args – Passed to self.timeseries().head()

  • **kwargs – Passed to self.timeseries().head()

Returns

Tail of self.timeseries()

Return type

pd.DataFrame

interpolate(target_times, interpolation_type=<InterpolationType.LINEAR: 1>, extrapolation_type=<ExtrapolationType.CONSTANT: 0>)

Interpolate the dataframe onto a new time frame.

Uses openscm.timeseries_converter.TimeseriesConverter internally. For each time series a ParameterType is guessed from the variable name. To override the guessed parameter type, specify a “parameter_type” meta column before calling interpolate. The guessed parameter types are returned in meta.

Parameters
Returns

A new ScmDataFrameBase containing the data interpolated onto the target_times grid

Return type

ScmDataFrameBase

line_plot(x='time', y='value', **kwargs)

Plot a line chart.

See pyam.IamDataFrame.line_plot() for more information.

Return type

None

property meta

Metadata

Return type

DataFrame

pivot_table(index, columns, **kwargs)

Pivot the underlying data series.

See pyam.core.IamDataFrame.pivot_table() for details.

Return type

DataFrame

process_over(cols, operation, **kwargs)

Process the data over the input columns.

Parameters
  • cols (Union[str, List[str]]) – Columns to perform the operation on. The timeseries will be grouped by all other columns in meta.

  • operation (['median', 'mean', 'quantile']) – The operation to perform. This uses the equivalent pandas function. Note that quantile means the value of the data at a given point in the cumulative distribution of values at each point in the timeseries, for each timeseries once the groupby is applied. As a result, using q=0.5 is is the same as taking the median and not the same as taking the mean/average.

  • **kwargs – Keyword arguments to pass to the pandas operation

Returns

The quantiles of the timeseries, grouped by all columns in meta other than cols

Return type

pd.DataFrame

Raises

ValueError – If the operation is not one of [‘median’, ‘mean’, ‘quantile’]

region_plot(**kwargs)

Plot regional data for a single model, scenario, variable, and year.

See pyam.plotting.region_plot for details.

Return type

None

relative_to_ref_period_mean(append_str=None, **kwargs)

Return the timeseries relative to a given reference period mean.

The reference period mean is subtracted from all values in the input timeseries.

Parameters
  • append_str (Optional[str]) – String to append to the name of all the variables in the resulting DataFrame to indicate that they are relevant to a given reference period. E.g. ‘rel. to 1961-1990’. If None, this will be autofilled with the keys and ranges of kwargs.

  • **kwargs – Arguments to pass to filter() to determine the data to be included in the reference time period. See the docs of filter() for valid options.

Returns

DataFrame containing the timeseries, adjusted to the reference period mean

Return type

pd.DataFrame

rename(mapping, inplace=False)

Rename and aggregate column entries using groupby.sum() on values. When renaming models or scenarios, the uniqueness of the index must be maintained, and the function will raise an error otherwise.

Parameters
  • mapping (Dict[str, Dict[str, str]]) –

    For each column where entries should be renamed, provide current name and target name

    {<column name>: {<current_name_1>: <target_name_1>,
                     <current_name_2>: <target_name_2>}}
    

  • inplace (bool) – If True, do operation inplace and return None

Returns

If inplace is True, return a new ScmDataFrameBase instance

Return type

ScmDataFrameBase

Raises

ValueError – Column is not in meta or renaming will cause non-unique metadata

resample(rule='AS', **kwargs)

Resample the time index of the timeseries data onto a custom grid.

This helper function allows for values to be easily interpolated onto annual or monthly timesteps using the rules=’AS’ or ‘MS’ respectively. Internally, the interpolate function performs the regridding.

Parameters
  • rule (str) –

    See the pandas user guide for a list of options. Note that Business-related offsets such as “BusinessDay” are not supported.

  • **kwargs – Other arguments to pass through to interpolate()

Returns

New ScmDataFrameBase instance on a new time index

Return type

ScmDataFrameBase

Examples

Resample a dataframe to annual values

>>> scm_df = ScmDataFrame(
...     pd.Series([1, 2, 10], index=(2000, 2001, 2009)),
...     columns={
...         "model": ["a_iam"],
...         "scenario": ["a_scenario"],
...         "region": ["World"],
...         "variable": ["Primary Energy"],
...         "unit": ["EJ/y"],
...     }
... )
>>> scm_df.timeseries().T
model             a_iam
scenario     a_scenario
region            World
variable Primary Energy
unit               EJ/y
year
2000                  1
2010                 10

An annual timeseries can be the created by interpolating to the start of years using the rule ‘AS’.

>>> res = scm_df.resample('AS')
>>> res.timeseries().T
model                        a_iam
scenario                a_scenario
region                       World
variable            Primary Energy
unit                          EJ/y
time
2000-01-01 00:00:00       1.000000
2001-01-01 00:00:00       2.001825
2002-01-01 00:00:00       3.000912
2003-01-01 00:00:00       4.000000
2004-01-01 00:00:00       4.999088
2005-01-01 00:00:00       6.000912
2006-01-01 00:00:00       7.000000
2007-01-01 00:00:00       7.999088
2008-01-01 00:00:00       8.998175
2009-01-01 00:00:00      10.00000
>>> m_df = scm_df.resample('MS')
>>> m_df.timeseries().T
model                        a_iam
scenario                a_scenario
region                       World
variable            Primary Energy
unit                          EJ/y
time
2000-01-01 00:00:00       1.000000
2000-02-01 00:00:00       1.084854
2000-03-01 00:00:00       1.164234
2000-04-01 00:00:00       1.249088
2000-05-01 00:00:00       1.331204
2000-06-01 00:00:00       1.416058
2000-07-01 00:00:00       1.498175
2000-08-01 00:00:00       1.583029
2000-09-01 00:00:00       1.667883
                            ...
2008-05-01 00:00:00       9.329380
2008-06-01 00:00:00       9.414234
2008-07-01 00:00:00       9.496350
2008-08-01 00:00:00       9.581204
2008-09-01 00:00:00       9.666058
2008-10-01 00:00:00       9.748175
2008-11-01 00:00:00       9.833029
2008-12-01 00:00:00       9.915146
2009-01-01 00:00:00      10.000000
[109 rows x 1 columns]

Note that the values do not fall exactly on integer values as not all years are exactly the same length.

References

See the pandas documentation for resample <http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html> for more information about possible arguments.

scatter(x, y, **kwargs)

Plot a scatter chart using metadata columns.

See pyam.plotting.scatter() for details.

Return type

None

set_meta(meta, name=None, index=None)

Set metadata information.

TODO: re-write this to make it more sane and add type annotations

Parameters
  • meta (Union[Series, list, int, float, str]) – Column to be added to metadata

  • name (Optional[str]) – Meta column name (defaults to meta.name)

  • index (Union[DataFrame, Series, Index, MultiIndex, None]) – The index to which the metadata is to be applied

Raises

ValueError – No name can be determined from inputs or index cannot be coerced to pd.MultiIndex

Return type

None

tail(*args, **kwargs)

Return tail of self.timeseries().

Parameters
  • *args – Passed to self.timeseries().tail()

  • **kwargs – Passed to self.timeseries().tail()

Returns

Tail of self.timeseries()

Return type

pd.DataFrame

property time_points

Time points of the data

Return type

ndarray

timeseries(meta=None)

Return the data in wide format (same as the timeseries method of pyam.IamDataFrame).

Parameters

meta (Optional[List[str]]) – The list of meta columns that will be included in the output’s MultiIndex. If None (default), then all metadata will be used.

Returns

DataFrame with datetimes as columns and timeseries as rows. Metadata is in the index.

Return type

pd.DataFrame

Raises

ValueError – If the metadata are not unique between timeseries

to_csv(path, **kwargs)

Write timeseries data to a csv file

Parameters

path (str) – Path to write the file into

Return type

None

to_iamdataframe()

Convert to a LongDatetimeIamDataFrame instance.

LongDatetimeIamDataFrame is a subclass of pyam.IamDataFrame. We use LongDatetimeIamDataFrame to ensure all times can be handled, see docstring of LongDatetimeIamDataFrame for details.

Returns

LongDatetimeIamDataFrame instance containing the same data.

Return type

LongDatetimeIamDataFrame

Raises

ImportError

If pyam is not installed

to_parameterset(parameterset=None)

Add parameters in this ScmDataFrameBase to a ParameterSet.

It can only be transformed if all timeseries have the same metadata. This is typically the case if all data comes from a single scenario/model input dataset. If that is not the case, further filtering is needed to reduce to a dataframe with identical metadata.

Parameters

parameterset (Optional[ParameterSet]) – ParameterSet to add this ScmDataFrameBase’s parameters to. A new ParameterSet is created if this is None.

Returns

ParameterSet containing the data in self (equals parameterset if not None)

Return type

ParameterSet

Raises

ValueError – Not all timeseries have the same metadata or climate_model is given and does not equal “unspecified”

property values

Timeseries values without metadata

Calls timeseries()

Return type

ndarray

openscm.scmdataframe.base._format_data(df)

Prepare data to initialize ScmDataFrameBase from pd.DataFrame or pd.Series.

See docstring of ScmDataFrameBase.__init__() for details.

Parameters

df (Union[DataFrame, Series]) – Data to format.

Returns

First dataframe is the data. Second dataframe is metadata.

Return type

pd.DataFrame, pd.DataFrame

Raises

ValueError – Not all required metadata columns are present or the time axis cannot be understood

openscm.scmdataframe.base._format_long_data(df)
openscm.scmdataframe.base._format_wide_data(df)
openscm.scmdataframe.base._from_ts(df, index=None, **columns)

Prepare data to initialize ScmDataFrameBase from wide timeseries.

See docstring of ScmDataFrameBase.__init__() for details.

Returns

First dataframe is the data. Second dataframe is metadata

Return type

Tuple[pd.DataFrame, pd.DataFrame]

Raises

ValueError – Not all required columns are present

openscm.scmdataframe.base._handle_potential_duplicates_in_append(data, duplicate_msg)
openscm.scmdataframe.base._read_file(fnames, *args, **kwargs)

Prepare data to initialize ScmDataFrameBase from a file.

Parameters
Returns

First dataframe is the data. Second dataframe is metadata

Return type

pd.DataFrame, pd.DataFrame

openscm.scmdataframe.base._read_pandas(fname, *args, **kwargs)

Read a file and return a pd.DataFrame.

Parameters
  • fname (str) – Path from which to read data

  • *args – Passed to pd.read_csv() if fname ends with ‘.csv’, otherwise passed to pd.read_excel().

  • **kwargs – Passed to pd.read_csv() if fname ends with ‘.csv’, otherwise passed to pd.read_excel().

Returns

Read data

Return type

pd.DataFrame

Raises

OSError – Path specified by fname does not exist

openscm.scmdataframe.base.df_append(dfs, inplace=False, duplicate_msg='warn')

Append together many objects.

When appending many objects, it may be more efficient to call this routine once with a list of ScmDataFrames, than using ScmDataFrame.append() multiple times. If timeseries with duplicate metadata are found, the timeseries are appended and values falling on the same timestep are averaged (this behaviour can be adjusted with the duplicate_msg arguments).

Parameters
  • dfs (List[Union[ScmDataFrameBase, None, DataFrame, Series, ndarray, str]]) – The dataframes to append. Values will be attempted to be cast to ScmDataFrameBase.

  • inplace (bool) – If True, then the operation updates the first item in dfs and returns None.

  • duplicate_msg (Union[str, bool]) – If “warn”, raise a warning if duplicate data is detected. If “return”, return the joint dataframe (including duplicate timeseries) so the user can inspect further. If False, take the average and do not raise a warning.

Returns

If not inplace, the return value is the object containing the merged data. The resultant class will be determined by the type of the first object. If duplicate_msg == "return", a pd.DataFrame will be returned instead.

Return type

ScmDataFrameBase

Raises

Filters

Helpers for filtering DataFrames.

Borrowed from pyam.utils.

openscm.scmdataframe.filters.datetime_match(data, dts)

Match datetimes in time columns for data filtering.

Parameters
Returns

Array where True indicates a match

Return type

np.array of bool

Raises

TypeErrordts contains int

openscm.scmdataframe.filters.day_match(data, days)

Match days in time columns for data filtering.

Parameters
Returns

Array where True indicates a match

Return type

np.array of bool

openscm.scmdataframe.filters.find_depth(meta_col, s, level, separator='|')

Find all values which match given depth from a filter keyword.

Parameters
  • meta_col (Series) – Column in which to find values which match the given depth

  • s (str) – Filter keyword, from which level should be applied

  • level (Union[int, str]) – Depth of value to match as defined by the number of separator in the value name. If an int, the depth is matched exactly. If a str, then the depth can be matched as either “X-“, for all levels up to level “X”, or “X+”, for all levels above level “X”.

  • separator (str) – The string used to separate levels in s. Defaults to a pipe (“|”).

Returns

Array where True indicates a match

Return type

np.array of bool

Raises

ValueError – If level cannot be understood

openscm.scmdataframe.filters.hour_match(data, hours)

Match hours in time columns for data filtering.

Parameters
  • data (List[~T]) – Input data to perform filtering on

  • hours (Union[List[int], int]) – Hours to match

Returns

Array where True indicates a match

Return type

np.array of bool

openscm.scmdataframe.filters.is_in(vals, items)

Find elements of vals which are in items.

Parameters
  • vals (List[~T]) – The list of values to check

  • items (List[~T]) – The options used to determine whether each element of vals is in the desired subset or not

Returns

Array of the same length as vals where the element is True if the corresponding element of vals is in items and False otherwise

Return type

np.array of bool

openscm.scmdataframe.filters.month_match(data, months)

Match months in time columns for data filtering.

Parameters
Returns

Array where True indicates a match

Return type

np.array of bool

openscm.scmdataframe.filters.pattern_match(meta_col, values, level=None, regexp=False, has_nan=True, separator='|')

Filter data by matching metadata columns to given patterns.

Parameters
  • meta_col (Series) – Column to perform filtering on

  • values (Union[Iterable[str], str]) – Values to match

  • level (Union[str, int, None]) – Passed to find_depth(). For usage, see docstring of find_depth().

  • regexp (bool) –

    If True, match using regexp rather than pseudo regexp syntax of pyam.

  • has_nan (bool) – If True, convert all nan values in meta_col to empty string before applying filters. This means that “” and “*” will match rows with np.nan. If False, the conversion is not applied and so a search in a string column which contains np.nan will result in a TypeError.

  • separator (str) – String used to separate the hierarchy levels in values. Defaults to ‘|’

Returns

Array where True indicates a match

Return type

np.array of bool

Raises

TypeError – Filtering is performed on a string metadata column which contains np.nan and has_nan is False

openscm.scmdataframe.filters.time_match(data, times, conv_codes, strptime_attr, name)

Match times by applying conversion codes to filtering list.

Parameters
  • data (List[~T]) – Input data to perform filtering on

  • times (Union[List[str], List[int], int, str]) – Times to match

  • conv_codes (List[str]) – If times contains strings, conversion codes to try passing to time.strptime() to convert times to datetime.datetime

  • strptime_attr (str) – If times contains strings, the datetime.datetime attribute to finalize the conversion of strings to integers

  • name (str) – Name of the part of a datetime to extract, used to produce useful error messages.

Returns

Array where True indicates a match

Return type

np.array of bool

Raises

ValueError – If input times cannot be converted understood or if input strings do not lead to increasing integers (i.e. “Nov-Feb” will not work, one must use [“Nov-Dec”, “Jan-Feb”] instead)

openscm.scmdataframe.filters.years_match(data, years)

Match years in time columns for data filtering.

Parameters
  • data (List[~T]) – Input data to perform filtering on

  • years (Union[List[int], int]) – Years to match

Returns

Array where True indicates a match

Return type

np.array of bool

Raises

TypeError – If years is not int or list of int

Offsets

A simplified version of pandas.DateOffset`s which use datetime-like objects instead of :class:`pandas.Timestamp.

This differentiation allows for times which exceed the range of :class`pandas.Timestamp` (see here) which is particularly important for longer running models.

TODO: use np.timedelta64 instead?

openscm.scmdataframe.offsets.apply_dt(func, self)

Apply a wrapper which keeps the result as a datetime instead of converting to pd.Timestamp.

This decorator is a simplified version of pandas.tseries.offsets.apply_wraps(). It is required to avoid running into errors when our time data is outside panda’s limited time range of 1677-09-22 00:12:43.145225 to 2262-04-11 23:47:16.854775807, see this discussion.

openscm.scmdataframe.offsets.apply_rollback(obj)

Roll provided date backward to previous offset, only if not on offset.

openscm.scmdataframe.offsets.apply_rollforward(obj)

Roll provided date forward to next offset, only if not on offset.

openscm.scmdataframe.offsets.generate_range(start, end, offset)

Generate a range of datetime objects between start and end, using offset to determine the steps.

The range will extend both ends of the span to the next valid timestep, see examples.

Parameters
  • start (datetime) – Starting datetime from which to generate the range (noting roll backward mentioned above and illustrated in the examples).

  • end (datetime) – Last datetime from which to generate the range (noting roll forward mentioned above and illustrated in the examples).

  • offset (DateOffset) – Offset object for determining the timesteps. An offsetter obtained from :func`to_offset` must be used.

Yields

datetime.datetime – Next datetime in the range

Raises

ValueError – Offset does not result in increasing :class`datetime.datetime`s

Examples

The range is extended at either end to the nearest timestep. In the example below, the first timestep is rolled back to 1st Jan 2001 whilst the last is extended to 1st Jan 2006.

>>> import datetime as dt
>>> from pprint import pprint
>>> from openscm.scmdataframe.offsets import to_offset, generate_range
>>> g = generate_range(
...     dt.datetime(2001, 4, 1),
...     dt.datetime(2005, 6, 3),
...     to_offset("AS"),
... )
>>> pprint([d for d in g])
[datetime.datetime(2001, 1, 1, 0, 0),
 datetime.datetime(2002, 1, 1, 0, 0),
 datetime.datetime(2003, 1, 1, 0, 0),
 datetime.datetime(2004, 1, 1, 0, 0),
 datetime.datetime(2005, 1, 1, 0, 0),
 datetime.datetime(2006, 1, 1, 0, 0)]

In this example the first timestep is rolled back to 31st Dec 2000 whilst the last is extended to 31st Dec 2005.

>>> g = generate_range(
...     dt.datetime(2001, 4, 1),
...     dt.datetime(2005, 6, 3),
...     to_offset("A"),
... )
>>> pprint([d for d in g])
[datetime.datetime(2000, 12, 31, 0, 0),
 datetime.datetime(2001, 12, 31, 0, 0),
 datetime.datetime(2002, 12, 31, 0, 0),
 datetime.datetime(2003, 12, 31, 0, 0),
 datetime.datetime(2004, 12, 31, 0, 0),
 datetime.datetime(2005, 12, 31, 0, 0)]

In this example the first timestep is already on the offset so stays there, the last timestep is to 1st Sep 2005.

>>> g = generate_range(
...     dt.datetime(2001, 4, 1),
...     dt.datetime(2005, 6, 3),
...     to_offset("QS"),
... )
>>> pprint([d for d in g])
[datetime.datetime(2001, 4, 1, 0, 0),
 datetime.datetime(2001, 7, 1, 0, 0),
 datetime.datetime(2001, 10, 1, 0, 0),
 datetime.datetime(2002, 1, 1, 0, 0),
 datetime.datetime(2002, 4, 1, 0, 0),
 datetime.datetime(2002, 7, 1, 0, 0),
 datetime.datetime(2002, 10, 1, 0, 0),
 datetime.datetime(2003, 1, 1, 0, 0),
 datetime.datetime(2003, 4, 1, 0, 0),
 datetime.datetime(2003, 7, 1, 0, 0),
 datetime.datetime(2003, 10, 1, 0, 0),
 datetime.datetime(2004, 1, 1, 0, 0),
 datetime.datetime(2004, 4, 1, 0, 0),
 datetime.datetime(2004, 7, 1, 0, 0),
 datetime.datetime(2004, 10, 1, 0, 0),
 datetime.datetime(2005, 1, 1, 0, 0),
 datetime.datetime(2005, 4, 1, 0, 0),
 datetime.datetime(2005, 7, 1, 0, 0)]
Return type

Iterable[datetime]

openscm.scmdataframe.offsets.to_offset(rule)

Return a wrapped DateOffset instance for a given rule.

The DateOffset class is manipulated to return datetimes instead of pd.Timestamp, allowing it to handle times outside panda’s limited time range of 1677-09-22 00:12:43.145225 to 2262-04-11 23:47:16.854775807, see this discussion.

Parameters

rule (str) – The rule to use to generate the offset. For options see pandas offset aliases.

Returns

Wrapped DateOffset class for the given rule

Return type

DateOffset

Raises

ValueError – If unsupported offset rule is requested, e.g. all business related offsets

Pyam Compatibilty

Imports and classes required to ensure compatibility with Pyam is intelligently handled.