ScmDataFrame¶
ScmDataFrame¶
ScmDataFrame provides a high level analysis tool for simple climate model relevant data. It provides a simple interface for reading/writing, subsetting and visualising model data. ScmDataFrames are able to hold multiple model runs which aids in analysis of ensembles of model runs.
-
class
openscm.scmdataframe.
ScmDataFrame
(data, index=None, columns=None, **kwargs)¶ Bases:
openscm.scmdataframe.base.ScmDataFrameBase
OpenSCM’s custom DataFrame implementation.
The ScmDataFrame implements a subset of the functionality provided by pyam’s IamDataFrame, but is focused on providing a performant way of storing time series data and the metadata associated with those time series.
For users who wish to take advantage of all of Pyam’s functionality, please cast your ScmDataFrame to an IamDataFrame first with
to_iamdataframe()
. Note: this operation can be computationally expensive for large data sets because IamDataFrames stored data in long/tidy form internally rather than ScmDataFrames’ more compact internal format.-
__init__
(data, index=None, columns=None, **kwargs)¶ Initialize.
- Parameters
data (
Union
[ScmDataFrameBase
,None
,DataFrame
,Series
,ndarray
,str
]) – A pd.DataFrame or data file with IAMC-format data columns, or a numpy array of timeseries data ifcolumns
is specified. If a string is passed, data will be attempted to be read from file.index (
Optional
[Any
]) – Only used ifcolumns
is notNone
. Ifindex
is notNone
, too, then this value sets the time index of theScmDataFrameBase
instance. Ifindex
isNone
andcolumns
is notNone
, the index is taken fromdata
.columns (
Optional
[Dict
[str
,list
]]) –If None, ScmDataFrameBase will attempt to infer the values from the source. Otherwise, use this dict to write the metadata for each timeseries in data. For each metadata key (e.g. “model”, “scenario”), an array of values (one per time series) is expected. Alternatively, providing a list of length 1 applies the same value to all timeseries in data. For example, if you had three timeseries from ‘rcp26’ for 3 different models ‘model’, ‘model2’ and ‘model3’, the column dict would look like either ‘col_1’ or ‘col_2’:
>>> col_1 = { "scenario": ["rcp26"], "model": ["model1", "model2", "model3"], "region": ["unspecified"], "variable": ["unspecified"], "unit": ["unspecified"] } >>> col_2 = { "scenario": ["rcp26", "rcp26", "rcp26"], "model": ["model1", "model2", "model3"], "region": ["unspecified"], "variable": ["unspecified"], "unit": ["unspecified"] } >>> assert pd.testing.assert_frame_equal( ScmDataFrameBase(d, columns=col_1).meta, ScmDataFrameBase(d, columns=col_2).meta )
**kwargs – Additional parameters passed to
pyam.core._read_file()
to read files
- Raises
ValueError – If metadata for [‘model’, ‘scenario’, ‘region’, ‘variable’, ‘unit’] is not found. A
ValueError
is also raised if you try to load from multiple files at once. If you wish to do this, please usedf_append()
instead.TypeError – Timeseries cannot be read from
data
-
_apply_filters
(filters, has_nan=True)¶ Determine rows to keep in data for given set of filters.
- Parameters
filters (
Dict
[~KT, ~VT]) – Dictionary of filters({col: values}})
; uses a pseudo-regexp syntax by default but iffilters["regexp"]
isTrue
, regexp is used directly.has_nan (
bool
) – If True`, convert all nan values inmeta_col
to empty string before applying filters. This means that “” and “*” will match rows withnp.nan
. IfFalse
, the conversion is not applied and so a search in a string column which containsnp.nan
will result in aTypeError
.
- Returns
Two boolean
np.ndarray
’s. The first contains the columns to keep (i.e. which time points to keep). The second contains the rows to keep (i.e. which metadata matched the filters).- Return type
- Raises
ValueError – Filtering cannot be performed on requested column
-
_day_match
(values)¶
-
_sort_meta_cols
()¶
-
append
(other, inplace=False, duplicate_msg='warn', **kwargs)¶ Append additional data to the current dataframe.
For details, see
df_append()
.- Parameters
other (
Union
[ScmDataFrameBase
,None
,DataFrame
,Series
,ndarray
,str
]) – Data (in format which can be cast toScmDataFrameBase
) to appendinplace (
bool
) – IfTrue
, append data in place and returnNone
. Otherwise, return a newScmDataFrameBase
instance with the appended data.duplicate_msg (
Union
[str
,bool
]) – If “warn”, raise a warning if duplicate data is detected. If “return”, return the joint dataframe (including duplicate timeseries) so the user can inspect further. IfFalse
, take the average and do not raise a warning.**kwargs – Keywords to pass to
ScmDataFrameBase.__init__()
when readingother
- Returns
If not
inplace
, return a newScmDataFrameBase
instance containing the result of the append.- Return type
ScmDataFrameBase
-
convert_unit
(unit, context=None, inplace=False, **kwargs)¶ Convert the units of a selection of timeseries.
Uses
openscm.units.UnitConverter
to perform the conversion.- Parameters
unit (
str
) – Unit to convert to. This must be recognised byUnitConverter
.context (
Optional
[str
]) – Context to use for the conversion i.e. which metric to apply when performing CO2-equivalent calculations. IfNone
, no metric will be applied and CO2-equivalent calculations will raiseDimensionalityError
.inplace (
bool
) – IfTrue
, the operation is performed inplace, updating the underlying data. Otherwise a newScmDataFrameBase
instance is returned.**kwargs – Extra arguments which are passed to
filter()
to limit the timeseries which are attempted to be converted. Defaults to selecting the entire ScmDataFrame, which will likely fail.
- Returns
If
inplace
is notFalse
, a newScmDataFrameBase
instance with the converted units.- Return type
ScmDataFrameBase
-
copy
()¶ Return a
copy.deepcopy()
of self.- Returns
copy.deepcopy()
ofself
- Return type
ScmDataFrameBase
-
data_hierarchy_separator
= '|'¶
-
filter
(keep=True, inplace=False, has_nan=True, **kwargs)¶ Return a filtered ScmDataFrame (i.e., a subset of the data).
- Parameters
keep (
bool
) – If True, keep all timeseries satisfying the filters, otherwise drop all the timeseries satisfying the filtersinplace (
bool
) – If True, do operation inplace and return Nonehas_nan (
bool
) – IfTrue
, convert all nan values inmeta_col
to empty string before applying filters. This means that “” and “*” will match rows withnp.nan
. IfFalse
, the conversion is not applied and so a search in a string column which contains ;class:np.nan will result in aTypeError
.**kwargs –
Argument names are keys with which to filter, values are used to do the filtering. Filtering can be done on:
all metadata columns with strings, “*” can be used as a wildcard in search strings
’level’: the maximum “depth” of IAM variables (number of hierarchy levels, excluding the strings given in the ‘variable’ argument)
’time’: takes a
datetime.datetime
or list ofdatetime.datetime
’s TODO: default to np.datetime64’year’, ‘month’, ‘day’, hour’: takes an
int
or list ofint
’s (‘month’ and ‘day’ also acceptstr
or list ofstr
)
If
regexp=True
is included inkwargs
then the pseudo-regexp syntax inpattern_match
is disabled.
- Returns
If not
inplace
, return a new instance with the filtered data.- Return type
ScmDataFrameBase
- Raises
AssertionError – Data and meta become unaligned
-
head
(*args, **kwargs)¶ Return head of
self.timeseries()
.- Parameters
*args – Passed to
self.timeseries().head()
**kwargs – Passed to
self.timeseries().head()
- Returns
Tail of
self.timeseries()
- Return type
pd.DataFrame
-
interpolate
(target_times, interpolation_type=<InterpolationType.LINEAR: 1>, extrapolation_type=<ExtrapolationType.CONSTANT: 0>)¶ Interpolate the dataframe onto a new time frame.
Uses
openscm.timeseries_converter.TimeseriesConverter
internally. For each time series aParameterType
is guessed from the variable name. To override the guessed parameter type, specify a “parameter_type” meta column before calling interpolate. The guessed parameter types are returned in meta.- Parameters
target_times (
Union
[ndarray
,List
[Union
[datetime
,int
]]]) – Time grid onto which to interpolateinterpolation_type (
Union
[InterpolationType
,str
]) – How to interpolate the data between timepointsextrapolation_type (
Union
[ExtrapolationType
,str
]) – If and how to extrapolate the data beyond the data inself.timeseries()
- Returns
A new
ScmDataFrameBase
containing the data interpolated onto thetarget_times
grid- Return type
ScmDataFrameBase
-
line_plot
(x='time', y='value', **kwargs)¶ Plot a line chart.
See
pyam.IamDataFrame.line_plot()
for more information.- Return type
None
-
property
meta
¶ Metadata
- Return type
DataFrame
-
pivot_table
(index, columns, **kwargs)¶ Pivot the underlying data series.
See
pyam.core.IamDataFrame.pivot_table()
for details.- Return type
DataFrame
-
process_over
(cols, operation, **kwargs)¶ Process the data over the input columns.
- Parameters
cols (
Union
[str
,List
[str
]]) – Columns to perform the operation on. The timeseries will be grouped by all other columns inmeta
.operation (['median', 'mean', 'quantile']) – The operation to perform. This uses the equivalent pandas function. Note that quantile means the value of the data at a given point in the cumulative distribution of values at each point in the timeseries, for each timeseries once the groupby is applied. As a result, using
q=0.5
is is the same as taking the median and not the same as taking the mean/average.**kwargs – Keyword arguments to pass to the pandas operation
- Returns
The quantiles of the timeseries, grouped by all columns in
meta
other thancols
- Return type
pd.DataFrame
- Raises
ValueError – If the operation is not one of [‘median’, ‘mean’, ‘quantile’]
-
region_plot
(**kwargs)¶ Plot regional data for a single model, scenario, variable, and year.
See
pyam.plotting.region_plot
for details.- Return type
None
-
relative_to_ref_period_mean
(append_str=None, **kwargs)¶ Return the timeseries relative to a given reference period mean.
The reference period mean is subtracted from all values in the input timeseries.
- Parameters
append_str (
Optional
[str
]) – String to append to the name of all the variables in the resulting DataFrame to indicate that they are relevant to a given reference period. E.g. ‘rel. to 1961-1990’. If None, this will be autofilled with the keys and ranges ofkwargs
.**kwargs – Arguments to pass to
filter()
to determine the data to be included in the reference time period. See the docs offilter()
for valid options.
- Returns
DataFrame containing the timeseries, adjusted to the reference period mean
- Return type
pd.DataFrame
-
rename
(mapping, inplace=False)¶ Rename and aggregate column entries using
groupby.sum()
on values. When renaming models or scenarios, the uniqueness of the index must be maintained, and the function will raise an error otherwise.- Parameters
- Returns
If
inplace
isTrue
, return a newScmDataFrameBase
instance- Return type
ScmDataFrameBase
- Raises
ValueError – Column is not in meta or renaming will cause non-unique metadata
-
resample
(rule='AS', **kwargs)¶ Resample the time index of the timeseries data onto a custom grid.
This helper function allows for values to be easily interpolated onto annual or monthly timesteps using the rules=’AS’ or ‘MS’ respectively. Internally, the interpolate function performs the regridding.
- Parameters
rule (
str
) – See the pandas user guide for a list of options. Note that Business-related offsets such as “BusinessDay” are not supported.**kwargs – Other arguments to pass through to
interpolate()
- Returns
New
ScmDataFrameBase
instance on a new time index- Return type
ScmDataFrameBase
Examples
Resample a dataframe to annual values
>>> scm_df = ScmDataFrame( ... pd.Series([1, 2, 10], index=(2000, 2001, 2009)), ... columns={ ... "model": ["a_iam"], ... "scenario": ["a_scenario"], ... "region": ["World"], ... "variable": ["Primary Energy"], ... "unit": ["EJ/y"], ... } ... ) >>> scm_df.timeseries().T model a_iam scenario a_scenario region World variable Primary Energy unit EJ/y year 2000 1 2010 10
An annual timeseries can be the created by interpolating to the start of years using the rule ‘AS’.
>>> res = scm_df.resample('AS') >>> res.timeseries().T model a_iam scenario a_scenario region World variable Primary Energy unit EJ/y time 2000-01-01 00:00:00 1.000000 2001-01-01 00:00:00 2.001825 2002-01-01 00:00:00 3.000912 2003-01-01 00:00:00 4.000000 2004-01-01 00:00:00 4.999088 2005-01-01 00:00:00 6.000912 2006-01-01 00:00:00 7.000000 2007-01-01 00:00:00 7.999088 2008-01-01 00:00:00 8.998175 2009-01-01 00:00:00 10.00000
>>> m_df = scm_df.resample('MS') >>> m_df.timeseries().T model a_iam scenario a_scenario region World variable Primary Energy unit EJ/y time 2000-01-01 00:00:00 1.000000 2000-02-01 00:00:00 1.084854 2000-03-01 00:00:00 1.164234 2000-04-01 00:00:00 1.249088 2000-05-01 00:00:00 1.331204 2000-06-01 00:00:00 1.416058 2000-07-01 00:00:00 1.498175 2000-08-01 00:00:00 1.583029 2000-09-01 00:00:00 1.667883 ... 2008-05-01 00:00:00 9.329380 2008-06-01 00:00:00 9.414234 2008-07-01 00:00:00 9.496350 2008-08-01 00:00:00 9.581204 2008-09-01 00:00:00 9.666058 2008-10-01 00:00:00 9.748175 2008-11-01 00:00:00 9.833029 2008-12-01 00:00:00 9.915146 2009-01-01 00:00:00 10.000000 [109 rows x 1 columns]
Note that the values do not fall exactly on integer values as not all years are exactly the same length.
References
See the pandas documentation for resample <http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html> for more information about possible arguments.
-
scatter
(x, y, **kwargs)¶ Plot a scatter chart using metadata columns.
See
pyam.plotting.scatter()
for details.- Return type
None
-
set_meta
(meta, name=None, index=None)¶ Set metadata information.
TODO: re-write this to make it more sane and add type annotations
- Parameters
- Raises
ValueError – No name can be determined from inputs or index cannot be coerced to
pd.MultiIndex
- Return type
None
-
tail
(*args, **kwargs)¶ Return tail of
self.timeseries()
.- Parameters
*args – Passed to
self.timeseries().tail()
**kwargs – Passed to
self.timeseries().tail()
- Returns
Tail of
self.timeseries()
- Return type
pd.DataFrame
-
timeseries
(meta=None)¶ Return the data in wide format (same as the timeseries method of
pyam.IamDataFrame
).- Parameters
meta (
Optional
[List
[str
]]) – The list of meta columns that will be included in the output’s MultiIndex. If None (default), then all metadata will be used.- Returns
DataFrame with datetimes as columns and timeseries as rows. Metadata is in the index.
- Return type
pd.DataFrame
- Raises
ValueError – If the metadata are not unique between timeseries
-
to_csv
(path, **kwargs)¶ Write timeseries data to a csv file
- Parameters
path (
str
) – Path to write the file into- Return type
None
-
to_iamdataframe
()¶ Convert to a
LongDatetimeIamDataFrame
instance.LongDatetimeIamDataFrame
is a subclass ofpyam.IamDataFrame
. We useLongDatetimeIamDataFrame
to ensure all times can be handled, see docstring ofLongDatetimeIamDataFrame
for details.- Returns
LongDatetimeIamDataFrame
instance containing the same data.- Return type
LongDatetimeIamDataFrame
- Raises
If pyam is not installed
-
to_parameterset
(parameterset=None)¶ Add parameters in this
ScmDataFrameBase
to aParameterSet
.It can only be transformed if all timeseries have the same metadata. This is typically the case if all data comes from a single scenario/model input dataset. If that is not the case, further filtering is needed to reduce to a dataframe with identical metadata.
- Parameters
parameterset (
Optional
[ParameterSet
]) – ParameterSet to add thisScmDataFrameBase
’s parameters to. A newParameterSet
is created if this isNone
.- Returns
ParameterSet
containing the data inself
(equalsparameterset
if notNone
)- Return type
ParameterSet
- Raises
ValueError – Not all timeseries have the same metadata or
climate_model
is given and does not equal “unspecified”
-
property
values
¶ Timeseries values without metadata
Calls
timeseries()
- Return type
-
-
openscm.scmdataframe.
convert_openscm_to_scmdataframe
(parameterset, time_points, model='unspecified', scenario='unspecified', climate_model='unspecified')¶ Get an
ScmDataFrame
from aParameterSet
.An ScmDataFrame is a view with a common time index for all time series. All metadata in the ParameterSet must be represented as Generic parameters with in the World region.
TODO: overhaul this function and move to an appropriate location
- Parameters
parameterset (
ParameterSet
) –ParameterSet
containing time series and optional metadata.time_points (
ndarray
) – Time points onto which all timeseries will be interpolated.model (
str
) – Default value for the model metadata value. This value is only used if themodel
parameter is not found.scenario (
str
) – Default value for the scenario metadata value. This value is only used if thescenario
parameter is not found.climate_model (
str
) – Default value for the climate_model metadata value. This value is only used if theclimate_model
parameter is not found.
- Raises
ValueError – If a generic parameter cannot be mapped to an ScmDataFrame meta table. This happens if the parameter has a region which is not
('World',)
.- Returns
ScmDataFrame
containing the data fromparameterset
- Return type
Base¶
Base and utilities for OpenSCM’s custom DataFrame implementation.
-
openscm.scmdataframe.base.
REQUIRED_COLS
= ['model', 'scenario', 'region', 'variable', 'unit']¶ Minimum metadata columns required by an ScmDataFrame
-
class
openscm.scmdataframe.base.
ScmDataFrameBase
(data, index=None, columns=None, **kwargs)¶ Bases:
object
Base of OpenSCM’s custom DataFrame implementation.
This base is the class other libraries can subclass. Having such a subclass avoids a potential circularity where e.g. OpenSCM imports ScmDataFrame as well as Pymagicc, but Pymagicc wants to import ScmDataFrame too. Hence, importing ScmDataFrame requires importing ScmDataFrame, causing a circularity.
-
__init__
(data, index=None, columns=None, **kwargs)¶ Initialize.
- Parameters
data (
Union
[ScmDataFrameBase
,None
,DataFrame
,Series
,ndarray
,str
]) – A pd.DataFrame or data file with IAMC-format data columns, or a numpy array of timeseries data ifcolumns
is specified. If a string is passed, data will be attempted to be read from file.index (
Optional
[Any
]) – Only used ifcolumns
is notNone
. Ifindex
is notNone
, too, then this value sets the time index of theScmDataFrameBase
instance. Ifindex
isNone
andcolumns
is notNone
, the index is taken fromdata
.columns (
Optional
[Dict
[str
,list
]]) –If None, ScmDataFrameBase will attempt to infer the values from the source. Otherwise, use this dict to write the metadata for each timeseries in data. For each metadata key (e.g. “model”, “scenario”), an array of values (one per time series) is expected. Alternatively, providing a list of length 1 applies the same value to all timeseries in data. For example, if you had three timeseries from ‘rcp26’ for 3 different models ‘model’, ‘model2’ and ‘model3’, the column dict would look like either ‘col_1’ or ‘col_2’:
>>> col_1 = { "scenario": ["rcp26"], "model": ["model1", "model2", "model3"], "region": ["unspecified"], "variable": ["unspecified"], "unit": ["unspecified"] } >>> col_2 = { "scenario": ["rcp26", "rcp26", "rcp26"], "model": ["model1", "model2", "model3"], "region": ["unspecified"], "variable": ["unspecified"], "unit": ["unspecified"] } >>> assert pd.testing.assert_frame_equal( ScmDataFrameBase(d, columns=col_1).meta, ScmDataFrameBase(d, columns=col_2).meta )
**kwargs – Additional parameters passed to
pyam.core._read_file()
to read files
- Raises
ValueError – If metadata for [‘model’, ‘scenario’, ‘region’, ‘variable’, ‘unit’] is not found. A
ValueError
is also raised if you try to load from multiple files at once. If you wish to do this, please usedf_append()
instead.TypeError – Timeseries cannot be read from
data
-
_apply_filters
(filters, has_nan=True)¶ Determine rows to keep in data for given set of filters.
- Parameters
filters (
Dict
[~KT, ~VT]) – Dictionary of filters({col: values}})
; uses a pseudo-regexp syntax by default but iffilters["regexp"]
isTrue
, regexp is used directly.has_nan (
bool
) – If True`, convert all nan values inmeta_col
to empty string before applying filters. This means that “” and “*” will match rows withnp.nan
. IfFalse
, the conversion is not applied and so a search in a string column which containsnp.nan
will result in aTypeError
.
- Returns
Two boolean
np.ndarray
’s. The first contains the columns to keep (i.e. which time points to keep). The second contains the rows to keep (i.e. which metadata matched the filters).- Return type
- Raises
ValueError – Filtering cannot be performed on requested column
-
_data
= None¶ Timeseries data
-
_day_match
(values)¶
-
_meta
= None¶ Meta data
-
_sort_meta_cols
()¶
-
_time_points
= None¶ Time points
-
append
(other, inplace=False, duplicate_msg='warn', **kwargs)¶ Append additional data to the current dataframe.
For details, see
df_append()
.- Parameters
other (
Union
[ScmDataFrameBase
,None
,DataFrame
,Series
,ndarray
,str
]) – Data (in format which can be cast toScmDataFrameBase
) to appendinplace (
bool
) – IfTrue
, append data in place and returnNone
. Otherwise, return a newScmDataFrameBase
instance with the appended data.duplicate_msg (
Union
[str
,bool
]) – If “warn”, raise a warning if duplicate data is detected. If “return”, return the joint dataframe (including duplicate timeseries) so the user can inspect further. IfFalse
, take the average and do not raise a warning.**kwargs – Keywords to pass to
ScmDataFrameBase.__init__()
when readingother
- Returns
If not
inplace
, return a newScmDataFrameBase
instance containing the result of the append.- Return type
-
convert_unit
(unit, context=None, inplace=False, **kwargs)¶ Convert the units of a selection of timeseries.
Uses
openscm.units.UnitConverter
to perform the conversion.- Parameters
unit (
str
) – Unit to convert to. This must be recognised byUnitConverter
.context (
Optional
[str
]) – Context to use for the conversion i.e. which metric to apply when performing CO2-equivalent calculations. IfNone
, no metric will be applied and CO2-equivalent calculations will raiseDimensionalityError
.inplace (
bool
) – IfTrue
, the operation is performed inplace, updating the underlying data. Otherwise a newScmDataFrameBase
instance is returned.**kwargs – Extra arguments which are passed to
filter()
to limit the timeseries which are attempted to be converted. Defaults to selecting the entire ScmDataFrame, which will likely fail.
- Returns
If
inplace
is notFalse
, a newScmDataFrameBase
instance with the converted units.- Return type
-
copy
()¶ Return a
copy.deepcopy()
of self.- Returns
copy.deepcopy()
ofself
- Return type
-
data_hierarchy_separator
= '|'¶ String used to define different levels in our data hierarchies.
By default we follow pyam and use “|”. In such a case, emissions of CO2 for energy from coal would be “Emissions|CO2|Energy|Coal”.
- Type
-
filter
(keep=True, inplace=False, has_nan=True, **kwargs)¶ Return a filtered ScmDataFrame (i.e., a subset of the data).
- Parameters
keep (
bool
) – If True, keep all timeseries satisfying the filters, otherwise drop all the timeseries satisfying the filtersinplace (
bool
) – If True, do operation inplace and return Nonehas_nan (
bool
) – IfTrue
, convert all nan values inmeta_col
to empty string before applying filters. This means that “” and “*” will match rows withnp.nan
. IfFalse
, the conversion is not applied and so a search in a string column which contains ;class:np.nan will result in aTypeError
.**kwargs –
Argument names are keys with which to filter, values are used to do the filtering. Filtering can be done on:
all metadata columns with strings, “*” can be used as a wildcard in search strings
’level’: the maximum “depth” of IAM variables (number of hierarchy levels, excluding the strings given in the ‘variable’ argument)
’time’: takes a
datetime.datetime
or list ofdatetime.datetime
’s TODO: default to np.datetime64’year’, ‘month’, ‘day’, hour’: takes an
int
or list ofint
’s (‘month’ and ‘day’ also acceptstr
or list ofstr
)
If
regexp=True
is included inkwargs
then the pseudo-regexp syntax inpattern_match
is disabled.
- Returns
If not
inplace
, return a new instance with the filtered data.- Return type
- Raises
AssertionError – Data and meta become unaligned
-
head
(*args, **kwargs)¶ Return head of
self.timeseries()
.- Parameters
*args – Passed to
self.timeseries().head()
**kwargs – Passed to
self.timeseries().head()
- Returns
Tail of
self.timeseries()
- Return type
pd.DataFrame
-
interpolate
(target_times, interpolation_type=<InterpolationType.LINEAR: 1>, extrapolation_type=<ExtrapolationType.CONSTANT: 0>)¶ Interpolate the dataframe onto a new time frame.
Uses
openscm.timeseries_converter.TimeseriesConverter
internally. For each time series aParameterType
is guessed from the variable name. To override the guessed parameter type, specify a “parameter_type” meta column before calling interpolate. The guessed parameter types are returned in meta.- Parameters
target_times (
Union
[ndarray
,List
[Union
[datetime
,int
]]]) – Time grid onto which to interpolateinterpolation_type (
Union
[InterpolationType
,str
]) – How to interpolate the data between timepointsextrapolation_type (
Union
[ExtrapolationType
,str
]) – If and how to extrapolate the data beyond the data inself.timeseries()
- Returns
A new
ScmDataFrameBase
containing the data interpolated onto thetarget_times
grid- Return type
-
line_plot
(x='time', y='value', **kwargs)¶ Plot a line chart.
See
pyam.IamDataFrame.line_plot()
for more information.- Return type
None
-
property
meta
¶ Metadata
- Return type
DataFrame
-
pivot_table
(index, columns, **kwargs)¶ Pivot the underlying data series.
See
pyam.core.IamDataFrame.pivot_table()
for details.- Return type
DataFrame
-
process_over
(cols, operation, **kwargs)¶ Process the data over the input columns.
- Parameters
cols (
Union
[str
,List
[str
]]) – Columns to perform the operation on. The timeseries will be grouped by all other columns inmeta
.operation (['median', 'mean', 'quantile']) – The operation to perform. This uses the equivalent pandas function. Note that quantile means the value of the data at a given point in the cumulative distribution of values at each point in the timeseries, for each timeseries once the groupby is applied. As a result, using
q=0.5
is is the same as taking the median and not the same as taking the mean/average.**kwargs – Keyword arguments to pass to the pandas operation
- Returns
The quantiles of the timeseries, grouped by all columns in
meta
other thancols
- Return type
pd.DataFrame
- Raises
ValueError – If the operation is not one of [‘median’, ‘mean’, ‘quantile’]
-
region_plot
(**kwargs)¶ Plot regional data for a single model, scenario, variable, and year.
See
pyam.plotting.region_plot
for details.- Return type
None
-
relative_to_ref_period_mean
(append_str=None, **kwargs)¶ Return the timeseries relative to a given reference period mean.
The reference period mean is subtracted from all values in the input timeseries.
- Parameters
append_str (
Optional
[str
]) – String to append to the name of all the variables in the resulting DataFrame to indicate that they are relevant to a given reference period. E.g. ‘rel. to 1961-1990’. If None, this will be autofilled with the keys and ranges ofkwargs
.**kwargs – Arguments to pass to
filter()
to determine the data to be included in the reference time period. See the docs offilter()
for valid options.
- Returns
DataFrame containing the timeseries, adjusted to the reference period mean
- Return type
pd.DataFrame
-
rename
(mapping, inplace=False)¶ Rename and aggregate column entries using
groupby.sum()
on values. When renaming models or scenarios, the uniqueness of the index must be maintained, and the function will raise an error otherwise.- Parameters
- Returns
If
inplace
isTrue
, return a newScmDataFrameBase
instance- Return type
- Raises
ValueError – Column is not in meta or renaming will cause non-unique metadata
-
resample
(rule='AS', **kwargs)¶ Resample the time index of the timeseries data onto a custom grid.
This helper function allows for values to be easily interpolated onto annual or monthly timesteps using the rules=’AS’ or ‘MS’ respectively. Internally, the interpolate function performs the regridding.
- Parameters
rule (
str
) –See the pandas user guide for a list of options. Note that Business-related offsets such as “BusinessDay” are not supported.
**kwargs – Other arguments to pass through to
interpolate()
- Returns
New
ScmDataFrameBase
instance on a new time index- Return type
Examples
Resample a dataframe to annual values
>>> scm_df = ScmDataFrame( ... pd.Series([1, 2, 10], index=(2000, 2001, 2009)), ... columns={ ... "model": ["a_iam"], ... "scenario": ["a_scenario"], ... "region": ["World"], ... "variable": ["Primary Energy"], ... "unit": ["EJ/y"], ... } ... ) >>> scm_df.timeseries().T model a_iam scenario a_scenario region World variable Primary Energy unit EJ/y year 2000 1 2010 10
An annual timeseries can be the created by interpolating to the start of years using the rule ‘AS’.
>>> res = scm_df.resample('AS') >>> res.timeseries().T model a_iam scenario a_scenario region World variable Primary Energy unit EJ/y time 2000-01-01 00:00:00 1.000000 2001-01-01 00:00:00 2.001825 2002-01-01 00:00:00 3.000912 2003-01-01 00:00:00 4.000000 2004-01-01 00:00:00 4.999088 2005-01-01 00:00:00 6.000912 2006-01-01 00:00:00 7.000000 2007-01-01 00:00:00 7.999088 2008-01-01 00:00:00 8.998175 2009-01-01 00:00:00 10.00000
>>> m_df = scm_df.resample('MS') >>> m_df.timeseries().T model a_iam scenario a_scenario region World variable Primary Energy unit EJ/y time 2000-01-01 00:00:00 1.000000 2000-02-01 00:00:00 1.084854 2000-03-01 00:00:00 1.164234 2000-04-01 00:00:00 1.249088 2000-05-01 00:00:00 1.331204 2000-06-01 00:00:00 1.416058 2000-07-01 00:00:00 1.498175 2000-08-01 00:00:00 1.583029 2000-09-01 00:00:00 1.667883 ... 2008-05-01 00:00:00 9.329380 2008-06-01 00:00:00 9.414234 2008-07-01 00:00:00 9.496350 2008-08-01 00:00:00 9.581204 2008-09-01 00:00:00 9.666058 2008-10-01 00:00:00 9.748175 2008-11-01 00:00:00 9.833029 2008-12-01 00:00:00 9.915146 2009-01-01 00:00:00 10.000000 [109 rows x 1 columns]
Note that the values do not fall exactly on integer values as not all years are exactly the same length.
References
See the pandas documentation for resample <http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html> for more information about possible arguments.
-
scatter
(x, y, **kwargs)¶ Plot a scatter chart using metadata columns.
See
pyam.plotting.scatter()
for details.- Return type
None
-
set_meta
(meta, name=None, index=None)¶ Set metadata information.
TODO: re-write this to make it more sane and add type annotations
- Parameters
- Raises
ValueError – No name can be determined from inputs or index cannot be coerced to
pd.MultiIndex
- Return type
None
-
tail
(*args, **kwargs)¶ Return tail of
self.timeseries()
.- Parameters
*args – Passed to
self.timeseries().tail()
**kwargs – Passed to
self.timeseries().tail()
- Returns
Tail of
self.timeseries()
- Return type
pd.DataFrame
-
timeseries
(meta=None)¶ Return the data in wide format (same as the timeseries method of
pyam.IamDataFrame
).- Parameters
meta (
Optional
[List
[str
]]) – The list of meta columns that will be included in the output’s MultiIndex. If None (default), then all metadata will be used.- Returns
DataFrame with datetimes as columns and timeseries as rows. Metadata is in the index.
- Return type
pd.DataFrame
- Raises
ValueError – If the metadata are not unique between timeseries
-
to_csv
(path, **kwargs)¶ Write timeseries data to a csv file
- Parameters
path (
str
) – Path to write the file into- Return type
None
-
to_iamdataframe
()¶ Convert to a
LongDatetimeIamDataFrame
instance.LongDatetimeIamDataFrame
is a subclass ofpyam.IamDataFrame
. We useLongDatetimeIamDataFrame
to ensure all times can be handled, see docstring ofLongDatetimeIamDataFrame
for details.- Returns
LongDatetimeIamDataFrame
instance containing the same data.- Return type
LongDatetimeIamDataFrame
- Raises
If pyam is not installed
-
to_parameterset
(parameterset=None)¶ Add parameters in this
ScmDataFrameBase
to aParameterSet
.It can only be transformed if all timeseries have the same metadata. This is typically the case if all data comes from a single scenario/model input dataset. If that is not the case, further filtering is needed to reduce to a dataframe with identical metadata.
- Parameters
parameterset (
Optional
[ParameterSet
]) – ParameterSet to add thisScmDataFrameBase
’s parameters to. A newParameterSet
is created if this isNone
.- Returns
ParameterSet
containing the data inself
(equalsparameterset
if notNone
)- Return type
ParameterSet
- Raises
ValueError – Not all timeseries have the same metadata or
climate_model
is given and does not equal “unspecified”
-
property
values
¶ Timeseries values without metadata
Calls
timeseries()
- Return type
-
-
openscm.scmdataframe.base.
_format_data
(df)¶ Prepare data to initialize
ScmDataFrameBase
frompd.DataFrame
orpd.Series
.See docstring of
ScmDataFrameBase.__init__()
for details.- Parameters
df (
Union
[DataFrame
,Series
]) – Data to format.- Returns
First dataframe is the data. Second dataframe is metadata.
- Return type
pd.DataFrame
,pd.DataFrame
- Raises
ValueError – Not all required metadata columns are present or the time axis cannot be understood
-
openscm.scmdataframe.base.
_format_long_data
(df)¶
-
openscm.scmdataframe.base.
_format_wide_data
(df)¶
-
openscm.scmdataframe.base.
_from_ts
(df, index=None, **columns)¶ Prepare data to initialize
ScmDataFrameBase
from wide timeseries.See docstring of
ScmDataFrameBase.__init__()
for details.- Returns
First dataframe is the data. Second dataframe is metadata
- Return type
Tuple[pd.DataFrame, pd.DataFrame]
- Raises
ValueError – Not all required columns are present
-
openscm.scmdataframe.base.
_handle_potential_duplicates_in_append
(data, duplicate_msg)¶
-
openscm.scmdataframe.base.
_read_file
(fnames, *args, **kwargs)¶ Prepare data to initialize
ScmDataFrameBase
from a file.- Parameters
*args – Passed to
_read_pandas()
.**kwargs – Passed to
_read_pandas()
.
- Returns
First dataframe is the data. Second dataframe is metadata
- Return type
pd.DataFrame
,pd.DataFrame
-
openscm.scmdataframe.base.
_read_pandas
(fname, *args, **kwargs)¶ Read a file and return a
pd.DataFrame
.- Parameters
fname (
str
) – Path from which to read data*args – Passed to
pd.read_csv()
iffname
ends with ‘.csv’, otherwise passed topd.read_excel()
.**kwargs – Passed to
pd.read_csv()
iffname
ends with ‘.csv’, otherwise passed topd.read_excel()
.
- Returns
Read data
- Return type
pd.DataFrame
- Raises
OSError – Path specified by
fname
does not exist
-
openscm.scmdataframe.base.
df_append
(dfs, inplace=False, duplicate_msg='warn')¶ Append together many objects.
When appending many objects, it may be more efficient to call this routine once with a list of ScmDataFrames, than using
ScmDataFrame.append()
multiple times. If timeseries with duplicate metadata are found, the timeseries are appended and values falling on the same timestep are averaged (this behaviour can be adjusted with theduplicate_msg
arguments).- Parameters
dfs (
List
[Union
[ScmDataFrameBase
,None
,DataFrame
,Series
,ndarray
,str
]]) – The dataframes to append. Values will be attempted to be cast toScmDataFrameBase
.inplace (
bool
) – IfTrue
, then the operation updates the first item indfs
and returnsNone
.duplicate_msg (
Union
[str
,bool
]) – If “warn”, raise a warning if duplicate data is detected. If “return”, return the joint dataframe (including duplicate timeseries) so the user can inspect further. IfFalse
, take the average and do not raise a warning.
- Returns
If not
inplace
, the return value is the object containing the merged data. The resultant class will be determined by the type of the first object. Ifduplicate_msg == "return"
, a pd.DataFrame will be returned instead.- Return type
- Raises
TypeError – If
inplace
isTrue
but the first element indfs
is not an instance ofScmDataFrameBase
ValueError –
duplicate_msg
option is not recognised.
Filters¶
Helpers for filtering DataFrames.
Borrowed from pyam.utils
.
-
openscm.scmdataframe.filters.
datetime_match
(data, dts)¶ Match datetimes in time columns for data filtering.
-
openscm.scmdataframe.filters.
day_match
(data, days)¶ Match days in time columns for data filtering.
-
openscm.scmdataframe.filters.
find_depth
(meta_col, s, level, separator='|')¶ Find all values which match given depth from a filter keyword.
- Parameters
meta_col (
Series
) – Column in which to find values which match the given depths (
str
) – Filter keyword, from which level should be appliedlevel (
Union
[int
,str
]) – Depth of value to match as defined by the number of separator in the value name. If an int, the depth is matched exactly. If a str, then the depth can be matched as either “X-“, for all levels up to level “X”, or “X+”, for all levels above level “X”.separator (
str
) – The string used to separate levels in s. Defaults to a pipe (“|”).
- Returns
Array where
True
indicates a match- Return type
np.array
ofbool
- Raises
ValueError – If
level
cannot be understood
-
openscm.scmdataframe.filters.
hour_match
(data, hours)¶ Match hours in time columns for data filtering.
-
openscm.scmdataframe.filters.
is_in
(vals, items)¶ Find elements of vals which are in items.
- Parameters
- Returns
Array of the same length as
vals
where the element isTrue
if the corresponding element ofvals
is initems
and False otherwise- Return type
np.array
ofbool
-
openscm.scmdataframe.filters.
month_match
(data, months)¶ Match months in time columns for data filtering.
-
openscm.scmdataframe.filters.
pattern_match
(meta_col, values, level=None, regexp=False, has_nan=True, separator='|')¶ Filter data by matching metadata columns to given patterns.
- Parameters
meta_col (
Series
) – Column to perform filtering onlevel (
Union
[str
,int
,None
]) – Passed tofind_depth()
. For usage, see docstring offind_depth()
.regexp (
bool
) –If
True
, match using regexp rather than pseudo regexp syntax of pyam.has_nan (
bool
) – IfTrue
, convert all nan values inmeta_col
to empty string before applying filters. This means that “” and “*” will match rows withnp.nan
. IfFalse
, the conversion is not applied and so a search in a string column which containsnp.nan
will result in aTypeError
.separator (
str
) – String used to separate the hierarchy levels in values. Defaults to ‘|’
- Returns
Array where
True
indicates a match- Return type
np.array
ofbool
- Raises
TypeError – Filtering is performed on a string metadata column which contains
np.nan
andhas_nan
isFalse
-
openscm.scmdataframe.filters.
time_match
(data, times, conv_codes, strptime_attr, name)¶ Match times by applying conversion codes to filtering list.
- Parameters
data (
List
[~T]) – Input data to perform filtering ontimes (
Union
[List
[str
],List
[int
],int
,str
]) – Times to matchconv_codes (
List
[str
]) – Iftimes
contains strings, conversion codes to try passing totime.strptime()
to converttimes
todatetime.datetime
strptime_attr (
str
) – Iftimes
contains strings, thedatetime.datetime
attribute to finalize the conversion of strings to integersname (
str
) – Name of the part of a datetime to extract, used to produce useful error messages.
- Returns
Array where
True
indicates a match- Return type
np.array
ofbool
- Raises
ValueError – If input times cannot be converted understood or if input strings do not lead to increasing integers (i.e. “Nov-Feb” will not work, one must use [“Nov-Dec”, “Jan-Feb”] instead)
-
openscm.scmdataframe.filters.
years_match
(data, years)¶ Match years in time columns for data filtering.
Offsets¶
A simplified version of pandas.DateOffset`s which use datetime-like
objects instead of :class:`pandas.Timestamp
.
This differentiation allows for times which exceed the range of :class`pandas.Timestamp` (see here) which is particularly important for longer running models.
TODO: use np.timedelta64 instead?
-
openscm.scmdataframe.offsets.
apply_dt
(func, self)¶ Apply a wrapper which keeps the result as a datetime instead of converting to
pd.Timestamp
.This decorator is a simplified version of
pandas.tseries.offsets.apply_wraps()
. It is required to avoid running into errors when our time data is outside panda’s limited time range of 1677-09-22 00:12:43.145225 to 2262-04-11 23:47:16.854775807, see this discussion.
-
openscm.scmdataframe.offsets.
apply_rollback
(obj)¶ Roll provided date backward to previous offset, only if not on offset.
-
openscm.scmdataframe.offsets.
apply_rollforward
(obj)¶ Roll provided date forward to next offset, only if not on offset.
-
openscm.scmdataframe.offsets.
generate_range
(start, end, offset)¶ Generate a range of datetime objects between start and end, using offset to determine the steps.
The range will extend both ends of the span to the next valid timestep, see examples.
- Parameters
start (
datetime
) – Starting datetime from which to generate the range (noting roll backward mentioned above and illustrated in the examples).end (
datetime
) – Last datetime from which to generate the range (noting roll forward mentioned above and illustrated in the examples).offset (
DateOffset
) – Offset object for determining the timesteps. An offsetter obtained from :func`to_offset` must be used.
- Yields
datetime.datetime
– Next datetime in the range- Raises
ValueError – Offset does not result in increasing :class`datetime.datetime`s
Examples
The range is extended at either end to the nearest timestep. In the example below, the first timestep is rolled back to 1st Jan 2001 whilst the last is extended to 1st Jan 2006.
>>> import datetime as dt >>> from pprint import pprint >>> from openscm.scmdataframe.offsets import to_offset, generate_range >>> g = generate_range( ... dt.datetime(2001, 4, 1), ... dt.datetime(2005, 6, 3), ... to_offset("AS"), ... )
>>> pprint([d for d in g]) [datetime.datetime(2001, 1, 1, 0, 0), datetime.datetime(2002, 1, 1, 0, 0), datetime.datetime(2003, 1, 1, 0, 0), datetime.datetime(2004, 1, 1, 0, 0), datetime.datetime(2005, 1, 1, 0, 0), datetime.datetime(2006, 1, 1, 0, 0)]
In this example the first timestep is rolled back to 31st Dec 2000 whilst the last is extended to 31st Dec 2005.
>>> g = generate_range( ... dt.datetime(2001, 4, 1), ... dt.datetime(2005, 6, 3), ... to_offset("A"), ... ) >>> pprint([d for d in g]) [datetime.datetime(2000, 12, 31, 0, 0), datetime.datetime(2001, 12, 31, 0, 0), datetime.datetime(2002, 12, 31, 0, 0), datetime.datetime(2003, 12, 31, 0, 0), datetime.datetime(2004, 12, 31, 0, 0), datetime.datetime(2005, 12, 31, 0, 0)]
In this example the first timestep is already on the offset so stays there, the last timestep is to 1st Sep 2005.
>>> g = generate_range( ... dt.datetime(2001, 4, 1), ... dt.datetime(2005, 6, 3), ... to_offset("QS"), ... ) >>> pprint([d for d in g]) [datetime.datetime(2001, 4, 1, 0, 0), datetime.datetime(2001, 7, 1, 0, 0), datetime.datetime(2001, 10, 1, 0, 0), datetime.datetime(2002, 1, 1, 0, 0), datetime.datetime(2002, 4, 1, 0, 0), datetime.datetime(2002, 7, 1, 0, 0), datetime.datetime(2002, 10, 1, 0, 0), datetime.datetime(2003, 1, 1, 0, 0), datetime.datetime(2003, 4, 1, 0, 0), datetime.datetime(2003, 7, 1, 0, 0), datetime.datetime(2003, 10, 1, 0, 0), datetime.datetime(2004, 1, 1, 0, 0), datetime.datetime(2004, 4, 1, 0, 0), datetime.datetime(2004, 7, 1, 0, 0), datetime.datetime(2004, 10, 1, 0, 0), datetime.datetime(2005, 1, 1, 0, 0), datetime.datetime(2005, 4, 1, 0, 0), datetime.datetime(2005, 7, 1, 0, 0)]
-
openscm.scmdataframe.offsets.
to_offset
(rule)¶ Return a wrapped
DateOffset
instance for a given rule.The
DateOffset
class is manipulated to return datetimes instead ofpd.Timestamp
, allowing it to handle times outside panda’s limited time range of 1677-09-22 00:12:43.145225 to 2262-04-11 23:47:16.854775807, see this discussion.- Parameters
rule (
str
) – The rule to use to generate the offset. For options see pandas offset aliases.- Returns
Wrapped
DateOffset
class for the given rule- Return type
DateOffset
- Raises
ValueError – If unsupported offset rule is requested, e.g. all business related offsets
Pyam Compatibilty¶
Imports and classes required to ensure compatibility with Pyam is intelligently handled.