economic_model_pipeline package¶

Here we add documentation concerning the overall pipeline.

The syntax is identical to that in the docstrings. For example, we can add diagrams in the same way:

Pipeline diagram of economic_model_pipeline.CutRasterForTrainingData, economic_model_pipeline.CreateSurfacePlot, economic_model_pipeline.UploadToCkan on Mon 09 Jul 12:02

This is a caption for the diagram. Hover over nodes for more details.¶

For more detailed information, hover your mouse over the nodes. Those that have docstrings defined will display them as a tooltip. The diagram background tooltip lists the tasks the diagram was generated for and the date and time it was generated. We can also specify an alternative base URL for the node hyperlinks, to jump to a task’s documentation on a different page.

We can omit tasks with the omit_tasks flag:

Diagram of the task and target DAG omitting tasks.¶

There is also an omit_outputs flag:

Diagram of the task and target DAG omitting output nodes.¶

We can filter by tag, using the tags flag - here only tasks or targets with tags another or this:

Diagram including all nodes with the tags another or this.¶

We can also filter by base class - here we only display subclasses of Task and CkanTarget:

Diagram including only subclasses of pipelines.tasks.Task or CkanTarget.¶

We can confirm these using an inheritance diagram, for example for RegridRasterForTrainingData, ClusterDataCkanExternal and CkanTarget:

Inheritance diagram of economic_model_pipeline.RegridRasterForTrainingData, economic_model_pipeline.ClusterDataCkanExternal, pipelines.targets.CkanTarget

An inheritance diagram for RegridRasterForTrainingData, ClusterDataCkanExternal and CkanTarget¶

We can filter by tags and base classes, eg, only subclasses of Target or ExternalTask, that also have tags another or this:

A diagram combining base class and tag filters.¶

These also combine with the omit_~ flags, here omit_outputs (rather pointless in this case):

A diagram combining base class, tag and omit_outputs filters.¶

We can tell Sphinx to extract the documentation for every task in the module:

class economic_model_pipeline.ClusterDataCkanExternal(*args, tags=None, **kwargs)¶

output()¶

We could use standard RST docstrings:

param temp_df: The survey response DataFrame

type temp_df: DataFrame

param xyz_q_list:

The list of questions for xyz

type xyz_q_list:

list

returns: A list of processed answers

rtype: list

Or the Google style - more readable and less fiddly to type:

Parameters:	temp_df (DataFrame) – The survey response DataFrame xyz_q_list (list) – The `list` of questions for xyz
Returns:	A `list` of processed answers
Return type:	list

Or the Numpy style - more readable when we’ve lots of parameters, but takes more vertical space:

Parameters:	temp_df (DataFrame) – The survey response ::class::DataFrame xyz_q_list (list) – The `list` of questions for xyz
Returns:	The `list` of processed answers
Return type:	list

Possible section headers:

Args (alias of Parameters)

Arguments (alias of Parameters)

Attributes

Example

Examples

Keyword Args (alias of Keyword Arguments)

Keyword Arguments

Methods

Note

Notes

Other Parameters

Parameters

Return (alias of Returns)

Returns

Raises

References

See Also

Todo

Warning

Warnings (alias of Warning)

Warns

Yield (alias of Yields)

Yields

tags = 'another'¶

Pull the clustered economic data from CKAN.

This is generated from the docstring of ClusterDataCkanExternal. It contains examples of some of the syntax made available by the Sphinx RST language. Use the examples to build the ideal docstring format for Task classes. Also see this class’s output() docstring for the three ways we could document methods:

Standard RST
The Google format - looks nicer
The Numpy format - looks nicer for large numbers of parameters, but takes more vertical space

In addition to docstrings on Task classes and their methods, analysts will maintain RST files documenting the pipelines. See economic_pipeline.rst

The source of the below demonstrates some of the syntax.

It includes code inline, and as a block:

this.py¶

 def output(self):
     l = [1, 2, 3, ]
     i = l + [4, 5, ]
     # This should be Python highlighted and lines 2 and 3 emphasised - these don't seem to work. Hey ho.
     return [LocalTarget(os.path.join(config.get("paths", "quarterly_data_path"), "Section_6__Household.csv"))]

It includes a doctest section. These can be automatically tested on each commit to test the documentation is up to date.

>>> 1 + 1
2

Some tables:

Header row, column 1 (header rows optional)	Header 2	Header 3	Header 4
body row 1, column 1	column 2	column 3	column 4
body row 2	…	…

Simpler:

A	B	A and B
False	False	False
True	False	False
False	True	False
True	True	True

A link

A footnote too.

Also, a footnoted academic reference [ref01]

Some text replacement replacement replaced

An internal reference to our in-page RegridRasterForTrainingData will convert into a hyperlink.

An external reference to Luigi Task will not convert into a hyperlink. You can use :class:, :mod: and :func: similarly.

Warning

It also includes a warning.

A version comment is added when a package has functionality added:

New in version 0.30.

An image:

Inline maths: $a^2 + b^2 = c^2$ .

Maths sections, labelled for cross-referencing:

(1)¶ $e^{i\pi} + 1 = 0$

Euler’s identity, equation (1).

Nb. When a :label: is specified readthedocs currently renders oddly on the autodoc list, so don’t label equations until readthedocs fixes this.

There are more complex directives documented here.

This is what the pipeline DAG diagram produces for CutRasterForTrainingData. The pipeline diagram directive can merge multiple endpoint task DAGs into a single diagram - in this case CutRasterForTrainingData, CreateSurfacePlot and UploadToCkan.

There is an inheritance diagram directive, here run on RegridRasterForTrainingData:

Inheritance diagram of economic_model_pipeline.RegridRasterForTrainingData

[ref01]

https://www.example.com

class economic_model_pipeline.CreateSurfacePlot(*args, tags=None, **kwargs)¶

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()¶

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()¶

The task run method, to be overridden in a subclass.

See Task.run

class economic_model_pipeline.CutRasterForTrainingData(*args, tags=None, **kwargs)¶

Cut Raster for Training Data.

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()¶

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()¶

The task run method, to be overridden in a subclass.

See Task.run

class economic_model_pipeline.ExtractClusterData(*args, tags=None, **kwargs)¶

Description:: Take GPS Coordinates for clusters and extract the median value of the desired rasters for each cluster.
Inputs:: Raster Filepaths (Dictionary of CkanTargets): raster files of independent variables DataFrame Filepath (Dictionary of LocalTarget): dataframe file of GPS points for cluster centers.
Outputs:: raster_clusters_data_file (Serialized DataFrame): dataframe with independent variables for each cluster.
User Defined Parameters:: data_list (List): list of variables from the input dataframe to be used as independent variables in the model. Default = [‘accessibility’, ‘population’, ‘night_lights’, ‘landcover’] multi_wave_data (List): list of variables in data_list that vary round to round. Default = [‘population’] waves (Dict): a dictionary translating the integer number wave to the string year. Default = {1: ‘2015’, 2: ‘2016’, 3: ‘2016’, 4: ‘2017’}

Literary References: .. method:: buffer_points() from the `geospatialtasks repo <https

//github.com/kimetrica/geospatialtasks/blob/input_functions/functions/geospatial.py>`_

calc_cluster_data() from the `geospatialtasks repo <https: //github.com/kimetrica/geospatialtasks/blob/input_functions/functions/remote_sensing_functs.py>`

merge_dataframes(stats_df, df, data_col_name_map, on_wave=False)¶: Concatenate multiple rounds of data. stats_dfs = list of dataframes to concatenate df = dataframe with the one row per cluster that will acrew all the data data_col_name_map = map of old to new data column nape e.g {‘median’: ‘population’}

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()¶

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()¶

The task run method, to be overridden in a subclass.

See Task.run

class economic_model_pipeline.ImportHouseholdDataLocal(*args, tags=None, **kwargs)¶

Description:: Pull the 4 waves of household survey data from CKAN.
Inputs:: HFS data (DataFrame): High Frequency Survey data from 2015-2017 from the World Bank
Outputs:: DataFrame Filepaths (Dictionary of CkanTargets): filepaths of household survey data pulled from CKAN server.

User Defined Parameters: Literary References: Methods:

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

class economic_model_pipeline.ImportRasterDataLocal(*args, tags=None, **kwargs)¶

Description:: Pull the raster data for independent variables.
Inputs:: accessibility (Raster): the accessibility data from the Malaria Atlas Project night_lights (Raster): the night lights data from NOAA landcover (Raster): the landcover data from USGS population_2015 (Raster): the population data from the population model population_2016 (Raster): the population data from the population model population_2017 (Raster): the population data from the population model
Outputs:: Raster Filepaths (Dictionary of CkanTargets): local filepaths to all data pulled in from CKAN server.

User Defined Parameters: References: Methods:

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

class economic_model_pipeline.MergeRasterEconomicClusters(*args, tags=None, **kwargs)¶

Description:: Merge the raster data with the socioeconomic data for each cluster.
Inputs:: raster_clusters_data_file (Serialized DataFrame): dataframe with independent variables for each cluster. DataFrame Filepaths (Dictionary of CkanTargets): filepahts to household survey data containing dependent variable.
Outputs:: processed_data_file (csv): dataframe with independent and dependent variables for each cluster.
User Defined Parameters:: filename: the filepath at which to save the results. dep_var: the name of the dependent variable contained in the household survey data. Default = tc_imp poverty_line: the poverty line of this dataset. Default = 8.7126

Literary References Methods:

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()¶

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()¶

The task run method, to be overridden in a subclass.

See Task.run

class economic_model_pipeline.RegridRasterForTrainingData(*args, tags=None, **kwargs)¶

Regrid Raster for Training Data.

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()¶

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()¶

The task run method, to be overridden in a subclass.

See Task.run

class economic_model_pipeline.TrainEconomicModel(*args, tags=None, **kwargs)¶

Run the economic model using the cluster data Dependencies: EconomicModel, model_ert_cont

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()¶

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()¶

The task run method, to be overridden in a subclass.

See Task.run

class economic_model_pipeline.UploadToCkan(*args, tags=None, **kwargs)¶

Description:: Upload dataframe containing all independent and dependent variables for the clusters to CKAN
Inputs:: processed_data_file (csv): dataframe with independent and dependent variables for each cluster.
Outputs:: ea_remote_data (csv): DataFrame of all data uploaded to CKAN

User Defined Parameters: Literary References: Methods:

output()¶

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note: If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()¶

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a Subclasses can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()¶

The task run method, to be overridden in a subclass.

See Task.run