A python library for reading data from Statistics Canada

This library implements most of the functions defined by the Statistics Canada Web Data Service. It also has a number of helper functions that make it easy to read Statistics Canada tables or vectors into pandas dataframes.

Installation

The package can either be installed with pip or conda:

conda install -c conda-forge stats_can

Or:

pip install stats-can

The code is also available on github,

Quickstart

After installing the stats_can package, all of the core functionality is available by instantiating a StatsCan object:

from stats_can import StatsCan
sc = StatsCan()

Without any arguments the StatsCan object will look for a file in the current working directory named “stats_can.h5”. If it doesn’t exist it will create one when it is first asked to load a table. You can also pass in a path in order to specify the location of the file. This is useful if you or a team want persistent access to certain tables.

For example:

sc = StatsCan(data_folder="~/stats_can_data")

The most common use case for stats_can is simply to read in a table from Statistics Canada to a Pandas DataFrame. For example, table 271-000-22-01 is “Personnel engaged in research and development by performing sector and occupational category” to read in that table (downloading it first if it’s the first time accessing it) run:

df = sc.table_to_df("271-000-22-01")

If there are only a couple specific series of interest you can also read them into a dataframe (whether they’re in different source tables or not) as follows:

df = sc.vectors_to_df(["v74804", "v41692457"])

The above command takes an optional start_date argument which will return a dataframe beginning with a reference date no earlier than the provided start date. By default it will return all available history for the V#s provided.

You can check which tables you have stored locally by running

sc.downloaded_tables

Which will return a list of table numbers.

If a table is locally stored, it will not automatically update if Statistics Canada releases an update. To update locally stored tables run:

sc.update_tables()

You can optionally pass in a list of tables if you only want a subset of the locally stored tables to be updated.

Finally, if you want to delete any tables you’ve loaded you can run:

sc.delete_tables("271-000-22-01")

StatsCan class documentation

Core functions outlined in the Quickstart along with some extra functionality are described here:

class stats_can.api_class.StatsCan(data_folder=None)[source]

Load Statistics Canada data and metadata into python.

Parameters
  • data_folder (Path/str, default None) –

  • location to save/search for locally stored Statistics Canada (The) –

  • tables. Defaults to the current working directory (data) –

delete_tables(tables)[source]

Remove locally stored tables.

Parameters

tables (str or [str]) – tables to delete

Returns

Return type

[deleted tables]

property downloaded_tables

Check which tables you’ve downloaded.

Checks the file “stats_can.h5” in the instantiated data folder and lists all tables stored there.

Returns

Return type

[table_ids]

static get_code_sets()[source]

Get code sets.

Code sets provide additional metadata to describe variables and are grouped into scales, frequencies, symbols etc.

Returns

code_sets – one dictionary for each group of information

Return type

[dict]

static get_tables_for_vectors(vectors)[source]

Find which table(s) a V# or list of V#s is from.

Parameters

vectors (str or [str]) – V#(s) to look up tables for

Returns

  • dictionary of vector (table pairs plus an)

  • ”all_tables” key with a list of all tables

  • containing the input V#s

>>> StatsCan.get_tables_for_vectors("v39050")
{39050: '10100139', 'all_tables': ['10100139']}
>>> StatsCan..get_tables_for_vectors(["v39050", "v1074250274"])
{39050: '10100139', 1074250274: '16100011', 'all_tables': ['10100139', '16100011']}
table_to_df(table)[source]

Read a table to a dataframe.

Parameters

table (str) – The ID of the table of interest, e.g “271-000-22”

Returns

  • pandas.DataFrame – Dataframe of the requested table

  • If the table has been previously loaded to the file in self.data_folder

  • it will retrieve that locally stored dataframe. If it’s unavailable it will

  • download it and then return the table. To update a locally stored table,

  • call StatsCan.update_tables(), optionally passing just the table number of interest

static tables_updated_on_date(date)[source]

Get a list of tables that were updated on a given date.

Parameters

date (str or datetime.date) – The date to check tables

Returns

changed_tables – one dictionary for each table with its update date

Return type

[dict]

static tables_updated_today()[source]

Get a list of tables that were updated today.

Returns

changed_tables – one dictionary for each table with its update date

Return type

[dict]

update_tables(tables=None)[source]

Update locally stored tables.

Compares latest available reference period in locally stored tables to the latest available on Statistics Canada and updates any tables necessary

Parameters

tables (str or [str], default None) – Optional subset of tables to check for updates, defaults to update all downloaded tables

Returns

Return type

[str] list of tables that were updated, empty list if no updates made

static vector_metadata(vectors)[source]

Get metadata on vectors.

Parameters

vectors (str or [str]) – V#(s) to retrieve metadata

Returns

vector_metadata – list of dictionaries with one dict for each vector

Return type

[dict]

vectors_to_df(vectors, start_date=None)[source]

Get a dataframe of V#s.

Parameters
  • vectors (str or [str]) – the V#s to retrieve

  • start_date (datetime.date, optional) – earliest reference period to return, defaults to all available history

Returns

  • pandas.DataFrame – Dataframe indexed on reference date, with columns for each V# input

  • Note that any V#s in tables that are not currently locally stored will

  • have their tables downloaded prior to returning the dataframe

static vectors_to_df_remote(vectors, periods=1, start_release_date=None, end_release_date=None)[source]

Retrieve V# data directly from Statistics Canada.

Parameters
  • vectors (str or [str]) – V#(s) to retrieve data for

  • periods (int, default 1) – Number of periods to retrieve data. Note that this will be ignored if start_release_date and end_release date are set

  • start_release_date (datetime.date, default None) – earliest release date to retrieve data

  • end_release_date (datetime.date, default None) – latest release date to retrieve data

Returns

  • pandas.DataFrame – Dataframe indexed on reference (not release) date, with columns for each V# input

  • Note that start and end release date refer to the dates the data was released,

  • not the reference period they cover. For example. October labour force survey

  • data is released on the first or second Friday of November.

static vectors_updated_today()[source]

Get a list of all V#s that were updated today.

Returns

changed_series – one dictionary for each vector with its update date

Return type

[dict]

Contributing

Contributions to this project are welcome. Fork the repository from github,

You’ll need a python environment with poetry and nox installed. A good guide for setting up an environment and project (that I used for this library) is hypermodern python.

After making any changes you can run nox to make sure testing and linting went ok, and then you should be good to submit a PR.

I’d also welcome contributions to the docs, or anything else that would make this tool better for you or others.

Indices and tables