Datasets manual

Datasets generators

We provide the generation of different customizable datasets to use as inputs for Gudhi complexes and data structures.

Points generators

The module points enables the generation of random points on a sphere, random points on a torus and as a grid.

Points on sphere

The function sphere enables the generation of random i.i.d. points uniformly on a (d-1)-sphere in \(R^d\). The user should provide the number of points to be generated on the sphere n_samples and the ambient dimension ambient_dim. The radius of sphere is optional and is equal to 1 by default. Only random points generation is currently available.

The generated points are given as an array of shape \((n\_samples, ambient\_dim)\).

Example
from gudhi.datasets.generators import points
from gudhi import AlphaComplex

# Generate 50 points on a sphere in R^2
gen_points = points.sphere(n_samples = 50, ambient_dim = 2, radius = 1, sample = "random")

# Create an alpha complex from the generated points
alpha_complex = AlphaComplex(points = gen_points)
gudhi.datasets.generators.points.sphere(n_samples: int, ambient_dim: int, radius: float = 1.0, sample: str = 'random') numpy.ndarray[numpy.float64]

Generate random i.i.d. points uniformly on a (d-1)-sphere in R^d

Parameters
  • n_samples (integer) – The number of points to be generated.

  • ambient_dim (integer) – The ambient dimension d.

  • radius (float) – The radius. Default value is 1..

  • sample (string) – The sample type. Default and only available value is “random”.

Returns

the generated points on a sphere.

Points on a flat torus

You can also generate points on a torus.

Two functions are available and give the same output: the first one depends on CGAL and the second does not and consists of full python code.

On another hand, two sample types are provided: you can either generate i.i.d. points on a d-torus in \(R^{2d}\) randomly or on a grid.

First function: ctorus

The user should provide the number of points to be generated on the torus n_samples, and the dimension dim of the torus on which points would be generated in \(R^{2dim}\). The sample argument is optional and is set to ‘random’ by default. In this case, the returned generated points would be an array of shape \((n\_samples, 2*dim)\). Otherwise, if set to ‘grid’, the points are generated on a grid and would be given as an array of shape:

\[( ⌊n\_samples^{1 \over {dim}}⌋^{dim}, 2*dim )\]

Note 1: The output array first shape is rounded down to the closest perfect \(dim^{th}\) power.

Note 2: This version is recommended when the user wishes to use ‘grid’ as sample type, or ‘random’ with a relatively small number of samples (~ less than 150).

Example
from gudhi.datasets.generators import points

# Generate 50 points randomly on a torus in R^6
gen_points = points.ctorus(n_samples = 50, dim = 3)

# Generate 27 points on a torus as a grid in R^6
gen_points = points.ctorus(n_samples = 50, dim = 3, sample = 'grid')
gudhi.datasets.generators.points.ctorus(n_samples: int, dim: int, sample: str = 'random') numpy.ndarray[numpy.float64]

Generate random i.i.d. points on a d-torus in R^2d or as a grid

Parameters
  • n_samples (integer) – The number of points to be generated.

  • dim (integer) – The dimension of the torus on which points would be generated in R^2*dim.

  • sample (string) – The sample type. Available values are: “random” and “grid”. Default value is “random”.

Returns

the generated points on a torus.

The shape of returned numpy array is:

If sample is ‘random’: (n_samples, 2*dim).

If sample is ‘grid’: (⌊n_samples**(1./dim)⌋**dim, 2*dim), where shape[0] is rounded down to the closest perfect ‘dim’th power.

Second function: torus

The user should provide the number of points to be generated on the torus n_samples and the dimension dim of the torus on which points would be generated in \(R^{2dim}\). The sample argument is optional and is set to ‘random’ by default. The other allowed value of sample type is ‘grid’.

Note: This version is recommended when the user wishes to use ‘random’ as sample type with a great number of samples and a low dimension.

Example
from gudhi.datasets.generators import points

# Generate 50 points randomly on a torus in R^6
gen_points = points.torus(n_samples = 50, dim = 3)

# Generate 27 points on a torus as a grid in R^6
gen_points = points.torus(n_samples = 50, dim = 3, sample = 'grid')
gudhi.datasets.generators.points.torus(n_samples, dim, sample='random')[source]

Generate points on a flat dim-torus in R^2dim either randomly or on a grid

Parameters
  • n_samples – The number of points to be generated.

  • dim – The dimension of the torus on which points would be generated in R^2*dim.

  • sample – The sample type of the generated points. Can be ‘random’ or ‘grid’.

Returns

numpy array containing the generated points on a torus.

The shape of returned numpy array is:

If sample is ‘random’: (n_samples, 2*dim).

If sample is ‘grid’: (⌊n_samples**(1./dim)⌋**dim, 2*dim), where shape[0] is rounded down to the closest perfect ‘dim’th power.

Fetching datasets

We provide some ready-to-use datasets that are not available by default when getting GUDHI, and need to be fetched explicitly.

By default, the fetched datasets directory is set to a folder named ‘gudhi_data’ in the user home folder. Alternatively, it can be set using the ‘GUDHI_DATA’ environment variable.

gudhi.datasets.remote.fetch_bunny(file_path=None, accept_license=False)[source]

Load the Stanford bunny dataset.

This dataset contains 35947 vertices.

Note that if the dataset already exists in the target location, it is not downloaded again, and the corresponding array is returned from cache.

Parameters
  • file_path (string) –

    Full path of the downloaded file including filename.

    Default is None, meaning that it’s set to “data_home/points/bunny/bunny.npy”. In this case, the LICENSE file would be downloaded as “data_home/points/bunny/bunny.LICENSE”.

    The “data_home” directory is set by default to “~/gudhi_data”, unless the ‘GUDHI_DATA’ environment variable is set.

  • accept_license (boolean) –

    Flag to specify if user accepts the file LICENSE and prevents from printing the corresponding license terms.

    Default is False.

Returns

points – Array of shape (35947, 3).

Return type

numpy array

_images/bunny.png

3D Stanford bunny with 35947 vertices.

gudhi.datasets.remote.fetch_spiral_2d(file_path=None)[source]

Load the spiral_2d dataset.

Note that if the dataset already exists in the target location, it is not downloaded again, and the corresponding array is returned from cache.

Parameters

file_path (string) –

Full path of the downloaded file including filename.

Default is None, meaning that it’s set to “data_home/points/spiral_2d/spiral_2d.npy”.

The “data_home” directory is set by default to “~/gudhi_data”, unless the ‘GUDHI_DATA’ environment variable is set.

Returns

points – Array of shape (114562, 2).

Return type

numpy array

_images/spiral_2d.png

2D spiral with 114562 vertices.

gudhi.datasets.remote.clear_data_home(data_home=None)[source]

Delete the data home cache directory and all its content.

Parameters

data_home (string, default is None.) – The path to remote datasets directory. If None and the ‘GUDHI_DATA’ environment variable does not exist, the default directory to be removed is set to “~/gudhi_data”.