extract#

geowombat.extract(data, aoi, bands=None, time_names=None, band_names=None, frac=1.0, min_frac_area=None, all_touched=False, id_column='id', time_format='%Y%m%d', mask=None, n_jobs=8, verbose=0, n_workers=1, n_threads=-1, use_client=False, address=None, total_memory=24, processes=False, pool_kwargs=None, **kwargs)#

Extracts data within an area or points of interest. Projections do not need to match, as they are handled ‘on-the-fly’.

Parameters:
  • data (DataArray) – The xarray.DataArray to extract data from.

  • aoi (str or GeoDataFrame) – A file or geopandas.GeoDataFrame to extract data frame.

  • bands (Optional[int or 1d array-like]) – A band or list of bands to extract. If not given, all bands are used. Bands should be GDAL-indexed (i.e., the first band is 1, not 0).

  • band_names (Optional[list]) – A list of band names. Length should be the same as bands.

  • time_names (Optional[list]) – A list of time names.

  • frac (Optional[float]) – A fractional subset of points to extract in each polygon feature.

  • min_frac_area (Optional[int | float]) – A minimum polygon area to use frac. Otherwise, use all samples within a polygon.

  • all_touched (Optional[bool]) – The all_touched argument is passed to rasterio.features.rasterize().

  • id_column (Optional[str]) – The id column name.

  • time_format (Optional[str]) – The datetime conversion format if time_names are datetime objects.

  • mask (Optional[GeoDataFrame or Shapely Polygon]) – A shapely.geometry.Polygon mask to subset to.

  • n_jobs (Optional[int]) – The number of features to rasterize in parallel.

  • verbose (Optional[int]) – The verbosity level.

  • n_workers (Optional[int]) – The number of process workers. Only applies when use_client = True.

  • n_threads (Optional[int]) – The number of thread workers. Only applies when use_client = True.

  • use_client (Optional[bool]) – Whether to use a dask client.

  • address (Optional[str]) – A cluster address to pass to client. Only used when use_client = True.

  • total_memory (Optional[int]) – The total memory (in GB) required when use_client = True.

  • processes (Optional[bool]) – Whether to use process workers with the dask.distributed client. Only applies when use_client = True.

  • pool_kwargs (Optional[dict]) – Keyword arguments passed to multiprocessing.Pool().imap().

  • kwargs (Optional[dict]) – Keyword arguments passed to dask.compute().

Return type:

GeoDataFrame

Returns:

geopandas.GeoDataFrame

Examples

>>> import geowombat as gw
>>>
>>> with gw.open('image.tif') as src:
>>>     df = gw.extract(src, 'poly.gpkg')
>>>
>>> # On a cluster
>>> # Use a local cluster
>>> with gw.open('image.tif') as src:
>>>     df = gw.extract(src, 'poly.gpkg', use_client=True, n_threads=16)
>>>
>>> # Specify the client address with a local cluster
>>> with LocalCluster(
>>>     n_workers=1,
>>>     threads_per_worker=8,
>>>     scheduler_port=0,
>>>     processes=False,
>>>     memory_limit='4GB'
>>> ) as cluster:
>>>
>>>     with gw.open('image.tif') as src:
>>>         df = gw.extract(
>>>             src,
>>>             'poly.gpkg',
>>>             use_client=True,
>>>             address=cluster
>>>         )