Land cover classification at the Mississppi Delta¶

In this notebook, you will use a k-means unsupervised clustering algorithm to group pixels by similar spectral signatures. k-means is an exploratory method for finding patterns in data. Because it is unsupervised, you don’t need any training data for the model. You also can’t measure how well it “performs” because the clusters will not correspond to any particular land cover class. However, we expect at least some of the clusters to be identifiable as different types of land cover.

You will use the harmonized Sentinal/Landsat multispectral dataset. You can access the data with an Earthdata account and the earthaccess library from NSIDC:

STEP 1: SET UP¶

BLOCK 0
Import all libraries

In [2]:
import os
import pickle
import re
import warnings
import json 

import cartopy.crs as ccrs
import earthaccess
import earthpy as et
import geopandas as gpd
import geoviews as gv
import hvplot.pandas
import hvplot.xarray
import numpy as np
import pandas as pd
import rioxarray as rxr
import rioxarray.merge as rxrmerge
from tqdm.notebook import tqdm
import xarray as xr
from shapely.geometry import Polygon
from sklearn.cluster import KMeans

os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"

warnings.simplefilter('ignore')
c:\Users\Anamaria\miniconda3\envs\earth-analytics-python\Lib\site-packages\dask\dataframe\__init__.py:42: FutureWarning: 
Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  warnings.warn(msg, FutureWarning)

Below you can find code for a caching decorator which you can use in your code. To use the decorator:

@cached(key, override)
def do_something(*args, **kwargs):
    ...
    return item_to_cache

This decorator will pickle the results of running the do_something() function, and only run the code if the results don’t already exist. To override the caching, for example temporarily after making changes to your code, set override=True. Note that to use the caching decorator, you must write your own function to perform each task!

Notes: If the cached file (filename) exists and override is False, the function loads the results from the pickle file. If the file doesn’t exist or override=True, the function runs normally, and the result is saved in a pickle file. This avoids redundant computations, especially for expensive functions.

In Python, do_something() is simply a generic function name that is commonly used as an example in documentation and tutorials to represent a function that performs some specific task. It is not a built-in Python function, but a placeholder for any function you define. For example, you could define do_something() to perform a mathematical operation, process data, access an API, read a file, etc.

BLOCK 1

In [3]:
def cached(func_key, override=False):
    """
    A decorator to cache function results
    
    Parameters
    ==========
    key: str
      File basename used to save pickled results
    override: bool
      When True, re-compute even if the results are already stored
    """
    def compute_and_cache_decorator(compute_function): 
        """
        Wrap the caching function
        
        Parameters
        ==========
        compute_function: function
          The function to run and cache results
        """
        def compute_and_cache(*args, **kwargs): 
            """
            Perform a computation and cache, or load cached result.
            
            Parameters
            ==========
            args
              Positional arguments for the compute function
            kwargs
              Keyword arguments for the compute function
            """
            # Add an identifier from the particular function call 
            if 'cache_key' in kwargs:
                key = '_'.join((func_key, kwargs['cache_key']))
            else:
                key = func_key  
            path = os.path.join(
                et.io.HOME, et.io.DATA_NAME, 'jars', f'{key}.pickle') 
                       
            # Check if the cache exists already or override caching 
                # Make jars directory if needed
                os.makedirs(os.path.dirname(path), exist_ok=True)
                
                # Run the compute function as the user did 
                result = compute_function(*args, **kwargs)
                
                # Pickle the object 
                with open(path, 'wb') as file:
                    pickle.dump(result, file)
            else:
                # Unpickle the object 
                with open(path, 'rb') as file:
                    result = pickle.load(file)
                    
            return result 
        
        return compute_and_cache 
    return compute_and_cache_decorator

This code defines a Python cache decorator cached(), which stores the results of a function in a pickle file and reuses them in future executions to avoid unnecessary recomputations.

What does this code do? Saves the results of a function in a .pickle file, avoiding repeated computations.

It uses a key identifier (func_key) to generate the name of the file where the cache will be saved.

If override=True, reruns the function and updates the cache.

Stores the files in a specific directory (et.io.HOME/et.io.DATA_NAME/jars/).

If the cache exists, it loads the results from the file instead of recalculating.

STEP 2: STUDY SITE¶

For this analysis, you will use a watershed from the Water Boundary Dataset, HU12 watersheds (WBDHU12.shp).

Try It
  1. Download the Water Boundary Dataset for region 8 (Mississippi)
  2. Select watershed 080902030506
  3. Generate a site map of the watershed

Try to use the caching decorator

We chose this watershed because it covers parts of New Orleans an is near the Mississippi Delta. Deltas are boundary areas between the land and the ocean, and as a result tend to contain a rich variety of different land cover and land use types.

BLOCK 2

In [4]:
@cached('wbd_08')                                                    
def read_wbd_file(wbd_filename, huc_level, cache_key): 
    # Download and unzip
    wbd_url = (
        "https://prd-tnm.s3.amazonaws.com"
        "/StagedProducts/Hydrography/WBD/HU2/Shape/"
        f"{wbd_filename}.zip")
    wbd_dir = et.data.get_data(url=wbd_url)                          
                  
    # Read desired data
    wbd_path = os.path.join(wbd_dir, 'Shape', f'WBDHU{huc_level}.shp') 
    wbd_gdf = gpd.read_file(wbd_path, engine='pyogrio')               
    return wbd_gdf                                                     

huc_level = 12                                                         
wbd_gdf = read_wbd_file(
    "WBD_08_HU2_Shape", huc_level, cache_key=f'hu{huc_level}')         

delta_gdf = (
    wbd_gdf[wbd_gdf[f'huc{huc_level}']                                 
    .isin(['080902030506'])]
    .dissolve()                                                        
)

(
    delta_gdf.to_crs(ccrs.Mercator())                                 
    .hvplot(                                                           
        alpha=.2, fill_color='white',                                  
        tiles='EsriImagery', crs=ccrs.Mercator())
    .opts(width=600, height=300)                                      
)
Out[4]:

DELTA THE MISSISSIPPI¶

The Mississippi River Delta is where the Mississippi River meets the Gulf of Mexico in southeastern Louisiana. Covering about 3 million acres, it is one of the largest coastal wetland areas in the U.S., containing 37% of the nation's estuarine marshes. As the 7th largest river delta in the world, it drains 41% of the contiguous U.S. into the Gulf at an average rate of 470,000 cubic feet per second.

The Mississippi Delta has multiple definitions, including a political one encompassing counties in several states and a natural one referring to the alluvial valley formed by sediment deposition after the Ice Age. The modern delta, built over the last 5,000 years, consists of multiple deltaic lobes, with the most recent being the "bird’s foot" delta near New Orleans. Ecologically, the Delta is a crucial habitat supporting wetlands, hardwood forests, migratory birds, and diverse aquatic species. Human influence on the Delta has ranged from early Indigenous agricultural practices to massive engineering projects like levees and diversions that have reshaped the landscape. While the Delta remains a major agricultural and industrial hub, environmental concerns such as water pollution and wetland loss have led to conservation and cleanup efforts.

Using data from the Mississippi Delta, we will learn how to use clustering techniques on delta geospatial data, such as:

  1. introduce us to the world of clustering.
  2. how to prepare the data, datasets for analysis.
  3. implementation of the k-means algorithm and hierarchical clustering, which offers methods to evaluate the efficiency of them.

Reference:
Mississipi River Delta
Natural Enviroment: The Delta and Its Resources
Mississipi Delta from space

LAND COVER CLASSIFICATION AT THE MISSISSIPPI DELTA¶

According to the USGS Open-File Report 2009-1280 delta land covers include water areas, which include rivers, lakes, estuaries and the ocean, as well as developed areas, where urban infrastructure, roads and human facilities are located. There are also mechanically disturbed lands, which correspond to areas affected by deforestation or construction activities, as well as areas dedicated to mining, where subsoil materials are extracted.

Barren areas are composed of arid lands with less than 10% vegetation cover, while forests include areas with more than 10% tree density. There are also extensions of grasslands and scrublands, characterized by the presence of scattered grasses and shrubs. Agriculture is one of the predominant land covers in the region, with land used for crops, pastures, orchards and vineyards.

Wetlands represent a significant part of the landscape, with highly water-saturated soils and specific vegetation. In addition, there are non-mechanically disturbed lands, which have been affected by natural phenomena such as fires or floods. Finally, although to a lesser extent, there are areas covered by ice and snow, such as glaciers or areas with permanent snow accumulations.

Throughout the study period (1973-2000), the most notable changes include the conversion of approximately 4,368 km² of wetlands into wetland bodies.

Reference: Land-Cover Change in the Lower Mississippi Valley, 1973-2000

STEP 3: MULTISPECTRAL DATA¶

Search for data¶

Try It
  1. Log in to the earthaccess service using your Earthdata credentials: python earthaccess.login(persist=True)
  2. Modify the following sample code to search for granules of the HLSL30 product overlapping the watershed boundary from May to October 2023 (there should be 76 granules): python results = earthaccess.search_data( short_name="...", cloud_hosted=True, bounding_box=tuple(gdf.total_bounds), temporal=("...", "..."), )

BLOCK 3

In [5]:
import earthaccess

# log in earthaccess
earthaccess.login(persist=True)

# Ensure that delta_gdf is in geographic coordinates (WGS 84)
if delta_gdf.crs != "EPSG:4326":
    delta_gdf = delta_gdf.to_crs("EPSG:4326")

# define search parameters
short_name = "HLSL30"  
bounding_box = tuple(delta_gdf.total_bounds)  
temporal_range = ("2023-05-01", "2023-10-31")  

# Execute granules search 
results = earthaccess.search_data(
    short_name=short_name,
    cloud_hosted=True,
    bounding_box=bounding_box,
    temporal=temporal_range,
)

# Show the quantity of granules found
print(f"Number of granules found: {len(results)}")

print("Bounding Box:", delta_gdf.total_bounds)
Number of granules found: 88
Bounding Box: [-89.97046834  29.68190731 -89.78679539  29.82339776]

Compile information about each granule¶

I recommend building a GeoDataFrame, as this will allow you to plot the granules you are downloading and make sure they line up with your shapefile. You could also use a DataFrame, dictionary, or a custom object to store this information.

Try It
  1. For each search result:
    1. Get the following information (HINT: look at the [‘umm’] values for each search result):
      • granule id (UR)
      • datetime
      • geometry (HINT: check out the shapely.geometry.Polygon class to convert points to a Polygon)
    2. Open the granule files. I recomment opening one granule at a time, e.g. with (earthaccess.open([result]).
    3. For each file (band), get the following information:
      • file handler returned from earthaccess.open()
      • tile id
      • band number
  2. Compile all the information you collected into a GeoDataFrame

BLOCK 4

Compile information about each scene (granule)
Each scene represents a data acquisition over a specific area at a given time. In this case, for each granule (scene), we are storing:

1.Scene Metadata

-Acquisition date (datetime)
-Granule ID (granule_id)
-Spatial location (tile_id)
-Download URLs (url)
-Scene coverage (geometry - bounding box or footprint)

2.Spectral Data

-Spectral bands (B02, B03, B04, etc.)
-Cloud mask (Fmask)
-Applied scale factor (e.g., * 0.0001 for reflectance)
-Scene cropped to watershed boundary (.rio.clip())
-Quality mask applied (.where(cloud_mask == 1))

How This Function Works Extracts granule metadata:

Granule ID (GranuleUR)
Acquisition date (BeginningDateTime)
Spatial extent (polygon geometry)
Retrieves download links for each granule using earthaccess.open()
Uses regex to extract tile ID and band info from filenames
Stores all the information in a GeoDataFrame for easy analysis and plotting

In [6]:
def get_earthaccess_links(results):                                       
    url_re = re.compile(                                                  
        r'\.(?P<tile_id>\w+)\.\d+T\d+\.v\d\.\d\.(?P<band>[A-Za-z0-9]+)\.tif'
    )

    # Loop through each granule
    link_rows = []                                                       
    for granule in tqdm(results):
        # Get granule information
        info_dict = granule['umm']                                        
        granule_id = info_dict['GranuleUR']                               
        datetime = pd.to_datetime(                                        
            info_dict['TemporalExtent']['RangeDateTime']['BeginningDateTime']
        )

        # Extract spatial geometry (bounding polygon)
        try:
            points = (
                info_dict['SpatialExtent']['HorizontalSpatialDomain']['Geometry']['GPolygons'][0]['Boundary']['Points']   
            )
            geometry = Polygon([(point['Longitude'], point['Latitude']) for point in points])
        except KeyError:
            print(f" Warning: No geometry found for granule {granule_id}")    
            continue                                                          

        # Get file URLs
        files = earthaccess.open([granule])                                   

        # Build metadata DataFrame
        for file in files:
            match = url_re.search(file.full_name)
            if match is not None:
                link_rows.append(
                    dict(
                        datetime=datetime,
                        granule_id=granule_id,                            
                        tile_id=match.group('tile_id'),
                        band=match.group('band'),
                        url=str(file),                                    
                        geometry=geometry
                    )
                )

    # Convert to GeoDataFrame
    if link_rows:                                                         
        file_df = gpd.GeoDataFrame(link_rows, crs="EPSG:4326")
        return file_df
    else:
        print("No valid granules found.")
        return None

# Use the function with your search results
granules_gdf = get_earthaccess_links(results)                            

# Check results
unique_granules = granules_gdf['granule_id'].nunique()
print(f"Unique granules processed: {unique_granules}")
  0%|          | 0/88 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | :   0%|          | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | :   0%|          | 0/15 [00:00<?, ?it/s]
Unique granules processed: 88
In [7]:
#Show the granules display on an interactive map
granules_gdf.hvplot(geo=True, alpha=0.3, fill_color="red", line_color="black", tiles="EsriImagery")
Out[7]:
In [8]:
#Show Box coordinates
print("Bounding Box:", delta_gdf.total_bounds)
Bounding Box: [-89.97046834  29.68190731 -89.78679539  29.82339776]
In [9]:
filtered_granules = granules_gdf.sjoin(delta_gdf, predicate="intersects")
print(f"Granules intersecting the watershed: {filtered_granules['granule_id'].nunique()}")
Granules intersecting the watershed: 88
In [10]:
granules_gdf = granules_gdf.drop_duplicates(subset=['granule_id'])
print(f"Unique granules after removing duplicates: {granules_gdf['granule_id'].nunique()}")
Unique granules after removing duplicates: 88
In [11]:
granules_gdf = granules_gdf.sort_values(by="datetime").drop_duplicates(subset="granule_id", keep="last")
print(f"Granules after keeping latest version: {granules_gdf['granule_id'].nunique()}")
Granules after keeping latest version: 88
In [12]:
# Final Bedugging 
# Step 1: Strict spatial filtering (fully inside watershed)
filtered_granules = granules_gdf.sjoin(delta_gdf, predicate="within")
print(f"Granules fully within the watershed: {filtered_granules['granule_id'].nunique()}")

# Step 2: Remove duplicate versions, keeping the latest
granules_gdf = granules_gdf.sort_values(by="datetime").drop_duplicates(subset="granule_id", keep="last")
print(f"Granules after keeping only the latest version: {granules_gdf['granule_id'].nunique()}")

# Step 3: Check unique dates
print("Unique dates of granules:", granules_gdf["datetime"].unique())
Granules fully within the watershed: 0
Granules after keeping only the latest version: 88
Unique dates of granules: <DatetimeArray>
['2023-05-04 16:31:32.101000+00:00', '2023-05-12 16:31:44.329000+00:00',
 '2023-05-20 16:31:23.029000+00:00', '2023-05-28 16:31:34.837000+00:00',
 '2023-06-05 16:31:28.153000+00:00', '2023-06-13 16:31:26.844000+00:00',
 '2023-06-21 16:31:34.135000+00:00', '2023-06-29 16:31:32.405000+00:00',
 '2023-07-07 16:31:46.664000+00:00', '2023-07-15 16:31:42.442000+00:00',
 '2023-07-23 16:31:50.873000+00:00', '2023-07-31 16:31:47.828000+00:00',
 '2023-08-08 16:31:57.564000+00:00', '2023-08-16 16:31:52.083000+00:00',
 '2023-08-24 16:32:05.861000+00:00', '2023-09-01 16:32:03.072000+00:00',
 '2023-09-09 16:32:06.260000+00:00', '2023-09-17 16:32:11.040000+00:00',
 '2023-09-25 16:32:12.175000+00:00', '2023-10-03 16:32:11.664000+00:00',
 '2023-10-19 16:32:21.176000+00:00', '2023-10-27 16:32:18.830000+00:00']
Length: 22, dtype: datetime64[ns, UTC]

Expect Outcome: could some granules were only partially overlapping, the issue duplicate versions, or NASA could update the dataset with extra images. I think that is why be have 88 granules instead of 76 like the tutorial indicates.¶

Open, crop, and mask data¶

This will be the most resource-intensive step. I recommend caching your results using the cached decorator or by writing your own caching code. I also recommend testing this step with one or two dates before running the full computation.

This code should include at least one function including a numpy-style docstring. A good place to start would be a function for opening a single masked raster, applying the appropriate scale parameter, and cropping.

Try It
  1. For each granule:
    1. Open the Fmask band, crop, and compute a quality mask for the granule. You can use the following code as a starting point, making sure that mask_bits contains the quality bits you want to consider: ```python # Expand into a new dimension of binary bits bits = ( np.unpackbits(da.astype(np.uint8), bitorder=‘little’) .reshape(da.shape + (-1,)) )

      # Select the required bits and check if any are flagged mask = np.prod(bits[…, mask_bits]==0, axis=-1) ```

    2. For each band that starts with ‘B’:

      1. Open the band, crop, and apply the scale factor
      2. Name the DataArray after the band using the .name attribute
      3. Apply the cloud mask using the .where() method
      4. Store the DataArray in your data structure (e.g. adding a GeoDataFrame column with the DataArray in it. Note that you will need to remove the rows for unused bands)

BLOCK 5

The structured approach to process each granule, crop the Fmask band, create a cloud mask, and to each spectral band

In [13]:
@cached('delta_reflectance_da_df')                        
def compute_reflectance_da(search_results, boundary_gdf): 
    """
    Connect to files over VSI, crop, cloud mask, and wrangle
    
    Returns a single reflectance DataFrame 
    with all bands as columns and
    centroid coordinates and datetime as the index.
    
    Parameters
    ==========
    file_df : pd.DataFrame
        File connection and metadata (datetime, tile_id, band, and url)
    boundary_gdf : gpd.GeoDataFrame
        Boundary use to crop the data
    """
        
    boundary_proj_gdf = boundary_gdf.to_crs("EPSG:32614")

    def open_dataarray(url, boundary_proj_gdf, scale=1, masked=True): 
        # Open masked DataArray /
        da = rxr.open_rasterio(url, masked=masked).squeeze() * scale     
            
        # Crop to watershed boundary 
        return da.rio.clip_box(*boundary_proj_gdf.total_bounds)
        
    def compute_quality_mask(da, mask_bits=[1, 2, 3]):     
        """Mask out low quality data by bit, that means pixel using bit flags."""
        # Unpack bits into a new axis 
        bits = (                                         
            np.unpackbits(
                da.astype(np.uint8), bitorder='little'
            ).reshape(da.shape + (-1,))
        )

        # Select the required bits and check if any are flagged 
        mask = np.prod(bits[..., mask_bits]==0, axis=-1)
        return mask

    file_df = get_earthaccess_links(search_results)                
    granule_da_rows= []                                            
   

    # Loop through each image 
    group_iter = file_df.groupby(['datetime', 'tile_id'])
    for (datetime, tile_id), granule_df in tqdm(group_iter):
        print(f'Processing granule {tile_id} {datetime}')
              
        # Open granule cloud cover
        cloud_mask_url = (
            granule_df.loc[granule_df.band=='Fmask', 'url']
            .values[0])
        cloud_mask_cropped_da = open_dataarray(cloud_mask_url, boundary_proj_gdf, masked=False)

        # Compute cloud mask
        cloud_mask = compute_quality_mask(cloud_mask_cropped_da)    

        #Filter only spectral bands
        band_df = granule_df[granule_df.band.str.startswith('B')]

        # Loop through each spectral band/ cada banda espectral
        for _, row in band_df.iterrows():
            band_cropped = open_dataarray(row.url, scale=0.0001)    
            band_cropped.name = row.band
            row['da'] = band_cropped.where(cloud_mask)             
            granule_da_rows.append(row.to_frame().T)                
    
    # Reassemble the metadata DataFrame
    return pd.concat(rows_list, ignore_index=True)                 
reflectance_da_df = compute_reflectance_da(results, delta_gdf)     
                                                                   

BLOCK 6

The goal of this funtion is process remote sensing reflectance data

In [14]:
# visualize a processed band
reflectance_da_df.iloc[0]['da'].hvplot.image(cmap='viridis')
Out[14]:

Merge and Composite Data¶

You will notice for this watershed that: 1. The raster data for each date are spread across 4 granules 2. Any given image is incomplete because of clouds

Try It
  1. For each band:

    1. For each date:

      1. Merge all 4 granules
      2. Mask any negative values created by interpolating from the nodata value of -9999 (rioxarray should account for this, but doesn’t appear to when merging. If you leave these values in they will create problems down the line)
    2. Concatenate the merged DataArrays along a new date dimension

    3. Take the mean in the date dimension to create a composite image that fills cloud gaps

    4. Add the band as a dimension, and give the DataArray a name

  2. Concatenate along the band dimension

BLOCK 7

In [15]:
@cached('delta_reflectance_da')
def merge_and_composite_arrays(granule_da_df):
    """
    Efficiently merges and composites satellite image granules across bands and dates.
    """
    da_list = []

    for band, band_df in tqdm(granule_da_df.groupby('band')):
        # Merge granules per date and mask negatives
        merged_das = [
            rxrmerge.merge_arrays(list(date_df.da)).where(lambda x: x > 0)
            for _, date_df in band_df.groupby('datetime')
        ]
        
        # Composite across dates using the median
        composite_da = xr.concat(merged_das, dim='datetime').median(dim='datetime')

        # Assign band metadata
        composite_da = composite_da.assign_coords(band=int(band[1:])).expand_dims('band')
        composite_da.name = 'reflectance'

        da_list.append(composite_da)

    # Concatenate all bands into a final dataset
    return xr.concat(da_list, dim='band')

reflectance_da = merge_and_composite_arrays(reflectance_da_df)
#reflectance_da

STEP 4: K-MEANS¶

Cluster your data by spectral signature using the k-means algorithm.

Try It
  1. Convert your DataArray into a tidy DataFrame of reflectance values (hint: check out the .to_dataframe() and .unstack() methods)
  2. Filter out all rows with no data (all 0s or any N/A values)
  3. Fit a k-means model. You can experiment with the number of groups to find what works best.

BLOCK 8

In [16]:
# Convert DataArray to a tidy DataFrame
model_df = reflectance_da.to_dataframe().reset_index()

# Unstack to create a feature matrix (pixels as rows, bands as columns)
model_df = model_df.pivot(index=['y', 'x'], columns='band', values='reflectance')

# Remove rows with all 0s or NaN values
model_df = model_df[(model_df > 0).any(axis=1)].dropna()

# Fit K-Means model
n_clusters = 4  # You can adjust this number
kmeans = KMeans(n_clusters=n_clusters, random_state=42)

# **Apply K-Means clustering and store results**
model_df['clusters'] = kmeans.fit_predict(model_df)

# Show sample of the clustered data
print(model_df.head())
band                              1        2        3       4        5  \
y            x                                                           
3.287163e+06 793408.062907  0.09700  0.11950  0.16060  0.1753  0.26410   
             793438.062907  0.07180  0.08560  0.12880  0.1328  0.29880   
             793468.062907  0.04585  0.05455  0.09295  0.0868  0.37255   
             793498.062907  0.03710  0.04560  0.08760  0.0718  0.38380   
             793528.062907  0.02330  0.02880  0.06110  0.0380  0.37750   

band                              6        7        9      10       11  \
y            x                                                           
3.287163e+06 793408.062907  0.30230  0.22730  0.00090  0.2905  0.24750   
             793438.062907  0.26040  0.17350  0.00100  0.2860  0.24440   
             793468.062907  0.27145  0.15045  0.00085  0.2778  0.24675   
             793498.062907  0.24620  0.13380  0.00090  0.2695  0.23650   
             793528.062907  0.15420  0.06160  0.00120  0.2636  0.23240   

band                        clusters  
y            x                        
3.287163e+06 793408.062907         3  
             793438.062907         3  
             793468.062907         3  
             793498.062907         3  
             793528.062907         3  

STEP 5: PLOT¶

Try It

Create a plot that shows the k-means clusters next to an RGB image of the area. You may need to brighten your RGB image by multiplying it by 10. The code for reshaping and plotting the clusters is provided for you below, but you will have to create the RGB plot yourself!

So, what is .sortby(['x', 'y']) doing for us? Try the code without it and find out.

BLOCK 9

In [17]:
# Setect R, G, B y transform to uint8
rgb = reflectance_da.sel(band=[4, 3, 2])
rgb_uint8 = (rgb * 255).astype(np.uint8).where(~np.isnan(rgb), 0)  # avoid NaN

# restore the brigthness with control
rgb_bright = np.clip(rgb_uint8 * 10, 0, 255)  # avoid extrem saturation

# Convert clusters to xarray in correct order
clusters_xr = model_df.clusters.to_xarray().sortby(['x', 'y'])

# Visualize with `hvplot`
plot = (
    rgb_bright.hvplot.rgb(
        x='x', y='y', bands='band',
        data_aspect=1, xaxis=None, yaxis=None
    ) +
    clusters_xr.hvplot(cmap="tab10", aspect='equal')
)

# graphics
plot
Out[17]:

Unsupervised analysis using K-Means for class clustering¶

Introduction¶

Land cover classification is a fundamental tool in environmental monitoring, as it allows for the assessment of landscape changes and their impact on ecosystems. In this analysis, we used NASA's Harmonized Landsat and Sentinel-2 (HLS) product, which provides harmonized and compatible surface reflectance (SR) data from the Landsat-8 and Sentinel-2 missions.

The HLS product provides satellite imagery with Bottom of Atmosphere (BOA) surface reflectance. This means that the data has already been atmospherically corrected, removing the effects of scattering and atmospheric absorption, which enables better comparison between images from different dates or sensors.

The HLS dataset integrates observations from the Operational Land Imager (OLI) onboard Landsat-8 and the Multispectral Instrument (MSI) from Sentinel-2, generating a time series of images with increased temporal frequency and spectral consistency. HLS data products can be considered the building blocks of a "data cube", allowing users to examine any pixel over time and analyze near-daily surface reflectance time series as if they were derived from a single sensor. This feature enhances data continuity and facilitates multitemporal land cover analysis.

Additionally, the HLS product employs standardized processing methods across all images, including atmospheric correction and the Fmask algorithm for cloud masking. Surface reflectance is corrected to account for the effect of the viewing angle, ensuring that all pixels are normalized to nadir observation. This guarantees greater accuracy in the spectral interpretation of land cover.

For this study, we used images acquired between May 2023 and October 2023, with a spatial resolution of 30 meters, resulting in a total of 88 scenes or granules. The data was compiled in an organized manner according to acquisition dates over the specific study area. A scale factor of 1 was assigned to all bands to mitigate size distortions (distances and areas) that increase as the distance from the central meridian grows.

This study applies a supervised classification approach to identify and map different land cover classes along the Mississippi River, a key ecosystem in North America. The methodology includes image preprocessing, spectral feature extraction, and the implementation of classification algorithms to generate a detailed land cover map for the study region.

The results obtained can contribute to monitoring environmental changes, managing natural resources, and making informed decisions regarding the conservation of Mississippi’s riparian ecosystems.

Metodology¶

The analysis was conducted using the K-Means clustering algorithm to segment an image into different groups based on similar data characteristics. A total of four clusters (K=4) were selected to minimize variability within each group while maximizing the differences between them.

The methodological process followed these steps:

  1. Data Preprocessing: The image was prepared for analysis by normalizing spectral bands or pixel values to improve clustering accuracy.

  2. Application of the K-Means Algorithm: The algorithm was executed with K=4, assigning each pixel to one of the four clusters based on similarity in feature space.

  3. Results Visualization: A segmented image was generated where each cluster was represented by a distinct color, facilitating the interpretation of homogeneous areas.

  4. Cluster Analysis and Interpretation: The spatial distribution of clusters was evaluated, and possible meanings were assigned based on visual appearance and geographical context.

Conclusion¶

Our data groups were identified, ensuring a classification with low internal variability and high differentiation between clusters. The K-Means algorithm, while not providing a semantic interpretation of the groups, enabled the identification of patterns in the segmented image.

Visual analysis suggests the following interpretation of the clusters:

Cluster 0 (Blue): Represents well-defined bodies of water.

Cluster 1 (Red): Corresponds to the edges of water bodies, possibly mud or soil with a high-density cover.

Cluster 2 (Pink): Areas with sparse vegetation or bare soil.

Cluster 3 (Cyan): Transition zones between water and land, possibly representing wetlands or saturated soils.

This analysis provides an initial approach to terrain segmentation and can be improved with additional information, such as spectral data or field validation, to enhance the interpretation of each cluster in environmental and geospatial monitoring studies.