Land cover classification at the Mississppi Delta¶
In this notebook, you will use a k-means unsupervised clustering algorithm to group pixels by similar spectral signatures. k-means is an exploratory method for finding patterns in data. Because it is unsupervised, you don’t need any training data for the model. You also can’t measure how well it “performs” because the clusters will not correspond to any particular land cover class. However, we expect at least some of the clusters to be identifiable as different types of land cover.
You will use the harmonized Sentinal/Landsat multispectral
dataset.
You can access the data with an Earthdata
account and the
earthaccess
library from
NSIDC:
STEP 1: SET UP¶
BLOCK 0
Import all libraries
import os
import pickle
import re
import warnings
import json
import cartopy.crs as ccrs
import earthaccess
import earthpy as et
import geopandas as gpd
import geoviews as gv
import hvplot.pandas
import hvplot.xarray
import numpy as np
import pandas as pd
import rioxarray as rxr
import rioxarray.merge as rxrmerge
from tqdm.notebook import tqdm
import xarray as xr
from shapely.geometry import Polygon
from sklearn.cluster import KMeans
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"
warnings.simplefilter('ignore')
c:\Users\Anamaria\miniconda3\envs\earth-analytics-python\Lib\site-packages\dask\dataframe\__init__.py:42: FutureWarning: Dask dataframe query planning is disabled because dask-expr is not installed. You can install it with `pip install dask[dataframe]` or `conda install dask`. This will raise in a future version. warnings.warn(msg, FutureWarning)
Below you can find code for a caching decorator which you can use in your code. To use the decorator:
@cached(key, override)
def do_something(*args, **kwargs):
...
return item_to_cache
This decorator will pickle the results of running the
do_something()
function, and only run the code if the results don’t
already exist. To override the caching, for example temporarily after
making changes to your code, set override=True
. Note that to use the
caching decorator, you must write your own function to perform each
task!
Notes: If the cached file (filename) exists and override is False, the function loads the results from the pickle file. If the file doesn’t exist or override=True, the function runs normally, and the result is saved in a pickle file. This avoids redundant computations, especially for expensive functions.
In Python, do_something() is simply a generic function name that is commonly used as an example in documentation and tutorials to represent a function that performs some specific task. It is not a built-in Python function, but a placeholder for any function you define. For example, you could define do_something() to perform a mathematical operation, process data, access an API, read a file, etc.
BLOCK 1
def cached(func_key, override=False):
"""
A decorator to cache function results
Parameters
==========
key: str
File basename used to save pickled results
override: bool
When True, re-compute even if the results are already stored
"""
def compute_and_cache_decorator(compute_function):
"""
Wrap the caching function
Parameters
==========
compute_function: function
The function to run and cache results
"""
def compute_and_cache(*args, **kwargs):
"""
Perform a computation and cache, or load cached result.
Parameters
==========
args
Positional arguments for the compute function
kwargs
Keyword arguments for the compute function
"""
# Add an identifier from the particular function call
if 'cache_key' in kwargs:
key = '_'.join((func_key, kwargs['cache_key']))
else:
key = func_key
path = os.path.join(
et.io.HOME, et.io.DATA_NAME, 'jars', f'{key}.pickle')
# Check if the cache exists already or override caching
# Make jars directory if needed
os.makedirs(os.path.dirname(path), exist_ok=True)
# Run the compute function as the user did
result = compute_function(*args, **kwargs)
# Pickle the object
with open(path, 'wb') as file:
pickle.dump(result, file)
else:
# Unpickle the object
with open(path, 'rb') as file:
result = pickle.load(file)
return result
return compute_and_cache
return compute_and_cache_decorator
This code defines a Python cache decorator cached(), which stores the results of a function in a pickle file and reuses them in future executions to avoid unnecessary recomputations.
What does this code do? Saves the results of a function in a .pickle file, avoiding repeated computations.
It uses a key identifier (func_key) to generate the name of the file where the cache will be saved.
If override=True, reruns the function and updates the cache.
Stores the files in a specific directory (et.io.HOME/et.io.DATA_NAME/jars/).
If the cache exists, it loads the results from the file instead of recalculating.
STEP 2: STUDY SITE¶
For this analysis, you will use a watershed from the Water Boundary Dataset, HU12 watersheds (WBDHU12.shp).
- Download the Water Boundary Dataset for region 8 (Mississippi)
- Select watershed 080902030506
- Generate a site map of the watershed
Try to use the caching decorator
We chose this watershed because it covers parts of New Orleans an is near the Mississippi Delta. Deltas are boundary areas between the land and the ocean, and as a result tend to contain a rich variety of different land cover and land use types.
BLOCK 2
@cached('wbd_08')
def read_wbd_file(wbd_filename, huc_level, cache_key):
# Download and unzip
wbd_url = (
"https://prd-tnm.s3.amazonaws.com"
"/StagedProducts/Hydrography/WBD/HU2/Shape/"
f"{wbd_filename}.zip")
wbd_dir = et.data.get_data(url=wbd_url)
# Read desired data
wbd_path = os.path.join(wbd_dir, 'Shape', f'WBDHU{huc_level}.shp')
wbd_gdf = gpd.read_file(wbd_path, engine='pyogrio')
return wbd_gdf
huc_level = 12
wbd_gdf = read_wbd_file(
"WBD_08_HU2_Shape", huc_level, cache_key=f'hu{huc_level}')
delta_gdf = (
wbd_gdf[wbd_gdf[f'huc{huc_level}']
.isin(['080902030506'])]
.dissolve()
)
(
delta_gdf.to_crs(ccrs.Mercator())
.hvplot(
alpha=.2, fill_color='white',
tiles='EsriImagery', crs=ccrs.Mercator())
.opts(width=600, height=300)
)
DELTA THE MISSISSIPPI¶
The Mississippi River Delta is where the Mississippi River meets the Gulf of Mexico in southeastern Louisiana. Covering about 3 million acres, it is one of the largest coastal wetland areas in the U.S., containing 37% of the nation's estuarine marshes. As the 7th largest river delta in the world, it drains 41% of the contiguous U.S. into the Gulf at an average rate of 470,000 cubic feet per second.
The Mississippi Delta has multiple definitions, including a political one encompassing counties in several states and a natural one referring to the alluvial valley formed by sediment deposition after the Ice Age. The modern delta, built over the last 5,000 years, consists of multiple deltaic lobes, with the most recent being the "bird’s foot" delta near New Orleans. Ecologically, the Delta is a crucial habitat supporting wetlands, hardwood forests, migratory birds, and diverse aquatic species. Human influence on the Delta has ranged from early Indigenous agricultural practices to massive engineering projects like levees and diversions that have reshaped the landscape. While the Delta remains a major agricultural and industrial hub, environmental concerns such as water pollution and wetland loss have led to conservation and cleanup efforts.
Using data from the Mississippi Delta, we will learn how to use clustering techniques on delta geospatial data, such as:
- introduce us to the world of clustering.
- how to prepare the data, datasets for analysis.
- implementation of the k-means algorithm and hierarchical clustering, which offers methods to evaluate the efficiency of them.
Reference:
Mississipi River Delta
Natural Enviroment: The Delta and Its Resources
Mississipi Delta from space
LAND COVER CLASSIFICATION AT THE MISSISSIPPI DELTA¶
According to the USGS Open-File Report 2009-1280 delta land covers include water areas, which include rivers, lakes, estuaries and the ocean, as well as developed areas, where urban infrastructure, roads and human facilities are located. There are also mechanically disturbed lands, which correspond to areas affected by deforestation or construction activities, as well as areas dedicated to mining, where subsoil materials are extracted.
Barren areas are composed of arid lands with less than 10% vegetation cover, while forests include areas with more than 10% tree density. There are also extensions of grasslands and scrublands, characterized by the presence of scattered grasses and shrubs. Agriculture is one of the predominant land covers in the region, with land used for crops, pastures, orchards and vineyards.
Wetlands represent a significant part of the landscape, with highly water-saturated soils and specific vegetation. In addition, there are non-mechanically disturbed lands, which have been affected by natural phenomena such as fires or floods. Finally, although to a lesser extent, there are areas covered by ice and snow, such as glaciers or areas with permanent snow accumulations.
Throughout the study period (1973-2000), the most notable changes include the conversion of approximately 4,368 km² of wetlands into wetland bodies.
Reference: Land-Cover Change in the Lower Mississippi Valley, 1973-2000
STEP 3: MULTISPECTRAL DATA¶
Search for data¶
- Log in to the
earthaccess
service using your Earthdata credentials:python earthaccess.login(persist=True)
- Modify the following sample code to search for granules of the
HLSL30 product overlapping the watershed boundary from May to October
2023 (there should be 76 granules):
python results = earthaccess.search_data( short_name="...", cloud_hosted=True, bounding_box=tuple(gdf.total_bounds), temporal=("...", "..."), )
BLOCK 3
import earthaccess
# log in earthaccess
earthaccess.login(persist=True)
# Ensure that delta_gdf is in geographic coordinates (WGS 84)
if delta_gdf.crs != "EPSG:4326":
delta_gdf = delta_gdf.to_crs("EPSG:4326")
# define search parameters
short_name = "HLSL30"
bounding_box = tuple(delta_gdf.total_bounds)
temporal_range = ("2023-05-01", "2023-10-31")
# Execute granules search
results = earthaccess.search_data(
short_name=short_name,
cloud_hosted=True,
bounding_box=bounding_box,
temporal=temporal_range,
)
# Show the quantity of granules found
print(f"Number of granules found: {len(results)}")
print("Bounding Box:", delta_gdf.total_bounds)
Number of granules found: 88 Bounding Box: [-89.97046834 29.68190731 -89.78679539 29.82339776]
Compile information about each granule¶
I recommend building a GeoDataFrame, as this will allow you to plot the granules you are downloading and make sure they line up with your shapefile. You could also use a DataFrame, dictionary, or a custom object to store this information.
- For each search result:
- Get the following information (HINT: look at the [‘umm’] values for
each search result):
- granule id (UR)
- datetime
- geometry (HINT: check out the shapely.geometry.Polygon class to convert points to a Polygon)
- Open the granule files. I recomment opening one granule at a time,
e.g. with (
earthaccess.open([result]
). - For each file (band), get the following information:
- file handler returned from
earthaccess.open()
- tile id
- band number
- file handler returned from
- Get the following information (HINT: look at the [‘umm’] values for
each search result):
- Compile all the information you collected into a GeoDataFrame
BLOCK 4
Compile information about each scene (granule)
Each scene represents a data acquisition over a specific area at a given time. In this case, for each granule (scene), we are storing:
1.Scene Metadata
-Acquisition date (datetime)
-Granule ID (granule_id)
-Spatial location (tile_id)
-Download URLs (url)
-Scene coverage (geometry - bounding box or footprint)
2.Spectral Data
-Spectral bands (B02, B03, B04, etc.)
-Cloud mask (Fmask)
-Applied scale factor (e.g., * 0.0001 for reflectance)
-Scene cropped to watershed boundary (.rio.clip())
-Quality mask applied (.where(cloud_mask == 1))
How This Function Works Extracts granule metadata:
Granule ID (GranuleUR)
Acquisition date (BeginningDateTime)
Spatial extent (polygon geometry)
Retrieves download links for each granule using earthaccess.open()
Uses regex to extract tile ID and band info from filenames
Stores all the information in a GeoDataFrame for easy analysis and plotting
def get_earthaccess_links(results):
url_re = re.compile(
r'\.(?P<tile_id>\w+)\.\d+T\d+\.v\d\.\d\.(?P<band>[A-Za-z0-9]+)\.tif'
)
# Loop through each granule
link_rows = []
for granule in tqdm(results):
# Get granule information
info_dict = granule['umm']
granule_id = info_dict['GranuleUR']
datetime = pd.to_datetime(
info_dict['TemporalExtent']['RangeDateTime']['BeginningDateTime']
)
# Extract spatial geometry (bounding polygon)
try:
points = (
info_dict['SpatialExtent']['HorizontalSpatialDomain']['Geometry']['GPolygons'][0]['Boundary']['Points']
)
geometry = Polygon([(point['Longitude'], point['Latitude']) for point in points])
except KeyError:
print(f" Warning: No geometry found for granule {granule_id}")
continue
# Get file URLs
files = earthaccess.open([granule])
# Build metadata DataFrame
for file in files:
match = url_re.search(file.full_name)
if match is not None:
link_rows.append(
dict(
datetime=datetime,
granule_id=granule_id,
tile_id=match.group('tile_id'),
band=match.group('band'),
url=str(file),
geometry=geometry
)
)
# Convert to GeoDataFrame
if link_rows:
file_df = gpd.GeoDataFrame(link_rows, crs="EPSG:4326")
return file_df
else:
print("No valid granules found.")
return None
# Use the function with your search results
granules_gdf = get_earthaccess_links(results)
# Check results
unique_granules = granules_gdf['granule_id'].nunique()
print(f"Unique granules processed: {unique_granules}")
0%| | 0/88 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
QUEUEING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
PROCESSING TASKS | : 0%| | 0/15 [00:00<?, ?it/s]
COLLECTING RESULTS | : 0%| | 0/15 [00:00<?, ?it/s]
Unique granules processed: 88
#Show the granules display on an interactive map
granules_gdf.hvplot(geo=True, alpha=0.3, fill_color="red", line_color="black", tiles="EsriImagery")
#Show Box coordinates
print("Bounding Box:", delta_gdf.total_bounds)
Bounding Box: [-89.97046834 29.68190731 -89.78679539 29.82339776]
filtered_granules = granules_gdf.sjoin(delta_gdf, predicate="intersects")
print(f"Granules intersecting the watershed: {filtered_granules['granule_id'].nunique()}")
Granules intersecting the watershed: 88
granules_gdf = granules_gdf.drop_duplicates(subset=['granule_id'])
print(f"Unique granules after removing duplicates: {granules_gdf['granule_id'].nunique()}")
Unique granules after removing duplicates: 88
granules_gdf = granules_gdf.sort_values(by="datetime").drop_duplicates(subset="granule_id", keep="last")
print(f"Granules after keeping latest version: {granules_gdf['granule_id'].nunique()}")
Granules after keeping latest version: 88
# Final Bedugging
# Step 1: Strict spatial filtering (fully inside watershed)
filtered_granules = granules_gdf.sjoin(delta_gdf, predicate="within")
print(f"Granules fully within the watershed: {filtered_granules['granule_id'].nunique()}")
# Step 2: Remove duplicate versions, keeping the latest
granules_gdf = granules_gdf.sort_values(by="datetime").drop_duplicates(subset="granule_id", keep="last")
print(f"Granules after keeping only the latest version: {granules_gdf['granule_id'].nunique()}")
# Step 3: Check unique dates
print("Unique dates of granules:", granules_gdf["datetime"].unique())
Granules fully within the watershed: 0 Granules after keeping only the latest version: 88 Unique dates of granules: <DatetimeArray> ['2023-05-04 16:31:32.101000+00:00', '2023-05-12 16:31:44.329000+00:00', '2023-05-20 16:31:23.029000+00:00', '2023-05-28 16:31:34.837000+00:00', '2023-06-05 16:31:28.153000+00:00', '2023-06-13 16:31:26.844000+00:00', '2023-06-21 16:31:34.135000+00:00', '2023-06-29 16:31:32.405000+00:00', '2023-07-07 16:31:46.664000+00:00', '2023-07-15 16:31:42.442000+00:00', '2023-07-23 16:31:50.873000+00:00', '2023-07-31 16:31:47.828000+00:00', '2023-08-08 16:31:57.564000+00:00', '2023-08-16 16:31:52.083000+00:00', '2023-08-24 16:32:05.861000+00:00', '2023-09-01 16:32:03.072000+00:00', '2023-09-09 16:32:06.260000+00:00', '2023-09-17 16:32:11.040000+00:00', '2023-09-25 16:32:12.175000+00:00', '2023-10-03 16:32:11.664000+00:00', '2023-10-19 16:32:21.176000+00:00', '2023-10-27 16:32:18.830000+00:00'] Length: 22, dtype: datetime64[ns, UTC]
Expect Outcome: could some granules were only partially overlapping, the issue duplicate versions, or NASA could update the dataset with extra images. I think that is why be have 88 granules instead of 76 like the tutorial indicates.¶
Open, crop, and mask data¶
This will be the most resource-intensive step. I recommend caching your
results using the cached
decorator or by writing your own caching
code. I also recommend testing this step with one or two dates before
running the full computation.
This code should include at least one function including a numpy-style docstring. A good place to start would be a function for opening a single masked raster, applying the appropriate scale parameter, and cropping.
- For each granule:
Open the Fmask band, crop, and compute a quality mask for the granule. You can use the following code as a starting point, making sure that
mask_bits
contains the quality bits you want to consider: ```python # Expand into a new dimension of binary bits bits = ( np.unpackbits(da.astype(np.uint8), bitorder=‘little’) .reshape(da.shape + (-1,)) )# Select the required bits and check if any are flagged mask = np.prod(bits[…, mask_bits]==0, axis=-1) ```
For each band that starts with ‘B’:
- Open the band, crop, and apply the scale factor
- Name the DataArray after the band using the
.name
attribute - Apply the cloud mask using the
.where()
method - Store the DataArray in your data structure (e.g. adding a GeoDataFrame column with the DataArray in it. Note that you will need to remove the rows for unused bands)
BLOCK 5
The structured approach to process each granule, crop the Fmask band, create a cloud mask, and to each spectral band
@cached('delta_reflectance_da_df')
def compute_reflectance_da(search_results, boundary_gdf):
"""
Connect to files over VSI, crop, cloud mask, and wrangle
Returns a single reflectance DataFrame
with all bands as columns and
centroid coordinates and datetime as the index.
Parameters
==========
file_df : pd.DataFrame
File connection and metadata (datetime, tile_id, band, and url)
boundary_gdf : gpd.GeoDataFrame
Boundary use to crop the data
"""
boundary_proj_gdf = boundary_gdf.to_crs("EPSG:32614")
def open_dataarray(url, boundary_proj_gdf, scale=1, masked=True):
# Open masked DataArray /
da = rxr.open_rasterio(url, masked=masked).squeeze() * scale
# Crop to watershed boundary
return da.rio.clip_box(*boundary_proj_gdf.total_bounds)
def compute_quality_mask(da, mask_bits=[1, 2, 3]):
"""Mask out low quality data by bit, that means pixel using bit flags."""
# Unpack bits into a new axis
bits = (
np.unpackbits(
da.astype(np.uint8), bitorder='little'
).reshape(da.shape + (-1,))
)
# Select the required bits and check if any are flagged
mask = np.prod(bits[..., mask_bits]==0, axis=-1)
return mask
file_df = get_earthaccess_links(search_results)
granule_da_rows= []
# Loop through each image
group_iter = file_df.groupby(['datetime', 'tile_id'])
for (datetime, tile_id), granule_df in tqdm(group_iter):
print(f'Processing granule {tile_id} {datetime}')
# Open granule cloud cover
cloud_mask_url = (
granule_df.loc[granule_df.band=='Fmask', 'url']
.values[0])
cloud_mask_cropped_da = open_dataarray(cloud_mask_url, boundary_proj_gdf, masked=False)
# Compute cloud mask
cloud_mask = compute_quality_mask(cloud_mask_cropped_da)
#Filter only spectral bands
band_df = granule_df[granule_df.band.str.startswith('B')]
# Loop through each spectral band/ cada banda espectral
for _, row in band_df.iterrows():
band_cropped = open_dataarray(row.url, scale=0.0001)
band_cropped.name = row.band
row['da'] = band_cropped.where(cloud_mask)
granule_da_rows.append(row.to_frame().T)
# Reassemble the metadata DataFrame
return pd.concat(rows_list, ignore_index=True)
reflectance_da_df = compute_reflectance_da(results, delta_gdf)
BLOCK 6
The goal of this funtion is process remote sensing reflectance data
# visualize a processed band
reflectance_da_df.iloc[0]['da'].hvplot.image(cmap='viridis')
Merge and Composite Data¶
You will notice for this watershed that: 1. The raster data for each date are spread across 4 granules 2. Any given image is incomplete because of clouds
For each band:
For each date:
- Merge all 4 granules
- Mask any negative values created by interpolating from the nodata
value of -9999 (
rioxarray
should account for this, but doesn’t appear to when merging. If you leave these values in they will create problems down the line)
Concatenate the merged DataArrays along a new date dimension
Take the mean in the date dimension to create a composite image that fills cloud gaps
Add the band as a dimension, and give the DataArray a name
Concatenate along the band dimension
BLOCK 7
@cached('delta_reflectance_da')
def merge_and_composite_arrays(granule_da_df):
"""
Efficiently merges and composites satellite image granules across bands and dates.
"""
da_list = []
for band, band_df in tqdm(granule_da_df.groupby('band')):
# Merge granules per date and mask negatives
merged_das = [
rxrmerge.merge_arrays(list(date_df.da)).where(lambda x: x > 0)
for _, date_df in band_df.groupby('datetime')
]
# Composite across dates using the median
composite_da = xr.concat(merged_das, dim='datetime').median(dim='datetime')
# Assign band metadata
composite_da = composite_da.assign_coords(band=int(band[1:])).expand_dims('band')
composite_da.name = 'reflectance'
da_list.append(composite_da)
# Concatenate all bands into a final dataset
return xr.concat(da_list, dim='band')
reflectance_da = merge_and_composite_arrays(reflectance_da_df)
#reflectance_da
STEP 4: K-MEANS¶
Cluster your data by spectral signature using the k-means algorithm.
- Convert your DataArray into a tidy DataFrame of
reflectance values (hint: check out the
.to_dataframe()
and.unstack()
methods) - Filter out all rows with no data (all 0s or any N/A values)
- Fit a k-means model. You can experiment with the number of groups to find what works best.
BLOCK 8
# Convert DataArray to a tidy DataFrame
model_df = reflectance_da.to_dataframe().reset_index()
# Unstack to create a feature matrix (pixels as rows, bands as columns)
model_df = model_df.pivot(index=['y', 'x'], columns='band', values='reflectance')
# Remove rows with all 0s or NaN values
model_df = model_df[(model_df > 0).any(axis=1)].dropna()
# Fit K-Means model
n_clusters = 4 # You can adjust this number
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
# **Apply K-Means clustering and store results**
model_df['clusters'] = kmeans.fit_predict(model_df)
# Show sample of the clustered data
print(model_df.head())
band 1 2 3 4 5 \ y x 3.287163e+06 793408.062907 0.09700 0.11950 0.16060 0.1753 0.26410 793438.062907 0.07180 0.08560 0.12880 0.1328 0.29880 793468.062907 0.04585 0.05455 0.09295 0.0868 0.37255 793498.062907 0.03710 0.04560 0.08760 0.0718 0.38380 793528.062907 0.02330 0.02880 0.06110 0.0380 0.37750 band 6 7 9 10 11 \ y x 3.287163e+06 793408.062907 0.30230 0.22730 0.00090 0.2905 0.24750 793438.062907 0.26040 0.17350 0.00100 0.2860 0.24440 793468.062907 0.27145 0.15045 0.00085 0.2778 0.24675 793498.062907 0.24620 0.13380 0.00090 0.2695 0.23650 793528.062907 0.15420 0.06160 0.00120 0.2636 0.23240 band clusters y x 3.287163e+06 793408.062907 3 793438.062907 3 793468.062907 3 793498.062907 3 793528.062907 3
STEP 5: PLOT¶
Create a plot that shows the k-means clusters next to an RGB image of the area. You may need to brighten your RGB image by multiplying it by 10. The code for reshaping and plotting the clusters is provided for you below, but you will have to create the RGB plot yourself!
So, what is .sortby(['x', 'y'])
doing for us? Try the
code without it and find out.
BLOCK 9
# Setect R, G, B y transform to uint8
rgb = reflectance_da.sel(band=[4, 3, 2])
rgb_uint8 = (rgb * 255).astype(np.uint8).where(~np.isnan(rgb), 0) # avoid NaN
# restore the brigthness with control
rgb_bright = np.clip(rgb_uint8 * 10, 0, 255) # avoid extrem saturation
# Convert clusters to xarray in correct order
clusters_xr = model_df.clusters.to_xarray().sortby(['x', 'y'])
# Visualize with `hvplot`
plot = (
rgb_bright.hvplot.rgb(
x='x', y='y', bands='band',
data_aspect=1, xaxis=None, yaxis=None
) +
clusters_xr.hvplot(cmap="tab10", aspect='equal')
)
# graphics
plot
Unsupervised analysis using K-Means for class clustering¶
Introduction¶
Land cover classification is a fundamental tool in environmental monitoring, as it allows for the assessment of landscape changes and their impact on ecosystems. In this analysis, we used NASA's Harmonized Landsat and Sentinel-2 (HLS) product, which provides harmonized and compatible surface reflectance (SR) data from the Landsat-8 and Sentinel-2 missions.
The HLS product provides satellite imagery with Bottom of Atmosphere (BOA) surface reflectance. This means that the data has already been atmospherically corrected, removing the effects of scattering and atmospheric absorption, which enables better comparison between images from different dates or sensors.
The HLS dataset integrates observations from the Operational Land Imager (OLI) onboard Landsat-8 and the Multispectral Instrument (MSI) from Sentinel-2, generating a time series of images with increased temporal frequency and spectral consistency. HLS data products can be considered the building blocks of a "data cube", allowing users to examine any pixel over time and analyze near-daily surface reflectance time series as if they were derived from a single sensor. This feature enhances data continuity and facilitates multitemporal land cover analysis.
Additionally, the HLS product employs standardized processing methods across all images, including atmospheric correction and the Fmask algorithm for cloud masking. Surface reflectance is corrected to account for the effect of the viewing angle, ensuring that all pixels are normalized to nadir observation. This guarantees greater accuracy in the spectral interpretation of land cover.
For this study, we used images acquired between May 2023 and October 2023, with a spatial resolution of 30 meters, resulting in a total of 88 scenes or granules. The data was compiled in an organized manner according to acquisition dates over the specific study area. A scale factor of 1 was assigned to all bands to mitigate size distortions (distances and areas) that increase as the distance from the central meridian grows.
This study applies a supervised classification approach to identify and map different land cover classes along the Mississippi River, a key ecosystem in North America. The methodology includes image preprocessing, spectral feature extraction, and the implementation of classification algorithms to generate a detailed land cover map for the study region.
The results obtained can contribute to monitoring environmental changes, managing natural resources, and making informed decisions regarding the conservation of Mississippi’s riparian ecosystems.
Metodology¶
The analysis was conducted using the K-Means clustering algorithm to segment an image into different groups based on similar data characteristics. A total of four clusters (K=4) were selected to minimize variability within each group while maximizing the differences between them.
The methodological process followed these steps:
Data Preprocessing: The image was prepared for analysis by normalizing spectral bands or pixel values to improve clustering accuracy.
Application of the K-Means Algorithm: The algorithm was executed with K=4, assigning each pixel to one of the four clusters based on similarity in feature space.
Results Visualization: A segmented image was generated where each cluster was represented by a distinct color, facilitating the interpretation of homogeneous areas.
Cluster Analysis and Interpretation: The spatial distribution of clusters was evaluated, and possible meanings were assigned based on visual appearance and geographical context.
Conclusion¶
Our data groups were identified, ensuring a classification with low internal variability and high differentiation between clusters. The K-Means algorithm, while not providing a semantic interpretation of the groups, enabled the identification of patterns in the segmented image.
Visual analysis suggests the following interpretation of the clusters:
Cluster 0 (Blue): Represents well-defined bodies of water.
Cluster 1 (Red): Corresponds to the edges of water bodies, possibly mud or soil with a high-density cover.
Cluster 2 (Pink): Areas with sparse vegetation or bare soil.
Cluster 3 (Cyan): Transition zones between water and land, possibly representing wetlands or saturated soils.
This analysis provides an initial approach to terrain segmentation and can be improved with additional information, such as spectral data or field validation, to enhance the interpretation of each cluster in environmental and geospatial monitoring studies.