Improving representations

Some possibilities

Update encoder/projector networks during fine-tuning, rather than keeping them frozen as we do now. Will be computationally very expensive!

Add 4 SAR spectral bands, i.e. VH/VV in ascending/descending orbit to the 10 spectral bands in each S2 image d-pixel. SAR seems to be very promising, based on recent work. 1 2.

Do better (class-balanced) sampling of the ROI when creating augmentations, rather than random sampling as we do now.

Sentinel 2 orbits have overlap, with more overlap nearer the poles. If we focus on overlap areas of orbits then the learning may be better, since we have more observational data. We need to distinguish orbital overlap with artefacts from tiling which has an artificial (south-north) overlap (100x100km which adds some km to the upper/lower border), because atmospheric correction is applied tile-wise.

It helps to incorporate topography and climate variables during the creation of task-specific representations.

Pixel- vs patch- vs object-wise approaches for representation generation in the context of small fields.

Different parts of the world have different levels of cloudiness. Ideally, we could sample more rows in a d-pixel for less cloudy areas of the world, but doing so would make the representations incomparable across tiles, since the encoder for each part of the world will have different length inputs. So, we need to choose the same input lengths across the world, requiring a compromise between the best/worst cloud conditions. To tackle this, some options are:
1. Replacing ResNet in the encoder network with a vision transformer. This might allow us to use different length inputs. Instead of a vision transformer, it may be better to work with a transformer that directly works with vector time-series rather than a 2D image. (Done)
2. We could remap representations to some common global standard.
3. We could use stratification (based on average cloud cover) and have different foundation models for different strata (not clear)
4. We could train a separate model per ecoregion (there are about 250 of them)

Change the BT loss function. For example, instead of simply learning to make the cross-correlation matrix close to the identity matrix, one could simultaneously generate (from the representations and another shallow net) the full spectral-temporal signature of the sample. This would be similar to a masked AE and this loss could be "added" to the existing loss function.

We could potentially use Matryoshka representations: https://arxiv.org/abs/2205.13147. With this approach, there is a different loss function for reduced representation lengths, which allows truncation of embeddings, with the most important features at the start of a representation.

Sampling at GEDI footprint points, rather than random pixels, so that GEDI products can be leveraged.

Navigation menu