Discussion about this post

User's avatar
Johannes's avatar

Hi Christopher,

thanks for the feedback :) even if it's a bit harsh haha probably due to your disappointment. Let's put things in perspective:

Performance

- 4% better than UNets on average ((57.58 - 55.29) / 55.29) and also better in the avg rank.

- 7.5% better than the second-best foundation model in the benchmark ((59.57 - 55.29) / 55.29)

- It's worth noting that the FM encoders are generally frozen, while the UNet encoder is trainable. That setting was chosen to investigate the generalizability of the FM encoders in PANGAEA.

Model size and pretraining

- We use base and large as backbone sizes just as in ViTs. If you use a single modality you end up with 86M parameters for the ViT-B backbone, so with the patch embedding less than 100M. Same story for large. So not heavier than other models. Also UNets are not a lot smaller.

- Understanding how scaling data and compute in pretraining can help to get better models out is exciting isn't it? You are right to refer to the language domain where this works particularly well and I do see some scaling behaviour when comparing other models with TerraMind in your figures. However, I agree, we should not generally expect the same scaling behaviours people see in language.

Capabilities on top:

- UNets cannot generate synthetic data if they are constructed for a downstream task

- UNets cannot do Thinking-in-Modalities which can give an additional improvement

Overall:

You get a cool new model for free that is not bigger than other models, but better than many in a lot of settings and has some cool new capabilities.

Are you still disappointed? Then we should talk in more detail. :) I would really like to understand your perspective.

Johannes

Expand full comment
4 more comments...

No posts