TLDR; After 10²¹ FLOPs and 500 B patches, IBM’s TerraMind beats a supervised U‑Net by just +2 mIoU on PANGAEA; losing on 5/9 tasks, most other GFMs do worse.
thanks for the feedback :) even if it's a bit harsh haha probably due to your disappointment. Let's put things in perspective:
Performance
- 4% better than UNets on average ((57.58 - 55.29) / 55.29) and also better in the avg rank.
- 7.5% better than the second-best foundation model in the benchmark ((59.57 - 55.29) / 55.29)
- It's worth noting that the FM encoders are generally frozen, while the UNet encoder is trainable. That setting was chosen to investigate the generalizability of the FM encoders in PANGAEA.
Model size and pretraining
- We use base and large as backbone sizes just as in ViTs. If you use a single modality you end up with 86M parameters for the ViT-B backbone, so with the patch embedding less than 100M. Same story for large. So not heavier than other models. Also UNets are not a lot smaller.
- Understanding how scaling data and compute in pretraining can help to get better models out is exciting isn't it? You are right to refer to the language domain where this works particularly well and I do see some scaling behaviour when comparing other models with TerraMind in your figures. However, I agree, we should not generally expect the same scaling behaviours people see in language.
Capabilities on top:
- UNets cannot generate synthetic data if they are constructed for a downstream task
- UNets cannot do Thinking-in-Modalities which can give an additional improvement
Overall:
You get a cool new model for free that is not bigger than other models, but better than many in a lot of settings and has some cool new capabilities.
Are you still disappointed? Then we should talk in more detail. :) I would really like to understand your perspective.
Thanks for the clarifications! I would love to chat, please do reach out at chris@demeterlabs.io if you'd like to chat more.
Compute vs. performance:
If we treat 500 B masked‑patch tokens as the true D and apply 6 N D, TerraMind clocks ≈ 1 × 10²¹ FLOPs. That’s ~10⁵ × DOFA/CROMA. A +7.5 mIoU mean bump is real but still a very shallow return curve given 5 orders of magnitude of scaling.
The model also still loses to the U-Net on 5/9 tasks, which is information practitioners will want to know and shouldn't be in the supplementary material IMO.
Frozen encoders.Fair point, freezing highlights generality., I've added an edit to the blog. But practitioners usually fine‑tune encoders because they care about every last percent. It would be great to publish the “fully unfrozen” TerraMind numbers so the community can see the ceiling.
Model size: I think there is some confusion here. Large model is fine if it is trained on a lot of data, and U-Nets in the original paper clock in at 15M parameters I believe. The point is given the model size + amount of data one would perhaps expect more of an mIoU bump, indicating there is something sub-optimal about the pre-training objective.
Synthetic data & TiM. Happy to test‑drive: if a TiM‑fine‑tuned TerraMind pushes > 60 mIoU on PANGAEA, that’s compelling evidence the extra capabilities pay rent!
Finally: there is a typo in the table of the TerraMind paper that I've posted in the blog. Under AI4Farms, the TerraMindv-1B score is bolded despite being lower than the UNet Baseline score which is also bolded.
sounds great, I've sent you a message. :) I also understand your concern better, thanks for clarifying!
I agree with you that it makes a lot of sense to focus on the temporal dimension next and that computer vision and EO are more difficult to scale than language (which I believe is a general common understanding in both communities). However, several things that you suggest above are already part of the objective of TerraMind, especially the design of the pretraining task. So, some things that I would like to further clarify:
Tokens:
- TerraMind has 8 image-like modalities of which several are fairly similar (like S2L2A, S2L1C), so in total, per scene that's ~60B tokens. The reason we introduce the modalities is not to increase the token count and scale out of the roof but for exactly what you suggested on the pretraining objective in your blog post. See below:
Objective:
TerraMind already does a lot of things you suggest:
- TerraMind does cross-modal sensor prediction as pretraining task between S2L2A, S2L1C, S1GRD, S1RTC, ...
- It does not use on a pixel-wise reconstruction or contrastive loss for the reasons you point out above, instead it's a patch-wise classification in latent space.
- TerraMind already aims to address the distribution shifts between pretraining and downstream applications: Instead of teaching the model for pixel-wise reconstruction and assuming it works well in finetuning, we teach the model to learn the correlation between sensor data and segmentation maps (LULC modality), pixel-wise regression (NDVI) and classification tasks (geo-location). It's one of the first attempts I have seen in the community to have a pretraining objective that is very much in line with what you suggest. So we aimed to have a model that already during pretraining learns about the different types of downstream applications. That means TerraMind is not only about multimodality, but also about fixing the distribution shift problem between pretraining and finetuning by incorporating relevant types of downstream tasks in pretraining.
Finally, a meta-comment: It's great for us in the model development team to get feedback from the community, as there is always things to do as next steps. Typically, that feedback also includes to learn more about what you like about the model, instead of focusing on the single thing you don't haha
Looking forward a lot to chat with you! I think a lot of our ideas overlap and it will be good to better understand how we can further push the next model version. :)
Great discussion. Thanks for sharing your insights. I am also working on geospatial foundation models, and the question keeps popping up: "If UNet does so well in segmentation tasks, why bother with transformers, pertaining and consuming so much processing power?" I received great insights from your comments and am now re-motivated for my research. I would be very glad to be included if there is any ongoing discussion.
"If UNet does so well in segmentation tasks, why bother with transformers, pertaining and consuming so much processing power?"
This is an excellent question to ask yourself, I would encourage you to explore first what kind of contribution you are hoping to make: who are your target end users and how is your contribution solving a problem they face? This is a very 'start-up' way of thinking, but I think society could greatly benefit from having more researchers that think more like entrepreneurs.
Feel free to reach out if you want to chat more about this.
Hi Christopher,
thanks for the feedback :) even if it's a bit harsh haha probably due to your disappointment. Let's put things in perspective:
Performance
- 4% better than UNets on average ((57.58 - 55.29) / 55.29) and also better in the avg rank.
- 7.5% better than the second-best foundation model in the benchmark ((59.57 - 55.29) / 55.29)
- It's worth noting that the FM encoders are generally frozen, while the UNet encoder is trainable. That setting was chosen to investigate the generalizability of the FM encoders in PANGAEA.
Model size and pretraining
- We use base and large as backbone sizes just as in ViTs. If you use a single modality you end up with 86M parameters for the ViT-B backbone, so with the patch embedding less than 100M. Same story for large. So not heavier than other models. Also UNets are not a lot smaller.
- Understanding how scaling data and compute in pretraining can help to get better models out is exciting isn't it? You are right to refer to the language domain where this works particularly well and I do see some scaling behaviour when comparing other models with TerraMind in your figures. However, I agree, we should not generally expect the same scaling behaviours people see in language.
Capabilities on top:
- UNets cannot generate synthetic data if they are constructed for a downstream task
- UNets cannot do Thinking-in-Modalities which can give an additional improvement
Overall:
You get a cool new model for free that is not bigger than other models, but better than many in a lot of settings and has some cool new capabilities.
Are you still disappointed? Then we should talk in more detail. :) I would really like to understand your perspective.
Johannes
Hi Johannes,
Thanks for the clarifications! I would love to chat, please do reach out at chris@demeterlabs.io if you'd like to chat more.
Compute vs. performance:
If we treat 500 B masked‑patch tokens as the true D and apply 6 N D, TerraMind clocks ≈ 1 × 10²¹ FLOPs. That’s ~10⁵ × DOFA/CROMA. A +7.5 mIoU mean bump is real but still a very shallow return curve given 5 orders of magnitude of scaling.
The model also still loses to the U-Net on 5/9 tasks, which is information practitioners will want to know and shouldn't be in the supplementary material IMO.
Frozen encoders.Fair point, freezing highlights generality., I've added an edit to the blog. But practitioners usually fine‑tune encoders because they care about every last percent. It would be great to publish the “fully unfrozen” TerraMind numbers so the community can see the ceiling.
Model size: I think there is some confusion here. Large model is fine if it is trained on a lot of data, and U-Nets in the original paper clock in at 15M parameters I believe. The point is given the model size + amount of data one would perhaps expect more of an mIoU bump, indicating there is something sub-optimal about the pre-training objective.
Synthetic data & TiM. Happy to test‑drive: if a TiM‑fine‑tuned TerraMind pushes > 60 mIoU on PANGAEA, that’s compelling evidence the extra capabilities pay rent!
Finally: there is a typo in the table of the TerraMind paper that I've posted in the blog. Under AI4Farms, the TerraMindv-1B score is bolded despite being lower than the UNet Baseline score which is also bolded.
Looking forward to chatting!
Best,
Chris
Hi Chris,
sounds great, I've sent you a message. :) I also understand your concern better, thanks for clarifying!
I agree with you that it makes a lot of sense to focus on the temporal dimension next and that computer vision and EO are more difficult to scale than language (which I believe is a general common understanding in both communities). However, several things that you suggest above are already part of the objective of TerraMind, especially the design of the pretraining task. So, some things that I would like to further clarify:
Tokens:
- TerraMind has 8 image-like modalities of which several are fairly similar (like S2L2A, S2L1C), so in total, per scene that's ~60B tokens. The reason we introduce the modalities is not to increase the token count and scale out of the roof but for exactly what you suggested on the pretraining objective in your blog post. See below:
Objective:
TerraMind already does a lot of things you suggest:
- TerraMind does cross-modal sensor prediction as pretraining task between S2L2A, S2L1C, S1GRD, S1RTC, ...
- It does not use on a pixel-wise reconstruction or contrastive loss for the reasons you point out above, instead it's a patch-wise classification in latent space.
- TerraMind already aims to address the distribution shifts between pretraining and downstream applications: Instead of teaching the model for pixel-wise reconstruction and assuming it works well in finetuning, we teach the model to learn the correlation between sensor data and segmentation maps (LULC modality), pixel-wise regression (NDVI) and classification tasks (geo-location). It's one of the first attempts I have seen in the community to have a pretraining objective that is very much in line with what you suggest. So we aimed to have a model that already during pretraining learns about the different types of downstream applications. That means TerraMind is not only about multimodality, but also about fixing the distribution shift problem between pretraining and finetuning by incorporating relevant types of downstream tasks in pretraining.
Finally, a meta-comment: It's great for us in the model development team to get feedback from the community, as there is always things to do as next steps. Typically, that feedback also includes to learn more about what you like about the model, instead of focusing on the single thing you don't haha
Looking forward a lot to chat with you! I think a lot of our ideas overlap and it will be good to better understand how we can further push the next model version. :)
Johannes
Great discussion. Thanks for sharing your insights. I am also working on geospatial foundation models, and the question keeps popping up: "If UNet does so well in segmentation tasks, why bother with transformers, pertaining and consuming so much processing power?" I received great insights from your comments and am now re-motivated for my research. I would be very glad to be included if there is any ongoing discussion.
Thank you for reading!
"If UNet does so well in segmentation tasks, why bother with transformers, pertaining and consuming so much processing power?"
This is an excellent question to ask yourself, I would encourage you to explore first what kind of contribution you are hoping to make: who are your target end users and how is your contribution solving a problem they face? This is a very 'start-up' way of thinking, but I think society could greatly benefit from having more researchers that think more like entrepreneurs.
Feel free to reach out if you want to chat more about this.