Computer Vision Newsletter #31

ConvNets are dead? 🥺 Software²: AI-generating AI 🧠 Text-to-Video is getting good fast

What a week in computer vision! 🤯 Researchers are working hard to create super-versatile foundational models for all kinds of vision tasks. They're shifting away from supervised learning, with its tedious data labeling, and concentrating on curating top-notch, meaningful training data for self-supervised learning. ConvNets? So yesterday! Vision Transformers are taking over. Plus, text-to-video generation is getting crazy good, super fast!

But hey, before we jump in, here's a reminder: Got a computer vision question? Shoot it our way, and our awesome ML team at Superb AI will tackle it for you! I'll share their answers in upcoming issues. So, stay tuned!


1. DINOv2: Learning Robust Visual Features without Supervision

dinov2 image

Meta AI has been making waves in computer vision! Last week they open-sourced DINOv2, a cutting-edge Transformer-based computer vision model that leverages self-supervised learning to identify high-performance visual features. These features demonstrate remarkable robustness and versatility across different domains, eliminating the need for fine-tuning and serving as an excellent multipurpose backbone for diverse vision tasks. DINOv2 enhances the accomplishments of its forerunners, DINO and iBOT, by integrating various improvements such as data curation, additional regularization techniques, and self-distillation. DINOv2 can learn from any image collection, irrespective of annotations and pre-trained versions of DINOv2 are available for the community to use. Blog | GitHub | Demo

2. Software²: “Learning What Data to Learn From”

software 2 banner

A growing body of evidence indicates that the generality of large models is considerably constrained by the quality of the training data. Although the training data's quality greatly influences model performance, prevailing training methodologies tend to overlook data quality, prioritizing data quantity instead. Such inconsistency suggests a potential shift in research trends, emphasizing data collection and generation as a primary avenue for enhancing model performance. This article paints the broad strokes of Software 2.0, a quickly emerging, data-centric paradigm for developing self-improving programs based on modern deep learning. It is an approach that is likely to be as influential to the design of future software systems. Read the article here.


1. [Insight] Building and Deploying CV Models: Lessons Learned From Computer Vision Engineer → hard-won insights gained from designing, building, and deploying cutting-edge CV models across various platforms and essential lessons, tried-and-tested techniques, and real-world examples to help you tackle the unique challenges as a Computer Vision Engineer.

2. [Tutorial] Image Classification Using Vision Transformer and KerasCV → a great tutorial with code on how to use KerasCV to fine-tune a vision transformer (ViT) on a custom dataset.

3. [Podcast] Drago Anguelov: Waymo and Autonomous Vehicles Head of Research at Waymo talks on the development and challenges of self-driving, Waymo's innovations, and what's next for the field.

4. [Course] Self-Supervised Learning and Foundation Models → MIT lectures on state-of-the-art topics and techniques like Stable-Diffusion & Dall-E, Neural Networks, Supervised Learning, Self-supervised & Unsupervised Learning, Generative AI, Self-Supervised Learning as well as applications in both science and business.



Latent Diffusion Models (LDMs) are an efficient approach for generating high-quality images by training diffusion models within a compressed, low-dimensional latent space. This exciting research extends LDMs to create high-resolution videos! Authors pre-train an LDM using only images; then convert it into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning it with encoded video sequences. Additionally, they temporally align diffusion model upsamplers to create video super-resolution models with consistent frame-to-frame transitions (no more flickering artifacts).

Looks like one of the last strongholds of ConvNets is falling. Real-Time DEtection TRansformer (RT-DETR) is the first real-time end-to-end object detector. Researchers designed an efficient hybrid encoder that processes multi-scale features, and they have introduced IoU-aware query selection to optimize the initialization of object queries. Notably, the detector supports flexible adjustments of inference speed using various decoder layers, without requiring retraining. The source code and pre-trained models will be accessible through PaddleDetection.

This paper proposes a novel transformer architecture, SpectFormer, that combines spectral and multi-headed attention layers for image recognition tasks and shows that the resulting representation yields improved performance over other transformer architectures, achieving a top-1 accuracy of 84.25% on ImageNet.


ai readiness

From AI Readiness Report

If you like Ground Truth, share it with a computer vision friend! If you hate it, share it with an enemy. 😉

Have a great week!

Over and out,



or to participate.