Computer Vision Newsletter #33

New Object Detection SOTA Model; Amazing Midjourney V5.1; Open AI strikes again

There's a new champion in town for state-of-the-art (SOTA) object detection models - YOLO-NAS 🚀. Midjourney has rolled out another version - V5.1. While it's indeed a marvel, text rendering remains its Achilles' heel 🥲. OpenAI has dropped a new generative model for 3D asset creation 🦾. I've also got a bunch of learning resources and insights to share. Hope you find them as exciting as I do! 🤗 Happy reading!

AUTHOR PICKS

yolo nas gif

Deci AI released YOLO-NAS - an advanced object detection model that left other YOLOs in the dust. 😎 The architecture uses Neural Architecture Search to enhance the detection of small objects, improve localization accuracy, and achieve a higher performance-per-compute ratio. Ideal for real-time edge-device applications. The YOLO-NAS architecture is available under an open-source license. Its’ pre-trained weights are available for research. GitHub repo.

2. How self-supervised learning is changing computer vision → recently released DINOv2 🦖 by Meta AI, demonstrated the power of self-supervised learning in computer vision. DINO is capable of achieving meaningful features without the need for fine-tuning and is competitive against weak-supervised learning models considered state-of-the-art so far. DINO proved that metadata or labels are not mandatory. While data curation and data quality are crucial to train top-notch, robust computer vision models.

3. Text-to-Video: The Task, Challenges and the Current State → HuggingFace 🤗 dives into the world of text-to-video models, exploring their evolution and the distinctions between text-to-video and text-to-image tasks. They tackle the unique challenges of unconditional and text-conditioned video generation while shedding light on recent advancements, giving us a glimpse into the capabilities of these cutting-edge methods.

NEWSy bits

Midjourney comparison

Midjourney 5 (left) vs 5.1 (right)

Midjourney Version 5.1 of the engine is “more opinionated”, bringing it closer to the kind of results that you would get with version 4 of Midjourney, but at a higher quality. Other claimed improvements include greater accuracy, fewer unwanted borders or text artifacts in images, and improved sharpness. Still can’t render text though…

2. AI Leaders Meet With the Biden Administration → A high-profile meeting recently took place at the White House 🇺🇸, featuring the CEOs of major AI players including Google, OpenAI, Microsoft, and Anthropic. Biden told them “I hope you can educate us as to what you think is most needed to protect society”

3. Google "We Have No Moat, And Neither Does OpenAI" → A supposed leaked document from Google talking about the impact of open source models, basically saying open source will outcompete both, Google and Open AI, in the long run. 🫣 What do you think? Reply in the comments.

LEARNING & INSIGHTS

fastai image

Source: fast.ai

Discover how to build the remarkable Stable Diffusion algorithm from the ground up in a new course by Jeremy Howard, co-founder of fast.ai. Collaborating with experts from Stability.ai and Hugging Face (the minds behind the Diffusers library), this course guarantees a comprehensive understanding of the most cutting-edge techniques.

2. What Are Image Embeddings for Computer Vision Data Curation? → Modern deep learning algorithms may free us from explicit feature engineering, but that doesn't guarantee amazing results with any random data. With embeddings, developers can curate relevant, diverse, and representative datasets to train robust CV models that generalize well across various scenarios.

3. Segment Anything: A Foundation Model for Image Segmentation → Segment Anything is a project by Meta to build a starting point for foundation models for image segmentation. This article dives into the most essential components of the Segment Anything project, including the dataset and the model.

4. The Little Book of Deep Learning → a concise introduction to deep learning techniques, covering foundational concepts, model components, and various applications It addresses the challenges of training deep neural network architectures and presents solutions to overcome them.

RESEARCH SPOTLIGHT 

shap-e openai

Text-conditional meshes generated by Shap·E

OpenAI strikes again with their latest creation, Shap-E, an innovative conditional generative model for 3D assets. It stands out from other 3D generative models, as it can directly generate parameters of implicit functions with a single text prompt, rendering both textured meshes and neural radiance fields (NeRF). OpenAI is sharing the goods, offering model weights, inference code, and samples on GitHub.

2. What Do Self-Supervised Vision Transformers Learn? → compares the representations and performance of contrastive learning (CL) and masked image modeling (MIM) in self-supervised Vision Transformers (ViTs) and shows that CL trains self-attentions to capture longer-range global patterns and shape information in the later layers, while MIM mainly focuses on the early layers and texture information, and both methods can complement each other to improve representation diversity and downstream task performance.

3. Personalize Segment Anything Model with One Shot → proposes a personalized approach, termed PerSAM, for the Segment Anything Model (SAM) without requiring training data, by localizing and segmenting target concepts within images or videos via target-guided attention, target-semantic prompting, and cascaded post-refinement, and also presents an efficient one-shot fine-tuning variant, PerSAM-F, that trains only 2 parameters within 10 seconds for improved performance.

DEVELOPER’s CORNER

1. YOLONAS Starter Notebook → Kaggle notebook to get you going with the new YOLO-NAS object detection model.

2. DeepDoctection → a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models.

3. The-incredible-pytorcha curated list of tutorials, projects, libraries, videos, papers, books, and anything related to PyTorch.

Drop me a line if you have any feedback or questions.

Sending you good vibes,

Dasha 🫶 

Reply

or to participate.