Computer Vision Newsletter #37

Apple Vision Pro and Computer Vision; Ten Years of AI in Review; Guide to Diffusion Models and much mo

The unveiling of Apple Vision Pro has been the talk of the week, and I must admit, it's truly exciting to witness consumer tech incorporating computer vision, edging this powerful technology towards mainstream use.

Let’s dive in!

AUTHOR PICKS

On June 5th, 2023, Apple played its usual 'one more thing' card at the annual Worldwide Developers Conference (WWDC), revealing the much-anticipated Apple Vision Pro.

The Vision Pro headset has several cool features that are powered by computer vision:

  • Hand Gesture Recognition: The headset uses advanced computer vision algorithms to accurately recognize and interpret hand gestures. This allows users to interact with the digital environment in an intuitive and natural way, without the need for physical controllers.

  • Human Detection: Another important feature of the Vision Pro is its ability to detect and recognize people in the user's vicinity. This allows the headset to overlay digital content on the real world while maintaining awareness of the people around the user.

  • Device Detection: The headset can also recognize and interact with other Apple devices.

The VisionOS SDK is slated for release in June 2023. If you're keen to build visionOS applications using computer vision, there's plenty you can do to get ready. For a deep dive into all things ML and vision from WWDC23, check out the ML & Vision page.

10yearsinai image

Walk down memory lane and revisit some of the key breakthroughs that got us to where we are today. Whether you are a seasoned AI practitioner or simply interested in the latest developments in the field, this article will provide you with a comprehensive overview of the remarkable progress that led AI to become a household name.

LEARNING & INSIGHTS

diffusion

An Illustrated Guide to Diffusion Models

1. An Illustrated Guide to Diffusion Models → Why these models are called diffusion models, the two processes that govern these models, and how these models are trained.

2. Exploring SAHI: Slicing Aided Hyper Inference for Small Object Detection → Deep dive into challenges associated with small object detection along with a few existing approaches and the revolutionary SAHI: Slicing Aided Hyper Inference technique.

3. AI Canon → a curated list of resources a16z has relied on to get smarter about modern AI. They call it the “AI Canon” because these papers, blog posts, courses, and guides have had an outsized impact on the field over the past several years.

4. The Data-centric AI Concepts in Segment Anything → Unpacking the data-centric AI concepts used in Segment Anything, the first foundation model for image segmentation.

5. Understanding Logits, Sigmoid, Softmax, and Cross-Entropy Loss in Deep Learning → Learn about logits, softmax & sigmoid activation functions, understand how they are used everywhere in deep learning networks, what are their use cases & advantages, and cross-entropy loss.

6. Building efficient Experimentation Environments for ML Projects → a look into what it takes for an experimentation environment to be efficient in Machine Learning projects.

RESEARCH SPOTLIGHT

1. Segment Anything in High Quality → The paper introduces HQ-SAM, an enhancement to the Segment Anything Model (SAM) that improves the quality of mask predictions for segmenting objects with intricate structures while maintaining SAM's promptable design, efficiency, and zero-shot generalizability. 👇️ 

hq-sam img

The predicted masks of SAM vs. our HQ-SAM, given the same red box or several points

2. Occ-BEV: Multi-Camera Unified Pre-training via 3D Scene Reconstruction → a multi-camera unified pre-training framework that leverages 3D scene reconstruction and Bird's Eye View (BEV) features from multi-view images to capture spatial and temporal correlations among different camera views, leading to improved performance in multi-camera 3D perception tasks.

3. Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles → a hierarchical vision transformer that achieves high accuracy while being significantly faster than previous models by eliminating unnecessary components through pretraining with a strong visual pretext task.

DEVELOPER’s CORNER

1. AI Basketball Refereecomputer vision-based system that uses a custom YOLO model trained on 3000 annotated images to detect basketballs in real-time. Additionally, it utilizes YOLO pose estimation to detect keypoints on the body of the players.

2. AITemplatePython framework that transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving.

3. DragGANUnofficial Implementation of DragGAN paper - a method for interactive point-based manipulation of generative adversarial networks introduced just 2 weeks ago.

Previous Issue’s 3 Most Clicked Links

NEWSy bits

Drop me a line if you have any feedback or questions.

Sending you good vibes,

Dasha 🫶 

Join the conversation

or to participate.