Computer Vision Newsletter #38

New Self-Supervised Computer Vision Model from Meta; ViT Paper Explained; CVPR 2023 Survival Guide and more

AUTHOR PICKS  

Meta has been one of the biggest proponents of self-supervised learning. Yesterday, Meta AI announced I-JEPA, a self-supervised computer vision model that learns the world by predicting it, based on Yann LeCun’s vision of autonomous machine intelligence to learn and reason similar to how humans and animals do.

Key insights:

  • I-JEPA learns by creating an internal model of the external world by comparing abstract image representations instead of pixel-level comparisons.

  • By learning from representations instead of pixels, the model is able to avoid biases and issues that occur due to invariance-based pre-training.

  • No labeled data or data augmentation is needed, just masking!

  • The model delivers strong performance on various computer vision tasks without requiring extensive fine-tuning.

  • It is efficient, reaching SOTA results by training a 632M parameter visual transformer model using just 16 A100 GPUs in under 72 hours.

  • It will be presented at CVPR 2023 next week, and the training code and model checkpoints are already available on GitHub.

LEARNING & INSIGHTS

1. [Video] Vision Transformer(ViT) Paper ExplainedJean de Nyandwi dives deep into “An Image is Worth 16x16 Words” paper: its motivation, it’s architecture, and training, and highlights recent follow-up works that improved the original Vision Transformer. 👇️ 

2. [Upcoming Event] How to leverage Embeddings for Data Curation in Computer Vision → Ever wondered how to train robust computer vision models? Trust me, randomly selecting a portion of your image dataset for model training isn't as effective as you might think. I highly recommend registering for this tech talk hosted by AI Camp. The speaker, my brilliant colleague Sam, works with computer vision companies on a daily and has a wealth of knowledge to share.

3. How to Build ML Model Training Pipeline → picture a scenario: Clean code. Streamlined workflows. Efficient model training. Too good to be true? That’s exactly what this guide dives into how to create a clean, maintainable, and fully reproducible machine learning model training pipeline.

4. [Free Book] Deep Learning Interviews → hundreds of fully-solved problems, from a wide range of key topics in AI. It is designed to both rehearse interview or exam-specific topics and provide machine learning M.Sc./Ph.D. students, and those awaiting an interview a well-organized overview of the field.

5. Machine Learning Model Compression → When you deploy Machine Learning models to production you need to take into account several operational metrics that are in general not ML related. This article talks about different Model Compression methods you can use to influence both latency and model size.

6. CVPR 2023 Survival Guide → Top 10 papers you can’t miss, with links and summaries. Voxel51 selected papers based on GitHub project star counts, perceived impact on the field, and the team’s interest.

DEVELOPER RESOURCES 

1. Matte Anything → Interactive natural image matting with Segment Anything Models. It can produce high-quality alpha-matte with various simple hints.👇️ 

Matte Anything

2. YoloV8 TensorRT CPP → This project demonstrates how to use the TensorRT C++ API to run GPU inference for YoloV8.

3. Convert and Optimize YOLOv8 with OpenVINO™ → This tutorial demonstrates step-by-step instructions on how to run and optimize PyTorch Yolo V8 with OpenVINO.

4. I-JEPA: the Image-based Joint-Embedding Predictive Architecture from Meta → Official PyTorch codebase for I-JEPA by Meta AI (discussed in Autor Picks) published @ CVPR-23.

RESEARCH SPOTLIGHT

1. Recognize Anything: A Strong Image Tagging Model → a strong foundation model for image tagging that accurately recognizes any common category without manual annotations. The model is trained through several steps, including automatic annotation, data cleaning, and fine-tuning, resulting in impressive zero-shot performance.

2. CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation → The paper proposes a hybrid architecture called CiT-Net, which combines convolutional neural networks (CNNs) and vision Transformers for improved medical image segmentation.

3. Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design → The paper proposes a method to determine compute-optimal model shapes, such as width and depth, and applies it to vision transformers (ViTs). The shape-optimized vision transformer (SoViT) achieves competitive results compared to larger models.

Top 3 Clicked Links from Previous Issue

Miscellaneous

If Ground Truth has been valuable to you, share it with a fellow computer vision enthusiast. If it's not your cup of tea, maybe it will be for someone you're not so fond of. Either way, sharing is caring! 😉 

Drop me a line if you have any feedback or questions.

Sending you good vibes,

Dasha 🫶 

Join the conversation

or to participate.