Computer Vision Newsletter #36

CVPR 2023 & State of Computer Vision; Face Recognition Models, Toolkit and Datasets; Document Understanding with Donut and more

CVPR 2023 is fast approaching, carrying a surge of breakthroughs in computer vision, and admittedly, a dash of 'keep-up' anxiety. But, don't worry! I've got you covered. I'll be weaving in these exciting developments starting today. 🤗 

AUTHOR PICKS

1. Advancements in Face Recognition Models, Toolkit and Datasets → The article provides an extensive exploration of current face recognition models, toolkits, datasets, and pipelines. By dissecting popular models, it illustrates the evolutionary journey of face recognition technology, demonstrating how each successive innovation builds on the last, propelling the field to impressive new milestones.

2. CVPR 2023 and the State of Computer Vision → This article digs into key stats and info about CVPR 2023, along with a handpicked list of standout papers and the latest trends in computer vision as we gear up for this year's CVPR conference.

3. Cool CVPR 2023 Visualization → see papers by subject areas, team size, and category. 👇️ 

cvpr2023

LEARNING & INSIGHTS

1. Generative AI for Document Understanding with Hugging Face and Amazon SageMaker → Learn how to fine-tune and deploy Donut-base for document-understand/document-parsing using Hugging Face Transformers and Amazon SageMaker.

2. Multimodal Models and Computer Vision → Explore the challenges and opportunities of multimodal machine learning, and the different architectures and techniques used to tackle multimodal computer vision challenges.

3. An Introduction to Unsupervised Learning → This article defines unsupervised learning, discusses its most well-known algorithms, and provides examples of unsupervised learning in computer vision and natural language processing.

4. How to Save Trained Model in Python → Learn about different methods of saving, storing, and packaging a trained machine-learning model, along with the pros and cons of each method.

5. Guide to Autoencoders → Understand what Autoencoders are, what they are used for and popular architectures.

6. Review: ImageBind - One Embedding Space To Bind Them All → Deep dive into ImageBind, a novel approach that enables learning of a joint embedding across six different modalities — images, text, audio, depth, thermal, and IMU data.

Data Curation webinar

If we want our CV models to perform their best and generalize well across various scenarios we have to carefully curate relevant, diverse, and representative datasets. But how do we sift through a massive trove of image data? The answer lies in embeddings - the Swiss army knife of machine learning. This webinar is geared up to guide you through data curation leveraging embeddings.

RESEARCH SPOTLIGHT  

controlvideo

1. ControlVideo: Training-free Controllable Text-to-Video Generation → This paper presents a training-free framework called ControlVideo for efficient and natural text-to-video generation. It addresses the challenges of training cost, appearance inconsistency, and structural flickers in video synthesis by leveraging structural consistency, introducing cross-frame interaction, employing frame interpolation, and using a hierarchical sampler.

2. Model evaluation for extreme risks → This paper from DeepMind emphasizes the importance of model evaluation in addressing extreme risks associated with the development of AI systems, highlighting the need for identifying dangerous capabilities and assessing model alignment to prevent harm and make responsible decisions.

3. VanillaNet: the Power of Minimalism in Deep Learning → VanillaNet is a neural network architecture that embraces simplicity and minimalism by avoiding high depth, shortcuts, and intricate operations like self-attention, making it refreshingly concise yet remarkably powerful, overcoming challenges of complexity and enabling efficient deployment in resource-constrained environments.

DEVELOPER’s CORNER

1. ControlVideo [Official pytorch implementation] → ControlVideo adapts ControlNet to the video counterpart without any finetuning, aiming to directly inherit its high-quality and consistent generation.

2. Donut 🍩 : Document Understanding Transformer → a new document-understanding model achieving state-of-art performance with an MIT-license, which allows it to be used for commercial purposes.

3. Roop: One-click deep fake → Take a video and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, and no training.

NEWSy bits 

P.S. Should you find yourself in the thick of real-world computer vision projects and seeking a streamlined approach to handling, curating, and labeling your data, consider giving Superb AI a shot. It might just be the solution you need!

Drop me a line if you have any feedback or questions.

Sending you good vibes,

Dasha 🫶 

Join the conversation

or to participate.