Last Week in Computer Vision #26

The World Will Never Be The Same 🤯

Hello there! 👋 

Are you feeling dizzy from all the AI updates and breakthroughs that flooded our feeds last week? I know I am. 😵‍💫 I've sifted through the chaos and extracted the juiciest bits for you in this week's issue. But brace yourself, as the NVIDIA GTC conference is kicking off today with a bold promise of ushering in the "Era of AI and the Metaverse". Looks like we're in for another wild ride of mind-bending news and innovation!

Quote of the week:

I've been working on neural networks for almost a decade. The best way to describe how I'm currently feeling is like a dog that caught up with the car it was chasing, sunk its teeth in the fender, and is now traveling at 80 mph -- tail wagging, jaw tiring.

@charles_irl

Super quick highlits of this issue:

 🤌 GPT-4 can see and Mifjourney V5 can do hands now

 🧐 Understanding ViTs, Best Laptops for Deep Learning, MIT course, Multimodal Learning course

 ⚙️ Web Stable Diffusion, PyTorch 2.0, repository of ML surveys

 🔬 Consistency Models, Personalized Image Manipulation by Stable Diffusion,

 😁 Quizz: AI generated or not + Meme therapy

Author Picks

On Tuesday, Open AI announced the release of GPT-4, although it is their most secretive release to date. This marks their full transition from a nonprofit research lab to a for-profit tech firm. So, what's new with GPT-4? It can now process images alongside text, is better at playing with language, can handle more text, and has been powering Bing all along. However, it still makes mistakes.

On Thursday, Midjourney released Version 5 and a new magazine. The magazine aims to showcase the diverse creativity of the Midjourney community. You can get the first issue of the magazine for free with the promo code "subscribe."

However, the real excitement surrounds Midjourney V5. This version includes several improvements, but everyone is talking about the unmatched photorealism of generated images and the ability to generate hands!

Whenever new technology captures consumer attention so quickly, it begs the question: is there real value here? a16z believes that the answer is undoubtedly yes. Generative AI will be the next major platform upon which founders build category-defining products.

[Event] NVIDIA GTC Developer Conference → GTC is starting today! Based on the tagline "The Conference for the Era of AI and the Metaverse" it appears that this week will be just as eventful as last week in terms of major AI announcements. Buckle up, everyone!

Learning & Insights

🧐 Understanding Vision Transformers (ViTs): Hidden properties, insights, and robustness of their representations → It is well-established that Vision Transformers (ViTs) can outperform convolutional neural networks (CNNs), such as ResNets in image recognition. But what are the factors that cause ViTs' superior performance?

🧐 Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023 → Top picks for laptops that are perfect for machine learning, data science, and deep learning at every budget. Over 8,000 options were analyzed to identify the best to help future-proof your AI rig.

🧐 Multimodal Learning: Vision Language Models → a resource that distills the rapid advancement in the field by presenting a few key architectures and core concepts that yield exceptional results. Transformers-like models and contrastive learning are currently the most promising approaches.

🧐 Introduction to Deep Learning → An efficient and intense boot camp from MIT designed to teach you the fundamentals of deep learning as quickly as possible. Classes are on Friday at 10 am ET, every week.

Developer’s Corner

⚙️ Web Stable Diffusion → This project brings stable diffusion models onto web browsers. Everything runs inside the browser with no server support.

⚙️ PyTorch 2.0 → PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at the compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.

⚙️ ml-surveysIt's hard to keep up with the latest and greatest in machine learning. Here's a selection of survey papers summarizing the advances in the field.

⚙️ xView Dataset → is one of the largest publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.

⚙️ modelstorePython library that allows you to version, export, save and download machine learning models in your choice of storage.

Research Spotlight

HiPer

Image manipulation results with highly personalized (HiPer) text embeddings

🔬 Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion → Diffusion models have been successful in image generation and manipulation. However, their stochastic nature presents challenges in maintaining image content and identity. Previous models, such as DreamBooth and Textual Inversion, rely on multiple reference images and complex training, limiting their usefulness. This paper presents a simple approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. The proposed method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just one image and target text.

🔬 Consistency Models → Diffusion Models are expensive computationally cause they need to decode the output iteratively many times. To overcome this limitation, this paper proposes consistency models, a new family of generative models that achieve high sample quality without adversarial training. They support fast one-step generation by design, while still allowing for few-step sampling to trade compute for sample quality. They also support zero-shot data editing, like image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks.

🔬 DeepMIM: Deep Supervision for Masked Image Modeling → Deep supervision, which involves extra supervision of a neural network's intermediate features, was widely used to classify images in early deep learning. It simplified training and avoided issues like gradient vanishing. However, normalization techniques and residual connections replaced deep supervision in image classification. This paper revisits deep supervision for masked image modeling (MIM) using a pre-trained Vision Transformer (ViT) and a mask-and-predict scheme. DeepMIM accelerates model convergence, improves meaningful representations in shallower layers, and expands attention diversities.

Fun

😅 Latest Midjourney V5 in action! I got a score of 51% on this quiz despite my best efforts.

🤭 Meme Therapy

Have a great week!

Over and out,

Dasha

Join the conversation

or to participate.