Last Week in Computer Vision #18

Crypto and the Metaverse are out. Generative AI is in.

Crypto and the metaverse are out. Generative AI is in. Models like DALL-E, Midjourney, and Stable Diffusion are transcending the boundaries of AI into mainstream pop culture. The opportunities for transforming industries with generative AI seem endless, and, not surprisingly, this has quickly become part of the investment thesis of many of the top venture capital firms in the world.

Here is what Emad Mostaque, Stability AI's CEO, had to say: “So much of the world is creatively constipated, and we’re going to make it so that they can poop rainbows.”  🦄

In this issue of The Ground Truth: Stability AI is now a unicorn; AI in medicine is over-hyped; MUSIQ - multi-scale image quality transformer; The State of AI Report; Stable Diffusion Course; Measuring Perception in AI models; Text-based real image editing with Diffusion Models; CircularNet - reducing waste with Machine Learning and much more!

Author Picks

StabilityAI

"No generative AI project has created as much buzz - or as much controversy - as Stable Diffusion. Stability AI, the company behind the Stable Diffusion models and the LAION datasets, raised $101 million at a $1 billion valuation! Stability AI challenged the many concerns around the safety of large text-to-image synthesis models by open-sourcing Stable Diffusion ahead of AI powerhouses like Google, Meta and OpenAI."

The first CNN prototypes that diagnosed diseases from CT or X-Ray scans were built many years ago, and I vaguely remember many proclaiming then that physicians may soon be displaced by AI models. This hasn’t happened yet, and it is unlikely to happen any time soon. The gap between tech innovation and tech deployment is often bigger and more uncertain than people estimate. 

Insights & Learning

The State of AI Report, produced by AI investors Nathan Benaich and Ian Hogarth, analyzes top developments in AI. It encourages conversations around the state of AI and its implication for the future.

What do I need for running the state-of-the-art text-to-image model? Can a gaming card do the job, or should I get a fancy A100? What if I only have a CPU?

Illustrated Deep Learning cheatsheets covering the content of the Stanford CS 230 class. They are pretty great and can be useful not only to students but anyone else working with or interested in Deep Learning.

A course from Fast AI that will walk you through implementing the Stable Diffusion algorithm. Nearly every key technique in modern deep learning comes together in Stable Diffusion, making it a great learning objective.

twitter

Publications & Research

Google proposes a multi-scale image quality transformer (MUSIQ), which can handle full-size image input with varying resolutions and aspect ratios. By transforming the input image to a multi-scale representation with both global and local views, the model can capture the image quality at different granularities. Although MUSIQ is designed for image quality assessment, it can be applied to other scenarios where task labels are sensitive to image resolution and aspect ratio. 

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning.

"As we work towards the goal of building artificial general intelligence (AGI), developing robust and effective benchmarks that expand AI models’ capabilities is as important as developing the models themselves". So DeepMind introduced the Perception Test, a multimodal benchmark using real-world videos to help evaluate the perception capabilities of a model.

Imagic

In this paper, authors demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image! Contrary to previous work, the proposed method requires only a single input image and a target text (the desired edit). It operates on real images and does not require any additional inputs (such as image masks or additional views of the object). "Imagic" leverages a pre-trained text-to-image diffusion model for this task.

Tools & Libraries

Lightly is a computer vision framework for self-supervised learningLightly is a computer vision framework for self-supervised learning. The solution can be applied before any data annotation step and the learned representations can be used to visualize and analyze datasets. This allows to select the best core set of samples for model training through advanced filtering.

CircularNet from TensorFlow is a set of models that lowers barriers to AI/ML tech for waste identification and all the benefits this new level of transparency can offer. The goal is to develop a robust and data-efficient model for waste/recyclables detection, which can support the way we identify, sort, manage, and recycle materials across the waste management ecosystem.

circularNet

AITemplate: Faster, more flexible inference on GPUsMeta AI has developed and is open-sourcing AITemplate (AIT), a unified inference system with separate acceleration back ends for both AMD and NVIDIA GPU hardware. It delivers close to hardware-native Tensor Core (NVIDIA GPU) and Matrix Core (AMD GPU) performance on a variety of widely used AI models such as convolutional neural networks, transformers, and diffusers.

Reply

or to participate.