Last Week in Computer Vision #25

AI is no longer a one-trick pony. 🦄 Multimodal AI is here.

Hello there! 👋 

Do you remember when we used to think that AI models could only perform one task at a time? And that one model couldn't be expected to deal with different data modalities? Well, while working on this issue, I almost forgot this was ever the case.

In the past week, there has been a surge of research directed toward developing Multimodal ML systems capable of comprehending/generating natural language and visual content simultaneously. As someone who is invested in the advancements of AI, I must say that this is particularly thrilling. Make sure to check out the Research Spotlight section of this issue.

Super Quick Highlights: Survey of AI-Generated Content | Unlocking ML requires an ecosystem approach | AI is eating the world | Ultra-fast ControlNet | Learn PyTorch in a day | Fine-Tuning Pre-Trained Models in TensorFlow & Keras | Serve Stable Diffusion Three Times Faster | Visual ChatGPT | Scaling up GANs for Text-to-Image Synthesis | PaLM-E: An embodied multimodal language model | Prismer: A Vision-Language Model with Multi-Modal Experts and more!

If you like Ground Truth, share it with a computer vision friend! If you hate it, share it with an enemy. 😉 

Author Picks

midjourney_by_nextdimension

 A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT → This survey provides a comprehensive review of the history of generative models, and basic components, recent advances in AIGC from unimodal interaction and multimodal interaction.

 Unlocking ML requires an ecosystem approachThe ML field is still nascent and significant work is needed to transform it into an established discipline. ML Commons identified critical ecosystem areas to improve: Data discoverability, access, and quality; Operational scaling and hardening; Ethical applied ML.

 AI is Eating The World → As new AI capabilities make way for new products, it’s reasonable to ask: How does this change the value game? This article shares observations on the value of the AI technology stack and focuses on where some of the technical moats might be.

Learnings & Insights

🤓 [Tutorial] Serve Stable Diffusion Three Times Faster → leverage optimizations from PyTorch and other third-party libraries such as DeepSpeed to reduce the cost of serving Stable Diffusion without significant impact on the quality of the images generated.

🤓 [Tutorial] Ultra-fast ControlNet with DiffusersControlNet provides a framework to customize the generation process of Stable Diffusion models. This blog post introduces pre-trained ControlNet on HuggingFace and shows how it can be applied for various control conditionings.

🤓 [Tutorial] Fine-Tuning Pre-Trained Models in TensorFlow & Keras → Fine-Tuning in the context of image classification using the VGG-16 network as an example.

🤓 Prompt Engineering Guide [so you can ask ChatGPT questions about computer vision] → prompt engineering guide that contains all the latest papers, learning guides, lectures, references, and tools related to prompt engineering.

🤓 [Course] Learn PyTorch in a day. Literally. → a great course that will teach you the foundations of machine learning and deep learning with PyTorch.

Research Spotlight

🔎 Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models → Microsoft researchers built a system called Visual ChatGPT, incorporating different Visual Foundation Models, to enable users to interact with ChatGPT beyond language format. Users can send, receive, and edit images during chatting.

🔎 Scaling up GANs for Text-to-Image Synthesis → 1B-parameter GigaGAN is a large-scale GAN model trained on a diverse large dataset (i.e. LAION), that can generate high-resolution images orders of magnitude faster than diffusion and autoregressive models. While the current standards have shifted towards diffusion and autoregressive modeling, this work shows that GANs are still competitive given the right design choices.

giga_gan

GigaGAN: Large-scale GAN for Text-to-Image Synthesis

🔎 PaLM-E: An embodied multimodal language modela 562-billion 🤯 parameter, general-purpose, embodied visual-language generalist. It is trained on a diverse mixture of tasks across multiple robot embodiments and general vision-language tasks. The researchers demonstrated that this diversity in training leads to several approaches of transfer from the vision language domains into embodied decision-making, enabling robot planning tasks to be achieved data efficiently.

🔎 Prismer: A Vision-Language Model with Multi-Modal Experts → Prismer is a data- and parameter-efficient vision-language model that leverages an ensemble of diverse, pre-trained domain experts. Prismer achieves fine-tuned and few-shot learning vision-language reasoning performance which is competitive with the current state-of-the-art, while requiring up to two orders of magnitude less training data.

Developer’s Corner

⚙️ pigeon → Pigeon is a simple widget that lets you quickly annotate a dataset of unlabeled examples from the comfort of your Jupyter notebook.

⚙️ Pandas 2.0 Release Candidate is out → In this release it looks like the focus was mainly on bug fixes and no major API changes.

⚙️ sketchy vision → Visuals covering key Computer Vision concepts.

⚙️ phind → The AI search engine for developers.

⚙️ FSVVD: A Dataset of Full Scene Volumetric Videodataset of volumetric videos depicting the interaction of people in real-life scenarios with external scenes.  

Miscellaneous

Meme Therapy 😏 

Have a great week!

Over and out,

Dasha

Join the conversation

or to participate.