- Ground Truth
- Computer Vision Newsletter #27
Computer Vision Newsletter #27
Text-to-video breakthroughs 📹️ How GPT-4 Changes Computer Vision 👀 Sparks of AGI 🧠
Quote of the week:
Starting to have a feeling that none of our SciFi imaginings have adequately prepared us for what’s coming.
🔬 Gen-2: A multi-modal AI system that can generate novel videos with text, images, or video clips → Just a short while ago, Runway broke new ground in the realm of generative AI by introducing Gen-1, a revolutionary video-to-video model that enables users to craft videos from existing ones using words and images. Recently, Runway has raised the bar yet again with the unveiling of Gen-2, an innovative multi-modal AI system capable of creating entirely new videos from nothing more than a few words.
🤌 Speculating on How GPT-4 Changes Computer Vision → Are general models going to obviate the need to label images and train models? Has YOLO only lived once? How soon will general models be adopted throughout the industry? What tasks will benefit the most from general model inference and what tasks will remain difficult?
🤌 The AI Powerhouses of Tomorrow → an opinion piece on who will profit most from AI and whether it'll be accessible to all or remain exclusive. AI is entering a "takeoff stage," becoming increasingly intuitive and user-friendly. This shift is transforming our interaction with the world. Although some remain skeptical of AI's potential, its capabilities continue to expand and improve each day.
🤌 What can we learn from Lex Fridman’s interview with Sam Altman? → In his signature long-form style, Lex Fridman engaged in an insightful conversation with Sam Altman. For those who'd like to watch or listen, the YouTube video is available here. However, if you're short on time, you might enjoy these highly opinionated notes featured on LessWrong.
Learning & Insights
🤓 Vision Transformers: Intelligent video processing with vision transformers → Recent work on Multiscale Vision Transformers (MViTs) has shown that ViTs are incredibly useful for solving video recognition tasks like action recognition or detection. However, the ViT architecture in its original form is not enough to solve such tasks accurately or efficiently.
🤓 CVPR 2022 Tutorial on MultiModal Machine Learning → This tutorial builds upon the annual course on multimodal machine learning taught at Carnegie Mellon University and is a completely revised version of the previous tutorials on multimodal learning at CVPR, ACL, and ICMI conferences.
🤓 Midjourney V5! How Does It Stack Up? → what’s changed in Midjourney version 5 and how does it compare to previous versions? Side-by-side comparison and analyses.
🤓 Stable Diffusion – A New Paradigm in Generative AI → This article breaks down the Stable Diffusion model into the components, dissects the inner workings and covers the different versions and variations of it.
⚙️ CleanVision → automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc.
⚙️ Stable Diffusion Reimagine → a new Clipdrop tool that allows users to generate multiple variations of a single image without limits. Users can simply upload an image into the algorithm to create as many variations as they want. Reimagine’s model will soon be open-sourced in StabilityAI’s GitHub.
⚙️ MONAI: Generative models for medical → A GitHub repo for evaluation, creation, and curation of generative models for a wide variety of tasks with a specific focus on medical.
⚙️ Alibaba’s Text-to-Video Synthesis → Alibaba released text-to-video (1.7 billion parameter) diffusion model. The model has been launched on ModelScope Studio and huggingface, where you can play with it directly or refer to this Colab page to build it yourself.
🔬 Sparks of Artificial General Intelligence: Early experiments with GPT-4 → GPT-4 excels in various fields, including vision. Its capabilities suggest it could be an early, albeit incomplete, version of artificial general intelligence (AGI). This study from Microsoft focuses on GPT-4's limitations and explores challenges for more advanced AGI development, contemplating the need for new paradigms beyond next-word prediction. The paper concludes with reflections on societal impacts and future research directions.
🔬 A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need? → This work offers a comprehensive review of Generative AI techniques and applications, covering foundations like model architecture, pretraining, and generative modeling methods. The study explores AIGC tasks for various output types and their applications in industries like education and creative content. It also discusses current challenges and the future evolution of generative AI.
🔬 Pix2Video: Video Editing using Image Diffusion → This paper proposes a training-free and generalizable approach for text-guided video editing using pre-trained structure-guided image diffusion models. The two-step method involves performing text-guided edits on an anchor frame using the diffusion model and then progressively propagating the changes to future frames via self-attention feature injection, resulting in realistic edits without any compute-intensive preprocessing or video-specific finetuning.
Quick Industry News
⚡️ NVIDIA to Bring AI to Every Industry → From AI training to deployment, semiconductors to software libraries, systems to cloud services, NVIDIA CEO Jensen Huang outlined how a new generation of breakthroughs will be put at the world’s fingertips.
⚡️ Adobe Firefly: an AI image generator → a family of creative generative AI models coming to Adobe products. The initial feedback from the community is that the tools is actually pretty good and easier to use than other image generators.
⚡️ Canva adds AI to everything → Canva users can now turn an image into a completely personalized design template; Remove elements from an image; Replace an element with an AI-generated image; generate presentation from a prompt and enjoy an improved text-to-image.
⚡️ Microsoft’s Bing chatbot now lets you create images via OpenAI’s DALL-E → The Bing Image Creator will be powered by an “advanced version” of OpenAI’s DALL-E model and will let Bing users create images by simply writing what they want to generate.
Midjourney V5 photorealism took the internet by storm! The plot thickens as we ponder the implications of these ultra-realistic digital illusions.
Thanks for reading!
Over and out,