OpenAI’s new ‘Sora’ AI-powered text-to-video tool is so good that its outputs could easily be mistaken for real videos, prompting deepfake fears in a year of important global elections.
Sora
Open AI says that its new Sora text-to-video model can generate realistic videos up to a minute long while maintaining visual quality and adherence to the user’s prompt. Sora can both generate entire videos “all at once” or extend generated videos to make them longer.
According to OpenAI Sora can: “generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background”.
How?
Although Sora is based on OpenAI’s existing technologies such as its DALL-E and image generator and the GPT large language models (LLMs), what makes its outputs so realistic is the combination of Sora being a diffusion model and using “transformer architecture”. For example, as a diffusion model, Sora’s video-making process starts off with something looking like “static noise,” but this is transformed gradually by removing that ‘noise’ over many steps.
Also, transformer architecture means the “model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world”, i.e. it contextualises and pieces together sequential data.
Other aspects that make Sora so special are how its “deep understanding of language” enable it to accurately interpret prompts and “generate compelling characters that express vibrant emotions,” and the fact that Sora can “create multiple shots within a single generated video that accurately persist characters and visual style”.
Weaknesses
OpenAI admits, however, that Sora has its weaknesses, including:
– Not always accurately simulating the “physics of a complex scene” or understanding the cause and effect. OpenAI gives the example of a person taking a bite out of a cookie, but afterward, the cookie may not have a bite mark.
– Confusing spatial details of a prompt, e.g. mixing up left and right.
– Struggling with precise descriptions of events that take place over time, e.g. following a specific camera trajectory.
Testing & Safety
The potential and the power of Sora (for both good and bad) mean that OpenAI appears to be making sure it’s been thoroughly tested before releasing it to the public. For example, it’s currently only available to ‘red teamers’ who are assessing any potential critical areas for harms or risks, and with a number of visual artists, designers, and filmmakers to gain their feedback on how to advance the model to be most helpful for creative professionals.
Other measures that OpenAI says it’s taking to make sure Sora is safe include:
– Building tools to help detect misleading content, including a detection classifier that can tell when a video was generated by Sora and including C2PA metadata (data that verifies a video’s origin and related information). Both of these could help combat Sora being used for malicious/misleading deepfakes.
– Leveraging the existing safety methods used for DALL·E such as using a text classifier to check and reject text input prompts that violate OpenAI’s usage policies such as requests for extreme violence, sexual content, hateful imagery, celebrity likeness, or intellectual property of others.
– The use of image classifiers that can review each video frame to ensure adherence to OpenAI’s usage policies before a video is shown to the user.
Concerns
Following the announcement of how realistic Sora’s videos can be, concerns have been expressed online about its potential to be used by bad actors to effectively spread misinformation and disinformation using convincing Sora-produced deepfake videos (if Sora is publicly released in time). The ability of convincing deepfake videos to influence opinion is of particular concern with major elections coming up this year, e.g. in the US, Russia, Taiwan, the UK, and many more countries, and with major high-profile conflicts still ongoing (e.g. in Ukraine and Gaza).
In 2024, more than 50 countries that collectively account for half the planet’s population will be holding their national elections during 2024, and if Sora’s videos are as convincing as has been reported, and/or security measures and tools are not as effective as hoped, the consequences for countries, economies, and world peace could be dire.
What Does This Mean For Your Business?
For businesses, the ability to create amazingly professional and imaginative videos from simple text prompts whenever they want and as often as they want could significantly strengthen their marketing. For example, it could enable them to add value, reduce cost and complications in video making, improve and bolster their image and the quality of their communications, and develop an extra competitive advantage without needing any special video training, skills, or hires.
Sora could, however, also be a negative, disruptive threat to video-producing businesses and those whose value is their video-making skills. Also, as mentioned above, there is the very real threat of political damage or criminal damage (fraud) being caused by the convincing quality of Sora’s videos being used as deepfakes, and the difficulty of trying to control such a powerful tool. Some tech commentators have suggested that AI companies may need to collaborate with social media networks and governments to help tackle the potential risks, e.g. the spreading of misinformation and disinformation once Sora is released for public use.
That said, it will be interesting to see just how good the finished product’s outputs will be. Competitors of OpenAI (and its Microsoft partner) are also working on getting their own new AI image generator products out there, including Google’s Lumiere model, so it’s also exciting to see how these may compare, and the level of choice that businesses have.