Skip links

Open AI Announces Sora A Text-To-Video Tool

So about a year ago Stable Diffusion showed their efforts in generating videos using AI. Suffice to say most of us thought, while if this is it..they are still ways off from making convincing videos that can actually be used anywhere outside of a horror flik.

For those who don’t know what am talking about here is the infamous Will Smith eating spagetti video from an early release of Stable Diffusions text-to-video platform.

In this video you can clearly see the atrocious horrors beset upon those who watched it. I mean this is the stuff nightmares are made of. Anyway, jokes aside, for most creators this was not of concern. It seemed like it would be another couple of years before such a model would really threaten the videographers and photographers we so admire for the perspectives and creativity they bring to every shoot.

Things like frame rates, apertures, framing, lighting  were all violated when AI tried to do this first time round.

A couple of days ago, Sam Altman CEO Open AI announced Sora. Their attempt at Ai generated video. What they showed the world shocked the entire creative community. This is perhaps the ‘last straw’ before the camel back breaks.

Open AI describes Sora as AI that is being trained with the end goal of more accurately simulating the physical world in motion in order to help people who solve problems that require real-world interactions.Sora is a diffusion model, which generates videos that are 1 minute long by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps.

Sora can generate entire videos all at once or extend generated videos to make them longer. By giving the model foresight of many frames at a time, Open AI solved a challenging problem of making sure a subject stays the same even when they go out of view temporarily.

Now before we go further I would like you to see how far AI companies have come with video in just under 2 years. I understand that you expect this to be close to the Will Smith video however these videos you are about to see, are leaps ahead in capability, composition, lighting, framing, etc. It’s just incredible what they have done in such a short time. While there are still a few dead giveaways that indicate these videos are AI generated, you would not notice them if you didn’t look carefully.

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
Prompt: Aerial view of Santorini during the blue hour, showcasing the stunning architecture of white Cycladic buildings with blue domes. The caldera views are breathtaking, and the lighting creates a beautiful, serene atmosphere.
Prompt: A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.

Usecases For Text-To-Video AI

I know most photographers and videographers are probably seeing a good chunk of their careers go up in smoke right now. I think this technology will reduce the need to marketers and other content creators to purchase licenses for such content. Whether they need it for an ad or just to string together a video post, they will no longer have to pay hefty licensing fees that at times come with a lot of usage restrictions. e.g. You can only use this on tv, or you can only use this on digital, or you can only use the video for this campaign and not that other one. These conditions have always been a point a friction between creators and those who utilize their work.

It will also reduce the amount of work and cost that will be required to edit or create some scenes in video production. E.g. Creating something like the Santorini video, one would require going to Santorini and taking the video while there. This means that there is the cost of plane tickets or at the very least hiring a local videographer to do the job.

This presents an opportunity for brands and companies to have a video banks that they can use to engage with customers quickly and on the fly. Just to put this into context, a 45 second animated video that would be used to sell an item like a phone…just showing specifications, pricing , a call to action and some music would cost you upwards of USD 2500 in most african countries. It also takes at least 3 days to get the video done..by then the market opportunity could have already gone.

This technology is particulary interesting to me when coupled with VR. The users can just speak up a prompt and create any kind of environment, watch any kind of video and let their imagination run wild since every prompt they put in will be generated in real-time. For instance one could tell the VR headset that they want to see a large dragon, or a peacful waterfall and actually get immersed.

I see this technology being used in marketing, with campaign management tools now able to generate 60 second ads that do not require a shooting crew. It can also be used to make TV shows and movies in the future…imagine all those risky stunts done in your favourite movies could be generated on a computer.

Implications on the creative economy

The global creative economy is a USD1 trillion plus economy. It employs over 50 million people. If this economy has to compete with machine generated content, that can be done in mere seconds vs mobilizing resources and people to create,then a lot of lives will be affected.

I see companies that sell stock images and footage struggling to make ends meet.

I see this affecting the lives of those who create and their perspective of beauty.

So should we shut it down? Or is this the price of progress?

Let me know what you think in the comments below.

Leave a comment

🍪 This website uses cookies to improve your web experience.