ABOUT THE SPEAKER

Fei-Fei Li - Computer scientist
As Director of Stanford’s Artificial Intelligence Lab and Vision Lab, Fei-Fei Li is working to solve AI’s trickiest problems -- including image recognition, learning and language processing.

Why you should listen

Using algorithms built on machine learning methods such as neural network models, the Stanford Artificial Intelligence Lab led by Fei-Fei Li has created software capable of recognizing scenes in still photographs -- and accurately describe them using natural language.

Li’s work with neural networks and computer vision (with Stanford’s Vision Lab) marks a significant step forward for AI research, and could lead to applications ranging from more intuitive image searches to robots able to make autonomous decisions in unfamiliar situations.

Fei-Fei was honored as one of Foreign Policy's 2015 Global Thinkers.

More profile about the speaker
Fei-Fei Li | Speaker | TED.com

TED2015

Fei-Fei Li: How we're teaching computers to understand pictures

Filmed: 2015-03-17

Readability: 4.5

2,702,344 views

When a very young child looks at a picture, she can identify simple elements: "cat," "book," "chair." Now, computers are getting smart enough to do that too. What's next? In a thrilling talk, computer vision expert Fei-Fei Li describes the state of the art -- including the database of 15 million photos her team built to "teach" a computer to understand pictures -- and the key insights yet to come.

Fei-Fei Li - Computer scientist
As Director of Stanford’s Artificial Intelligence Lab and Vision Lab, Fei-Fei Li is working to solve AI’s trickiest problems -- including image recognition, learning and language processing. Full bio

Double-click the English transcript below to play the video.

00:14

Let me show you something.

0

2366

3738

00:18

(Video) Girl: Okay, that's a cat
sitting in a bed.

1

6104

4156

00:22

The boy is petting the elephant.

2

10260

4040

00:26

Those are people
that are going on an airplane.

3

14300

4354

00:30

That's a big airplane.

4

18654

2810

00:33

Fei-Fei Li: This is
a three-year-old child

5

21464

2206

00:35

describing what she sees
in a series of photos.

6

23670

3679

00:39

She might still have a lot
to learn about this world,

7

27349

2845

00:42

but she's already an expert
at one very important task:

8

30194

4549

00:46

to make sense of what she sees.

9

34743

2846

00:50

Our society is more
technologically advanced than ever.

10

38229

4226

00:54

We send people to the moon,
we make phones that talk to us

11

42455

3629

00:58

or customize radio stations
that can play only music we like.

12

46084

4946

01:03

Yet, our most advanced
machines and computers

13

51030

4055

01:07

still struggle at this task.

14

55085

2903

01:09

So I'm here today
to give you a progress report

15

57988

3459

01:13

on the latest advances
in our research in computer vision,

16

61447

4047

01:17

one of the most frontier
and potentially revolutionary

17

65494

4161

01:21

technologies in computer science.

18

69655

3206

01:24

Yes, we have prototyped cars
that can drive by themselves,

19

72861

4551

01:29

but without smart vision,
they cannot really tell the difference

20

77412

3853

01:33

between a crumpled paper bag
on the road, which can be run over,

21

81265

3970

01:37

and a rock that size,
which should be avoided.

22

85235

3340

01:41

We have made fabulous megapixel cameras,

23

89415

3390

01:44

but we have not delivered
sight to the blind.

24

92805

3135

01:48

Drones can fly over massive land,

25

96420

3305

01:51

but don't have enough vision technology

26

99725

2134

01:53

to help us to track
the changes of the rainforests.

27

101859

3461

01:57

Security cameras are everywhere,

28

105320

2950

02:00

but they do not alert us when a child
is drowning in a swimming pool.

29

108270

5067

02:06

Photos and videos are becoming
an integral part of global life.

30

114167

5595

02:11

They're being generated at a pace
that's far beyond what any human,

31

119762

4087

02:15

or teams of humans, could hope to view,

32

123849

2783

02:18

and you and I are contributing
to that at this TED.

33

126632

3921

02:22

Yet our most advanced software
is still struggling at understanding

34

130553

5232

02:27

and managing this enormous content.

35

135785

3876

02:31

So in other words,
collectively as a society,

36

139661

5272

02:36

we're very much blind,

37

144933

1746

02:38

because our smartest
machines are still blind.

38

146679

3387

02:43

"Why is this so hard?" you may ask.

39

151526

2926

02:46

Cameras can take pictures like this one

40

154452

2693

02:49

by converting lights into
a two-dimensional array of numbers

41

157145

3994

02:53

known as pixels,

42

161139

1650

02:54

but these are just lifeless numbers.

43

162789

2251

02:57

They do not carry meaning in themselves.

44

165040

3111

03:00

Just like to hear is not
the same as to listen,

45

168151

4343

03:04

to take pictures is not
the same as to see,

46

172494

4040

03:08

and by seeing,
we really mean understanding.

47

176534

3829

03:13

In fact, it took Mother Nature
540 million years of hard work

48

181293

6177

03:19

to do this task,

49

187470

1973

03:21

and much of that effort

50

189443

1881

03:23

went into developing the visual
processing apparatus of our brains,

51

191324

5271

03:28

not the eyes themselves.

52

196595

2647

03:31

So vision begins with the eyes,

53

199242

2747

03:33

but it truly takes place in the brain.

54

201989

3518

03:38

So for 15 years now, starting
from my Ph.D. at Caltech

55

206287

5060

03:43

and then leading Stanford's Vision Lab,

56

211347

2926

03:46

I've been working with my mentors,
collaborators and students

57

214273

4396

03:50

to teach computers to see.

58

218669

2889

03:54

Our research field is called
computer vision and machine learning.

59

222658

3294

03:57

It's part of the general field
of artificial intelligence.

60

225952

3878

04:03

So ultimately, we want to teach
the machines to see just like we do:

61

231000

5493

04:08

naming objects, identifying people,
inferring 3D geometry of things,

62

236493

5387

04:13

understanding relations, emotions,
actions and intentions.

63

241880

5688

04:19

You and I weave together entire stories
of people, places and things

64

247568

6153

04:25

the moment we lay our gaze on them.

65

253721

2164

04:28

The first step towards this goal
is to teach a computer to see objects,

66

256955

5583

04:34

the building block of the visual world.

67

262538

3368

04:37

In its simplest terms,
imagine this teaching process

68

265906

4434

04:42

as showing the computers
some training images

69

270340

2995

04:45

of a particular object, let's say cats,

70

273335

3321

04:48

and designing a model that learns
from these training images.

71

276656

4737

04:53

How hard can this be?

72

281393

2044

04:55

After all, a cat is just
a collection of shapes and colors,

73

283437

4052

04:59

and this is what we did
in the early days of object modeling.

74

287489

4086

05:03

We'd tell the computer algorithm
in a mathematical language

75

291575

3622

05:07

that a cat has a round face,
a chubby body,

76

295197

3343

05:10

two pointy ears, and a long tail,

77

298540

2299

05:12

and that looked all fine.

78

300839

1410

05:14

But what about this cat?

79

302859

2113

05:16

(Laughter)

80

304972

1091

05:18

It's all curled up.

81

306063

1626

05:19

Now you have to add another shape
and viewpoint to the object model.

82

307689

4719

05:24

But what if cats are hidden?

83

312408

1715

05:27

What about these silly cats?

84

315143

2219

05:31

Now you get my point.

85

319112

2417

05:33

Even something as simple
as a household pet

86

321529

3367

05:36

can present an infinite number
of variations to the object model,

87

324896

4504

05:41

and that's just one object.

88

329400

2233

05:44

So about eight years ago,

89

332573

2492

05:47

a very simple and profound observation
changed my thinking.

90

335065

5030

05:53

No one tells a child how to see,

91

341425

2685

05:56

especially in the early years.

92

344110

2261

05:58

They learn this through
real-world experiences and examples.

93

346371

5000

06:03

If you consider a child's eyes

94

351371

2740

06:06

as a pair of biological cameras,

95

354111

2554

06:08

they take one picture
about every 200 milliseconds,

96

356665

4180

06:12

the average time an eye movement is made.

97

360845

3134

06:15

So by age three, a child would have seen
hundreds of millions of pictures

98

363979

5550

06:21

of the real world.

99

369529

1834

06:23

That's a lot of training examples.

100

371363

2280

06:26

So instead of focusing solely
on better and better algorithms,

101

374383

5989

06:32

my insight was to give the algorithms
the kind of training data

102

380372

5272

06:37

that a child was given through experiences

103

385644

3319

06:40

in both quantity and quality.

104

388963

3878

06:44

Once we know this,

105

392841

1858

06:46

we knew we needed to collect a data set

106

394699

2971

06:49

that has far more images
than we have ever had before,

107

397670

4459

06:54

perhaps thousands of times more,

108

402129

2577

06:56

and together with Professor
Kai Li at Princeton University,

109

404706

4111

07:00

we launched the ImageNet project in 2007.

110

408817

4752

07:05

Luckily, we didn't have to mount
a camera on our head

111

413569

3838

07:09

and wait for many years.

112

417407

1764

07:11

We went to the Internet,

113

419171

1463

07:12

the biggest treasure trove of pictures
that humans have ever created.

114

420634

4436

07:17

We downloaded nearly a billion images

115

425070

3041

07:20

and used crowdsourcing technology
like the Amazon Mechanical Turk platform

116

428111

5880

07:25

to help us to label these images.

117

433991

2339

07:28

At its peak, ImageNet was one of
the biggest employers

118

436330

4900

07:33

of the Amazon Mechanical Turk workers:

119

441230

2996

07:36

together, almost 50,000 workers

120

444226

3854

07:40

from 167 countries around the world

121

448080

4040

07:44

helped us to clean, sort and label

122

452120

3947

07:48

nearly a billion candidate images.

123

456067

3575

07:52

That was how much effort it took

124

460612

2653

07:55

to capture even a fraction
of the imagery

125

463265

3900

07:59

a child's mind takes in
in the early developmental years.

126

467165

4171

08:04

In hindsight, this idea of using big data

127

472148

3902

08:08

to train computer algorithms
may seem obvious now,

128

476050

4550

08:12

but back in 2007, it was not so obvious.

129

480600

4110

08:16

We were fairly alone on this journey
for quite a while.

130

484710

3878

08:20

Some very friendly colleagues advised me
to do something more useful for my tenure,

131

488588

5003

08:25

and we were constantly struggling
for research funding.

132

493591

4342

08:29

Once, I even joked to my graduate students

133

497933

2485

08:32

that I would just reopen
my dry cleaner's shop to fund ImageNet.

134

500418

4063

08:36

After all, that's how I funded
my college years.

135

504481

4761

08:41

So we carried on.

136

509242

1856

08:43

In 2009, the ImageNet project delivered

137

511098

3715

08:46

a database of 15 million images

138

514813

4042

08:50

across 22,000 classes
of objects and things

139

518855

4805

08:55

organized by everyday English words.

140

523660

3320

08:58

In both quantity and quality,

141

526980

2926

09:01

this was an unprecedented scale.

142

529906

2972

09:04

As an example, in the case of cats,

143

532878

3461

09:08

we have more than 62,000 cats

144

536339

2809

09:11

of all kinds of looks and poses

145

539148

4110

09:15

and across all species
of domestic and wild cats.

146

543258

5223

09:20

We were thrilled
to have put together ImageNet,

147

548481

3344

09:23

and we wanted the whole research world
to benefit from it,

148

551825

3738

09:27

so in the TED fashion,
we opened up the entire data set

149

555563

4041

09:31

to the worldwide
research community for free.

150

559604

3592

09:36

(Applause)

151

564636

4000

09:41

Now that we have the data
to nourish our computer brain,

152

569416

4538

09:45

we're ready to come back
to the algorithms themselves.

153

573954

3737

09:49

As it turned out, the wealth
of information provided by ImageNet

154

577691

5178

09:54

was a perfect match to a particular class
of machine learning algorithms

155

582869

4806

09:59

called convolutional neural network,

156

587675

2415

10:02

pioneered by Kunihiko Fukushima,
Geoff Hinton, and Yann LeCun

157

590090

5248

10:07

back in the 1970s and '80s.

158

595338

3645

10:10

Just like the brain consists
of billions of highly connected neurons,

159

598983

5619

10:16

a basic operating unit in a neural network

160

604602

3854

10:20

is a neuron-like node.

161

608456

2415

10:22

It takes input from other nodes

162

610871

2554

10:25

and sends output to others.

163

613425

2718

10:28

Moreover, these hundreds of thousands
or even millions of nodes

164

616143

4713

10:32

are organized in hierarchical layers,

165

620856

3227

10:36

also similar to the brain.

166

624083

2554

10:38

In a typical neural network we use
to train our object recognition model,

167

626637

4783

10:43

it has 24 million nodes,

168

631420

3181

10:46

140 million parameters,

169

634601

3297

10:49

and 15 billion connections.

170

637898

2763

10:52

That's an enormous model.

171

640661

2415

10:55

Powered by the massive data from ImageNet

172

643076

3901

10:58

and the modern CPUs and GPUs
to train such a humongous model,

173

646977

5433

11:04

the convolutional neural network

174

652410

2369

11:06

blossomed in a way that no one expected.

175

654779

3436

11:10

It became the winning architecture

176

658215

2508

11:12

to generate exciting new results
in object recognition.

177

660723

5340

11:18

This is a computer telling us

178

666063

2810

11:20

this picture contains a cat

179

668873

2300

11:23

and where the cat is.

180

671173

1903

11:25

Of course there are more things than cats,

181

673076

2112

11:27

so here's a computer algorithm telling us

182

675188

2438

11:29

the picture contains
a boy and a teddy bear;

183

677626

3274

11:32

a dog, a person, and a small kite
in the background;

184

680900

4366

11:37

or a picture of very busy things

185

685266

3135

11:40

like a man, a skateboard,
railings, a lampost, and so on.

186

688401

4644

11:45

Sometimes, when the computer
is not so confident about what it sees,

187

693045

5293

11:51

we have taught it to be smart enough

188

699498

2276

11:53

to give us a safe answer
instead of committing too much,

189

701774

3878

11:57

just like we would do,

190

705652

2811

12:00

but other times our computer algorithm
is remarkable at telling us

191

708463

4666

12:05

what exactly the objects are,

192

713129

2253

12:07

like the make, model, year of the cars.

193

715382

3436

12:10

We applied this algorithm to millions
of Google Street View images

194

718818

5386

12:16

across hundreds of American cities,

195

724204

3135

12:19

and we have learned something
really interesting:

196

727339

2926

12:22

first, it confirmed our common wisdom

197

730265

3320

12:25

that car prices correlate very well

198

733585

3290

12:28

with household incomes.

199

736875

2345

12:31

But surprisingly, car prices
also correlate well

200

739220

4527

12:35

with crime rates in cities,

201

743747

2300

12:39

or voting patterns by zip codes.

202

747007

3963

12:44

So wait a minute. Is that it?

203

752060

2206

12:46

Has the computer already matched
or even surpassed human capabilities?

204

754266

5153

12:51

Not so fast.

205

759419

2138

12:53

So far, we have just taught
the computer to see objects.

206

761557

4923

12:58

This is like a small child
learning to utter a few nouns.

207

766480

4644

13:03

It's an incredible accomplishment,

208

771124

2670

13:05

but it's only the first step.

209

773794

2460

13:08

Soon, another developmental
milestone will be hit,

210

776254

3762

13:12

and children begin
to communicate in sentences.

211

780016

3461

13:15

So instead of saying
this is a cat in the picture,

212

783477

4224

13:19

you already heard the little girl
telling us this is a cat lying on a bed.

213

787701

5202

13:24

So to teach a computer
to see a picture and generate sentences,

214

792903

5595

13:30

the marriage between big data
and machine learning algorithm

215

798498

3948

13:34

has to take another step.

216

802446

2275

13:36

Now, the computer has to learn
from both pictures

217

804721

4156

13:40

as well as natural language sentences

218

808877

2856

13:43

generated by humans.

219

811733

3322

13:47

Just like the brain integrates
vision and language,

220

815055

3853

13:50

we developed a model
that connects parts of visual things

221

818908

5201

13:56

like visual snippets

222

824109

1904

13:58

with words and phrases in sentences.

223

826013

4203

14:02

About four months ago,

224

830216

2763

14:04

we finally tied all this together

225

832979

2647

14:07

and produced one of the first
computer vision models

226

835626

3784

14:11

that is capable of generating
a human-like sentence

227

839410

3994

14:15

when it sees a picture for the first time.

228

843404

3506

14:18

Now, I'm ready to show you
what the computer says

229

846910

4644

14:23

when it sees the picture

230

851554

1975

14:25

that the little girl saw
at the beginning of this talk.

231

853529

3830

14:31

(Video) Computer: A man is standing
next to an elephant.

232

859519

3344

14:36

A large airplane sitting on top
of an airport runway.

233

864393

3634

14:41

FFL: Of course, we're still working hard
to improve our algorithms,

234

869057

4212

14:45

and it still has a lot to learn.

235

873269

2596

14:47

(Applause)

236

875865

2291

14:51

And the computer still makes mistakes.

237

879556

3321

14:54

(Video) Computer: A cat lying
on a bed in a blanket.

238

882877

3391

14:58

FFL: So of course, when it sees
too many cats,

239

886268

2553

15:00

it thinks everything
might look like a cat.

240

888821

2926

15:05

(Video) Computer: A young boy
is holding a baseball bat.

241

893317

2864

15:08

(Laughter)

242

896181

1765

15:09

FFL: Or, if it hasn't seen a toothbrush,
it confuses it with a baseball bat.

243

897946

4583

15:15

(Video) Computer: A man riding a horse
down a street next to a building.

244

903309

3434

15:18

(Laughter)

245

906743

2023

15:20

FFL: We haven't taught Art 101
to the computers.

246

908766

3552

15:25

(Video) Computer: A zebra standing
in a field of grass.

247

913768

2884

15:28

FFL: And it hasn't learned to appreciate
the stunning beauty of nature

248

916652

3367

15:32

like you and I do.

249

920019

2438

15:34

So it has been a long journey.

250

922457

2832

15:37

To get from age zero to three was hard.

251

925289

4226

15:41

The real challenge is to go
from three to 13 and far beyond.

252

929515

5596

15:47

Let me remind you with this picture
of the boy and the cake again.

253

935111

4365

15:51

So far, we have taught
the computer to see objects

254

939476

4064

15:55

or even tell us a simple story
when seeing a picture.

255

943540

4458

15:59

(Video) Computer: A person sitting
at a table with a cake.

256

947998

3576

16:03

FFL: But there's so much more
to this picture

257

951574

2630

16:06

than just a person and a cake.

258

954204

2270

16:08

What the computer doesn't see
is that this is a special Italian cake

259

956474

4467

16:12

that's only served during Easter time.

260

960941

3217

16:16

The boy is wearing his favorite t-shirt

261

964158

3205

16:19

given to him as a gift by his father
after a trip to Sydney,

262

967363

3970

16:23

and you and I can all tell how happy he is

263

971333

3808

16:27

and what's exactly on his mind
at that moment.

264

975141

3203

16:31

This is my son Leo.

265

979214

3125

16:34

On my quest for visual intelligence,

266

982339

2624

16:36

I think of Leo constantly

267

984963

2391

16:39

and the future world he will live in.

268

987354

2903

16:42

When machines can see,

269

990257

2021

16:44

doctors and nurses will have
extra pairs of tireless eyes

270

992278

4712

16:48

to help them to diagnose
and take care of patients.

271

996990

4092

16:53

Cars will run smarter
and safer on the road.

272

1001082

4383

16:57

Robots, not just humans,

273

1005465

2694

17:00

will help us to brave the disaster zones
to save the trapped and wounded.

274

1008159

4849

17:05

We will discover new species,
better materials,

275

1013798

3796

17:09

and explore unseen frontiers
with the help of the machines.

276

1017594

4509

17:15

Little by little, we're giving sight
to the machines.

277

1023113

4167

17:19

First, we teach them to see.

278

1027280

2798

17:22

Then, they help us to see better.

279

1030078

2763

17:24

For the first time, human eyes
won't be the only ones

280

1032841

4165

17:29

pondering and exploring our world.

281

1037006

2934

17:31

We will not only use the machines
for their intelligence,

282

1039940

3460

17:35

we will also collaborate with them
in ways that we cannot even imagine.

283

1043400

6179

17:41

This is my quest:

284

1049579

2161

17:43

to give computers visual intelligence

285

1051740

2712

17:46

and to create a better future
for Leo and for the world.

286

1054452

5131

17:51

Thank you.

287

1059583

1811

17:53

(Applause)

288

1061394

3785

ABOUT THE SPEAKER

Fei-Fei Li - Computer scientist
As Director of Stanford’s Artificial Intelligence Lab and Vision Lab, Fei-Fei Li is working to solve AI’s trickiest problems -- including image recognition, learning and language processing.

Why you should listen

Using algorithms built on machine learning methods such as neural network models, the Stanford Artificial Intelligence Lab led by Fei-Fei Li has created software capable of recognizing scenes in still photographs -- and accurately describe them using natural language.

Li’s work with neural networks and computer vision (with Stanford’s Vision Lab) marks a significant step forward for AI research, and could lead to applications ranging from more intuitive image searches to robots able to make autonomous decisions in unfamiliar situations.

Fei-Fei was honored as one of Foreign Policy's 2015 Global Thinkers.

More profile about the speaker
Fei-Fei Li | Speaker | TED.com

THE ORIGINAL VIDEO ON TED.COM

Fei-Fei Li: How we're teaching computers to understand pictures | TED Talk | TED.com