ABOUT THE SPEAKER
Fei-Fei Li - Computer scientist
As Director of Stanford’s Artificial Intelligence Lab and Vision Lab, Fei-Fei Li is working to solve AI’s trickiest problems -- including image recognition, learning and language processing.

Why you should listen

Using algorithms built on machine learning methods such as neural network models, the Stanford Artificial Intelligence Lab led by Fei-Fei Li has created software capable of recognizing scenes in still photographs -- and accurately describe them using natural language.

Li’s work with neural networks and computer vision (with Stanford’s Vision Lab) marks a significant step forward for AI research, and could lead to applications ranging from more intuitive image searches to robots able to make autonomous decisions in unfamiliar situations.

Fei-Fei was honored as one of Foreign Policy's 2015 Global Thinkers

More profile about the speaker
Fei-Fei Li | Speaker | TED.com
TED2015

Fei-Fei Li: How we're teaching computers to understand pictures

Filmed:
2,702,344 views

When a very young child looks at a picture, she can identify simple elements: "cat," "book," "chair." Now, computers are getting smart enough to do that too. What's next? In a thrilling talk, computer vision expert Fei-Fei Li describes the state of the art -- including the database of 15 million photos her team built to "teach" a computer to understand pictures -- and the key insights yet to come.
- Computer scientist
As Director of Stanford’s Artificial Intelligence Lab and Vision Lab, Fei-Fei Li is working to solve AI’s trickiest problems -- including image recognition, learning and language processing. Full bio

Double-click the English transcript below to play the video.

00:14
Let me show you something.
0
2366
3738
00:18
(Video) Girl: Okay, that's a cat
sitting in a bed.
1
6104
4156
00:22
The boy is petting the elephant.
2
10260
4040
00:26
Those are people
that are going on an airplane.
3
14300
4354
00:30
That's a big airplane.
4
18654
2810
00:33
Fei-Fei Li: This is
a three-year-old child
5
21464
2206
00:35
describing what she sees
in a series of photos.
6
23670
3679
00:39
She might still have a lot
to learn about this world,
7
27349
2845
00:42
but she's already an expert
at one very important task:
8
30194
4549
00:46
to make sense of what she sees.
9
34743
2846
00:50
Our society is more
technologically advanced than ever.
10
38229
4226
00:54
We send people to the moon,
we make phones that talk to us
11
42455
3629
00:58
or customize radio stations
that can play only music we like.
12
46084
4946
01:03
Yet, our most advanced
machines and computers
13
51030
4055
01:07
still struggle at this task.
14
55085
2903
01:09
So I'm here today
to give you a progress report
15
57988
3459
01:13
on the latest advances
in our research in computer vision,
16
61447
4047
01:17
one of the most frontier
and potentially revolutionary
17
65494
4161
01:21
technologies in computer science.
18
69655
3206
01:24
Yes, we have prototyped cars
that can drive by themselves,
19
72861
4551
01:29
but without smart vision,
they cannot really tell the difference
20
77412
3853
01:33
between a crumpled paper bag
on the road, which can be run over,
21
81265
3970
01:37
and a rock that size,
which should be avoided.
22
85235
3340
01:41
We have made fabulous megapixel cameras,
23
89415
3390
01:44
but we have not delivered
sight to the blind.
24
92805
3135
01:48
Drones can fly over massive land,
25
96420
3305
01:51
but don't have enough vision technology
26
99725
2134
01:53
to help us to track
the changes of the rainforests.
27
101859
3461
01:57
Security cameras are everywhere,
28
105320
2950
02:00
but they do not alert us when a child
is drowning in a swimming pool.
29
108270
5067
02:06
Photos and videos are becoming
an integral part of global life.
30
114167
5595
02:11
They're being generated at a pace
that's far beyond what any human,
31
119762
4087
02:15
or teams of humans, could hope to view,
32
123849
2783
02:18
and you and I are contributing
to that at this TED.
33
126632
3921
02:22
Yet our most advanced software
is still struggling at understanding
34
130553
5232
02:27
and managing this enormous content.
35
135785
3876
02:31
So in other words,
collectively as a society,
36
139661
5272
02:36
we're very much blind,
37
144933
1746
02:38
because our smartest
machines are still blind.
38
146679
3387
02:43
"Why is this so hard?" you may ask.
39
151526
2926
02:46
Cameras can take pictures like this one
40
154452
2693
02:49
by converting lights into
a two-dimensional array of numbers
41
157145
3994
02:53
known as pixels,
42
161139
1650
02:54
but these are just lifeless numbers.
43
162789
2251
02:57
They do not carry meaning in themselves.
44
165040
3111
03:00
Just like to hear is not
the same as to listen,
45
168151
4343
03:04
to take pictures is not
the same as to see,
46
172494
4040
03:08
and by seeing,
we really mean understanding.
47
176534
3829
03:13
In fact, it took Mother Nature
540 million years of hard work
48
181293
6177
03:19
to do this task,
49
187470
1973
03:21
and much of that effort
50
189443
1881
03:23
went into developing the visual
processing apparatus of our brains,
51
191324
5271
03:28
not the eyes themselves.
52
196595
2647
03:31
So vision begins with the eyes,
53
199242
2747
03:33
but it truly takes place in the brain.
54
201989
3518
03:38
So for 15 years now, starting
from my Ph.D. at Caltech
55
206287
5060
03:43
and then leading Stanford's Vision Lab,
56
211347
2926
03:46
I've been working with my mentors,
collaborators and students
57
214273
4396
03:50
to teach computers to see.
58
218669
2889
03:54
Our research field is called
computer vision and machine learning.
59
222658
3294
03:57
It's part of the general field
of artificial intelligence.
60
225952
3878
04:03
So ultimately, we want to teach
the machines to see just like we do:
61
231000
5493
04:08
naming objects, identifying people,
inferring 3D geometry of things,
62
236493
5387
04:13
understanding relations, emotions,
actions and intentions.
63
241880
5688
04:19
You and I weave together entire stories
of people, places and things
64
247568
6153
04:25
the moment we lay our gaze on them.
65
253721
2164
04:28
The first step towards this goal
is to teach a computer to see objects,
66
256955
5583
04:34
the building block of the visual world.
67
262538
3368
04:37
In its simplest terms,
imagine this teaching process
68
265906
4434
04:42
as showing the computers
some training images
69
270340
2995
04:45
of a particular object, let's say cats,
70
273335
3321
04:48
and designing a model that learns
from these training images.
71
276656
4737
04:53
How hard can this be?
72
281393
2044
04:55
After all, a cat is just
a collection of shapes and colors,
73
283437
4052
04:59
and this is what we did
in the early days of object modeling.
74
287489
4086
05:03
We'd tell the computer algorithm
in a mathematical language
75
291575
3622
05:07
that a cat has a round face,
a chubby body,
76
295197
3343
05:10
two pointy ears, and a long tail,
77
298540
2299
05:12
and that looked all fine.
78
300839
1410
05:14
But what about this cat?
79
302859
2113
05:16
(Laughter)
80
304972
1091
05:18
It's all curled up.
81
306063
1626
05:19
Now you have to add another shape
and viewpoint to the object model.
82
307689
4719
05:24
But what if cats are hidden?
83
312408
1715
05:27
What about these silly cats?
84
315143
2219
05:31
Now you get my point.
85
319112
2417
05:33
Even something as simple
as a household pet
86
321529
3367
05:36
can present an infinite number
of variations to the object model,
87
324896
4504
05:41
and that's just one object.
88
329400
2233
05:44
So about eight years ago,
89
332573
2492
05:47
a very simple and profound observation
changed my thinking.
90
335065
5030
05:53
No one tells a child how to see,
91
341425
2685
05:56
especially in the early years.
92
344110
2261
05:58
They learn this through
real-world experiences and examples.
93
346371
5000
06:03
If you consider a child's eyes
94
351371
2740
06:06
as a pair of biological cameras,
95
354111
2554
06:08
they take one picture
about every 200 milliseconds,
96
356665
4180
06:12
the average time an eye movement is made.
97
360845
3134
06:15
So by age three, a child would have seen
hundreds of millions of pictures
98
363979
5550
06:21
of the real world.
99
369529
1834
06:23
That's a lot of training examples.
100
371363
2280
06:26
So instead of focusing solely
on better and better algorithms,
101
374383
5989
06:32
my insight was to give the algorithms
the kind of training data
102
380372
5272
06:37
that a child was given through experiences
103
385644
3319
06:40
in both quantity and quality.
104
388963
3878
06:44
Once we know this,
105
392841
1858
06:46
we knew we needed to collect a data set
106
394699
2971
06:49
that has far more images
than we have ever had before,
107
397670
4459
06:54
perhaps thousands of times more,
108
402129
2577
06:56
and together with Professor
Kai Li at Princeton University,
109
404706
4111
07:00
we launched the ImageNet project in 2007.
110
408817
4752
07:05
Luckily, we didn't have to mount
a camera on our head
111
413569
3838
07:09
and wait for many years.
112
417407
1764
07:11
We went to the Internet,
113
419171
1463
07:12
the biggest treasure trove of pictures
that humans have ever created.
114
420634
4436
07:17
We downloaded nearly a billion images
115
425070
3041
07:20
and used crowdsourcing technology
like the Amazon Mechanical Turk platform
116
428111
5880
07:25
to help us to label these images.
117
433991
2339
07:28
At its peak, ImageNet was one of
the biggest employers
118
436330
4900
07:33
of the Amazon Mechanical Turk workers:
119
441230
2996
07:36
together, almost 50,000 workers
120
444226
3854
07:40
from 167 countries around the world
121
448080
4040
07:44
helped us to clean, sort and label
122
452120
3947
07:48
nearly a billion candidate images.
123
456067
3575
07:52
That was how much effort it took
124
460612
2653
07:55
to capture even a fraction
of the imagery
125
463265
3900
07:59
a child's mind takes in
in the early developmental years.
126
467165
4171
08:04
In hindsight, this idea of using big data
127
472148
3902
08:08
to train computer algorithms
may seem obvious now,
128
476050
4550
08:12
but back in 2007, it was not so obvious.
129
480600
4110
08:16
We were fairly alone on this journey
for quite a while.
130
484710
3878
08:20
Some very friendly colleagues advised me
to do something more useful for my tenure,
131
488588
5003
08:25
and we were constantly struggling
for research funding.
132
493591
4342
08:29
Once, I even joked to my graduate students
133
497933
2485
08:32
that I would just reopen
my dry cleaner's shop to fund ImageNet.
134
500418
4063
08:36
After all, that's how I funded
my college years.
135
504481
4761
08:41
So we carried on.
136
509242
1856
08:43
In 2009, the ImageNet project delivered
137
511098
3715
08:46
a database of 15 million images
138
514813
4042
08:50
across 22,000 classes
of objects and things
139
518855
4805
08:55
organized by everyday English words.
140
523660
3320
08:58
In both quantity and quality,
141
526980
2926
09:01
this was an unprecedented scale.
142
529906
2972
09:04
As an example, in the case of cats,
143
532878
3461
09:08
we have more than 62,000 cats
144
536339
2809
09:11
of all kinds of looks and poses
145
539148
4110
09:15
and across all species
of domestic and wild cats.
146
543258
5223
09:20
We were thrilled
to have put together ImageNet,
147
548481
3344
09:23
and we wanted the whole research world
to benefit from it,
148
551825
3738
09:27
so in the TED fashion,
we opened up the entire data set
149
555563
4041
09:31
to the worldwide
research community for free.
150
559604
3592
09:36
(Applause)
151
564636
4000
09:41
Now that we have the data
to nourish our computer brain,
152
569416
4538
09:45
we're ready to come back
to the algorithms themselves.
153
573954
3737
09:49
As it turned out, the wealth
of information provided by ImageNet
154
577691
5178
09:54
was a perfect match to a particular class
of machine learning algorithms
155
582869
4806
09:59
called convolutional neural network,
156
587675
2415
10:02
pioneered by Kunihiko Fukushima,
Geoff Hinton, and Yann LeCun
157
590090
5248
10:07
back in the 1970s and '80s.
158
595338
3645
10:10
Just like the brain consists
of billions of highly connected neurons,
159
598983
5619
10:16
a basic operating unit in a neural network
160
604602
3854
10:20
is a neuron-like node.
161
608456
2415
10:22
It takes input from other nodes
162
610871
2554
10:25
and sends output to others.
163
613425
2718
10:28
Moreover, these hundreds of thousands
or even millions of nodes
164
616143
4713
10:32
are organized in hierarchical layers,
165
620856
3227
10:36
also similar to the brain.
166
624083
2554
10:38
In a typical neural network we use
to train our object recognition model,
167
626637
4783
10:43
it has 24 million nodes,
168
631420
3181
10:46
140 million parameters,
169
634601
3297
10:49
and 15 billion connections.
170
637898
2763
10:52
That's an enormous model.
171
640661
2415
10:55
Powered by the massive data from ImageNet
172
643076
3901
10:58
and the modern CPUs and GPUs
to train such a humongous model,
173
646977
5433
11:04
the convolutional neural network
174
652410
2369
11:06
blossomed in a way that no one expected.
175
654779
3436
11:10
It became the winning architecture
176
658215
2508
11:12
to generate exciting new results
in object recognition.
177
660723
5340
11:18
This is a computer telling us
178
666063
2810
11:20
this picture contains a cat
179
668873
2300
11:23
and where the cat is.
180
671173
1903
11:25
Of course there are more things than cats,
181
673076
2112
11:27
so here's a computer algorithm telling us
182
675188
2438
11:29
the picture contains
a boy and a teddy bear;
183
677626
3274
11:32
a dog, a person, and a small kite
in the background;
184
680900
4366
11:37
or a picture of very busy things
185
685266
3135
11:40
like a man, a skateboard,
railings, a lampost, and so on.
186
688401
4644
11:45
Sometimes, when the computer
is not so confident about what it sees,
187
693045
5293
11:51
we have taught it to be smart enough
188
699498
2276
11:53
to give us a safe answer
instead of committing too much,
189
701774
3878
11:57
just like we would do,
190
705652
2811
12:00
but other times our computer algorithm
is remarkable at telling us
191
708463
4666
12:05
what exactly the objects are,
192
713129
2253
12:07
like the make, model, year of the cars.
193
715382
3436
12:10
We applied this algorithm to millions
of Google Street View images
194
718818
5386
12:16
across hundreds of American cities,
195
724204
3135
12:19
and we have learned something
really interesting:
196
727339
2926
12:22
first, it confirmed our common wisdom
197
730265
3320
12:25
that car prices correlate very well
198
733585
3290
12:28
with household incomes.
199
736875
2345
12:31
But surprisingly, car prices
also correlate well
200
739220
4527
12:35
with crime rates in cities,
201
743747
2300
12:39
or voting patterns by zip codes.
202
747007
3963
12:44
So wait a minute. Is that it?
203
752060
2206
12:46
Has the computer already matched
or even surpassed human capabilities?
204
754266
5153
12:51
Not so fast.
205
759419
2138
12:53
So far, we have just taught
the computer to see objects.
206
761557
4923
12:58
This is like a small child
learning to utter a few nouns.
207
766480
4644
13:03
It's an incredible accomplishment,
208
771124
2670
13:05
but it's only the first step.
209
773794
2460
13:08
Soon, another developmental
milestone will be hit,
210
776254
3762
13:12
and children begin
to communicate in sentences.
211
780016
3461
13:15
So instead of saying
this is a cat in the picture,
212
783477
4224
13:19
you already heard the little girl
telling us this is a cat lying on a bed.
213
787701
5202
13:24
So to teach a computer
to see a picture and generate sentences,
214
792903
5595
13:30
the marriage between big data
and machine learning algorithm
215
798498
3948
13:34
has to take another step.
216
802446
2275
13:36
Now, the computer has to learn
from both pictures
217
804721
4156
13:40
as well as natural language sentences
218
808877
2856
13:43
generated by humans.
219
811733
3322
13:47
Just like the brain integrates
vision and language,
220
815055
3853
13:50
we developed a model
that connects parts of visual things
221
818908
5201
13:56
like visual snippets
222
824109
1904
13:58
with words and phrases in sentences.
223
826013
4203
14:02
About four months ago,
224
830216
2763
14:04
we finally tied all this together
225
832979
2647
14:07
and produced one of the first
computer vision models
226
835626
3784
14:11
that is capable of generating
a human-like sentence
227
839410
3994
14:15
when it sees a picture for the first time.
228
843404
3506
14:18
Now, I'm ready to show you
what the computer says
229
846910
4644
14:23
when it sees the picture
230
851554
1975
14:25
that the little girl saw
at the beginning of this talk.
231
853529
3830
14:31
(Video) Computer: A man is standing
next to an elephant.
232
859519
3344
14:36
A large airplane sitting on top
of an airport runway.
233
864393
3634
14:41
FFL: Of course, we're still working hard
to improve our algorithms,
234
869057
4212
14:45
and it still has a lot to learn.
235
873269
2596
14:47
(Applause)
236
875865
2291
14:51
And the computer still makes mistakes.
237
879556
3321
14:54
(Video) Computer: A cat lying
on a bed in a blanket.
238
882877
3391
14:58
FFL: So of course, when it sees
too many cats,
239
886268
2553
15:00
it thinks everything
might look like a cat.
240
888821
2926
15:05
(Video) Computer: A young boy
is holding a baseball bat.
241
893317
2864
15:08
(Laughter)
242
896181
1765
15:09
FFL: Or, if it hasn't seen a toothbrush,
it confuses it with a baseball bat.
243
897946
4583
15:15
(Video) Computer: A man riding a horse
down a street next to a building.
244
903309
3434
15:18
(Laughter)
245
906743
2023
15:20
FFL: We haven't taught Art 101
to the computers.
246
908766
3552
15:25
(Video) Computer: A zebra standing
in a field of grass.
247
913768
2884
15:28
FFL: And it hasn't learned to appreciate
the stunning beauty of nature
248
916652
3367
15:32
like you and I do.
249
920019
2438
15:34
So it has been a long journey.
250
922457
2832
15:37
To get from age zero to three was hard.
251
925289
4226
15:41
The real challenge is to go
from three to 13 and far beyond.
252
929515
5596
15:47
Let me remind you with this picture
of the boy and the cake again.
253
935111
4365
15:51
So far, we have taught
the computer to see objects
254
939476
4064
15:55
or even tell us a simple story
when seeing a picture.
255
943540
4458
15:59
(Video) Computer: A person sitting
at a table with a cake.
256
947998
3576
16:03
FFL: But there's so much more
to this picture
257
951574
2630
16:06
than just a person and a cake.
258
954204
2270
16:08
What the computer doesn't see
is that this is a special Italian cake
259
956474
4467
16:12
that's only served during Easter time.
260
960941
3217
16:16
The boy is wearing his favorite t-shirt
261
964158
3205
16:19
given to him as a gift by his father
after a trip to Sydney,
262
967363
3970
16:23
and you and I can all tell how happy he is
263
971333
3808
16:27
and what's exactly on his mind
at that moment.
264
975141
3203
16:31
This is my son Leo.
265
979214
3125
16:34
On my quest for visual intelligence,
266
982339
2624
16:36
I think of Leo constantly
267
984963
2391
16:39
and the future world he will live in.
268
987354
2903
16:42
When machines can see,
269
990257
2021
16:44
doctors and nurses will have
extra pairs of tireless eyes
270
992278
4712
16:48
to help them to diagnose
and take care of patients.
271
996990
4092
16:53
Cars will run smarter
and safer on the road.
272
1001082
4383
16:57
Robots, not just humans,
273
1005465
2694
17:00
will help us to brave the disaster zones
to save the trapped and wounded.
274
1008159
4849
17:05
We will discover new species,
better materials,
275
1013798
3796
17:09
and explore unseen frontiers
with the help of the machines.
276
1017594
4509
17:15
Little by little, we're giving sight
to the machines.
277
1023113
4167
17:19
First, we teach them to see.
278
1027280
2798
17:22
Then, they help us to see better.
279
1030078
2763
17:24
For the first time, human eyes
won't be the only ones
280
1032841
4165
17:29
pondering and exploring our world.
281
1037006
2934
17:31
We will not only use the machines
for their intelligence,
282
1039940
3460
17:35
we will also collaborate with them
in ways that we cannot even imagine.
283
1043400
6179
17:41
This is my quest:
284
1049579
2161
17:43
to give computers visual intelligence
285
1051740
2712
17:46
and to create a better future
for Leo and for the world.
286
1054452
5131
17:51
Thank you.
287
1059583
1811
17:53
(Applause)
288
1061394
3785

▲Back to top

ABOUT THE SPEAKER
Fei-Fei Li - Computer scientist
As Director of Stanford’s Artificial Intelligence Lab and Vision Lab, Fei-Fei Li is working to solve AI’s trickiest problems -- including image recognition, learning and language processing.

Why you should listen

Using algorithms built on machine learning methods such as neural network models, the Stanford Artificial Intelligence Lab led by Fei-Fei Li has created software capable of recognizing scenes in still photographs -- and accurately describe them using natural language.

Li’s work with neural networks and computer vision (with Stanford’s Vision Lab) marks a significant step forward for AI research, and could lead to applications ranging from more intuitive image searches to robots able to make autonomous decisions in unfamiliar situations.

Fei-Fei was honored as one of Foreign Policy's 2015 Global Thinkers

More profile about the speaker
Fei-Fei Li | Speaker | TED.com