ABOUT THE SPEAKER

Joseph Redmon - Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time.

Why you should listen

Computer scientist Joseph Redmon is working on the YOLO (You Only Look Once) algorithm, which has a simple goal: to deliver image recognition and object detection at a speed that would seem science-fictional only a few years ago. The algorithm looks like the simple face detection of a camera app but with the level complexity of systems like Google's Deep Mind Cloud Vision, using Convolutional Deep Neural Networks to crunch object detection in realtime. It's the kind of technology that will be embedded on all smartphones in the next few years.

Redmon is also internet-famous for his resume.

More profile about the speaker
Joseph Redmon | Speaker | TED.com

TED2017

Joseph Redmon: How computers learn to recognize objects instantly

约瑟夫·雷德蒙: 计算机如何学会快速识别物体

Filmed: 2017-04-24

Readability: 4.5

2,471,805 views

十年前，研究者认为让电脑去分辨一只猫和一只狗几乎是不可能的。今天，计算机视觉系统的正确率已经达到了99%以上。这是怎么做到的呢？约瑟夫·雷德蒙致力于YOLO（只看一眼）系统，这是一个开源的目标检测方法，可以分辨出图片和视频中的物体——从斑马到停车标志——以快如闪电般的速度。在一个卓越的现场演示中，雷德蒙展示了这一对于像无人驾驶汽车、机器人甚至癌症检测这些应用来说，向前迈进的重要一步。

Joseph Redmon - Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time. Full bio

Double-click the English transcript below to play the video.

00:12

Ten十 years年份 ago前,

0

825

1151

10年前，

计算机视觉研究者认为
要让一台电脑

00:14

computer电脑 vision视力 researchers研究人员
thought that getting得到 a computer电脑

1

2000

2776

去分辨出一只猫和狗的不同之处

00:16

to tell the difference区别
between之间 a cat猫 and a dog狗

2

4800

2696

00:19

would be almost几乎 impossible不可能,

3

7520

1976

几乎是不可能的，

00:21

even with the significant重大 advance提前
in the state州 of artificial人造 intelligence情报.

4

9520

3696

即便是在人工智能已经取得了
重大突破的情况下。

00:25

Now we can do it at a level水平
greater更大 than 99 percent百分 accuracy准确性.

5

13240

3560

现在我们已经可以做到
让它的正确率在99%以上。

00:29

This is called叫 image图片 classification分类 --

6

17680

1856

这个方法叫做图像分类——

00:31

give it an image图片,
put a label标签 to that image图片 --

7

19560

3096

给它一张图，再给这张图贴上标签——

00:34

and computers电脑 know
thousands数千 of other categories类别 as well.

8

22680

3040

通过这种方式，电脑就可以知道
数千种的分类。

00:38

I'm a graduate毕业 student学生
at the University大学 of Washington华盛顿,

9

26680

2896

我是华盛顿大学的一名研究生，

我致力于一个名叫“暗网”的项目，

00:41

and I work on a project项目 called叫 Darknet基于暗网,

10

29600

1896

00:43

which哪一个 is a neural神经 network网络 framework骨架

11

31520

1696

这是一个用来训练和测试
计算机视觉模型的

00:45

for training训练 and testing测试
computer电脑 vision视力 models楷模.

12

33240

2816

神经网络结构。

00:48

So let's just see what Darknet基于暗网 thinks想

13

36080

2976

让我们来看看暗网是如何看待

00:51

of this image图片 that we have.

14

39080

1760

我们手上的这张图片。

00:54

When we run跑 our classifier分类

15

42520

2336

当我们在这张图片上

00:56

on this image图片,

16

44880

1216

运行识别器时，

00:58

we see we don't just get
a prediction预测 of dog狗 or cat猫,

17

46120

2456

我们注意到，它不仅能判断出
图片上是猫是狗，

01:00

we actually其实 get
specific具体 breed品种 predictions预测.

18

48600

2336

还能给出它是哪个品种的预测。

这就是我们目前所达到的粒度级别。

01:02

That's the level水平
of granularity粒度 we have now.

19

50960

2176

01:05

And it's correct正确.

20

53160

1616

而且它的预测是正确的。

01:06

My dog狗 is in fact事实 a malamute雪橇犬.

21

54800

1840

我的狗的确是一只
阿拉斯加雪橇犬。

01:09

So we've我们已经 made制作 amazing惊人 strides进步
in image图片 classification分类,

22

57040

4336

很明显，我们在图像识别上
取得了惊人的进步，

但是如果我们对这样一张图片上

01:13

but what happens发生
when we run跑 our classifier分类

23

61400

2000

01:15

on an image图片 that looks容貌 like this?

24

63424

1960

运行识别器，会如何呢？

01:19

Well ...

25

67080

1200

看一下。。。。。

01:24

We see that the classifier分类 comes来 back
with a pretty漂亮 similar类似 prediction预测.

26

72640

3896

我们看到识别器给出了一个
非常相似的预测。

01:28

And it's correct正确,
there is a malamute雪橇犬 in the image图片,

27

76560

3096

而且是正确的，图中是有一只
阿拉斯加雪橇犬，

01:31

but just given特定 this label标签,
we don't actually其实 know that much

28

79680

3696

但只使用这一个标签，
我们并不能真正的了解

01:35

about what's going on in the image图片.

29

83400

1667

这张图片里的故事。

01:37

We need something more powerful强大.

30

85091

1560

我们需要更强大的检测器。

01:39

I work on a problem问题
called叫 object目的 detection发现,

31

87240

2616

我正在研究一个叫做
目标检测的问题，

01:41

where we look at an image图片
and try to find all of the objects对象,

32

89880

2936

也就是我们尝试
将一张图上的所有目标物都找出来，

然后将它们分别框起来，

01:44

put bounding边界 boxes盒子 around them

33

92840

1456

再加上标注。

01:46

and say what those objects对象 are.

34

94320

1520

01:48

So here's这里的 what happens发生
when we run跑 a detector探测器 on this image图片.

35

96400

3280

这就是我们对这张照片
运行检测器时所发生的。

01:53

Now, with this kind类 of result结果,

36

101240

2256

基于这样的结果，

01:55

we can do a lot more
with our computer电脑 vision视力 algorithms算法.

37

103520

2696

我们可以用计算机视觉算法
做更多的事情。

我们发现，它知道
这里有一只猫和一只狗。

01:58

We see that it knows知道
that there's a cat猫 and a dog狗.

38

106240

2976

02:01

It knows知道 their其 relative相对的 locations地点,

39

109240

2256

它知道它们的相对位置，

它们的大小。

02:03

their其 size尺寸.

40

111520

1216

它可能甚至还知道一些
额外的信息。

02:04

It may可能 even know some extra额外 information信息.

41

112760

1936

例如背景里有一本书。

02:06

There's a book书 sitting坐在 in the background背景.

42

114720

1960

02:09

And if you want to build建立 a system系统
on top最佳 of computer电脑 vision视力,

43

117280

3256

如果你想建立一个
基于计算机视觉的系统，

02:12

say a self-driving自驾车 vehicle车辆
or a robotic机器人 system系统,

44

120560

3456

比如说无人驾驶汽车
或者机器人系统，

02:16

this is the kind类
of information信息 that you want.

45

124040

2456

那么这就是你想要得到的那类信息。

02:18

You want something so that
you can interact相互作用 with the physical物理 world世界.

46

126520

3239

你要一个能与物质世界互动的系统。

02:22

Now, when I started开始 working加工
on object目的 detection发现,

47

130759

2257

当我最开始开展目标检测项目时，

它要花20秒去处理一张图片。

02:25

it took拿 20 seconds秒
to process处理 a single单 image图片.

48

133040

3296

02:28

And to get a feel for why
speed速度 is so important重要 in this domain域,

49

136360

3880

为了感受一下为什么速度
在这个领域是如此重要，

02:33

here's这里的 an example例 of an object目的 detector探测器

50

141120

2536

举一个例子，这是一个2秒钟

02:35

that takes two seconds秒
to process处理 an image图片.

51

143680

2416

就能处理一张图片的检测器。

02:38

So this is 10 times时 faster更快

52

146120

2616

这个检测器的速度要比

02:40

than the 20-seconds-per-image-seconds每次图像 detector探测器,

53

148760

3536

处理每张图需要20秒的
检测器快10倍，

你还可以看到
在它做出预测的时候，

02:44

and you can see that by the time
it makes品牌 predictions预测,

54

152320

2656

02:47

the entire整个 state州 of the world世界 has changed变,

55

155000

2040

被检测的世界已经发生变化了，

02:49

and this wouldn't不会 be very useful有用

56

157880

2416

这对于一个应用来说

是没有多大用处的。

02:52

for an application应用.

57

160320

1416

02:53

If we speed速度 this up
by another另一个 factor因子 of 10,

58

161760

2496

如果我们将它的速度再提升10倍，

02:56

this is a detector探测器 running赛跑
at five五 frames帧 per每 second第二.

59

164280

2816

这个检测器每秒可处理5张画面。

02:59

This is a lot better,

60

167120

1536

这就好很多了，

03:00

but for example例,

61

168680

1976

但是，举个例子

03:02

if there's any significant重大 movement运动,

62

170680

2296

如果有任何重大的移动
（它就反应不过来了），

03:05

I wouldn't不会 want a system系统
like this driving主动 my car汽车.

63

173000

2560

我可不想让这样的一个系统
来驾驶我的汽车。

03:09

This is our detection发现 system系统
running赛跑 in real真实 time on my laptop笔记本电脑.

64

177120

3240

这是在我电脑上运行的
实时检测系统。

03:13

So it smoothly顺利 tracks轨道 me
as I move移动 around the frame帧,

65

181000

3136

当我在移动时，它能顺利地追踪我，

03:16

and it's robust强大的 to a wide宽 variety品种
of changes变化 in size尺寸,

66

184160

3720

而且它强大到能适应不同的大小、

03:21

pose提出,

67

189440

1200

姿势、

03:23

forward前锋, backward落后.

68

191280

1856

向前、向后的改变。

很了不起。

03:25

This is great.

69

193160

1216

如果我们想要建造一个

03:26

This is what we really need

70

194400

1736

基于计算机视觉的系统，
那么这就是我们真正需要的。

03:28

if we're going to build建立 systems系统
on top最佳 of computer电脑 vision视力.

71

196160

2896

（掌声）

03:31

(Applause掌声)

72

199080

4000

03:36

So in just a few少数 years年份,

73

204280

2176

仅仅是几年的时间，

03:38

we've我们已经 gone走了 from 20 seconds秒 per每 image图片

74

206480

2656

我们就从每张图20秒，

提升到了每张图20毫秒，
速度提高了1000倍。

03:41

to 20 milliseconds毫秒 per每 image图片,
a thousand千 times时 faster更快.

75

209160

3536

我们是如何做到的呢？

03:44

How did we get there?

76

212720

1416

事实上在过去，目标检测系统

03:46

Well, in the past过去,
object目的 detection发现 systems系统

77

214160

3016

会将这张图片

03:49

would take an image图片 like this

78

217200

1936

03:51

and split分裂 it into a bunch束 of regions地区

79

219160

2456

分成很多小区域，

03:53

and then run跑 a classifier分类
on each每 of these regions地区,

80

221640

3256

然后在每一块区域运行一下识别器，

在识别器中获得最高分数（的输出）

03:56

and high高 scores分数 for that classifier分类

81

224920

2536

03:59

would be considered考虑
detections检测 in the image图片.

82

227480

3136

就会被认为是这张图片的检测结果。

这涉及到要在一张图片上
运行数千次识别器，

04:02

But this involved参与 running赛跑 a classifier分类
thousands数千 of times时 over an image图片,

83

230640

4056

以及数千次的神经网络评估
才能获得检测结果。

04:06

thousands数千 of neural神经 network网络 evaluations评估
to produce生产 detection发现.

84

234720

2920

04:11

Instead代替, we trained熟练 a single单 network网络
to do all of detection发现 for us.

85

239240

4536

而现在，我们训练了可以做出
所有检测的单一网络。

04:15

It produces产生 all of the bounding边界 boxes盒子
and class类 probabilities概率 simultaneously同时.

86

243800

4280

它能同时生成边界盒和类别概率。

04:20

With our system系统, instead代替 of looking
at an image图片 thousands数千 of times时

87

248680

3496

使用我们的系统，
不需要为了生成检测结果

去重复上千数次地看同一张图片，

04:24

to produce生产 detection发现,

88

252200

1456

04:25

you only look once一旦,

89

253680

1256

“只看一次”就行了，

这也是为什么我们称之为
目标检测的“YOLO”法。

04:26

and that's why we call it
the YOLOYOLO method方法 of object目的 detection发现.

90

254960

2920

04:31

So with this speed速度,
we're not just limited有限 to images图片;

91

259360

3976

有了这个速度，我们就
不仅限于识别图像了，

还可以实时处理视频。

04:35

we can process处理 video视频 in real真实 time.

92

263360

2416

现在，我们不仅看到了猫和狗，

04:37

And now, instead代替 of just seeing眼看
that cat猫 and dog狗,

93

265800

3096

还能看到它们走来走去，互相嘻戏。

04:40

we can see them move移动 around
and interact相互作用 with each每 other.

94

268920

2960

04:46

This is a detector探测器 that we trained熟练

95

274560

2056

这是一个我们在微软的
COCO数据库上，

04:48

on 80 different不同 classes类

96

276640

4376

用80种不同种类的物品

04:53

in Microsoft's微软的 COCO可可 dataset数据集.

97

281040

3256

训练过的检测器。

包含了各种东西，
像勺子、叉子、碗

04:56

It has all sorts排序 of things
like spoon勺 and fork叉子, bowl碗,

98

284320

3336

等常见物品。

04:59

common共同 objects对象 like that.

99

287680

1800

05:02

It has a variety品种 of more exotic异国情调 things:

100

290360

3096

还有各种奇特的东西：

05:05

animals动物, cars汽车, zebras斑马, giraffes长颈鹿.

101

293480

3256

动物、汽车、斑马、长颈鹿。

现在我们要做点儿有趣的事情。

05:08

And now we're going to do something fun开玩笑.

102

296760

1936

我们的摄像头将要对准观众区，

05:10

We're just going to go
out into the audience听众

103

298720

2096

看看能检测出什么。

05:12

and see what kind类 of things we can detect检测.

104

300840

2016

谁想要一个毛绒动物玩具？

05:14

Does anyone任何人 want a stuffed填充的 animal动物?

105

302880

1620

05:18

There are some teddy泰迪熊 bears熊 out there.

106

306000

1762

观众席里有了一些泰迪熊。

05:22

And we can turn转 down
our threshold阈 for detection发现 a little bit位,

107

310040

4536

我们把检测阀值调低一点，

这样就可以找出更多的观众。

05:26

so we can find more of you guys
out in the audience听众.

108

314600

3400

05:31

Let's see if we can get these stop signs迹象.

109

319560

2336

看下我们能不能找出这些停车标志。

05:33

We find some backpacks背包.

110

321920

1880

我们找到了一些背包。

05:37

Let's just zoom放大 in a little bit位.

111

325880

1840

再放大一点。

05:42

And this is great.

112

330320

1256

非常棒。

所有这些都是在电脑上

05:43

And all of the processing处理
is happening事件 in real真实 time

113

331600

3176

实时处理的。

05:46

on the laptop笔记本电脑.

114

334800

1200

05:49

And it's important重要 to remember记得

115

337080

1456

请大家记住：

这是一个通用的目标检测系统，

05:50

that this is a general一般 purpose目的
object目的 detection发现 system系统,

116

338560

3216

05:53

so we can train培养 this for any image图片 domain域.

117

341800

5000

因此我们可以将它训练
用于任何领域的图像识别。

06:00

The same相同 code码 that we use

118

348320

2536

我们在无人驾驶汽车中

用来发现停车标志、行人

06:02

to find stop signs迹象 or pedestrians行人,

119

350880

2456

06:05

bicycles自行车 in a self-driving自驾车 vehicle车辆,

120

353360

1976

和自行车的代码，

06:07

can be used to find cancer癌症 cells细胞

121

355360

2856

同样可以用于在组织活检中

06:10

in a tissue组织 biopsy活检.

122

358240

3016

找出癌细胞。

06:13

And there are researchers研究人员 around the globe地球
already已经 using运用 this technology技术

123

361280

4040

全球已经有很多研究者
正在利用这一技术

06:18

for advances进步 in things
like medicine医学, robotics机器人.

124

366240

3416

在医学、机器人学等方面取得了进展。

今天早上，我刚读到一篇文章，

06:21

This morning早上, I read读 a paper纸

125

369680

1376

人们在内罗毕国家公园
对动物数量进行普查，

06:23

where they were taking服用 a census人口调查
of animals动物 in Nairobi内罗毕 National国民 Park公园

126

371080

4576

使用了YOLO作为检测系统的一部分。

06:27

with YOLOYOLO as part部分
of this detection发现 system系统.

127

375680

3136

06:30

And that's because Darknet基于暗网 is open打开 source资源

128

378840

3096

这是因为暗网是一个开源项目，

在公共领域，任何人都可以免费使用。

06:33

and in the public上市 domain域,
free自由 for anyone任何人 to use.

129

381960

2520

06:37

(Applause掌声)

130

385600

5696

（掌声）

但是我们想要让检测器
能被更多人使用、也更好用，

06:43

But we wanted to make detection发现
even more accessible无障碍 and usable可用,

131

391320

4936

因此通过结合模型优化，

06:48

so through通过 a combination组合
of model模型 optimization优化,

132

396280

4056

网络二值化和近似法，

06:52

network网络 binarization二值化 and approximation近似,

133

400360

2296

我们实际上已经可以
在手机上进行目标检测了。

06:54

we actually其实 have object目的 detection发现
running赛跑 on a phone电话.

134

402680

3920

07:04

(Applause掌声)

135

412800

5320

（掌声）

07:10

And I'm really excited兴奋 because
now we have a pretty漂亮 powerful强大 solution解

136

418960

5056

我真的很激动，
因为我们在这个低级的

07:16

to this low-level低级别 computer电脑 vision视力 problem问题,

137

424040

2296

计算机视觉问题上
有了一个强大的解决方案，

而且任何人都可以
使用它来做些什么。

07:18

and anyone任何人 can take it
and build建立 something with it.

138

426360

3856

07:22

So now the rest休息 is up to all of you

139

430240

3176

所以接下来就看所有在座的各位

07:25

and people around the world世界
with access访问 to this software软件,

140

433440

2936

以及世界上所有
能够使用这个软件的人了，

而我已经等不及想要看看，
人们会用这一技术造出什么来了。

07:28

and I can't wait to see what people
will build建立 with this technology技术.

141

436400

3656

谢谢。

07:32

Thank you.

142

440080

1216

（掌声）

07:33

(Applause掌声)

143

441320

3440

Translated by chunhua zhang
Reviewed by Yi-Fan Yu

ABOUT THE SPEAKER

Joseph Redmon - Computer scientist
Joseph Redmon works on the YOLO algorithm, which combines the simple face detection of your phone camera with a cloud-based AI -- in real time.

Why you should listen

Computer scientist Joseph Redmon is working on the YOLO (You Only Look Once) algorithm, which has a simple goal: to deliver image recognition and object detection at a speed that would seem science-fictional only a few years ago. The algorithm looks like the simple face detection of a camera app but with the level complexity of systems like Google's Deep Mind Cloud Vision, using Convolutional Deep Neural Networks to crunch object detection in realtime. It's the kind of technology that will be embedded on all smartphones in the next few years.

Redmon is also internet-famous for his resume.

More profile about the speaker
Joseph Redmon | Speaker | TED.com

THE ORIGINAL VIDEO ON TED.COM

约瑟夫·雷德蒙: 计算机如何学会快速识别物体 | TED Talk | TED.com