[EDVR] EDVR: Video Restoration with Enhanced Deformable Convolutional Network

논문 요약

1. Paper Bibliography

논문 제목

- Edvr: Video restoration with enhanced deformable convolutional networks

저자

- Wang, Xintao, et al.

출판 정보 / 학술대회 발표 정보

- Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019.

년도

- 2019

2. Problems & Motivations

논문에서 언급된 현 VSR 연구들에서의 문제점 정리 + 관련 연구

당시 VSR은 4가지 과정(feature extraction, alignment, fusion, reconstruction)을 포함했다

- 보통 어려운 문제는 비디오가 occlusion, large motion, severe blurring을 가질 때 어떻게 alignment와 fusion을 잘 하는가이다

1) Alignment

- Explicitly : estimate optical flow between reference and its neighboring frames -> neighboring frames are warped based on the estimated motion fields [2, 48, 13]

- Implicitly: dynamic filtering [10], deformable convolution [40]

- 모션이 클 경우 single scale resolution에서는 explicit, implicit한 방법 모두 motion compensation을 수행하기가 어렵다

2) Fusion

- Early fusion: 모든 frames에 convolutions 사용 [2]

- RNN: 여러 frames를 서서히 합친다 [32, 6]

- 위 방법들은 각 프레임의 visual informativeness를 고려하지 않는다(프레임별 중요도가 다르다)

Deformable Convolution

- Dai et al. [3]은 처음으로 deformable convolutions를 제안했는데 offsets를 추가로 학습하여 네트워크가 regular local neighborhood에서 벗어난 정보들도 얻을 수 있게 하여 regular convolutions의 기능을 개선했다

- TDAN [40]은 deformable convolutions를 사용하여 explicit motion estimation이나 image warping없이 input frames를 feature level에서 align했다

Attention Mechanism

- Liu et al. [22]은 다른 temporal branches에서 features에 각각 다른 가중치를 부여할 수 있는 weight maps를 학습했다

[2] Jose Caballero, Christian Ledig, Aitken Andrew, Acosta Alejandro, Johannes Totz, Zehan Wang, and Wenzhe Shi. Realtime video super-resolution with spatio-temporal networks and motion compensation. In CVPR, 2017

[48] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. arXiv preprint arXiv:1711.09078, 2017

[13] Tae Hyun Kim, Mehdi S M Sajjadi, Michael Hirsch, and Bernhard Scholkopf. Spatio-temporal transformer network ¨ for video restoration. In ECCV, 2018

[10] Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In CVPR, 2018

[40] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. TDAN: Temporally deformable alignment network for video super-resolution. arXiv preprint arXiv:1812.02898, 2018.

[32] Mehdi S M Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In CVPR, 2018

[6] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Recurrent back-projection network for video superresolution. In CVPR, 2019.

[3] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, 2017

[22] Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. Robust video super-resolution with learned temporal dynamics. In ICCV, 2017.

3. Proposed Solutions

논문에서 제안하는 해결책들 정리

1. Overview

- image restoration 문제들에 일반적으로 적용가능한 구조

VSR의 예

1) 2N+1개의 LR frames를 inputs로 받는다

2) 각 neighboring frame은 PCD alignment module에 의해 reference에 feature level에서 align된다

3) TSA fusion module은 각 다른 프레임들의 이미지 정보를 fuse한다

4) Upsampling 연산을 해서 크기를 키운다

5) 마지막으로 upsampled image에 predicted image residual을 더해서 완성한다

- 만약 다른 task라면 먼저 input frames를 strided convolution으로 downsample해서 처리한다

2. Alignment with Pyramid, Cascading and Deformable Convolution

- Optical-flow 기반 방법들과 달리 deformable alignment는 각 frame의 features에 적용된다

Modulated deformable module

- K sampling locations에서 deformable convolution kernel이 주어졌을 때 k-th번째의 weight를 wk라 하고 pre-specified offsets를 pk라 한다

- 위치 p0에서의 aligned features는 다음과 같이 얻을 수 있다

position + pre-specified offset + learnable offset: 분수이므로 bilinear interpolation 사용

- 학습가능한 offset Δpk와 modulation scalar Δmk는 reference와 neighbor를 concat한 features를 convolution layers의 input으로 넣어 예측한다

- alignment에서의 complex motions와 large parallax problem(관측 위치에 따른 물체의 위치나 방향의 차이가 큰..?)을 해결하기 위해 PCD module제시

- PCD module은 pyramidal processing and cascading refinement방법 사용

- l-th level에서 features를 얻기 위해서 (l-1)-th features를 strided convolution filters로 factor of 2로 downsample한다 (L2을 얻기 위해 L1를 downsample)

- l-th level에서 offsets와 aligned features는 (l+1)-th level에서의 x2 upsampled offsets과 aligned features를 통해 예측된다 (L2의 offsets과 aligned features는 L3에서 얻는다)

- upsampling은 bilinear interpolation사용

- Deformable alignment는 aligned features를 계단식으로 고쳐나간다

- EDVR에서 L=3

- 컴퓨팅 부담을 줄이기 위해 spatial size가 줄어들 때 채널 수를 늘리지 않는다

- PCD alignment가 있을 때 flow가 더 작고 선명한 것을 볼 수 있다. 이는 PCD module이 성곡적으로 크고 복잡한 모션을 다룰 수 있음을 의미한다

3. Fusion with Temporal and Spatial Attention

- Inter-frame temporal relation과 Intra-frame spatial relation은 fusion에서 중요하다. 그 이유는

1) 각각의 neighboring frames는 occlusion, blurry regions, parralax problems로 인해 동일하게 informative하지 않다

2) 이전 alignment 단계에서 발생한 misalignment와 unalignment는 이어지는 재건에 나쁜 영향을 줄 수 있다

--> 그러므로 pixel-level에서 neighboring frames를 dynamic하게 합쳐서 효율적인 fusion을 해야 한다

- TSA fusion module은 각 프레임에 pixel-level aggregation weight를 부여한다

Temporal Attention

- embedding space에서 frame 유사도를 계산, neighboring frames 중 reference와 더 유사한 것에 더 attention한다

1) 먼저 유사도 h를 다음과 같이 계산한다

- sigmoid함수에는 2개의 embeddings가 들어가며(convolution filters로 얻음) 그 결과값은 0~1사이가 되어 역전파하기 쉽게 한다

2) 그 다음 tmporal attentiopn maps가 pixel-wise로 original aligned features와 곱해진다

3) 마지막으로 얻은 attention-modulated features는 concat한다

Spatial Attention

- fused features로부터 spatial attention masks가 계산된다

- 이 때 pyramid design이 사용되어 attention receptive field가 증가한다

- fused features는 element-wise multiplication, addition으로 조정되어간다

- 더 적은 flow를 가진 frames가 더 많은 attention을 받았다. 이것은 해당 프레임이 더 informative하다는 것을 의미한다

모듈추가에 따른 성능

4. Two-Stage Resoration

- 아무리 PCD, TSA 모듈을 사용한 EDVR이 성능이 좋아도 input frames가 blurry하거나 심하게 왜곡되어있으면 잘 재건되지 않는다

- 위의 경우 motion compensation과 detail aggregation이 영향을 받아서 더 나쁜 성능을 냈을 것

- 직관적으로 coaresly restored frames가 이를 완화해줄 것. 2-stage strategy: EDVR과 비슷하나 더 얕은 네트워크로 먼저 이미지를 먼저 만든다

이점

1) 효과적으로 심각한 motion blur를 없앤다

2) output frames의 inconsistency를 완화한다

- 이 방법은 성능을 0.5 dB 올려주었다

4. 입력의 형태

- RGB patches of size 64x64

- 5 consecutive frames

5. 시간적 정보 모델링 프레임워크

기본 프레임워크 (2D CNN, 3D CNN, RNN, etc)

구조에 기여한 바가 있다면?

6. 프레임 정렬 방식

Implicit (암시적) or Explicit (명시적)

- Implicit

추가 설명

- with a pyramid and cascading architecture to handle large motions

7. 업샘플링 방식

- pixel shuffle

8. 그 외

모델 파라미터 개수

학습 데이터

REDS

- NITIRE19에서 나온 데이터셋 (원본은 240 training clips, 30 validation clips, 30 testing clips)

- 720p

- 재그룹해서 train, val을 합쳐 266 clips를 train에 사용

- augment: random horizontal flip, 90° rotation

테스트 데이터

REDS4

- 원본 REDS는 test셋이 있으나 제공하지 않음

- diverse scenes and motions를 가진 대표 클립 4개 선정

Vid4, Vimeo-90K-T

- 학습 데이터와 평가 데이터가 차이가 있을 때 어떤 성능을 보이는지 확인

-> 편향에 따라 0.5~1.5dB의 성능 감소가 있음 --> Vid4와 Vimeo-90K-T 평가시 Vimeo90K로 학습한 모델 사용

논문 분석

1. 앞서 정리한 논문들에 대한 비평들 중 해당 논문에서 해결된 바가 있다면 정리

2. 해당 논문에 대한 비평(Critique)

Google Scholar Link

https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=EDVR&btnG=

Google 학술 검색

검색결과 약 1,230개 (0.03초)

scholar.google.co.kr

Github Link

https://github.com/xinntao/BasicSR

GitHub - xinntao/BasicSR: Open Source Image and Video Restoration Toolbox for Super-resolution, Denoise, Deblurring, etc. Curren

Open Source Image and Video Restoration Toolbox for Super-resolution, Denoise, Deblurring, etc. Currently, it includes EDSR, RCAN, SRResNet, SRGAN, ESRGAN, EDVR, BasicVSR, SwinIR, ECBSR, etc. Also ...

github.com

저작자표시 (새창열림)

'논문 리뷰 > Super-Resolution' 카테고리의 다른 글

[SOF-VSR] Learning for video super-resolution through HR optical flow estimation (0)	2022.04.12
[RLSP] Efficient Video Super-Resolution through Recurrent Latent Space Propagation (0)	2022.04.12
[RBPN] Recurrent Back-Projection Network for Video Super-Resolution (0)	2022.02.07
[VSR-DUF] Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation (0)	2022.02.02
[FRVSR] Frame-Recurrent Video Super-Resolution (0)	2022.01.31

뀰 블로그

[EDVR] EDVR: Video Restoration with Enhanced Deformable Convolutional Network

논문 요약

1. Paper Bibliography

2. Problems & Motivations

3. Proposed Solutions

4. 입력의 형태

5. 시간적 정보 모델링 프레임워크

6. 프레임 정렬 방식

7. 업샘플링 방식

8. 그 외

논문 분석

1. 앞서 정리한 논문들에 대한 비평들 중 해당 논문에서 해결된 바가 있다면 정리

2. 해당 논문에 대한 비평(Critique)

'논문 리뷰 > Super-Resolution' 카테고리의 다른 글

댓글

티스토리툴바

[EDVR] EDVR: Video Restoration with Enhanced Deformable Convolutional Network

논문 요약

1. Paper Bibliography

2. Problems & Motivations

3. Proposed Solutions

4. 입력의 형태

5. 시간적 정보 모델링 프레임워크

6. 프레임 정렬 방식

7. 업샘플링 방식

8. 그 외

논문 분석

1. 앞서 정리한 논문들에 대한 비평들 중 해당 논문에서 해결된 바가 있다면 정리

2. 해당 논문에 대한 비평(Critique)

'논문 리뷰 > Super-Resolution' 카테고리의 다른 글

관련글

댓글

티스토리툴바