본문 바로가기
논문 리뷰/Super-Resolution

[SOF-VSR] Learning for video super-resolution through HR optical flow estimation

by 귤이두번 2022. 4. 12.

논문 요약

 

1. Paper Bibliography

논문 제목

- Deep video super-resolution using HR optical flow estimation

 

저자

- Wang et al.

 

출판 정보 / 학술대회 발표 정보

- IEEE Transactions on Image Processing 29 (2020)

 

년도

- 2020

 

(2018년 conference 버전[17] 확장판)

 

[17] L. Wang, Y. Guo, Z. Lin, X. Deng, and W. An, “Learning for video super-resolution through HR optical flow estimation,” in ACCV, 2018.

 


 

2. Problems & Motivations

논문에서 언급된 현 VSR 연구들에서의 문제점 정리 + 관련 연구

 

Deep Video SR with Separated Motion Compensation

- Kappelar et al. [13]: two-step framework. 먼저 optical flow를 구한 후 compensation한 다음 concat해 CNN을 통해 HR frame을 얻는다 (VSRNet)

- Liao et al. [12]: 다른 파라미터들을 사용해서 여러개의 optical flow를 구한 후 여러 SR-drafts를 만든 후 나중에 앙상블

- Two-step framework는 ME-MC를 하는 과정이 CNN과 분리되어 있어서 과정 전체를 최적화하는 솔루션을 힘들다

 

Deep Video SR with Integrated Motion Compensation

- Caballero et al. [14]: first end-to-end CNN for video SR. Motion estimation module과 spatio-temporal ESPCN module로 구성되어있다 (VESPCN)

- Tao et al. [16]: VESPCN의 module을 사용하고 새로 레이어를 추가해 sub-pixel motion compensation(SPMC)과 resolution enhancement하는 방법을 제시. LSTM을 사용해 temporal context 얻음

- Liu et al. [34]: ESPCN을 customize해서 다른 수의 LR frames를 사용해 동시에 HR frames를 얻을 수 있는 방법 제시. 이후 temporal adaptive network (TDVSR)이 learned dynamic weights를 이용해 여러 HR들을 합친다

- Sajjadi et al. [35]: frame-recurrent architecture 제시하여 (FRVSR) 이전에 추론한 HR을 그 다음 프레임 SR에 사용. 추가 비용 없이 이전 프레임의 정보를 얻음

 

Deep Video SR without Explicit Motion Compensation

- Huang et al. [36]: bidirectional recurrent CNN을 제시해 explicit하게 ME-MC를 하지 않았다. 이런 recurrent한 구조는 long-term contextual information을 얻을 수 있게 하였다. 그러나 큰 변화나 복잡한 모션까지 잘 다루지 못했다

- Jo et al. [37]: CNN을 통해 dynamic upsampling filters생성. 이 필터는 local spatio-temporal neighborhood를 이용해 계산됨

 

 

[12] R. Liao, X. Tao, R. Li, Z. Ma, and J. Jia, “Video super-resolution via deep draft-ensemble learning,” in ICCV, 2015, pp. 531–539.

[13] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, “Video superresolution with convolutional neural networks,” IEEE Trans. Computational Imaging, vol. 2, no. 2, pp. 109–122, jun 2016.

[14] J. Caballero, C. Ledig, A. P. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, “Real-time video super-resolution with spatio-temporal networks and motion compensation,” in CVPR, 2017, pp. 2848–2857.

[16] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia, “Detail-revealing deep video super-resolution,” in ICCV, 2017, pp. 4482–4490.

[34] D. Liu, Z. Wang, Y. Fan, and X. Liu, “Robust video super-resolution with learned temporal dynamics,” in ICCV, 2017.

[35] M. S. M. Sajjadi, R. Vemulapalli, and M. Brown, “Frame-recurrent video super-resolution,” in CVPR, 2018, pp. 6626–6634

[36] Y. Huang, W. Wang, and L. Wang, “Video super-resolution and via bidirectional and recurrent convolutional and networks,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2017.

[37] Y. Jo, S. Wug, O. Jaeyeon, K. Seon, and J. Kim, “Deep video superresolution network using dynamic upsampling filters without explicit motion compensation,” in CVPR, 2018.

 

 


 

3. Proposed Solutions

논문에서 제안하는 해결책들 정리

 

A. Overview

1) 먼저 input LR frames가 OFRnet에 들어가 HR optical flows를 추론

2) Space-to-depth transformation을 통해 HR optical flows를 LR flow cubes를 만든다

3) Motion compensation을 통해 flow cubes를 draft cubes로 만든다

4) Draft cubes가 SRNet에 들어가 최종 HR frame을 만든다 

class SOFVSR(nn.Module):
    def __init__(self, cfg, n_frames=3, is_training=True):
        super(SOFVSR, self).__init__()
        self.scale = cfg.scale
        self.is_training = is_training
        self.OFR = OFRnet(scale=cfg.scale, channels=320)
        self.SR = SRnet(scale=cfg.scale, channels=320, n_frames=n_frames)

    def forward(self, x):
        b, n_frames, c, h, w = x.size()  # x: b*n*c*h*w
        idx_center = (n_frames - 1) // 2

        # motion estimation
        flow_L1 = []
        flow_L2 = []
        flow_L3 = []
        input = []

        for idx_frame in range(n_frames):
            if idx_frame != idx_center:
                input.append(torch.cat((x[:, idx_frame, :, :, :], x[:, idx_center, :, :, :]), 1))

        # Optical Flow Reconstruction
        optical_flow_L1, optical_flow_L2, optical_flow_L3 = self.OFR(torch.cat(input, 0))

        optical_flow_L1 = optical_flow_L1.view(-1, b, 2, h // 2, w // 2)
        optical_flow_L2 = optical_flow_L2.view(-1, b, 2, h, w)
        optical_flow_L3 = optical_flow_L3.view(-1, b, 2, h * self.scale, w * self.scale)

        # Motion Compensation
        draft_cube = []
        draft_cube.append(x[:, idx_center, :, :, :])

        for idx_frame in range(n_frames):
            if idx_frame == idx_center:
                flow_L1.append([])
                flow_L2.append([])
                flow_L3.append([])
            if idx_frame != idx_center:
                if idx_frame < idx_center:
                    idx = idx_frame
                if idx_frame > idx_center:
                    idx = idx_frame - 1

                flow_L1.append(optical_flow_L1[idx, :, :, :, :])
                flow_L2.append(optical_flow_L2[idx, :, :, :, :])
                flow_L3.append(optical_flow_L3[idx, :, :, :, :])

                for i in range(self.scale):
                    for j in range(self.scale):
                        draft = optical_flow_warp(x[:, idx_frame, :, :, :],
                                                  optical_flow_L3[idx, :, :, i::self.scale, j::self.scale] / self.scale)
                        draft_cube.append(draft)
        draft_cube = torch.cat(draft_cube, 1)

        # Super-Resolution
        SR = self.SR(draft_cube)

        if self.is_training:
            return flow_L1, flow_L2, flow_L3, SR
        if not self.is_training:
            return SR

 

B. Optical Flow Reconstruction Net (OFRnet)

- 아이디어: CNN base non-lineear mapping between LR and HR + CNN-based optical flow estimation methods

- Pair of LR frames를 input으로

        for idx_frame in range(n_frames):
            if idx_frame != idx_center:
                # I_L_i = I_L_t-1 + I_L_t, I_L_j = I_L_t+1 + I_L_t
                input.append(torch.cat((x[:, idx_frame, :, :, :], x[:, idx_center, :, :, :]), 1))   # len(input) == 2

        # Optical Flow Reconstruction
        optical_flow_L1, optical_flow_L2, optical_flow_L3 = self.OFR(torch.cat(input, 0))   # input: [I_L_i, I_L_j]

 

class OFRnet(nn.Module):
    def __init__(self, scale, channels):
        super(OFRnet, self).__init__()
        self.pool = nn.AvgPool2d(2)
        self.scale = scale

        # RNN part
        self.RNN1 = nn.Sequential(                          # feature extraction
            nn.Conv2d(4, channels, 3, 1, 1, bias=False),
            nn.LeakyReLU(0.1, inplace=True),
            CasResB(3, channels)
        )
        self.RNN2 = nn.Sequential(                          # flow estimation
            nn.Conv2d(channels, 2, 3, 1, 1, bias=False),
        )

        # SR part
        SR = []
        SR.append(CasResB(3, channels))

        if self.scale == 4:
            SR.append(nn.Conv2d(channels, 64 * 4, 1, 1, 0, bias=False))
            SR.append(nn.PixelShuffle(2))
            SR.append(nn.LeakyReLU(0.1, inplace=True))
            SR.append(nn.Conv2d(64, 64 * 4, 1, 1, 0, bias=False))
            SR.append(nn.PixelShuffle(2))
            SR.append(nn.LeakyReLU(0.1, inplace=True))
        elif self.scale == 3:
            SR.append(nn.Conv2d(channels, 64 * 9, 1, 1, 0, bias=False))
            SR.append(nn.PixelShuffle(3))
            SR.append(nn.LeakyReLU(0.1, inplace=True))
        elif self.scale == 2:
            SR.append(nn.Conv2d(channels, 64 * 4, 1, 1, 0, bias=False))
            SR.append(nn.PixelShuffle(2))
            SR.append(nn.LeakyReLU(0.1, inplace=True))
        SR.append(nn.Conv2d(64, 2, 3, 1, 1, bias=False))

        self.SR = nn.Sequential(*SR)

    def __call__(self, x):  # x: b*2*h*w
        """
        # Scale-recurrent network for optical flow reconstruction
        - For first two levels, we use a recurrent module to estimate optical flows for inputs with different scale
        - For level 3, we first use the recurrent structure to genereate deep representations, and then use SR module to
        recover HR optical flows from the LR feature representations
        """
        # input: pair of LR images [I_L_i, I_L_j]

        # Part 1

        # (1): downsample by a factor of 2
        x_L1 = self.pool(x)

        # (2): concat initial flow map with (1)
        b, c, h, w = x_L1.size()
        input_L1 = torch.cat((x_L1, torch.zeros(b, 2, h, w).cuda()), 1)

        # (3): fed to feature extraction layer -> then fed to flow estimation layer
        optical_flow_L1 = self.RNN2(self.RNN1(input_L1))

        # Part 2

        # (4): upscale flow by a factor of 2 with bilinear interpolation
        optical_flow_L1_upscaled = F.interpolate(optical_flow_L1, scale_factor=2, mode='bilinear',
                                                 align_corners=False) * 2

        # (5): warp (4) with image I_L_i
        x_L2 = optical_flow_warp(torch.unsqueeze(x[:, 0, :, :], 1), optical_flow_L1_upscaled)

        # (6): concat upscaled flow, warped image, not_warped image
        input_L2 = torch.cat((x_L2, torch.unsqueeze(x[:, 1, :, :], 1), optical_flow_L1_upscaled), 1)

        # (7): fed to recurrent module to generate optical flow
        optical_flow_L2 = self.RNN2(self.RNN1(input_L2)) + optical_flow_L1_upscaled

        # Part 3

        # (8) concat LR flow, warped image, not_warped image
        x_L3 = optical_flow_warp(torch.unsqueeze(x[:, 0, :, :], 1), optical_flow_L2)
        input_L3 = torch.cat((x_L3, torch.unsqueeze(x[:, 1, :, :], 1), optical_flow_L2), 1)

        # (9): fed to recurrent module to extract features
        #      -> 3 additional residual blocks to generate deep representations
        #      -> subpixel layer for resolution enhancement
        #      -> finally flow estimation layer to generate final HR optical flow
        optical_flow_L3 = self.SR(self.RNN1(input_L3)) + \
                          F.interpolate(optical_flow_L2, scale_factor=self.scale, mode='bilinear',
                                        align_corners=False) * self.scale

        return optical_flow_L1, optical_flow_L2, optical_flow_L3

 

C. Motion Compensation Module

- 만들어진 HR optical flow는 space-to-depth를 통해 LR 크기로 만들어 LR flow cube로 만든다

 

- 그 다음 LR frame I_LR_i와 warp해 여러 warped drafts를 만든다

                for i in range(self.scale):
                    for j in range(self.scale):
                        draft = optical_flow_warp(x[:, idx_frame, :, :, :],
                                                  optical_flow_L3[idx, :, :, i::self.scale, j::self.scale] / self.scale)
                        draft_cube.append(draft)
        draft_cube = torch.cat(draft_cube, 1)

 

D. Super-Resolution Net (SRnet)

- Motion compensation으로 만든 drafts는 center LR frame과 concat되고 SRent으로 들어간다

        draft_cube = torch.cat(draft_cube, 1)

        # Super-Resolution
        SR = self.SR(draft_cube)

 

1) draft cube는 feature extraction layer(320 kernels, size 3x3)로 들어간다

2) output features는 8 efficient residual blocks로 들어가 더 deep한 features를 만든다

3) sub-pixel layer로 들어가 크기를 키운다

4) 3x3 conv layer을 지나 최종 HR frame을 만든다

class SRnet(nn.Module):
    def __init__(self, scale, channels, n_frames):
        super(SRnet, self).__init__()
        body = []
        body.append(nn.Conv2d(1 * scale ** 2 * (n_frames - 1) + 1, channels, 3, 1, 1, bias=False))
        body.append(nn.LeakyReLU(0.1, inplace=True))
        body.append(CasResB(8, channels))
        if scale == 4:
            body.append(nn.Conv2d(channels, 64 * 4, 1, 1, 0, bias=False))
            body.append(nn.PixelShuffle(2))
            body.append(nn.LeakyReLU(0.1, inplace=True))
            body.append(nn.Conv2d(64, 64 * 4, 1, 1, 0, bias=False))
            body.append(nn.PixelShuffle(2))
            body.append(nn.LeakyReLU(0.1, inplace=True))
        elif scale == 3:
            body.append(nn.Conv2d(channels, 64 * 9, 1, 1, 0, bias=False))
            body.append(nn.PixelShuffle(3))
            body.append(nn.LeakyReLU(0.1, inplace=True))
        elif scale == 2:
            body.append(nn.Conv2d(channels, 64 * 4, 1, 1, 0, bias=False))
            body.append(nn.PixelShuffle(2))
            body.append(nn.LeakyReLU(0.1, inplace=True))
        body.append(nn.Conv2d(64, 1, 3, 1, 1, bias=True))

        self.body = nn.Sequential(*body)

    def __call__(self, x):
        out = self.body(x)
        return out

 

E. Loss Function

- SRnet: MSE loss

criterion = torch.nn.MSELoss()

 

- OFRnet: 

def OFR_loss(x0, x1, optical_flow):
    warped = optical_flow_warp(x0, optical_flow)
    loss = torch.mean(torch.abs(x1 - warped)) + 0.1 * L1_regularization(optical_flow)
    return loss

- L1은 optical flows의 smoothness 제한

 

        loss_OFR = torch.zeros(1).cuda()

        for i in range(n_frames):
            if i != idx_center:
                loss_L1 = OFR_loss(F.avg_pool2d(LR[:, i, :, :, :], kernel_size=2),
                                   F.avg_pool2d(LR[:, idx_center, :, :, :], kernel_size=2),
                                   flow_L1[i])
                loss_L2 = OFR_loss(LR[:, i, :, :, :], LR[:, idx_center, :, :, :], flow_L2[i])
                loss_L3 = OFR_loss(HR[:, i, :, :, :], HR[:, idx_center, :, :, :], flow_L3[i])
                loss_OFR = loss_OFR + loss_L3 + 0.2 * loss_L2 + 0.1 * loss_L1

- 람다 값을 설정, last level에 집중하도록함


- total loss: 위에것 둘 합침

        loss = loss_SR + 0.01 * loss_OFR / (n_frames - 1)
        loss_list.append(loss.data.cpu())

 

 

 


 

4. 입력의 형태

- T = 2N + 1 consecutive LR frames (여기서 T = 3)

- Convert to YCbCr and only use Y channel

 

- downsample original video clips to 540 x 960 as HR ground truth

- downsampled to generate LR video clips with differenct upscaling factors

- randomly cropped 32 x 32 patch

 


 

5. 시간적 정보 모델링 프레임워크

기본 프레임워크 (2D CNN, 3D CNN, RNN, etc)

- 2D CNN

 

구조에 기여한 바가 있다면?

- sacle recurrent network for optical flow reconstruction

 


 

6. 프레임 정렬 방식 

Implicit (암시적) or Explicit (명시적)

- explicit

 

추가 설명

- HR optical flow를 coarse-to-fine manner로 구해서 사용


 

7. 업샘플링 방식 

- pixelshuffle

 


 

8. 그 외

모델 파라미터 개수

- 1.0M

 

학습 데이터

CDVL

- 145 1080p GD video clips

- train

 

Derf's collection

- Coastguard, Foreman, Gardes, Husky

- validation

 

테스트 데이터

Vid4

- test

 

DAVIS

- 10 comparison

- each clip contains 31 consecutive frames

 

 

 


논문 분석

 

1. 앞서 정리한 논문들에 대한 비평들 중 해당 논문에서 해결된 바가 있다면 정리

- Video SR에서 연속된 프레임 사이의 temporal dependency가 중요하므로 대부분 SR methods는 explict 혹은 implicit하게 temproal dependency를 얻는 것에 집중했다. 그러나 이러한 temporal dependency는 LR space에서 모델링한 것이라 제한된 정확도는 fine detail을 얻을 수 없다. 이전과 다르게 SOF-VSR은 end-to-end로 temporal details과 spatial details 모두 복구했다. 먼저 optical flows를 super-resolve해서 temporal details를 복구했고 이러한 HR optical flows는 더 정확한 temporal dependency를 제공해서 spatial details 재건에 도움이 되었다. 

 

2. 해당 논문에 대한 비평(Critique)

1) 

2) 

3) 

 

 


Google Scholar Link

https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Deep+video+super+resolution+using+HR+optical+flow+estimation&btnG= 

 

Google 학술 검색

L Wang, Y Guo, L Liu, Z Lin, X Deng… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org Video super-resolution (SR) aims at generating a sequence of high-resolution (HR) frames with plausible and temporally consistent details from their low-resolut

scholar.google.co.kr

 

GitHub

https://github.com/The-Learning-And-Vision-Atelier-LAVA/SOF-VSR

 

GitHub - The-Learning-And-Vision-Atelier-LAVA/SOF-VSR: [ACCV 2018 & TIP 2020] Deep Video Super-resolution using HR Optical Flow

[ACCV 2018 & TIP 2020] Deep Video Super-resolution using HR Optical Flow Estimation - GitHub - The-Learning-And-Vision-Atelier-LAVA/SOF-VSR: [ACCV 2018 & TIP 2020] Deep Video Super-resoluti...

github.com

 

댓글