Skip to main content

A Comprehensive Guide to the Correlational Neural Network with Keras

Human beings along with many other animals have 5 basic senses: Sight, Hearing, Taste, Smell, and Touch. We also have additional senses like a sense of balance and acceleration, a sense of time, etc. Every single moment the human brain processes information from all these sources and each of these senses affects our decision-making process. During any conversation, movement of lips, facial expression along with sound produced by vocal cord helps us to fully understand the meaning of words pronounced by the speaker. We can even understand words only by seeing the lips movement without any sound. This visual information is not just supplementary but necessary. This was first exemplified in the McGurk effect (McGurk & MacDonald, 1976) where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects. As we want our machine learning models to achieve human-level performance, it is also necessary to enable them to use data from various sources.

In machine learning these type of data that comes from different heterogeneous sources are known as multimodal data. For example audio and visual information for speech recognition. It is difficult to directly use these multimodal data in the training, as they may be of different dimensionality and data types. So much of the attention is paid to learning a common representation of these multimodal data. Learning common representations for such multiple views of data will help in several downstream applications. For example, learning a common representation for videos and their audios could help in generating more accurate subtitles for that video than if it generated by only using audio. But how do you learn this common representation?

An abstract view of CorrNet. It tires to learn a common representation of both views of data and tries to reconstruct both the view from that encoded representation.


Correlational neural network (CorrNet) is one of the methods for learning common representations. Its architecture is almost the same as a conventional single-view deep autoencoder. But it contains one encoder-decoder pair for each modality of data.

Let’s consider a two-view input, Z = [IaIv], where Ia and Iv are two different views of data such as audio and video. In the figure below, one simple architecture of CorrNet with this data is shown.

Example of CorrNet with bimodal data Z = [Ia, Iv] where Ia and Iv are two different views of data like (audio and video )Add caption

Where both encoder and decoder are single-layered. H is the encoded representation. Ha = f( Wa.Ia+b) is encoded representation Ia. f() is any non linearity ( sigmoid, tanh , etc.). Same is the case of Hv = f( Wa.Ia+b). And the common representation of bimodal data Z is given as :

H = f(Wa.Ia + Wv.Iv + b).

In the decoder part, the model tries to reconstruct the input from the common representation H by I’a = g(W’a.H+b’) and I’v = g(W’vH+b’), where g() is any activation and I’a and I’v is the reconstructed Inputs.

During training, the gradient is calculated based on three losses:

i) Minimize the self-reconstruction error, that is, minimize the error in reconstructing Ia from Ia and Iv from Iv.

ii) Minimize the cross-reconstruction error, that is, minimize the error in reconstructing Ia from Iv and Iv from Ia.

iii) Maximize the correlation between the hidden representations of both views, that is, maximize the correlation between Ha and Hv.

The total loss can be written as

Here, Lr() represents reconstruction loss which may be ‘mean squared error’ or ‘mean absolute error’. Our target is to minimize this loss. And as we want to increase the correlation, it is subtracted from the loss, i.e., the higher is the correlation the lower is the loss.

Implementation

Total implementation can be divided into three parts: Model Building, Setting loss functions, and Training.In the Model Building phase, we have to create that auto encodes architecture. First, we have to include all the necessary packages

from keras import Model
from keras.layers import Input,Dense,concatenate,Addfrom keras import backend as K,activationsfrom tensorflow
import Tensor as Tfrom keras.engine.topology
import Layer
import numpy as np
#Then we have to create CorrNet architecture. For simplicity, it has a single-layered encoder and decoder.
class ZeroPadding(Layer):
def __init__(self, **kwargs):
super(ZeroPadding, self).__init__(**kwargs)
def call(self, x, mask=None):
return K.zeros_like(x)
def get_output_shape_for(self, input_shape):
return input_shape
#inputDimx,inputDimy are the dimentions two input modalities. And #hdim_deep is the dimentions of shared representation( kept as #global veriable)
hdim_deep = 10 #dimenstion of shared representation
inpx = Input(shape=(inputDimx,))
inpy = Input(shape=(inputDimx,))
#Encoder
hl = Dense(hdim_deep,activation='relu')(inpx) hr = Dense(hdim_deep,activation='relu')(inpy) h = Add()([hl,hr]) #Common representation/Encoded representation
#decoder
recx = Dense(inputDimx,activation='relu')(h) recy = Dense(inputDimy,activation='relu')(h)
CorrNet = Model( [inpx,inpy],[recx,recy,h])
CorrNet.summary()
'''we have to create a separate model for training this CorrNet. During training, we have to take a gradient from 3 different loss
functions and which can not be obtained from a single input. If you look closely at the loss function, we will see it
has different input parameters.'''
[recx0,recy0,h1] = CorrNet( [inpx, inpy])
H= concatenate([h1,h2])
model = Model( [inpx,inpy],[recx0,recx1,recx2,recy0,recy1,recy2,H])
[recx1,recy1,h1] = CorrNet( [inpx, ZeroPadding()(inpy)])[recx2,recy2,h2] = CorrNet( [ZeroPadding()(inpx), inpy ])
#Now we have to write the correlational loss function for our model.
Lambda = 0.01
#by convention error function need to have two argument, but in #correlation loss we do not need any argument. Loss is computed #competely based on correlation of two hidden representation. For #this a 'fake' argument is used.
def correlationLoss(fake,H):
y1 = H[:,:hdim_deep]
y2 = H[:,hdim_deep:]
y1_mean = K.mean(y1, axis=0)
y1_centered = y1 - y1_mean
y2_mean = K.mean(y2, axis=0)
y2_centered = y2 - y2_mean
corr_nr = K.sum(y1_centered * y2_centered, axis=0)
corr_dr1 = K.sqrt(K.sum(y1_centered * y1_centered, axis=0) + 1e-8)
corr_dr2 = K.sqrt(K.sum(y2_centered * y2_centered, axis=0) + 1e-8)
corr_dr = corr_dr1 * corr_dr2
corr = corr_nr / corr_dr
return K.sum(corr) * Lambda
def square_loss(y_true, y_pred):
error = ls.mean_squared_error(y_true,y_pred)
return error
#Now we have to compile the model and train
model.compile(loss=[square_loss,square_loss,square_loss, square_loss,square_loss,square_loss,correlationLoss],optimizer="adam")
model.summary()
'''Suppose you have already prepared your data and kept one moadlity data in Ia(e.g. Audio) and another in Iv( e.g. Video).
To be used by this model Audios and videos must be converted into 1D tensor.'''
model.fit([Ia,Iv],[Ia,Ia,Ia,Iv,Iv,Iv,np.ones((Ia.shape[0],Ia.shape[1]))],nb_epoch=100)
np.ones((Ia.shape[0],Ia.shape[1])) is fake tensor that will be passed to correlationLoss function but will have no use
'''using this model we can generate Ia to Iv.For example, from video Iv we can generate corresponding audio Ia
np.zeros(Ia.shape) gives tensors of 0 of dimestions same as output tensor audio.'''
audio,_,_ = CorrNet.predict([np.zeros(Ia.shape),Iv])

After training, the common representation learned by this model can be used in different prediction tasks. For example, common representations learned using CorrNet can be used for: cross-language document classification or transliteration equivalence detection. Various studies have found that using common representation increases performance. It can also be used in data generation. For example, in a certain dataset, you have 10000 audio clips with corresponding videos, 5000 audio clips with their videos missing, and 5000 videos with their audios missing. In such cases, we can train a CorrNet with 10000 audios and videos and can be used to generate missing videos for audio and vice versa.

Popular posts from this blog

Understanding Conditional Variational Autoencoders

Add caption The variational autoencoder or VAE is a directed graphical generative model which has obtained excellent results and is among the state of the art approaches to generative modeling. It assumes that the data is generated by some random process, involving an unobserved continuous random variable  z.  it is assumed that the  z  is generated from some prior distribution  P_θ(z)  and the data is generated from some condition distribution  P_θ(X|Z) , where  X  represents that data. The  z  is sometimes called the hidden representation of data  X . Like any other autoencoder architecture, it has an encoder and a decoder. The encoder part tries to learn  q_φ(z|x) , which is equivalent to learning hidden representation of data  X  or encoding the  X  into the hidden representation (probabilistic encoder). The decoder part tries to learn  P_θ(X|z)  which decoding the hidden representation to...

Neural Architecture Search (NAS)- The Future of Deep Learning

Most of us have probably heard about the success of ResNet, winner of ILSVRC 2015 in image classification, detection, and localization and Winner of MS COCO 2015 detection, and segmentation .  It is an enormous architecture with skip connections all over. While using this ResNet as a pre-trained network for my machine learning project, I thought “ How can someone come out with such an architecture? ’’ Large Human Engineered Architecture Soon after I learned that many engineers and scientists with their years of experience build this architecture. And there is more of a hunch than complete math that will tell you “ we need to have a 5x5 filter now to achieve the best accuracy ’’. We have wonderful architectures for the image classification task, but for other tasks, we have to spend much of our energy to find an architecture with reasonable performance for those tasks. It would certainly be better if we can automate this architecture modeling task just as we learn the parameters of ...