{"title": "Few-Shot Adversarial Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 6670, "page_last": 6680, "abstract": "This work provides a framework for addressing the problem of supervised domain adaptation with deep models. The main idea is to exploit adversarial learning to learn an embedded subspace that simultaneously maximizes the confusion between two domains while semantically aligning their embedding. The supervised setting becomes attractive especially when there are only a few target data samples that need to be labeled. In this few-shot learning scenario, alignment and separation of semantic probability distributions is difficult because of the lack of data. We found that by carefully designing a training scheme whereby the typical binary adversarial discriminator is augmented to distinguish between four different classes, it is possible to effectively address the supervised adaptation problem. In addition, the approach has a high \u201cspeed\u201d of adaptation, i.e. it requires an extremely low number of labeled target training samples, even one per category can be effective. We then extensively compare this approach to the state of the art in domain adaptation in two experiments: one using datasets for handwritten digit recognition, and one using datasets for visual object recognition.", "full_text": "Few-Shot Adversarial Domain Adaptation\n\nSaeid Motiian, Quinn Jones, Seyed Mehdi Iranmanesh, Gianfranco Doretto\n\nLane Department of Computer Science and Electrical Engineering\n\n{samotiian, qjones1, seiranmanesh, gidoretto}@mix.wvu.edu\n\nWest Virginia University\n\nAbstract\n\nThis work provides a framework for addressing the problem of supervised domain\nadaptation with deep models. The main idea is to exploit adversarial learning to\nlearn an embedded subspace that simultaneously maximizes the confusion between\ntwo domains while semantically aligning their embedding. The supervised setting\nbecomes attractive especially when there are only a few target data samples that\nneed to be labeled. In this few-shot learning scenario, alignment and separation of\nsemantic probability distributions is dif\ufb01cult because of the lack of data. We found\nthat by carefully designing a training scheme whereby the typical binary adversarial\ndiscriminator is augmented to distinguish between four different classes, it is\npossible to effectively address the supervised adaptation problem. In addition, the\napproach has a high \u201cspeed\u201d of adaptation, i.e. it requires an extremely low number\nof labeled target training samples, even one per category can be effective. We then\nextensively compare this approach to the state of the art in domain adaptation in\ntwo experiments: one using datasets for handwritten digit recognition, and one\nusing datasets for visual object recognition.\n\n1\n\nIntroduction\n\nAs deep learning approaches have gained prominence in computer vision we have seen tasks that\nhave large amounts of available labeled data \ufb02ourish with improved results. There are still many\nproblems worth solving where labeled data on an equally large scale is too expensive to collect,\nannotate, or both, and by extension a straightforward deep learning approach would not be feasible.\nTypically, in such a scenario, practitioners will train or reuse a model from a closely related dataset\nwith a large amount of samples, here called the source domain, and then train with the much smaller\ndataset of interest, referred to as the target domain. This process is well-known under the name\n\ufb01netuning. Finetuning, while simple to implement, has been found to be sub-optimal when compared\nto later techniques such as domain adaptation [5]. Domain Adaptation can be supervised [58, 27],\nunsupervised [15, 34], or semi-supervised [16, 21, 63], depending on what data is available in a\nlabeled format and how much can be collected.\nUnsupervised domain adaptation (UDA) algorithms do not need any target data labels, but they\nrequire large amounts of target training samples, which may not always be available. Conversely,\nsupervised domain adaptation (SDA) algorithms do require labeled target data, and because labeling\ninformation is available, for the same quantity of target data, SDA outperforms UDA [38]. Therefore,\nif the available target data is scarce, SDA becomes attractive, even if the labeling process is expensive,\nbecause only few samples need to be processed.\nMost domain adaptation approaches try to \ufb01nd a feature space such that the confusion between\nsource and target distributions in that space is maximum (domain confusion). Because of that, it\nis hard to say whether a sample in the feature space has come from the source distribution or the\ntarget distribution. Recently, generative adversarial networks [18] have been introduced for image\ngeneration which can also be used for domain adaptation. In [18], the goal is to learn a discriminator\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Examples from MNIST [32] and SVHN [40] of grouped sample pairs. G1 is composed\nof samples of the same class from the source dataset in this case MNIST. G2 is composed of samples\nof the same class, but one is from the source dataset and the other is from the target dataset. In G3 the\nsamples in each pair are from the source dataset but with differing class labels. Finally, pairs in G4\nare composed of samples from the target and source datasets with differing class labels.\n\nto distinguish between real samples and generated (fake) samples and then to learn a generator which\nbest confuses the discriminator. Domain adaptation can also be seen as a generative adversarial\nnetwork with one difference, in domain adaptation there is no need to generate samples, instead, the\ngenerator network is replaced with an inference network. Since the discriminator cannot determine\nif a sample is from the source or the target distribution the inference becomes optimal in terms of\ncreating a joint latent space. In this manner, generative adversarial learning has been successfully\nmodi\ufb01ed for UDA [33, 59, 49] and provided very promising results.\nHere instead, we are interested in adapting adversarial learning for SDA which we are calling few-shot\nadversarial domain adaptation (FADA) for cases when there are very few labeled target samples\navailable in training. In this few-shot learning regime, our SDA method has proven capable of\nincreasing a model\u2019s performance at a very high rate with respect to the inclusion of additional\nsamples. Indeed, even one additional sample can signi\ufb01cantly increase performance.\nOur \ufb01rst contribution is to handle this scarce data while providing effective training. Our second\ncontribution is to extend adversarial learning [18] to exploit the label information of target samples.\nWe propose a novel way of creating pairs of samples using source and target samples to address the\n\ufb01rst challenge. We assign a group label to a pair according to the following procedure: 0 if samples of\na pair come from the source distribution and the same class label, 1 if they come from the source and\ntarget distributions but the same class label, 2 if they come from the source distribution but different\nclass labels, and 3 if they come from the source and target distributions and have different class\nlabels. The second challenge is addressed by using adversarial learning [18] to train a deep inference\nfunction, which confuses a well-trained domain-class discriminator (DCD) while maintaining a high\nclassi\ufb01cation accuracy for the source samples. The DCD is a multi-class classi\ufb01er that takes pairs of\nsamples as input and classi\ufb01es them into the above four groups. Confusing the DCD will encourage\ndomain confusion, as well as the semantic alignment of classes. Our third contribution is an extensive\nvalidation of FADA against the state-of-the-art. Although our method is general, and can be used for\nall domain adaptation applications, we focus on visual recognition.\n\n2 Related work\n\nNaively training a classi\ufb01er on one dataset for testing on another is known to produce sub-optimal\nresults, because an effect known as dataset bias [42, 57, 56] or covariate shift [51] occurs due to a\ndifference in the distributions of the images between the datasets.\nPrior work in domain adaptation has minimized this shift largely in three ways. Some try to \ufb01nd a\nfunction which can map from the source domain to the target domain [47, 28, 19, 16, 11, 55, 52].\nOthers \ufb01nd a shared latent space that both domains can be mapped to before classi\ufb01cation [35, 2,\n39, 13, 14, 41, 37, 38]. Finally, some use regularization to improve the \ufb01t on the target domain [4,\n1, 62, 10, 3, 8]. UDA can leverage the \ufb01rst two approaches while SDA uses the second, third, or a\ncombination of the two approaches. In addition to these methods, [6, 36, 50] have addressed UDA\nwhen an auxiliary data view [31, 37], is available during training, but that is beyond the scope of this\nwork.\nFor this approach we are focused on \ufb01nding a shared subspace for both the source and target\ndistributions. Siamese networks [7] work well for subspace learning and have worked very well\nwith deep convolutional neural networks [9, 53, 30, 61]. Siamese networks have also been useful in\n\n2\n\nG21G31G41G11\fFigure 2: Few-shot adversarial domain adaptation. For simplicity we show our networks in the\ncase of weight sharing (gs = gt = g). (a) In the \ufb01rst step, we initialized g and h using the source\nsamples Ds. (b) We freeze g and train a DCD. The picture shows a pair from the second group G2\nwhen the samples come from two different distributions but the same class label. (c) We freeze the\nDCD and update g and h.\n\ndomain adaptation recently. In [58], which is a deep SDA approach, unlabeled and sparsely labeled\ntarget domain data are used to optimize for domain invariance to facilitate domain transfer while using\na soft label distribution matching loss. In [54], which is a deep UDA approach, unlabeled target data\nis used to learn a nonlinear transformation that aligns correlations of layer activations in deep neural\nnetworks. Some approaches went beyond Siamese weight-sharing and used couple networks for DA.\n[27] uses two CNN streams, for source and target, fused at the classi\ufb01er level. [45], which is a deep\nUDA approach and can be seen as an SDA after \ufb01ne-tuning, also uses a two-streams architecture,\nfor source and target, with related but not shared weights. [38], which is an SDA approach, creates\npositive and negative pairs using source and target data and then \ufb01nds a shared feature space between\nsource and target by bringing together the positive pairs and pushing apart the negative pairs.\nRecently, adversarial learning [18] has shown promising results in domain adaptation and can be\nseen as examples of the second category. [33] introduced a coupled generative adversarial network\n(CoGAN) for learning a joint distribution of multi-domain images for different applications including\nUDA. [59] has used the adversarial loss for discriminative UDA. [49] introduces an approach that\nleverages unlabeled data to bring the source and target distributions closer by inducing a symbiotic\nrelationship between the learned embedding and a generative adversarial framework.\nHere we use adversarial learning to train inference networks such that samples from different\ndistributions are not distinguishable. We consider the task where very few labeled target data are\navailable in training. With this assumption, it is not possible to use the standard adversarial loss\nused in [33, 59, 49], because the training target data would be insuf\ufb01cient. We address that problem\nby modifying the usual pairing technique used in many applications such as learning similarity\nmetrics [7, 23, 22]. Our pairing technique encodes domain labels as well as class labels of the training\ndata (source and target samples), producing four groups of pairs. We then introduce a multi-class\ndiscriminator with four outputs and design an adversarial learning strategy to \ufb01nd a shared feature\nspace. Our method also encourages the semantic alignment of classes, while other adversarial UDA\napproaches do not.\n\n3 Few-shot adversarial domain adaptation\n\nIn this section we describe the model we propose to address supervised domain adaptation (SDA). We\nare given a training dataset made of pairs Ds = {(xs\ni \u2208 X is a realization\ni , ys\nfrom a random variable X s, and the label ys\ni \u2208 Y is a realization from a random variable Y s. In\ni \u2208 X is a realization from\naddition, we are also given the training data Dt = {(xt\na random variable X t, and the labels yt\ni \u2208 Y. We assume that there is a covariate shift [51] between\nX s and X t, i.e., there is a difference between the probability distributions p(X s) and p(X t). We say\nthat X s represents the source domain and that X t represents the target domain. Under this settings\nthe goal is to learn a prediction function f : X \u2192 Y that during testing is going to perform well on\ndata from the target domain.\nThe problem formulated thus far is typically referred to as supervised domain adaptation. In this\nwork we are especially concerned with the version of this problem where only very few target labeled\n\ni=1. The feature xs\ni )}N\ni )}M\ni, yt\n\ni=1, where xt\n\n3\n\nSourceg1h1\u03c61h1Loss (1)(a)(b)(c)Loss (3)Loss (1)Loss (4)DCDG21G21Loss (1)\fAlgorithm 1 FADA algorithm\n1: Train g and h on Ds using (1).\n2: Uniformly sample G1,G3 from DsxDs.\n3: Uniformly sample G2,G4 from DsxDt.\n4: Train DCD w.r.t. gt = gs = g using (3).\n5: while not convergent do\n6:\n7:\n8: end while\n\nUpdate g and h by minimizing (5).\nUpdate DCD by minimizing (3).\n\nsamples per class are available. We aim at handling cases where there is only one target labeled\nsample, and there can even be some classes with no target samples at all.\nIn absence of covariate shift a visual classi\ufb01er f is trained by minimizing a classi\ufb01cation loss\n\nLC(f ) = E[(cid:96)(f (X s), Y )] ,\n\n(1)\nwhere E[\u00b7] denotes statistical expectation and (cid:96) could be any appropriate loss function. When\nthe distributions of X s and X t are different, a deep model fs trained with Ds will have reduced\nperformance on the target domain. Increasing it would be trivial by simply training a new model ft\nwith data Dt. However, Dt is small and deep models require large amounts of labeled data.\nIn general, f could be modeled by the composition of two functions, i.e., f = h\u25e6 g. Here g : X \u2192 Z\nwould be an inference from the input space X to a feature or inference space Z, and h : Z \u2192 Y would\nbe a function for predicting from the feature space. With this notation we would have fs = hs \u25e6 gs\nand ft = ht \u25e6 gt, and the SDA problem would be about \ufb01nding the best approximation for gt and ht,\ngiven the constraints on the available data.\nIf gs and gt are able to embed source and target samples, respectively, to a domain invariant space, it\nis safe to assume from the feature to the label space that ht = hs = h. Therefore, domain adaptation\nparadigms are looking for such inference functions so that they can use the prediction function hs for\ntarget samples.\nTraditional unsupervised DA (UDA) paradigms try to align the distributions of the features in the\nfeature space, mapped from the source and the target domains using a metric between distributions,\nMaximum Mean Discrepancy [20] being a popular one and other metrics like Kullback Leibler\ndivergence [29] and Jensen\u2013Shannon [18] divergence becoming popular when using adversarial\nlearning. Once they are aligned, a classi\ufb01er function would no longer be able to tell whether a\nsample is coming from the source or the target domain. Recent UDA paradigms try to \ufb01nd inference\nfunctions to satisfy this important goal using adversarial learning. Adversarial training looks for a\ndomain discriminator D that is able to distinguish between samples of source and target distributions.\nIn this case D is a binary classi\ufb01er trained with the standard cross-entropy loss\n\nLadv\u2212D(Xs, Xt, gs, gt) = \u2212E[log(D(gs(X s)))] \u2212 E[log(1 \u2212 D(gt(X t)))] .\n\n(2)\n\nOnce the discriminator is learned, adversarial learning tries to update the target inference function\ngt in order to confuse the discriminator. In other words, the adversarial training is looking for an\ninference function gt that is able to map a target sample to a feature space such that the discriminator\nD will no longer distinguish it from a source sample.\nFrom the above discussion it is clear that in order to perform well, UDA needs to align the distributions\neffectively in order to be successful. This can happen only if distributions are represented by a\nsuf\ufb01ciently large dataset. Therefore, UDA approaches are in a position of weakness when we assume\nDt to be small. Moreover, UDA approaches have also another intrinsic limitation; even with perfect\nconfusion alignment, there is no guarantee that samples from different domains but with the same\nclass label will map nearby in the feature space. This lack of semantic alignment is a major source of\nperformance reduction.\n\n3.1 Handling Scarce Target Data\n\nWe are interested in the case where very few labeled target samples (as low as 1 sample per class) are\navailable. We are facing two challenges in this setting. First, since the size of Dt is small, we need to\n\ufb01nd a way to augment it. Second, we need to somehow use the label information of Dt. Therefore,\nwe create pairs of samples. In this way, we are able to alleviate the lack of training target samples by\n\n4\n\n\fpairing them with each training source sample. In [38], we have shown that creating positive and\nnegative pairs using source and target data is very effective for SDA. Since the method proposed\nin [38] does not encode the domain information of the samples, it cannot be used in adversarial\nlearning. Here we extend [38] by creating 4 groups of pairs (Gi, i = 1, 2, 3, 4) as follows: we break\ndown the positive pairs into two groups (Groups 1 and 2), where pairs of the \ufb01rst group consist of\nsamples from the source distribution with the same class labels, while pairs of the second group also\nhave the same class label but come from different distributions (one from the source and one from\nthe target distribution). This is important because we can encode both label and domain information\nof training samples. Similarly, we break down the negative pairs into two groups (Groups 3 and 4),\nwhere pairs of the third group consist of samples from the source distribution with different class\nlabels, while pairs of the forth group come from different class labels and different distributions (one\nfrom the source and one from the target distributions). See Figure 1. In order to give each group the\nsame amount of members we use all possible pairs from G2, as it is the smallest, and then uniformly\nsample from the pairs in G1, G3, and G4 to match the size of G2. Any reasonable amount of portions\nbetween the numbers of the pairs can also be used.\nIn classical adversarial learning we would at this point learn a domain discriminator, but since we have\nsemantic information to consider as well, we are interested in learning a multi-class discriminator\n(we call it domain-class discriminator (DCD)) in order to introduce semantic alignment of the source\nand target domains. By expanding the binary classi\ufb01er to its multiclass equivalent, we can train a\nclassi\ufb01er that will evaluate which of the 4 groups a given sample pair belongs to. We model the DCD\nwith 2 fully connected layers with a softmax activation in the last layer which we can train with the\nstandard categorical cross-entropy loss\n\n4(cid:88)\nLF ADA\u2212D = \u2212E[\n\ni=1\n\nyGi log(D(\u03c6(Gi)))] ,\n\n(3)\n\nwhere yGi is the label of Gi and D is the DCD function. \u03c6 is a symbolic function that takes a pair as\ninput and outputs the concatenation of the results of the appropriate inference functions. The output\nof \u03c6 is passed to the DCD (Figure 2).\nIn the second step, we are interested in updating gt in order to confuse the DCD in such a way that\nthe DCD can no longer distinguish between groups 1 and 2, and also between groups 3 and 4 using\nthe loss\n\nLF ADA\u2212g = \u2212E[yG1 log(D(\u03c6(G2))) + yG3 log(D(\u03c6(G4)))] .\n\n(4)\n\n(4) is inspired by the non-saturating game [17] and will force the inference function gt to embed\ntarget samples in a space that DCD will no longer be able to distinguish between them.\n\nConnection with multi-class discriminators: Consider an image generation task where training\nsamples come from k classes. Learning the image generator can be done by any standard k-\nclass classi\ufb01er and adding generated samples as a new class (generated class) and correspondingly\nincreasing the dimension of the classi\ufb01er output from k to k + 1. During the adversarial learning,\nonly the generated class is confused. This has proven effective for image generation [48] and other\ntasks. However, this is different than the proposed DCD, where group 1 is confused with 2, and group\n3 is confused with 4. Inspired by [48], we are able to create a k + 4 classi\ufb01er to also guarantee a high\nclassi\ufb01cation accuracy. Therefore, we suggest that (4) needs to be minimized together with the main\nclassi\ufb01er loss\n\nLF ADA\u2212g = \u2212\u03b3E[yG1 log(D(g(G2)))+yG3 log(D(g(G4)))]+E[(cid:96)(f (X s), Y )]+E[(cid:96)(f (X t), Y )] ,\n(5)\nwhere \u03b3 strikes the balance between classi\ufb01cation and confusion. Misclassifying pairs from group\n2 as group 1 and likewise for groups 4 and 3, means that the DCD is no longer able to distinguish\npositive or negative pairs of different distributions from positive or negative pairs of the source\ndistribution, while the classi\ufb01er is still able to discriminate positive pairs from negative pairs. This\nsimultaneously satis\ufb01es the two main goals of SDA, domain confusion and class separability in the\n\n5\n\n\fTable 1: MNIST-USPS-SVHN datasets. Classi\ufb01cation accuracy for domain adaptation over the\nMNIST, USPS, and SVHN datasets. M, U, and S stand for MNIST, USPS, and SVHN domain. LB is\nour base model without adaptation. FT and FADA stand for \ufb01ne-tuning and our method, respectively.\n\nTraditional UDA\n[60]\n[15]\n\n[45]\n\nAdversarial UDA\n[33]\n[49]\n\n[59]\n\nLB\n\nM \u2192 U\n\n65.4\n\n47.8\n\n60.7\n\n91.8\n\n91.2\n\n89.4\n\n92.5\n\nU \u2192 M 58.6\nS \u2192 M 60.1\n20.3\nM \u2192 S\nS \u2192 U\nU \u2192 S\n\n66.0\n\n15.3\n\n63.1\n\n67.3\n\n73.7\n\n89.1\n\n90.1\n\n90.8\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n82.0\n\n40.1\n\n-\n\n-\n\n76.0\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n84.7\n\n36.4\n\n-\n\n-\n\nSDA\nFT\n\n[38]\nFADA\n\nFT\n\n[38]\nFADA\n\nFT\n\nFADA\n\nFT\n\nFADA\n\nFT\n\nFADA\n\nFT\n\nFADA\n\n1\n82.3\n85.0\n89.1\n72.6\n78.4\n81.1\n65.5\n72.8\n29.7\n37.7\n69.4\n78.3\n19.9\n27.5\n\n2\n84.9\n89.0\n91.3\n78.2\n82.2\n84.2\n68.6\n81.8\n31.2\n40.5\n71.8\n83.2\n22.2\n29.8\n\n3\n85.7\n90.1\n91.9\n81.9\n85.8\n87.5\n70.7\n82.6\n36.1\n42.9\n74.3\n85.2\n22.8\n34.5\n\n4\n86.5\n91.4\n93.3\n83.1\n86.1\n89.9\n73.3\n85.1\n36.7\n46.3\n76.2\n85.7\n24.6\n36.0\n\n5\n87.2\n92.4\n93.4\n83.4\n88.8\n91.1\n74.5\n86.1\n38.1\n46.1\n78.1\n86.2\n25.4\n37.9\n\n6\n88.4\n93.0\n94.0\n83.6\n89.6\n91.2\n74.6\n86.8\n38.3\n46.8\n77.9\n87.1\n25.4\n41.3\n\n7\n88.6\n92.9\n94.4\n84.0\n89.4\n91.5\n75.4\n87.2\n39.1\n47.0\n78.9\n87.5\n25.6\n42.9\n\nfeature space. UDA only looks for domain confusion and does not address class separability, because\nof the lack of labeled target samples.\n\nConnection with conditional GANs: Concatenation of outputs of different inferences has been\ndone before in conditional GANs. For example, [43, 44, 64] concatenate the input text to the\npenultimate layers of the discriminators. [25] concatenates positive and negative pairs before passing\nthem to the discriminator. However, all of them use the vanilla binary discriminator.\n\nRelationship between gs and gt: There is no restriction for gs and gt and they can be constrained\nor unconstrained. An obvious choice of constraint is equality (weight-sharing) which makes the\ninference functions symmetric. This can be seen as a regularizer and will reduce over\ufb01tting [38].\nAnother approach would be learning an asymmetric inference function [45]. Since we have access to\nvery few target samples, we use weight-sharing (gs = gt = g).\n\nChoice of gs, gt, and h: Since we are interested in visual recognition, the inference functions gs\nand gt are modeled by a convolutional neural network (CNN) with some initial convolutional layers,\nfollowed by some fully connected layers which are described speci\ufb01cally in the experiments section.\nIn addition, the prediction function h is modeled by fully connected layers with a softmax activation\nfunction for the last layer.\n\nTraining Process: Here we discuss the training process for the weight-sharing regularizer (gs =\ngt = g). Once the inference functions g and the prediction function h are chosen, FADA takes the\nfollowing steps: First, g and h are initialized using the source dataset Ds. Then, the mentioned four\ngroups of pairs should be created using Ds and Dt. The next step is training DCD using the four\ngroups of pairs. This should be done by freezing g. In the next step, the inference function g and\nprediction function h should be updated in order to confuse DCD and maintain high classi\ufb01cation\naccuracy. This should be done by freezing DCD. See Algorithm 1 and Figure 2. The training process\nfor the non weight-sharing case can be derived similarly.\n\n4 Experiments\n\nWe present results using the Of\ufb01ce dataset [47], the MNIST dataset [32], the USPS dataset [24], and\nthe SVHN dataset [40].\n\n4.1 MNIST-USPS-SVHN Datasets\nThe MNIST (M), USPS (U), and SVHN (S) datasets have recently been used for domain adap-\ntation [12, 45, 59]. They contain images of digits from 0 to 9 in various different environments\nincluding in the wild in the case of SVHN [40]. We considered six cross-domain tasks. The \ufb01rst two\ntasks include M \u2192 U, U \u2192 M, and followed the experimental setting in [12, 45, 33, 59, 49], which\ninvolves randomly selecting 2000 images from MNIST and 1800 images from USPS. For the rest of\n\n6\n\n\fTable 2: Of\ufb01ce dataset. Classi\ufb01cation accuracy for domain adaptation over the 31 categories of the\nOf\ufb01ce dataset. A, W, and D stand for Amazon, Webcam, and DSLR domain. LB is our base model\nwithout adaptation.\n\nLB\n\nA \u2192 W 61.2 \u00b1 0.9\nA \u2192 D 62.3 \u00b1 0.8\nW \u2192 A 51.6 \u00b1 0.9\nW \u2192 D 95.6 \u00b1 0.7\nD \u2192 A 58.5 \u00b1 0.8\nD \u2192 W 80.1 \u00b1 0.6\nAverage\n\n68.2\n\nUnsupervised Methods\n\n[60]\n61.8 \u00b1 0.4\n64.4 \u00b1 0.3\n52.2 \u00b1 0.4\n98.5 \u00b1 0.4\n52.1 \u00b1 0.8\n95.0 \u00b1 0.5\n\n70.6\n\n[34]\n68.5 \u00b1 0.4\n67.0 \u00b1 0.4\n53.1 \u00b1 0.3\n99.0 \u00b1 0.2\n54.0 \u00b1 0.4\n96.0 \u00b1 0.3\n\n72.9\n\n[15]\n68.7 \u00b1 0.3\n67.1 \u00b1 0.3\n54.09 \u00b1 0.5\n99.0 \u00b1 0.2\n56.0 \u00b1 0.5\n96.4 \u00b1 0.3\n\n73.6\n\n[58]\n82.7 \u00b1 0.8\n86.1 \u00b1 1.2\n65.0 \u00b1 0.5\n97.6 \u00b1 0.2\n66.2 \u00b1 0.3\n95.7 \u00b1 0.5\n\n82.2\n\nSupervised Methods\n[38]\n88.2 \u00b1 1.0\n89.0 \u00b1 1.2\n72.1 \u00b1 1.0\n97.6 \u00b1 0.4\n71.8 \u00b1 0.5\n96.4 \u00b1 0.8\n\n[27]\n84.5 \u00b1 1.7\n86.3 \u00b1 0.8\n65.7 \u00b1 1.7\n97.5 \u00b1 0.7\n66.5 \u00b1 1.0\n95.5 \u00b1 0.6\n\n85.8\n\n82.6\n\nFADA\n88.1 \u00b1 1.2\n88.2 \u00b1 1.0\n71.1 \u00b1 0.9\n97.5 \u00b1 0.6\n68.1 \u00b1 06\n96.4 \u00b1 0.8\n\n84.9\n\nthe cross-domain tasks, M \u2192 S, S \u2192 M, U \u2192 S, and S \u2192 U, we used all training samples of the\nsource domain for training and all testing samples of the target domain for testing.\nSince [12, 45, 33, 59, 49] introduced unsupervised methods, they used all samples of a target domain\nas unlabeled data in training. Here instead, we randomly selected n labeled samples per class from\ntarget domain data and used them in training. We evaluated our approach for n ranging from 1 to 4\nand repeated each experiment 10 times (we only show the mean of the accuracies for this experiment\nbecause standard deviation is very small).\nSince the images of the USPS dataset have 16 \u00d7 16 pixels, we resized the images of the MNIST\nand SVHN datasets to 16 \u00d7 16 pixels. We assume gs and gt share weights (g = gs = gt) for this\nexperiment. Similar to [32], we used 2 convolutional layers with 6 and 16 \ufb01lters of 5 \u00d7 5 kernels\nfollowed by max-pooling layers and 2 fully connected layers with size 120 and 84 as the inference\nfunction g, and one fully connected layer with softmax activation as the prediction function h. Also,\nwe used 2 fully connected layers with size 64 and 4 as DCD (4 groups classi\ufb01er). Training for each\nstage was done using the Adam Optimizer [26]. We compare our method with 1 SDA method, under\nthe same condition, and 6 recent UDA methods. UDA methods use all target samples in their training\nstage, while we only use very few labeled target samples per category in training.\nTable 1 shows the classi\ufb01cation accuracies across a range for the number of target samples available in\ntraining (n = 1, . . . , 7). FADA works well even when only one target sample per category (n = 1) is\navailable in training. We can get comparable accuracies with the state-of-the-art using only 10 labeled\ntarget samples (one sample per class n = 1) instead of using more than thousands of unlabeled\ntarget samples. We also report the lower bound (LB) of our model which corresponds to training the\nbase model using only source samples. Moreover, we report the accuracies obtained by \ufb01ne-tuning\n(FT) the base model on available target data and also the recent work presented in [38]. Although\nTable 1 shows that FT increases the accuracies over LB, it has reduced performance compared to\nSDA methods.\nFigure 3 shows how much improvement can be obtained with respect to the base model. The base\nmodel is the lower bound LB. This is simply obtained by training g and h with only the classi\ufb01cation\nloss and source training data; so, no adaptation is performed.\nWeight-Sharing. As we discussed earlier, weight-sharing can be seen as a regularizer that prevents\nthe target network gt from over\ufb01tting. This is important because gt can be easily over\ufb01tted since\ntarget data is scarce. We repeated the experiment for the U \u2192 M with n = 5 without sharing weights.\nThis provides an average accuracy of 84.1 over 10 repetitions, which is less than the weight-sharing\ncase.\n\n4.2 Of\ufb01ce Dataset\n\nThe of\ufb01ce dataset is a standard benchmark dataset for visual domain adaptation. It contains 31\nobject classes for three domains: Amazon, Webcam, and DSLR, indicated as A, W, and D, for a\ntotal of 4,652 images. The \ufb01rst domain A, consists of images downloaded from online merchants,\nthe second W, consists of low resolution images acquired by webcams, the third D, consists of\nhigh resolution images collected with digital SLRs. We consider four domain shifts using the three\ndomains (A \u2192 W, A \u2192 D, W \u2192 A, and D \u2192 A). Since there is not a considerable domain shift\nbetween W and D, we exclude W \u2192 D and D \u2192 W.\n\n7\n\n\fFigure 3: MNIST-USPS-SVHN summary. The lower bar of each column represents the LB as\nreported in Table 1 for the corresponding domain pair. The middle bar is the improvement of \ufb01ne-\ntuning FT the base model using the available target data reported in Table 1. The top bar is the\nimprovement of FADA over FT, also reported in Table 1.\n\nWe followed the setting described in [58]. All classes of the of\ufb01ce dataset and 5 train-test splits\nare considered. For the source domain, 20 examples per category for the Amazon domain, and 8\nexamples per category for the DSLR and Webcam domains are randomly selected for training for\neach split. Also, 3 labeled examples are randomly selected for each category in the target domain for\ntraining for each split. The rest of the target samples are used for testing. Note that we used the same\nsplits generated by [58].\nIn addition to the SDA algorithms, we report the results of some recent UDA algorithms. They follow\na different experimental protocol compared to the SDA algorithms, and use all samples of the target\ndomain in training as unlabeled data together with all samples of the source domain. So, we cannot\nmake an exact comparison between results. However, since UDA algorithms use all samples of the\ntarget domain in training and we use only very few of them (3 per class), we think it is still worth\nlooking at how they differ.\nHere we are interested in the case where gs and gt share weights (gs = gt = g). For the inference\nfunction g, we used the convolutional layers of the VGG-16 architecture [53] followed by 2 fully\nconnected layers with output size of 1024 and 128, respectively. For the prediction function h, we\nused a fully connected layer with softmax activation. Similar to [58], we used the weights pre-trained\non the ImageNet dataset [46] for the convolutional layers, and initialized the fully connected layers\nusing all the source domain data. We model the DCD with 2 fully connected layers with a softmax\nactivation in the last layer.\nTable 2 reports the classi\ufb01cation accuracy over 31 classes for the Of\ufb01ce dataset and shows that FADA\nhas performance comparable to the state-of-the-art.\n\n5 Conclusions\n\nWe have introduced a deep model combining a classi\ufb01cation and an adversarial loss to address SDA\nin few-shot learning regime. We have shown that adversarial learning can be augmented to address\nSDA. The approach is general in the sense that the architecture sub-components can be changed. We\nfound that addressing the semantic distribution alignments with point-wise surrogates of distribution\ndistances and similarities for SDA works very effectively, even when labeled target samples are\nvery few. In addition, we found the SDA accuracy to converge very quickly as more labeled target\nsamples per category are available. The approach shows clear promise as it sets new state-of-the-art\nperformance in the experiments.\n\n8\n\nn=1n=2n=3n=4n=5n=6n=760708090100Accuracy %(a) M\u2192ULBFTFADAn=1n=2n=3n=4n=5n=6n=71020304050Accuracy %(c) U\u2192SLBFTFADAn=1n=2n=3n=4n=5n=6n=760708090100Accuracy %(e) S\u2192ULBFTFADAn=1n=2n=3n=4n=5n=6n=720304050Accuracy %(b) M\u2192SLBFTFADAn=1n=2n=3n=4n=5n=6n=75060708090100Accuracy %(d) U\u2192MLBFTFADAn=1n=2n=3n=4n=5n=6n=75060708090100Accuracy %(f) S\u2192MLBFTFADA\fReferences\n[1] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer for object category detection. In Computer Vision\n\n(ICCV), 2011 IEEE International Conference on, pages 2252\u20132259. IEEE, 2011.\n\n[2] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by\n\ndomain invariant projection. In IEEE ICCV, pages 769\u2013776, 2013.\n\n[3] C. J. Becker, C. M. Christoudias, and P. Fua. Non-linear domain adaptation with boosting. In Advances in\n\nNeural Information Processing Systems, pages 485\u2013493, 2013.\n\n[4] A. Bergamo and L. Torresani. Exploiting weakly-labeled web images to improve object classi\ufb01cation: a\ndomain adaptation approach. In Advances in Neural Information Processing Systems, pages 181\u2013189,\n2010.\n\n[5] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In\nProceedings of the 2006 conference on empirical methods in natural language processing, pages 120\u2013128.\nAssociation for Computational Linguistics, 2006.\n\n[6] L. Chen, W. Li, and D. Xu. Recognizing RGB images by learning from RGB-D data. In CVPR, pages\n\n1418\u20131425, June 2014.\n\n[7] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\nface veri\ufb01cation. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society\nConference on, volume 1, pages 539\u2013546. IEEE, 2005.\n\n[8] H. Daume III and D. Marcu. Domain adaptation for statistical classi\ufb01ers. Journal of Arti\ufb01cial Intelligence\n\nResearch, 26:101\u2013126, 2006.\n\n[9] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: a deep\n\nconvolutional activation feature for generic visual recognition. In arXiv:1310.1531, 2013.\n\n[10] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In\nComputer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1375\u20131381.\nIEEE, 2009.\n\n[11] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using\n\nsubspace alignment. In IEEE ICCV, pages 2960\u20132967, 2013.\n\n[12] B. Fernando, T. Tommasi, and T. Tuytelaarsc. Joint cross-domain classi\ufb01cation and subspace learning for\n\nunsupervised adaptation. Pattern Recogition Letters, 2015.\n\n[13] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint\n\narXiv:1409.7495, 2014.\n\n[14] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky.\nDomain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1\u201335,\n2016.\n\n[15] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep reconstruction-classi\ufb01cation networks\nIn European Conference on Computer Vision, pages 597\u2013613.\n\nfor unsupervised domain adaptation.\nSpringer, 2016.\n\n[16] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\nComputer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066\u20132073. IEEE,\n2012.\n\n[17] I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160,\n\n2016.\n\n[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nIn Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.\nGenerative adversarial nets.\nWeinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672\u20132680. Curran\nAssociates, Inc., 2014.\n\n[19] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach.\n\nIn IEEE ICCV, pages 999\u20131006, 2011.\n\n[20] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. A kernel method for the\n\ntwo-sample-problem. In NIPS, 2006.\n\n[21] Y. Guo and M. Xiao. Cross language text classi\ufb01cation via subspace co-regularized multi-view learning. In\nProceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland,\nUK, June 26 - July 1, 2012, 2012.\n\n[22] E. Hoffer and N. Ailon. Deep metric learning using triplet network.\n\nSimilarity-Based Pattern Recognition, pages 84\u201392. Springer, 2015.\n\nIn International Workshop on\n\n[23] J. Hu, J. Lu, and Y.-P. Tan. Discriminative deep metric learning for face veri\ufb01cation in the wild. In\nComputer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1875\u20131882, June\n2014.\n\n[24] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and\n\nmachine intelligence, 16(5):550\u2013554, 1994.\n\n[25] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial\n\nnetworks. arXiv preprint arXiv:1611.07004, 2016.\n\n9\n\n\f[26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.\n[27] P. Koniusz, Y. Tas, and F. Porikli. Domain adaptation by mixture of alignments of second-or higher-order\n\nscatter tensors. arXiv preprint arXiv:1611.08195, 2016.\n\n[28] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric\nkernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages\n1785\u20131792. IEEE, 2011.\n\n[29] S. Kullback and R. A. Leibler. On information and suf\ufb01ciency. The annals of mathematical statistics,\n\n22(1):79\u201386, 1951.\n\n[30] B. Kumar, G. Carneiro, I. Reid, et al. Learning local image descriptors with deep siamese and triplet\nconvolutional networks by minimising global loss functions. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 5385\u20135394, 2016.\n\n[31] M. Lapin, M. Hein, and B. Schiele. Learning using privileged information: SVM+ and weighted SVM.\n\nNeural Networks, 53:95\u2013108, 2014.\n\n[32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[33] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information\n\nProcessing Systems, pages 469\u2013477, 2016.\n\n[34] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks.\n\nIn ICML, pages 97\u2013105, 2015.\n\n[35] M. Long, G. Ding, J. Wang, J. Sun, Y. Guo, and P. S. Yu. Transfer sparse coding for robust image\nrepresentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n407\u2013414, 2013.\n\n[36] S. Motiian and G. Doretto. Information bottleneck domain adaptation with privileged information for\n\nvisual recognition. In European Conference on Computer Vision, pages 630\u2013647. Springer, 2016.\n\n[37] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Information bottleneck learning using privileged\ninformation for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 1496\u20131505, 2016.\n\n[38] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Uni\ufb01ed deep supervised domain adaptation and\n\ngeneralization. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.\n\n[39] K. Muandet, D. Balduzzi, and B. Sch\u00f6lkopf. Domain generalization via invariant feature representation. In\n\nICML (1), pages 10\u201318, 2013.\n\n[40] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\nunsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning,\n2011.\n\n[41] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE\n\nTNN, 22(2):199\u2013210, 2011.\n\n[42] J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid,\nB. C. Russell, A. Torralba, et al. Dataset issues in object recognition. In Toward category-level object\nrecognition, pages 29\u201348. Springer, 2006.\n\n[43] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image\nsynthesis. In Proceedings of the 33rd International Conference on International Conference on Machine\nLearning-Volume 48, pages 1060\u20131069. JMLR. org, 2016.\n\n[44] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In\n\nAdvances in Neural Information Processing Systems, pages 217\u2013225, 2016.\n\n[45] A. Rozantsev, M. Salzmann, and P. Fua. Beyond sharing weights for deep domain adaptation. arXiv\n\npreprint arXiv:1603.06432, 2016.\n\n[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV,\n2015.\n\n[47] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV,\n\npages 213\u2013226, 2010.\n\n[48] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for\n\ntraining gans. In Advances in Neural Information Processing Systems, pages 2234\u20132242, 2016.\n\n[49] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains\n\nusing generative adversarial networks. arXiv preprint arXiv:1704.01705, 2017.\n\n[50] N. Sara\ufb01anos, M. Vrigkas, and I. A. Kakadiaris. Adaptive svm+: Learning with privileged information for\n\ndomain adaptation. arXiv preprint arXiv:1708.09083, 2017.\n\n[51] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[52] A. Shrivastava, T. P\ufb01ster, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and\nunsupervised images through adversarial training. In The IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), July 2017.\n\n10\n\n\f[53] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[54] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In Computer\n\nVision\u2013ECCV 2016 Workshops, pages 443\u2013450. Springer, 2016.\n\n[55] T. Tommasi, M. Lanzi, P. Russo, and B. Caputo. Learning the roots of visual domain shift. In Computer\n\nVision\u2013ECCV 2016 Workshops, pages 475\u2013482. Springer, 2016.\n\n[56] T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars. A deeper look at dataset bias. In German Conference\n\non Pattern Recognition, pages 504\u2013516. Springer, 2015.\n\n[57] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition\n\n(CVPR), 2011 IEEE Conference on, pages 1521\u20131528, 2011.\n\n[58] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In\n\nICCV, 2015.\n\n[59] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.\n\n[60] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for\n\ndomain invariance. arXiv preprint arXiv:1412.3474, 2014.\n\n[61] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long short-term memory architecture for\n\nhuman re-identi\ufb01cation. In European Conference on Computer Vision, pages 135\u2013153. Springer, 2016.\n\n[62] J. Yang, R. Yan, and A. G. Hauptmann. Adapting svm classi\ufb01ers to data with shifted distributions. In\nData Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on, pages\n69\u201376. IEEE, 2007.\n\n[63] T. Yao, Y. Pan, C.-W. Ngo, H. Li, and T. Mei. Semi-supervised domain adaptation with subspace learning\nfor visual recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2015.\n\n[64] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic\n\nimage synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3343, "authors": [{"given_name": "Saeid", "family_name": "Motiian", "institution": "West Virginia University"}, {"given_name": "Quinn", "family_name": "Jones", "institution": "West Virginia University"}, {"given_name": "Seyed", "family_name": "Iranmanesh", "institution": "West Virginia University"}, {"given_name": "Gianfranco", "family_name": "Doretto", "institution": "West Virginia University"}]}