{"id":6428,"date":"2024-08-09T11:01:01","date_gmt":"2024-08-09T03:01:01","guid":{"rendered":""},"modified":"2024-08-09T11:01:01","modified_gmt":"2024-08-09T03:01:01","slug":"\u5143\u5b66\u4e60 \u8fc1\u79fb\u5b66\u4e60_\u5143\u5b66\u4e60\u5c31\u662f\u60a8\u6240\u9700\u8981\u7684","status":"publish","type":"post","link":"https:\/\/mushiming.com\/6428.html","title":{"rendered":"\u5143\u5b66\u4e60 \u8fc1\u79fb\u5b66\u4e60_\u5143\u5b66\u4e60\u5c31\u662f\u60a8\u6240\u9700\u8981\u7684"},"content":{"rendered":"
\n

\u5143\u5b66\u4e60 \u8fc1\u79fb\u5b66\u4e60<\/p>\n

\n
\n
\n
\n

Update<\/strong>: This post is part of a blog series on Meta-Learning that I\u2019m working on. Check out <\/em>part 1 and <\/em>part 2.<\/em><\/p>\n

\u66f4\u65b0<\/strong> \uff1a\u8fd9\u7bc7\u6587\u7ae0\u662f\u6211\u6b63\u5728\u4ece\u4e8b\u7684\u6709\u5173\u5143\u5b66\u4e60\u7684\u535a\u5ba2\u7cfb\u5217\u7684\u4e00\u90e8\u5206\u3002<\/em> \u68c0\u51fa<\/em> \u7b2c1 \u90e8\u5206 \u548c<\/em> \u7b2c2\u90e8\u5206 \u3002<\/em> <\/p>\n

Neural networks have been highly influential in the past decades in the machine learning community, thanks to the rise of computing power, the abundance of unstructured data, and the advancement of algorithmic solutions. However, it is still a long way for researchers to completely use neural networks in real-world settings where the data is scarce, and requirements for model accuracy\/speed are critical.<\/p>\n

\u5728\u8fc7\u53bb\u7684\u51e0\u5341\u5e74\u4e2d\uff0c\u7531\u4e8e\u8ba1\u7b97\u80fd\u529b\u7684\u63d0\u9ad8\uff0c\u975e\u7ed3\u6784\u5316\u6570\u636e\u7684\u4e30\u5bcc\u4ee5\u53ca\u7b97\u6cd5\u89e3\u51b3\u65b9\u6848\u7684\u53d1\u5c55\uff0c\u795e\u7ecf\u7f51\u7edc\u5728\u673a\u5668\u5b66\u4e60\u793e\u533a\u4e2d\u4e00\u76f4\u5177\u6709\u5f88\u5927\u7684\u5f71\u54cd\u529b\u3002 \u4f46\u662f\uff0c\u5bf9\u4e8e\u7814\u7a76\u4eba\u5458\u6765\u8bf4\uff0c\u8981\u5728\u7f3a\u4e4f\u6570\u636e\u7684\u73b0\u5b9e\u4e16\u754c\u4e2d\u5b8c\u5168\u4f7f\u7528\u795e\u7ecf\u7f51\u7edc\uff0c\u8fd8\u6709\u5f88\u957f\u7684\u8def\u8981\u8d70\uff0c\u5bf9\u6a21\u578b\u51c6\u786e\u6027\/\u901f\u5ea6\u7684\u8981\u6c42\u81f3\u5173\u91cd\u8981\u3002 <\/p>\n

Meta-learning<\/strong>, also known as learning how to learn<\/em>, has recently emerged as a potential learning paradigm that can absorb information from one task and generalize that information to unseen tasks proficiently. During this quarantine time, I started watching lectures on Stanford\u2019s CS 330 class on Deep Multi-Task and Meta-Learning taught by the brilliant Chelsea Finn. As a courtesy of her talks, this blog post attempts to answer these key questions:<\/p>\n

\u5143\u5b66\u4e60<\/strong> (\u4e5f\u79f0\u4e3a\u5b66\u4e60\u5982\u4f55\u5b66\u4e60<\/em> )\u6700\u8fd1\u5df2\u6210\u4e3a\u4e00\u79cd\u6f5c\u5728\u7684\u5b66\u4e60\u8303\u4f8b\uff0c\u53ef\u4ee5\u5438\u6536\u4e00\u9879\u4efb\u52a1\u4e2d\u7684\u4fe1\u606f\u5e76\u5c06\u5176\u6709\u6548\u5730\u6982\u62ec\u4e3a\u770b\u4e0d\u89c1\u7684\u4efb\u52a1\u3002 \u5728\u8fd9\u6bb5\u9694\u79bb\u671f\u95f4\uff0c\u6211\u5f00\u59cb\u89c2\u770b\u7531\u51fa\u8272\u7684\u5207\u5c14\u897f\u00b7\u82ac\u6069(Chelsea Finn)\u6559\u6388\u7684\u65af\u5766\u798f\u5927\u5b66CS 330\u8bfe\u7a0b\u201c\u6df1\u5ea6\u591a\u4efb\u52a1\u548c\u5143\u5b66\u4e60\u201d\u8bfe\u7a0b\u3002 \u51fa\u4e8e\u5bf9\u5979\u7684\u6f14\u8bb2\u7684\u793c\u8c8c\uff0c\u6b64\u535a\u5ba2\u6587\u7ae0\u5c1d\u8bd5\u56de\u7b54\u4ee5\u4e0b\u5173\u952e\u95ee\u9898\uff1a <\/p>\n

    \n
  1. Why do we need meta-learning?\n

    \n

    \u4e3a\u4ec0\u4e48\u6211\u4eec\u9700\u8981\u5143\u5b66\u4e60\uff1f <\/li>\n

  2. How does the math of meta-learning work?\n

    \n

    \u5143\u5b66\u4e60\u7684\u6570\u5b66\u5982\u4f55\u5de5\u4f5c\uff1f <\/li>\n

  3. What are the different approaches to design a meta-learning algorithm?\n

    \n

    \u8bbe\u8ba1\u5143\u5b66\u4e60\u7b97\u6cd5\u6709\u54ea\u4e9b\u4e0d\u540c\u7684\u65b9\u6cd5\uff1f <\/li>\n<\/ol><\/div>\n<\/p><\/div>\n

    \n
    \n

    Note:<\/strong> The content of this post is primarily based on CS330\u2019s <\/em>lecture one on problem definitions, <\/em>lecture two on supervised and black-box meta-learning, <\/em>lecture three on optimization-based meta-learning, and <\/em>lecture four on few-shot learning via metric learning. They are all accessible to the public.<\/em><\/p>\n

    \u6ce8\u610f\uff1a<\/strong> \u8fd9\u7bc7\u6587\u7ae0\u7684\u5185\u5bb9\u4e3b\u8981\u57fa\u4e8eCS330\u7684\u7b2c\u4e00<\/em> \u5802\u8bfe\u7684\u95ee\u9898\u5b9a\u4e49 \uff0c\u7b2c\u4e8c<\/em> \u5802\u8bfe\u7684\u76d1\u7763\u5f0f\u548c\u9ed1\u76d2\u5143\u5b66\u4e60 \uff0c<\/em> \u7b2c\u4e09\u5802\u8bfe\u7684\u57fa\u4e8e\u4f18\u5316\u7684\u5143\u5b66\u4e60 \u4ee5\u53ca<\/em> \u7b2c\u56db \u5802\u8bfe\u7684\u57fa\u4e8e\u5ea6\u91cf\u7684\u77ed\u65f6<\/em> \u5b66\u4e60\u5b66\u4e60 \u3002<\/em> \u5b83\u4eec\u90fd\u662f\u516c\u4f17\u53ef\u8bbf\u95ee\u7684\u3002<\/em> <\/p>\n

    1-\u5143\u5b66\u4e60\u52a8\u673a (<\/span>1 \u2014 Motivation For Meta-Learning)<\/span><\/h2>\n

    Thanks to the advancement in algorithms, data, and compute power in the past decade, deep neural networks have allowed us to handle unstructured data (such as images, text, audio, video, etc.) very well without the need to engineer features by hand. Empirical research has shown that if neural networks can generalize very well if we feed them large and diverse inputs. For example, Transformers and GPT-2 made the wave in the Natural Language Processing research community last year with their broad applicability in various tasks.<\/p>\n

    \u7531\u4e8e\u8fc7\u53bb\u5341\u5e74\u4e2d\u7b97\u6cd5\uff0c\u6570\u636e\u548c\u8ba1\u7b97\u80fd\u529b\u7684\u8fdb\u6b65\uff0c\u6df1\u5ea6\u795e\u7ecf\u7f51\u7edc\u4f7f\u6211\u4eec\u80fd\u591f\u5f88\u597d\u5730\u5904\u7406\u975e\u7ed3\u6784\u5316\u6570\u636e(\u4f8b\u5982\u56fe\u50cf\uff0c\u6587\u672c\uff0c\u97f3\u9891\uff0c\u89c6\u9891\u7b49)\uff0c\u800c\u65e0\u9700\u901a\u8fc7\u4ee5\u4e0b\u65b9\u6cd5\u6765\u8bbe\u8ba1\u529f\u80fd\u624b\u3002 \u7ecf\u9a8c\u7814\u7a76\u8868\u660e\uff0c\u5982\u679c\u6211\u4eec\u4e3a\u795e\u7ecf\u7f51\u7edc\u63d0\u4f9b\u5927\u91cf\u591a\u6837\u7684\u8f93\u5165\uff0c\u795e\u7ecf\u7f51\u7edc\u80fd\u5426\u5f88\u597d\u5730\u63a8\u5e7f\u3002 \u4f8b\u5982\uff0c\u201c \u53d8\u5f62\u91d1\u521a\u201d\u548cGPT-2\u53bb\u5e74\u5728\u81ea\u7136\u8bed\u8a00\u5904\u7406\u7814\u7a76\u754c\u5f15\u8d77\u4e86\u8f70\u52a8\uff0c\u5b83\u4eec\u5728\u5404\u79cd\u4efb\u52a1\u4e2d\u7684\u5e7f\u6cdb\u9002\u7528\u6027\u3002 <\/p>\n

    However, there is a catch with using neural networks in the real-world setting where:<\/p>\n

    \u4f46\u662f\uff0c\u5728\u5b9e\u9645\u73af\u5883\u4e2d\u4f7f\u7528\u795e\u7ecf\u7f51\u7edc\u6709\u4e00\u4e2a\u95ee\u9898\uff1a <\/p>\n

      \n
    • \n

      Large datasets are unavailable<\/strong>: This issue is common in many domains ranging from classification of rare diseases to translation of uncommon languages. It is impractical to learn from scratch for each task in these scenarios.<\/p>\n

      \u65e0\u6cd5\u4f7f\u7528\u5927\u578b\u6570\u636e\u96c6<\/strong> \uff1a\u4ece\u7a00\u6709\u75be\u75c5\u7684\u5206\u7c7b\u5230\u4e0d\u5e38\u89c1\u8bed\u8a00\u7684\u7ffb\u8bd1\uff0c\u8be5\u95ee\u9898\u5728\u8bb8\u591a\u9886\u57df\u90fd\u5f88\u5e38\u89c1\u3002 \u5728\u8fd9\u4e9b\u60c5\u51b5\u4e0b\uff0c\u4ece\u5934\u5f00\u59cb\u5b66\u4e60\u6bcf\u4e2a\u4efb\u52a1\u662f\u4e0d\u5207\u5b9e\u9645\u7684\u3002 <\/p>\n<\/li>\n

    • \n

      Data has a long tail<\/strong>: This issue can easily break the standard machine learning paradigm. For example, in the self-driving car setting, an autonomous vehicle can be trained to handle everyday situations very well. Still, it often struggles with uncommon conditions (such as people jay-walking, animal crossing, traffic lines not working) where humans can comfortably handle. This can lead to awful outcomes, such as the Uber\u2019s accident in Arizona a few years ago.<\/p>\n

      \u6570\u636e\u6709\u4e00\u6761\u957f\u5c3e\u5df4<\/strong> \uff1a\u8fd9\u4e2a\u95ee\u9898\u5f88\u5bb9\u6613\u6253\u7834\u6807\u51c6\u7684\u673a\u5668\u5b66\u4e60\u8303\u5f0f\u3002 \u4f8b\u5982\uff0c\u5728\u81ea\u52a8\u9a7e\u9a76\u6c7d\u8f66\u73af\u5883\u4e2d\uff0c\u53ef\u4ee5\u8bad\u7ec3\u81ea\u52a8\u9a7e\u9a76\u6c7d\u8f66\u5f88\u597d\u5730\u5904\u7406\u65e5\u5e38\u60c5\u51b5\u3002 \u5c3d\u7ba1\u5982\u6b64\uff0c\u5b83\u4ecd\u7136\u5728\u4eba\u7c7b\u53ef\u4ee5\u8212\u9002\u5730\u5904\u7406\u7684\u7f55\u89c1\u6761\u4ef6\u4e0b(\u4f8b\u5982\uff0c\u4eba\u4eec\u4e71\u7a7f\u9a6c\u8def\uff0c\u7a7f\u8d8a\u52a8\u7269\uff0c\u4ea4\u901a\u7ebf\u8def\u65e0\u6cd5\u6b63\u5e38\u5de5\u4f5c)\u6323\u624e\u3002 \u8fd9\u53ef\u80fd\u4f1a\u5bfc\u81f4\u53ef\u6015\u7684\u540e\u679c\uff0c\u4f8b\u5982\u51e0\u5e74\u524dUber\u5728\u4e9a\u5229\u6851\u90a3\u5dde\u53d1\u751f\u7684\u8f66\u7978 \u3002 <\/p>\n<\/li>\n

    • \n

      We want to quickly learn something about a new task without training our model from scratch:<\/strong> Humans can do this quite easily by leveraging our prior experience. For example, if I know a bit of Spanish, then it should not be too difficult for me to learn Italian, as these two languages are quite similar linguistically.<\/p>\n

      \u6211\u4eec\u5e0c\u671b\u5feb\u901f\u5b66\u4e60\u6709\u5173\u4e00\u9879\u65b0\u4efb\u52a1\u7684\u77e5\u8bc6\uff0c\u800c\u65e0\u9700\u4ece\u5934\u5f00\u59cb\u8bad\u7ec3\u6211\u4eec\u7684\u6a21\u578b\uff1a<\/strong>\u4eba\u7c7b\u53ef\u4ee5\u901a\u8fc7\u5229\u7528\u6211\u4eec\u5148\u524d\u7684\u7ecf\u9a8c\u5f88\u5bb9\u6613\u5730\u505a\u5230\u8fd9\u4e00\u70b9\u3002 \u4f8b\u5982\uff0c\u5982\u679c\u6211\u4f1a\u4e00\u70b9\u897f\u73ed\u7259\u8bed\uff0c\u90a3\u4e48\u5b66\u4e60\u610f\u5927\u5229\u8bed\u5e94\u8be5\u4e0d\u96be\uff0c\u56e0\u4e3a\u8fd9\u4e24\u79cd\u8bed\u8a00\u5728\u8bed\u8a00\u4e0a\u975e\u5e38\u76f8\u4f3c\u3002 <\/p>\n<\/li>\n<\/ul>\n

      In this article, I would like to give an introductory overview of meta-learning, which is a learning framework that can help our neural network become more effective in the settings mentioned above. In this setup, we want our system to learn a new task more proficiently \u2014 assuming that it is given access to data on previous tasks.<\/p>\n

      \u5728\u672c\u6587\u4e2d\uff0c\u6211\u60f3\u5bf9\u5143\u5b66\u4e60\u8fdb\u884c\u4ecb\u7ecd\u6027\u6982\u8ff0\uff0c\u5143\u5b66\u4e60\u662f\u4e00\u79cd\u5b66\u4e60\u6846\u67b6\uff0c\u53ef\u4ee5\u5e2e\u52a9\u6211\u4eec\u7684\u795e\u7ecf\u7f51\u7edc\u5728\u4e0a\u8ff0\u8bbe\u7f6e\u4e2d\u53d8\u5f97\u66f4\u52a0\u6709\u6548\u3002 \u5728\u6b64\u8bbe\u7f6e\u4e2d\uff0c\u6211\u4eec\u5e0c\u671b\u6211\u4eec\u7684\u7cfb\u7edf\u66f4\u52a0\u719f\u7ec3\u5730\u5b66\u4e60\u65b0\u4efb\u52a1-\u5047\u8bbe\u5df2\u83b7\u5f97\u5bf9\u5148\u524d\u4efb\u52a1\u6570\u636e\u7684\u8bbf\u95ee\u6743\u9650\u3002 <\/p>\n

      Historically, there have been a few papers thinking in this direction.<\/p>\n

      \u4ece\u5386\u53f2\u4e0a\u770b\uff0c\u6709\u51e0\u7bc7\u8bba\u6587\u90fd\u671d\u7740\u8fd9\u4e2a\u65b9\u5411\u601d\u8003\u3002 <\/p>\n

        \n
      • \n

        Back in 1992, Bengio et al. looked at the possibility of a learning rule that can solve new tasks.<\/p>\n

        \u65e9\u57281992\u5e74\uff0c Bengio\u7b49\u4eba\u3002 \u7814\u7a76\u4e86\u53ef\u4ee5\u89e3\u51b3\u65b0\u4efb\u52a1\u7684\u5b66\u4e60\u89c4\u5219\u7684\u53ef\u80fd\u6027\u3002 <\/p>\n<\/li>\n

      • \n

        In 1997, Rich Caruana wrote a survey about multi-task learning<\/strong>, which is a variant of meta-learning. He explained how tasks could be learned in parallel using a shared representation between models and also presented a multi-task inductive transfer notion that uses back-propagation to handle additional tasks.<\/p>\n

        1997\u5e74\uff0c Rich Caruana\u64b0\u5199\u4e86\u4e00\u9879\u6709\u5173\u591a\u4efb\u52a1\u5b66\u4e60<\/strong>\u7684\u8c03\u67e5\uff0c\u8fd9\u662f\u5143\u5b66\u4e60\u7684\u4e00\u79cd\u53d8\u4f53\u3002 \u4ed6\u89e3\u91ca\u4e86\u5982\u4f55\u4f7f\u7528\u6a21\u578b\u4e4b\u95f4\u7684\u5171\u4eab\u8868\u793a\u5e76\u884c\u5b66\u4e60\u4efb\u52a1\uff0c\u5e76\u63d0\u51fa\u4e86\u4f7f\u7528\u53cd\u5411\u4f20\u64ad\u5904\u7406\u5176\u4ed6\u4efb\u52a1\u7684\u591a\u4efb\u52a1\u5f52\u7eb3\u4f20\u8f93\u6982\u5ff5\u3002 <\/p>\n<\/li>\n

      • \n

        In 1998, Sebastian Thrun explored the problem of lifelong learning<\/strong>, which is inspired by the ability of humans to exploit experiences that come from related learning tasks to generalize to new tasks.<\/p>\n

        1998\u5e74\uff0c \u585e\u5df4\u65af\u8482\u5b89\u00b7\u7279\u4f26 ( Sebastian Thrun)\u63a2\u7d22\u4e86\u7ec8\u8eab\u5b66\u4e60<\/strong>\u7684\u95ee\u9898\uff0c\u8fd9\u662f\u53d7\u4eba\u7c7b\u5229\u7528\u76f8\u5173\u5b66\u4e60\u4efb\u52a1\u4e2d\u7684\u7ecf\u9a8c\u6765\u63a8\u5e7f\u5230\u65b0\u4efb\u52a1\u7684\u80fd\u529b\u542f\u53d1\u7684\u3002 <\/p>\n<\/li>\n<\/ul>\n

        \n
        \n
        \n
        \n
        \n
        \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
        \n Figure 1: The Domain-Adaptive Meta-Learning system described in <\/em> Yu et al.
        \n <\/figcaption>
        \n \u56fe1\uff1a<\/em> Yu\u7b49\u4eba
        \n \u63cf\u8ff0\u7684\u9886\u57df\u81ea\u9002\u5e94\u5143\u5b66\u4e60\u7cfb\u7edf<\/em> \u3002
        \n <\/figcaption><\/figure>\n

        Right now is an exciting period to study meta-learning because it is increasingly becoming more fundamental in machine learning research. Many recent works have leveraged meta-learning algorithms (and their variants) to do well for the given tasks. A few examples include:<\/p>\n

        \u73b0\u5728\u662f\u5b66\u4e60\u5143\u5b66\u4e60\u7684\u6fc0\u52a8\u4eba\u5fc3\u7684\u65f6\u671f\uff0c\u56e0\u4e3a\u5b83\u5728\u673a\u5668\u5b66\u4e60\u7814\u7a76\u4e2d\u53d8\u5f97\u8d8a\u6765\u8d8a\u91cd\u8981\u3002 \u6700\u8fd1\u7684\u8bb8\u591a\u5de5\u4f5c\u90fd\u5229\u7528\u5143\u5b66\u4e60\u7b97\u6cd5(\u53ca\u5176\u53d8\u4f53)\u6765\u5b8c\u6210\u7ed9\u5b9a\u4efb\u52a1\u3002 \u4e00\u4e9b\u793a\u4f8b\u5305\u62ec\uff1a <\/p>\n

          \n
        • \n

          Aharoni et al. expand the number of languages used in a multi-lingual neural machine translation setting from 2 to 102. Their method learns a small number of languages and generalizes them to a vast amount of others.<\/p>\n

          Aharoni\u7b49\u3002 \u5c06\u591a\u8bed\u8a00\u795e\u7ecf\u673a\u5668\u7ffb\u8bd1\u8bbe\u7f6e\u4e2d\u4f7f\u7528\u7684\u8bed\u8a00\u6570\u91cf\u4ece2\u79cd\u6269\u5c55\u5230102\u79cd\u3002\u4ed6\u4eec\u7684\u65b9\u6cd5\u53ef\u5b66\u4e60\u5c11\u91cf\u8bed\u8a00\u5e76\u5c06\u5176\u6982\u62ec\u4e3a\u5927\u91cf\u5176\u4ed6\u8bed\u8a00\u3002 <\/p>\n<\/li>\n

        • \n

          Yu et al. present Domain-Adaptive Meta-Learning (figure 1), a system that allows robots to learn from a single video of a human via prior meta-training data collected from related tasks.<\/p>\n

          Yu\u7b49\u3002 \u76ee\u524d\u7684\u9886\u57df\u81ea\u9002\u5e94\u5143\u5b66\u4e60(\u56fe1)\u662f\u4e00\u79cd\u7cfb\u7edf\uff0c\u8be5\u7cfb\u7edf\u5141\u8bb8\u673a\u5668\u4eba\u901a\u8fc7\u4ece\u76f8\u5173\u4efb\u52a1\u4e2d\u6536\u96c6\u7684\u5148\u524d\u5143\u8bad\u7ec3\u6570\u636e\u4ece\u4eba\u7c7b\u7684\u5355\u4e2a\u89c6\u9891\u4e2d\u5b66\u4e60\u3002 <\/p>\n<\/li>\n

        • \n

          A recent paper from YouTube shows how their team used multi-task methods to make video recommendations and handle multiple competing ranking objectives.<\/p>\n

          YouTube\u6700\u8fd1\u53d1\u8868\u7684\u4e00\u7bc7\u8bba\u6587\u5c55\u793a\u4e86\u4ed6\u4eec\u7684\u56e2\u961f\u5982\u4f55\u4f7f\u7528\u591a\u4efb\u52a1\u65b9\u6cd5\u6765\u63a8\u8350\u89c6\u9891\u5e76\u5904\u7406\u591a\u4e2a\u76f8\u4e92\u7ade\u4e89\u7684\u6392\u540d\u76ee\u6807\u3002 <\/p>\n<\/li>\n<\/ul>\n

          Forward-looking, the development of meta-learning algorithms will help democratize deep learning and solve problems in domains with limited data.<\/p>\n

          \u524d\u77bb\u6027\u5730\uff0c\u5143\u5b66\u4e60\u7b97\u6cd5\u7684\u5f00\u53d1\u5c06\u6709\u52a9\u4e8e\u4f7f\u6df1\u5ea6\u5b66\u4e60\u6c11\u4e3b\u5316\u5e76\u89e3\u51b3\u6570\u636e\u6709\u9650\u7684\u9886\u57df\u4e2d\u7684\u95ee\u9898\u3002 <\/p>\n

          2 \u2014\u5143\u5b66\u4e60\u57fa\u7840 (<\/span>2 \u2014 Basics of Meta-Learning)<\/span><\/h2>\n

          In this section, I will cover the basics of meta-learning. Let\u2019s start with the mathematical formulation of supervised meta-learning.<\/p>\n

          \u5728\u672c\u8282\u4e2d\uff0c\u6211\u5c06\u4ecb\u7ecd\u5143\u5b66\u4e60\u7684\u57fa\u7840\u77e5\u8bc6\u3002 \u8ba9\u6211\u4eec\u4ece\u6709\u76d1\u7763\u7684\u5143\u5b66\u4e60\u7684\u6570\u5b66\u516c\u5f0f\u5f00\u59cb\u3002 <\/p>\n

          2.1 \u2014\u914d\u65b9 (<\/span>2.1 \u2014 Formulation)<\/span><\/h3>\n

          In a standard supervised learning, we want to maximize the likelihood of model parameters \u03d5 given the training data D:<\/p>\n

          \u5728\u6807\u51c6\u7684\u76d1\u7763\u5b66\u4e60\u4e2d\uff0c\u6211\u4eec\u5e0c\u671b\u5728\u7ed9\u5b9a\u8bad\u7ec3\u6570\u636eD\u7684\u60c5\u51b5\u4e0b\u6700\u5927\u5316\u6a21\u578b\u53c2\u6570likelihood\u7684\u53ef\u80fd\u6027\uff1a <\/p>\n

          \n
          \n
          \n
          \n
          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
          \n Equation 1<\/em>
          \n \u7b49\u5f0f1<\/em>
          \n <\/figcaption><\/figure>\n

          Equation 1 can be redefined as maximizing the probability of the data provided the parameters and maximizing the marginal probability of the parameters, where p(D|\u03d5) corresponds to the data likelihood, and p(\u03d5) corresponds to a regularizer term:<\/p>\n

          \u53ef\u4ee5\u5c06\u7b49\u5f0f1\u91cd\u65b0\u5b9a\u4e49\u4e3a\u6700\u5927\u5316\u63d0\u4f9b\u53c2\u6570\u7684\u6570\u636e\u7684\u6982\u7387\u548c\u6700\u5927\u5316\u53c2\u6570\u7684\u8fb9\u9645\u6982\u7387\uff0c\u5176\u4e2dp(D | \u03d5)\u5bf9\u5e94\u4e8e\u6570\u636e\u4f3c\u7136\u6027\uff0c\u800cp(\u03d5)\u5bf9\u5e94\u4e8e\u6b63\u5219\u9879\uff1a <\/p>\n

          \n
          \n
          \n
          \n
          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
          \n Equation 2
          \n <\/figcaption>
          \n \u65b9\u7a0b\u5f0f2
          \n <\/figcaption><\/figure>\n

          Equation 2 can be further broken down as follows, assuming that the data D consists of (input, label) pairs of (x\u1d62, y\u1d62):<\/p>\n

          \u5047\u8bbe\u6570\u636eD\u7531(\u8f93\u5165\uff0c\u6807\u7b7e)\u5bf9(x\u1d62\uff0cy\u1d62)\u7ec4\u6210\uff0c\u5219\u516c\u5f0f2\u53ef\u4ee5\u8fdb\u4e00\u6b65\u5206\u89e3\u5982\u4e0b\uff1a <\/p>\n

          \n
          \n
          \n
          \n
          \n
          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
          \n Equation 3
          \n <\/figcaption>
          \n \u65b9\u7a0b\u5f0f3
          \n <\/figcaption><\/figure>\n

          However, if we deal with massive data D (as in most cases with complicated problems), our model will likely overfit. Even if we have a regularizer term here, it might not be enough to prevent that from happening.<\/p>\n

          \u4f46\u662f\uff0c\u5982\u679c\u6211\u4eec\u5904\u7406\u6d77\u91cf\u6570\u636eD(\u5728\u5927\u591a\u6570\u60c5\u51b5\u4e0b\u4f1a\u9047\u5230\u590d\u6742\u95ee\u9898)\uff0c\u5219\u6211\u4eec\u7684\u6a21\u578b\u53ef\u80fd\u4f1a\u8fc7\u62df\u5408\u3002 \u5373\u4f7f\u6211\u4eec\u5728\u8fd9\u91cc\u6709\u4e00\u4e2a\u6b63\u5219\u5316\u9879\uff0c\u4e5f\u53ef\u80fd\u4e0d\u8db3\u4ee5\u963b\u6b62\u8fd9\u79cd\u60c5\u51b5\u7684\u53d1\u751f\u3002 <\/p>\n

          The critical problem that supervised meta-learning solves is: Is it feasible to get more data when dealing with supervised learning problems?<\/strong><\/p>\n

          \u76d1\u7763\u5f0f\u5143\u5b66\u4e60\u89e3\u51b3\u7684\u5173\u952e\u95ee\u9898\u662f\uff1a \u5904\u7406\u76d1\u7763\u5f0f\u5b66\u4e60\u95ee\u9898\u65f6\u83b7\u53d6\u66f4\u591a\u6570\u636e\u662f\u5426\u53ef\u884c\uff1f<\/strong> <\/p>\n

          \n
          \n
          \n
          \n
          \n
          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
          \n Figure 2: The Meta-Learning setup described in \u201c<\/em> Optimization as a Model for Few-Shot Learning
          \n .<\/em>\u201d
          \n <\/figcaption>
          \n \u56fe2\uff1a\u201c<\/em> \u4f18\u5316\u4e3a\u5c11\u91cf\u5b66\u4e60\u7684\u6a21\u578b
          \n \u201d\u4e2d\u63cf\u8ff0\u7684\u5143\u5b66\u4e60\u8bbe\u7f6e<\/em>
          \n \u3002<\/em> \u201d
          \n <\/figcaption><\/figure>\n

          Ravi and Larochelle\u2019s \u201cOptimization as a Model for Few-Shot Learning\u201d is the first paper that provides a standard formulation of the meta-learning setup, as seen in figure 2. They reframe equation 1 to equation 4 below, where D_{meta-train} is the meta-training data that allows our model to learn more efficiently. Here, D_{meta-train} corresponds to a set of datasets for predefined tasks D\u2081, D\u2082, \u2026, Dn:<\/p>\n

          Ravi\u548cLarochelle\u7684\u201c \u4f18\u5316\u4f5c\u4e3a\u5c11\u91cf\u5b66\u4e60\u7684\u6a21\u578b \u201d\u662f\u7b2c\u4e00\u7bc7\u63d0\u4f9b\u5143\u5b66\u4e60\u8bbe\u7f6e\u6807\u51c6\u683c\u5f0f\u7684\u8bba\u6587\uff0c\u5982\u56fe2\u6240\u793a\u3002\u4ed6\u4eec\u5c06\u4e0b\u9762\u7684\u7b49\u5f0f1\u91cd\u7ec4\u4e3a\u7b49\u5f0f4\uff0c\u5176\u4e2dD_ {meta-\u8bad\u7ec3}\u662f\u5143\u8bad\u7ec3\u6570\u636e\uff0c\u53ef\u8ba9\u6211\u4eec\u7684\u6a21\u578b\u66f4\u6709\u6548\u5730\u5b66\u4e60\u3002 \u5728\u6b64\uff0cD_ {meta-train}\u5bf9\u5e94\u4e8e\u9884\u5b9a\u4e49\u4efb\u52a1D1\uff0cD2\uff0c\u2026\uff0cDn\u7684\u4e00\u7ec4\u6570\u636e\u96c6\uff1a <\/p>\n

          Next, they design a set of meta-parameters \u03b8 = p(\u03b8|D_{meta-train}), which includes the necessary information about D_{meta-train} to solve the new tasks.<\/p>\n

          \u63a5\u4e0b\u6765\uff0c\u4ed6\u4eec\u8bbe\u8ba1\u4e86\u4e00\u7ec4\u5143\u53c2\u6570\u03b8= p(\u03b8| D_ {meta-train})\uff0c\u5176\u4e2d\u5305\u62ec\u6709\u5173\u89e3\u51b3\u65b0\u4efb\u52a1\u7684D_ {meta-train}\u7684\u5fc5\u8981\u4fe1\u606f\u3002 <\/p>\n

          \n
          \n
          \n
          \n
          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
          \n Equation 4
          \n <\/figcaption>
          \n \u7b49\u5f0f4
          \n <\/figcaption><\/figure>\n

          Mathematically speaking, with the introduction of this intermediary variable \u03b8, the full likelihood of parameters for the original data given the meta-training data (in equation 4) can be expressed as an integral over the meta-parameters \u03b8:<\/p>\n

          \u4ece\u6570\u5b66\u4e0a\u8bb2\uff0c\u901a\u8fc7\u5f15\u5165\u4e2d\u95f4\u53d8\u91cf\u03b8\uff0c\u53ef\u4ee5\u5c06\u7ed9\u5b9a\u5143\u8bad\u7ec3\u6570\u636e(\u5728\u7b49\u5f0f4\u4e2d)\u7684\u539f\u59cb\u6570\u636e\u7684\u53c2\u6570\u7684\u5168\u90e8\u53ef\u80fd\u6027\u8868\u793a\u4e3a\u5143\u53c2\u6570\u03b8\u7684\u6574\u6570\uff1a <\/p>\n

          \n
          \n
          \n
          \n
          \n
          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
          \n Equation 5
          \n <\/figcaption>
          \n \u5f0f5
          \n <\/figcaption><\/figure>\n

          Equation 5 can be approximated further with a point estimate for our parameters:<\/p>\n

          \u53ef\u4ee5\u4f7f\u7528\u6211\u4eec\u7684\u53c2\u6570\u7684\u70b9\u4f30\u8ba1\u6765\u8fdb\u4e00\u6b65\u8fd1\u4f3c\u65b9\u7a0b\u5f0f5\uff1a <\/p>\n

          \n
          \n
          \n
          \n
          \n
          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
          \n Equation 6
          \n <\/figcaption>
          \n \u65b9\u7a0b\u5f0f6
          \n <\/figcaption><\/figure>\n
            \n
          • \n

            p(\u03d5|D, \u03b8*) is the adaptation<\/strong> task that collects task-specific parameters \u03d5 or a new task \u2014 assuming that it has access to the data from that tas D and meta-parameters \u03b8.<\/p>\n

            p(\u03d5 | D\uff0c\u03b8*)\u662f\u4e00\u79cd\u9002\u5e94\u6027<\/strong>\u4efb\u52a1\uff0c\u5b83\u6536\u96c6\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570new\u6216\u65b0\u4efb\u52a1-\u5047\u8bbe\u5b83\u53ef\u4ee5\u8bbf\u95ee\u8be5tas D\u548c\u5143\u53c2\u6570\u03b8\u4e2d\u7684\u6570\u636e\u3002 <\/p>\n<\/li>\n

          • \n

            p(\u03b8* | D_{meta-train}) is the meta-training<\/strong> task that collects meta-parameters \u03b8 \u2014 assuming that it has access to the meta-training data D_{meta-train}).<\/p>\n

            p(\u03b8* | D_ {meta-train})\u662f\u5143\u8bad\u7ec3<\/strong>\u4efb\u52a1\uff0c\u5b83\u6536\u96c6\u5143\u53c2\u6570\u03b8-\u5047\u8bbe\u5b83\u53ef\u4ee5\u8bbf\u95ee\u5143\u8bad\u7ec3\u6570\u636eD_ {meta-train})\u3002 <\/p>\n<\/li>\n<\/ul>\n

            To sum it up, the meta-learning paradigm can be broken down into two phases:<\/p>\n

            \u6982\u62ec\u8d77\u6765\uff0c\u5143\u5b66\u4e60\u8303\u5f0f\u53ef\u4ee5\u5206\u4e3a\u4e24\u4e2a\u9636\u6bb5\uff1a <\/p>\n

              \n
            • \n

              The adaptation<\/strong> phase: \u03d5* = arg max log p(\u03d5|D,\u03b8*) (first term in equation 6)<\/p>\n

              \u9002\u5e94<\/strong>\u9636\u6bb5\uff1a\u03d5 * = arg max log p(\u03d5 | D\uff0c\u03b8*)(\u7b49\u5f0f6\u4e2d\u7684\u7b2c\u4e00\u9879) <\/p>\n<\/li>\n

            • \n

              The meta-trainin<\/strong>g phase: \u03b8* = max log p(\u03b8|D_{meta-train}) (second term in equation 6)<\/p>\n

              \u51c6\u8bad\u7ec3<\/strong>\u9636\u6bb5\uff1a\u03b8* = max log p(\u03b8| D_ {meta-train})(\u7b49\u5f0f6\u4e2d\u7684\u7b2c\u4e8c\u9879) <\/p>\n<\/li>\n<\/ul>\n

              2.2 \u2014\u635f\u5931\u4f18\u5316 (<\/span>2.2 \u2014 Loss Optimization)<\/span><\/h3>\n

              Let\u2019s look at the optimization of the meta-learning method. Initially, our meta-training data consists of pairs of training-test set for every task:<\/p>\n

              \u8ba9\u6211\u4eec\u770b\u4e00\u4e0b\u5143\u5b66\u4e60\u65b9\u6cd5\u7684\u4f18\u5316\u3002 \u6700\u521d\uff0c\u6211\u4eec\u7684\u5143\u8bad\u7ec3\u6570\u636e\u5305\u62ec\u9488\u5bf9\u6bcf\u4e2a\u4efb\u52a1\u7684\u6210\u5bf9\u8bad\u7ec3\u6d4b\u8bd5\u96c6\uff1a <\/p>\n

              \n
              \n
              \n
              \n
              \n
              \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
              \n Equation 7
              \n <\/figcaption>
              \n \u65b9\u7a0b\u5f0f7
              \n <\/figcaption><\/figure>\n

              There are k feature-label pairs (x, y) in the training set D\u1d62\u1d57\u02b3 and l feature pairs (x, y) in the test set D\u1d62\u1d57\u02e2:<\/p>\n

              \u8bad\u7ec3\u96c6D\u1d62\u1d57\u02b3\u4e2d\u6709k\u4e2a\u7279\u5f81\u6807\u7b7e\u5bf9(x\uff0cy)\uff0c\u6d4b\u8bd5\u96c6D\u1d62\u1d57\u02e2\u4e2d\u6709l\u4e2a\u7279\u5f81\u5bf9(x\uff0cy)\uff1a <\/p>\n

              \n
              \n
              \n
              \n
              \n
              \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
              \n Equation 8
              \n <\/figcaption>
              \n \u65b9\u7a0b\u5f0f8
              \n <\/figcaption><\/figure>\n

              During the adaptation phase, we infer a set of task-specific parameters \u03d5*, which is a function that takes as input the training set D\u1d57\u02b3 and returns as output the task-specific parameters: \u03d5* = f_{\u03b8*} (D\u1d57\u02b3). Essentially, we want to learn a set of meta-parameters \u03b8 such that the function \u03d5\u1d62 = f_{\u03b8} (D\u1d62\u1d57\u02b3) is good enough for the test set D\u1d62\u1d57\u02e2.<\/p>\n

              \u5728\u9002\u5e94\u9636\u6bb5\uff0c\u6211\u4eec\u63a8\u65ad\u51fa\u4e00\u7ec4\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u03d5 *\uff0c\u8be5\u51fd\u6570\u5c06\u8bad\u7ec3\u96c6D\u1d57\u02b3\u4f5c\u4e3a\u8f93\u5165\uff0c\u5e76\u8fd4\u56de\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u4f5c\u4e3a\u8f93\u51fa\uff1a\u03d5 * = f_ {\u03b8*}(D\u1d57\u02b3) \u3002 \u672c\u8d28\u4e0a\uff0c\u6211\u4eec\u8981\u5b66\u4e60\u4e00\u7ec4\u5143\u53c2\u6570\u03b8\uff0c\u4ee5\u4f7f\u51fd\u6570\u03d5\u1d62 = f_ {\u03b8}(D\u1d62\u1d57\u02b3)\u5bf9\u4e8e\u6d4b\u8bd5\u96c6D\u1d62\u1d57\u02e2\u8db3\u591f\u597d\u3002 <\/p>\n

              During the meta-learning phase, to get the meta-parameters \u03b8*, we want to maximize the probability of the task-specific parameters \u03d5 being effective at new data points in the test set D\u1d62\u1d57\u02e2.<\/p>\n

              \u5728\u5143\u5b66\u4e60\u9636\u6bb5\uff0c\u8981\u83b7\u5f97\u5143\u53c2\u6570\u03b8*\uff0c\u6211\u4eec\u5e0c\u671b\u6700\u5927\u5316\u7279\u5b9a\u4efb\u52a1\u53c2\u6570\u03d5\u5728\u6d4b\u8bd5\u96c6D\u1d62\u1d57\u02e2\u4e2d\u7684\u65b0\u6570\u636e\u70b9\u6709\u6548\u7684\u6982\u7387\u3002 <\/p>\n

              \n
              \n
              \n
              \n
              \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
              \n Equation 9
              \n <\/figcaption>
              \n \u5f0f9
              \n <\/figcaption><\/figure>\n

              2.3 \u2014\u5143\u5b66\u4e60\u8303\u5f0f (<\/span>2.3 \u2014 Meta-Learning Paradigm)<\/span><\/h3>\n

              According to Chelsea Finn, there are two views of the meta-learning problem: a deterministic view and a probabilistic view.<\/p>\n

              \u6839\u636e\u5207\u5c14\u897f\u00b7\u82ac\u6069 ( Chelsea Finn)\u7684\u8bf4\u6cd5\uff0c\u5bf9\u5143\u5b66\u4e60\u95ee\u9898\u6709\u4e24\u79cd\u89c2\u70b9\uff1a\u786e\u5b9a\u6027\u89c2\u70b9\u548c\u6982\u7387\u89c2\u70b9\u3002 <\/p>\n

              The deterministic view<\/strong> is straightforward: we take as input a training data set D\u1d57\u02b3, a test data point x_test, and the meta-parameters \u03b8 to produce the label corresponding to that test input y_test. The way we learn this function is via the D_{meta-train}, as discussed earlier.<\/p>\n

              \u786e\u5b9a\u6027\u89c6\u56fe<\/strong>\u5f88\u7b80\u5355\uff1a\u6211\u4eec\u5c06\u8bad\u7ec3\u6570\u636e\u96c6D\u1d57\u02b3\uff0c\u6d4b\u8bd5\u6570\u636e\u70b9x_test\u548c\u5143\u53c2\u6570\u03b8\u4f5c\u4e3a\u8f93\u5165\uff0c\u4ee5\u4ea7\u751f\u4e0e\u8be5\u6d4b\u8bd5\u8f93\u5165y_test\u76f8\u5bf9\u5e94\u7684\u6807\u7b7e\u3002 \u5982\u524d\u6240\u8ff0\uff0c\u6211\u4eec\u901a\u8fc7D_ {meta-train}\u5b66\u4e60\u6b64\u529f\u80fd\u7684\u65b9\u6cd5\u3002 <\/p>\n

              \n
              \n
              \n
              \n
              \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
              \n Equation 10
              \n <\/figcaption>
              \n \u5f0f10
              \n <\/figcaption><\/figure>\n

              The probabilistic view<\/strong> incorporates Bayesian inference: we perform a maximum likelihood inference over the task-specific parameters \u03d5\u1d62 \u2014 assuming that we have the training dataset D\u1d62\u1d57\u02b3 and a set of meta-parameters \u03b8:<\/p>\n

              \u6982\u7387\u89c6\u56fe<\/strong>\u5305\u542b\u8d1d\u53f6\u65af\u63a8\u65ad\uff1a\u6211\u4eec\u5bf9\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570perform\u8fdb\u884c\u6700\u5927\u4f3c\u7136\u63a8\u65ad-\u5047\u8bbe\u6211\u4eec\u6709\u8bad\u7ec3\u6570\u636e\u96c6D\u1d62\u1d57\u02b3\u548c\u4e00\u7ec4\u5143\u53c2\u6570\u03b8\uff1a <\/p>\n

              \n
              \n
              \n
              \n
              \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
              \n Equation 11
              \n <\/figcaption>
              \n \u5f0f11
              \n <\/figcaption><\/figure>\n

              Regardless of the view, there two steps to design a meta-learning algorithm:<\/p>\n

              \u65e0\u8bba\u54ea\u79cd\u89c6\u56fe\uff0c\u90fd\u6709\u4e24\u4e2a\u6b65\u9aa4\u6765\u8bbe\u8ba1\u5143\u5b66\u4e60\u7b97\u6cd5\uff1a <\/p>\n

                \n
              • Step 1 is to create the function p(\u03d5\u1d62|D\u1d62\u1d57\u02b3, \u03b8) during the adaptation phase.\n

                \n

                \u6b65\u9aa41\u662f\u5728\u81ea\u9002\u5e94\u9636\u6bb5\u521b\u5efa\u51fd\u6570p(\u03d5\u1d62 |D\u1d62\u1d57\u02b3\uff0c\u03b8)\u3002 <\/li>\n

              • Step 2 is to optimize \u03b8 concerning D_{meta-train} during the meta-training phase.\n

                \n

                \u6b65\u9aa42\u662f\u5728\u5143\u8bad\u7ec3\u9636\u6bb5\u4f18\u5316\u4e0eD_ {meta-train}\u6709\u5173\u7684\u03b8\u3002 <\/li>\n<\/ul>\n

                In this post, I will only pay attention to the deterministic view of meta-learning. In the remaining sections, I focus on the three different approaches to build up the meta-learning algorithm: (1) The black-box approach, (2) The optimization-based approach, and (3) The non-parametric approach. More specifically, I will go over their formulation, architectures used, and challenges associated with each method.<\/p>\n

                \u5728\u672c\u6587\u4e2d\uff0c\u6211\u5c06\u53ea\u5173\u6ce8\u5143\u5b66\u4e60\u7684\u786e\u5b9a\u6027\u89c2\u70b9\u3002 \u5728\u5176\u4f59\u90e8\u5206\u4e2d\uff0c\u6211\u5c06\u91cd\u70b9\u4ecb\u7ecd\u6784\u5efa\u5143\u5b66\u4e60\u7b97\u6cd5\u7684\u4e09\u79cd\u4e0d\u540c\u65b9\u6cd5\uff1a(1)\u9ed1\u76d2\u65b9\u6cd5\uff0c(2)\u57fa\u4e8e\u4f18\u5316\u7684\u65b9\u6cd5\u548c(3)\u975e\u53c2\u6570\u65b9\u6cd5\u3002 \u66f4\u5177\u4f53\u5730\u8bf4\uff0c\u6211\u5c06\u4ecb\u7ecd\u5b83\u4eec\u7684\u8868\u8ff0\uff0c\u4f7f\u7528\u7684\u4f53\u7cfb\u7ed3\u6784\u4ee5\u53ca\u4e0e\u6bcf\u79cd\u65b9\u6cd5\u76f8\u5173\u7684\u6311\u6218\u3002 <\/p>\n

                3-\u9ed1\u76d2\u5143\u5b66\u4e60 (<\/span>3 \u2014 Black-Box Meta-Learning)<\/span><\/h2>\n

                3.1 \u2014\u914d\u65b9 (<\/span>3.1 \u2014 Formulation)<\/span><\/h3>\n

                The black-box meta-learning approach uses neural network architecture to generate the distribution p(\u03d5\u1d62|D\u1d62\u1d57\u02b3, \u03b8).<\/p>\n

                \u9ed1\u76d2\u5143\u5b66\u4e60\u65b9\u6cd5\u4f7f\u7528\u795e\u7ecf\u7f51\u7edc\u67b6\u6784\u6765\u751f\u6210\u5206\u5e03p(\u03d5\u1d62 |D\u1d62\u1d57\u02b3\uff0c\u03b8)\u3002 <\/p>\n

                  \n
                • Our task-specific parameters are: \u03d5\u1d62 = f_{\u03b8}(D\u1d62\u1d57\u02b3).\n

                  \n

                  \u6211\u4eec\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u4e3a\uff1a\u03d5\u1d62 = f_ {\u03b8}(D\u1d62\u1d57\u02b3)\u3002 <\/li>\n

                • A neural network with meta-parameters \u03b8 (denoted as f_{\u03b8}) takes in the training data D\u1d62\u1d57\u02b3 s input and returns the task-specific parameters \u03d5\u1d62 as output.\n

                  \n

                  \u5177\u6709\u5143\u53c2\u6570\u03b8(\u8868\u793a\u4e3af_ {\u03b8})\u7684\u795e\u7ecf\u7f51\u7edc\u63a5\u6536\u8bad\u7ec3\u6570\u636eD\u1d62\u1d57\u02b3\u7684\u8f93\u5165\uff0c\u5e76\u8fd4\u56de\u4efb\u52a1\u7279\u5b9a\u53c2\u6570parameters\u4f5c\u4e3a\u8f93\u51fa\u3002 <\/li>\n

                • Another neural network (denoted as g(\u03d5\u1d62)) takes in the task-specific parameters \u03d5\u1d62 as input and returns the predictions about test data points D\u1d62\u1d57\u02e2 as output.\n

                  \n

                  \u53e6\u4e00\u4e2a\u795e\u7ecf\u7f51\u7edc(\u8868\u793a\u4e3ag(\u03d5\u1d62))\u5c06\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u03d5\u1d62\u4f5c\u4e3a\u8f93\u5165\uff0c\u5e76\u8fd4\u56de\u6709\u5173\u6d4b\u8bd5\u6570\u636e\u70b9D\u1d62\u1d57\u02e2\u7684\u9884\u6d4b\u4f5c\u4e3a\u8f93\u51fa\u3002 <\/li>\n<\/ul>\n

                  During optimization, we maximize the log-likelihood of the outputs from g(\u03d5\u1d62) for all the test data points. This is applied across all the tasks in the meta-training set:<\/p>\n

                  \u5728\u4f18\u5316\u8fc7\u7a0b\u4e2d\uff0c\u6211\u4eec\u5c06\u6240\u6709\u6d4b\u8bd5\u6570\u636e\u70b9\u7684g(\u03d5\u1d62)\u8f93\u51fa\u7684\u5bf9\u6570\u4f3c\u7136\u6027\u6700\u5927\u5316\u3002 \u8fd9\u9002\u7528\u4e8e\u5143\u8bad\u7ec3\u96c6\u4e2d\u7684\u6240\u6709\u4efb\u52a1\uff1a <\/p>\n

                  \n
                  \n
                  \n
                  \n
                  \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                  \n Equation 12
                  \n <\/figcaption>
                  \n \u5f0f12
                  \n <\/figcaption><\/figure>\n

                  The log-likelihood of g(\u03d5\u1d62) in equation 12 is essentially the loss between a set of task-specific parameters \u03d5\u1d62 and a test data point D\u1d62\u1d57\u02e2:<\/p>\n

                  \u65b9\u7a0b12\u4e2dg(\u03d5\u1d62)\u7684\u5bf9\u6570\u4f3c\u7136\u6027\u5b9e\u8d28\u4e0a\u662f\u4e00\u7ec4\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u03d5\u1d62\u4e0e\u6d4b\u8bd5\u6570\u636e\u70b9D\u1d62\u1d57\u02e2\u4e4b\u95f4\u7684\u635f\u5931\uff1a <\/p>\n

                  \n
                  \n
                  \n
                  \n
                  \n
                  \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                  \n Equation 13
                  \n <\/figcaption>
                  \n \u5f0f13
                  \n <\/figcaption><\/figure>\n

                  Then in equation 12, we optimize the loss between the function f_\u03b8(D\u1d62\u1d57\u02b3) and the evaluation on the test set D\u1d62\u1d57\u02e2:<\/p>\n

                  \u7136\u540e\u5728\u7b49\u5f0f12\u4e2d\uff0c\u6211\u4eec\u4f18\u5316\u51fd\u6570f_\u03b8(D\u1d62\u1d57\u02b3)\u4e0e\u6d4b\u8bd5\u96c6D\u1d62\u1d57\u02e2\u7684\u8bc4\u4f30\u4e4b\u95f4\u7684\u635f\u5931\uff1a <\/p>\n

                  \n
                  \n
                  \n
                  \n
                  \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                  \n Equation 14
                  \n <\/figcaption>
                  \n \u5f0f14
                  \n <\/figcaption><\/figure>\n

                  This is the black-box meta-learning algorithm in a nutshell:<\/p>\n

                  \u7b80\u800c\u8a00\u4e4b\uff0c\u8fd9\u662f\u9ed1\u76d2\u5143\u5b66\u4e60\u7b97\u6cd5\uff1a <\/p>\n

                    \n
                  • We sample a task T_i, as well as the training set D\u1d62\u1d57\u02b3 and test set D\u1d62\u1d57\u02e2 from the task dataset D_i.\n

                    \n

                    \u6211\u4eec\u4ece\u4efb\u52a1\u6570\u636e\u96c6D_i\u4e2d\u91c7\u6837\u4efb\u52a1T_i\u4ee5\u53ca\u8bad\u7ec3\u96c6D\u1d62\u1d57\u02b3\u548c\u6d4b\u8bd5\u96c6D\u1d62\u1d57\u02e2\u3002 <\/li>\n

                  • We compute the task-specific parameters \u03d5\u1d62 given the training set D\u1d62\u1d57\u02b3: \u03d5\u1d62 \u2190 f_{\u03b8} (D\u1d62\u1d57\u02b3).\n

                    \n

                    \u7ed9\u5b9a\u8bad\u7ec3\u96c6D\u1d62\u1d57\u02b3\uff0c\u6211\u4eec\u8ba1\u7b97\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u03d5\u1d62\uff1a\u03d5\u1d62\u2190f_ {\u03b8}(D\u1d62\u1d57\u02b3)\u3002 <\/li>\n

                  • Then, we update the meta-parameters \u03b8 using the gradient of the objective with respect to the loss function between the computed task-specific parameters \u03d5\u1d62 and D\u1d62\u1d57\u02e2: \u2207_{\u03b8} L(\u03d5\u1d62, D\u1d62\u1d57\u02e2).\n

                    \n

                    \u7136\u540e\uff0c\u6211\u4eec\u4f7f\u7528\u76ee\u6807\u76f8\u5bf9\u4e8e\u6240\u8ba1\u7b97\u7684\u4efb\u52a1\u7279\u5b9a\u53c2\u6570\u03d5\u1d62\u548cD\u1d62\u1d57\u02e2\u4e4b\u95f4\u7684\u635f\u5931\u51fd\u6570\u7684\u68af\u5ea6\u6765\u66f4\u65b0\u5143\u53c2\u6570\u03b8\uff1a\u2207_{\u03b8} L(\u03d5\u1d62\uff0cD\u03d5\u1d62)\u3002 <\/li>\n

                  • This process is repeated iteratively with gradient descent optimizers.\n

                    \n

                    \u4f7f\u7528\u68af\u5ea6\u4e0b\u964d\u4f18\u5316\u5668\u53cd\u590d\u91cd\u590d\u6b64\u8fc7\u7a0b\u3002 <\/li>\n<\/ul>\n

                    3.2 \u2014\u6311\u6218 (<\/span>3.2 \u2014 Challenges)<\/span><\/h3>\n

                    The main challenge with this black-box approach occurs when \u03d5\u1d62 happens to be massive. If \u03d5\u1d62 is a set of all the parameters in a very deep neural network, then it is not scalable to output \u03d5\u1d62<\/strong>.<\/p>\n

                    \u5f53\u03d5\u1d62\u5f88\u5927\u65f6\uff0c\u8fd9\u79cd\u9ed1\u76d2\u65b9\u6cd5\u9762\u4e34\u7684\u4e3b\u8981\u6311\u6218\u3002 \u5982\u679c\u03d5\u1d62\u662f\u975e\u5e38\u6df1\u7684\u795e\u7ecf\u7f51\u7edc\u4e2d\u6240\u6709\u53c2\u6570\u7684\u96c6\u5408\uff0c\u5219i t\u4e0d\u53ef\u6269\u5c55\u5230\u8f93\u51fa\u03d5\u1d62<\/strong> \u3002 <\/p>\n

                    \u201cOne-Shot Learning with Memory Augmented Neural Networks\u201d and \u201cA Simple Neural Attentive Meta-Learner\u201d are two research papers that tackle this. Instead of having a neural network that outputs all of the parameters \u03d5\u1d62, they output a low-dimensional vector h\u1d62, which is then used alongside meta-parameters \u03b8 to make predictions. The new task-specific parameters \u03d5\u1d62 has the form: \u03d5\u1d62 = {h\u1d62, \u03b8}, where \u03b8 represents all of the parameters other than h.<\/p>\n

                    \u201c \u5177\u6709\u8bb0\u5fc6\u589e\u5f3a\u795e\u7ecf\u7f51\u7edc\u7684\u4e00\u952e\u5f0f\u5b66\u4e60 \u201d\u548c\u201c \u7b80\u5355\u7684\u795e\u7ecf\u6ce8\u610f\u529b\u5143\u5b66\u4e60\u5668 \u201d\u662f\u89e3\u51b3\u8fd9\u4e00\u95ee\u9898\u7684\u4e24\u7bc7\u7814\u7a76\u8bba\u6587\u3002 \u4ed6\u4eec\u6ca1\u6709\u8f93\u51fa\u8f93\u51fa\u6240\u6709\u53c2\u6570neural\u7684\u795e\u7ecf\u7f51\u7edc\uff0c\u800c\u662f\u8f93\u51fa\u4e86\u4f4e\u7ef4\u5411\u91cfh\u1d62\uff0c\u7136\u540e\u5c06\u5176\u4e0e\u5143\u53c2\u6570\u03b8\u4e00\u8d77\u7528\u4e8e\u8fdb\u884c\u9884\u6d4b\u3002 \u65b0\u7684\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u03d5\u1d62\u5177\u6709\u4ee5\u4e0b\u5f62\u5f0f\uff1a\u03d5\u1d62 = {h\u1d62\uff0c\u03b8}\uff0c\u5176\u4e2d\u03b8\u8868\u793ah\u4ee5\u5916\u7684\u6240\u6709\u53c2\u6570\u3002 <\/p>\n

                    Overall, the general form of this black-box approach is as follows:<\/p>\n

                    \u603b\u4f53\u800c\u8a00\uff0c\u8fd9\u79cd\u9ed1\u76d2\u65b9\u6cd5\u7684\u4e00\u822c\u5f62\u5f0f\u5982\u4e0b\uff1a <\/p>\n

                    \n
                    \n
                    \n
                    \n
                    \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                    \n Equation 15
                    \n <\/figcaption>
                    \n \u5f0f15
                    \n <\/figcaption><\/figure>\n

                    Here, y\u1d57\u02e2 corresponds to the labels of test data, x\u1d57\u02e2 corresponds to the features of test data, and D\u1d62\u1d57\u02b3 corresponds to pairs of training data.<\/p>\n

                    \u6b64\u5904\uff0cy 1\u5bf9\u5e94\u4e8e\u6d4b\u8bd5\u6570\u636e\u7684\u6807\u7b7e\uff0cx 1\u5bf9\u5e94\u4e8e\u6d4b\u8bd5\u6570\u636e\u7684\u7279\u5f81\uff0c\u5e76\u4e14D 1\u5bf9\u5e94\u4e8e\u8bad\u7ec3\u6570\u636e\u5bf9\u3002 <\/p>\n

                    3.3 \u2014\u4f53\u7cfb\u7ed3\u6784 (<\/span>3.3 \u2014 Architectures)<\/span><\/h3>\n

                    So what are the different model architectures to represent this function f?<\/p>\n

                    \u90a3\u4e48\uff0c\u4ee3\u8868\u6b64\u51fd\u6570f\u7684\u6a21\u578b\u6a21\u578b\u6709\u54ea\u4e9b\u4e0d\u540c\uff1f <\/p>\n

                    \n
                    \n
                    \n
                    \n
                    \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                    \n Figure 3 \u2014 The memory-augmented neural network as described in \u201c<\/em> One-Shot Learning with Memory Augmented Neural Networks
                    \n .\u201d<\/em>
                    \n <\/figcaption>
                    \n \u56fe3 \u2014\u5982\u201c<\/em> \u5e26\u8bb0\u5fc6\u589e\u5f3a\u795e\u7ecf\u7f51\u7edc\u7684\u4e00\u952e\u5f0f\u5b66\u4e60
                    \n \u201d\u4e2d\u6240\u8ff0\u7684<\/em> \u8bb0\u5fc6\u589e\u5f3a\u795e\u7ecf\u7f51\u7edc
                    \n \u3002<\/em>
                    \n <\/figcaption><\/figure>\n
                      \n
                    • \n

                      Memory Augmented Neural Networks by Santoro et al. uses Long Short-Term Memory and Neural Turing Machine architectures to represent f. Both architectures have an external memory mechanism to store information from the training data point and then access that information during inference in a differentiable way, as seen in figure 3.<\/p>\n

                      Santoro\u7b49\u4eba\u7684\u8bb0\u5fc6\u589e\u5f3a\u795e\u7ecf\u7f51\u7edc \u3002 \u4f7f\u7528\u957f\u77ed\u671f\u8bb0\u5fc6\u548c\u795e\u7ecf\u56fe\u7075\u673a\u4f53\u7cfb\u7ed3\u6784\u6765\u8868\u793af\u3002 \u4e24\u79cd\u67b6\u6784\u90fd\u5177\u6709\u5916\u90e8\u5b58\u50a8\u673a\u5236\uff0c\u7528\u4e8e\u5b58\u50a8\u6765\u81ea\u8bad\u7ec3\u6570\u636e\u70b9\u7684\u4fe1\u606f\uff0c\u7136\u540e\u5728\u63a8\u7406\u671f\u95f4\u4ee5\u53ef\u533a\u5206\u7684\u65b9\u5f0f\u8bbf\u95ee\u8be5\u4fe1\u606f\uff0c\u5982\u56fe3\u6240\u793a\u3002 <\/p>\n<\/li>\n

                    • \n

                      Conditional Neural Processes by Garnelo et al. represents f via 3 steps: (1) using a feed-forward neural network to compute the training data information, (2) aggregating that information, and (3) passing that information to another feed-forward network for inference.<\/p>\n

                      Garnelo\u7b49\u4eba\u7684\u6761\u4ef6\u795e\u7ecf\u8fc7\u7a0b \u3002 \u4ee3\u8868f\u901a\u8fc73\u4e2a\u6b65\u9aa4\u8868\u793a\uff1a(1)\u4f7f\u7528\u524d\u9988\u795e\u7ecf\u7f51\u7edc\u6765\u8ba1\u7b97\u8bad\u7ec3\u6570\u636e\u4fe1\u606f\uff0c(2)\u6c47\u603b\u8be5\u4fe1\u606f\uff0c\u4ee5\u53ca(3)\u5c06\u4fe1\u606f\u4f20\u9012\u7ed9\u53e6\u4e00\u4e2a\u524d\u9988\u7f51\u7edc\u4ee5\u8fdb\u884c\u63a8\u65ad\u3002 <\/p>\n<\/li>\n

                    • \n

                      Meta Networks by Munkhdalai and Yu uses other external memory mechanisms with slow and fast weights that are inspired by neuroscience to represent f. Specifically, the slow weights are designed for meta-parameters \u03b8 and the fast weights are designed for task-specific parameters \u03d5.<\/p>\n

                      Munkhdalai\u548cYu\u64b0\u5199\u7684\u5143\u7f51\u7edc\u4f7f\u7528\u4e86\u53d7\u795e\u7ecf\u79d1\u5b66\u542f\u53d1\u6765\u8868\u793af\u7684\u5176\u4ed6\u5177\u6709\u7f13\u6162\u548c\u5feb\u901f\u6743\u91cd\u7684\u5916\u90e8\u5b58\u50a8\u673a\u5236\u3002 \u5177\u4f53\u800c\u8a00\uff0c\u6162\u901f\u6743\u91cd\u7528\u4e8e\u5143\u53c2\u6570\u03b8\uff0c\u800c\u5feb\u901f\u6743\u91cd\u7528\u4e8e\u7279\u5b9a\u4efb\u52a1\u53c2\u6570\u03d5\u3002 <\/p>\n<\/li>\n<\/ul>\n

                      \n
                      \n
                      \n
                      \n
                      \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                      \n Figure 4: The simple neural attentive meta-learner as described in \u201c<\/em>
                      \n \u56fe4\uff1a\u5982\u201c\u7b80\u5355\u7684\u795e\u7ecf\u6ce8\u610f\u5143\u5b66\u4e60\u5668\u201d\u4e2d\u6240\u8ff0<\/em>
                      \n A Simple Neural Attentive Meta-Learner<\/em>
                      \n \u7684\u7b80\u5355\u7684\u795e\u7ecf\u6ce8\u610f\u5143\u5b66\u4e60\u5668<\/em>
                      \n .\u201d<\/em>
                      \n \u3002<\/em>
                      \n <\/figcaption><\/figure>\n
                        \n
                      • \n

                        Neural Attentive Meta-Learner by Mishra et al. uses an attention mechanism to represent f. Such a mechanism allows the network to pick out the most important information that it gathers, thus making the optimization process much more efficient, as seen in figure 4.<\/p>\n

                        Mishra\u7b49\u4eba\u7684\u300a \u795e\u7ecf\u6ce8\u610f\u5143\u5b66\u4e60\u5668 \u300b\u3002 \u4f7f\u7528\u6ce8\u610f\u529b\u673a\u5236\u6765\u8868\u793af\u3002 \u8fd9\u79cd\u673a\u5236\u4f7f\u7f51\u7edc\u80fd\u591f\u6311\u9009\u51fa\u6240\u6536\u96c6\u7684\u6700\u91cd\u8981\u7684\u4fe1\u606f\uff0c\u4ece\u800c\u4f7f\u4f18\u5316\u8fc7\u7a0b\u66f4\u52a0\u9ad8\u6548\uff0c\u5982\u56fe4\u6240\u793a\u3002 <\/p>\n<\/li>\n<\/ul>\n

                        In conclusion, black-box meta-learning approach has high learning capacity. Given that neural networks are universal function approximators, the black-box meta-learning algorithm can represent any function of our training data. However, as neural networks are fairly complex and the learning process usually happens from scratch, the black-box approach usually requires a large amount of training data and a large number of tasks in order to perform well.<\/p>\n

                        \u603b\u4e4b\uff0c\u9ed1\u76d2\u5143\u5b66\u4e60\u65b9\u6cd5\u5177\u6709\u8f83\u9ad8\u7684\u5b66\u4e60\u80fd\u529b\u3002 \u7531\u4e8e\u795e\u7ecf\u7f51\u7edc\u662f\u901a\u7528\u51fd\u6570\u903c\u8fd1\u5668\uff0c\u56e0\u6b64\u9ed1\u76d2\u5143\u5b66\u4e60\u7b97\u6cd5\u53ef\u4ee5\u4ee3\u8868\u6211\u4eec\u8bad\u7ec3\u6570\u636e\u7684\u4efb\u4f55\u51fd\u6570\u3002 \u4f46\u662f\uff0c\u7531\u4e8e\u795e\u7ecf\u7f51\u7edc\u76f8\u5f53\u590d\u6742\uff0c\u5b66\u4e60\u8fc7\u7a0b\u901a\u5e38\u662f\u4ece\u5934\u5f00\u59cb\u7684\uff0c\u56e0\u6b64\u9ed1\u5323\u5b50\u65b9\u6cd5\u901a\u5e38\u9700\u8981\u5927\u91cf\u7684\u8bad\u7ec3\u6570\u636e\u548c\u5927\u91cf\u7684\u4efb\u52a1\u624d\u80fd\u8868\u73b0\u826f\u597d\u3002 <\/p>\n

                        4 \u2014\u57fa\u4e8e\u4f18\u5316\u7684\u5143\u5b66\u4e60<\/strong> (<\/span>4 \u2014 Optimization-Based Meta-Learning<\/strong>)<\/span><\/h2>\n

                        Okay, so how else can we represent the distribution p(\u03d5\u1d62|D\u1d62\u1d57\u02b3, \u03b8) in the adaptation phase of meta-learning? If we want to infer all the parameters of our network, we can treat this as an optimization procedure. The key idea behind optimization-based meta-learning is that we can optimize the process of getting the task-specific parameters \u03d5\u1d62 so that we will get a good performance on the test set.<\/p>\n

                        \u597d\u7684\uff0c\u5728\u5143\u5b66\u4e60\u7684\u9002\u5e94\u9636\u6bb5\uff0c\u6211\u4eec\u8fd8\u80fd\u5982\u4f55\u8868\u793a\u5206\u5e03p(\u03d5\u1d62 |D\u1d62\u1d57\u02b3\uff0c\u03b8)\uff1f \u5982\u679c\u6211\u4eec\u8981\u63a8\u65ad\u7f51\u7edc\u7684\u6240\u6709\u53c2\u6570\uff0c\u53ef\u4ee5\u5c06\u5176\u89c6\u4e3a\u4f18\u5316\u8fc7\u7a0b\u3002 \u57fa\u4e8e\u4f18\u5316\u7684\u5143\u5b66\u4e60\u80cc\u540e\u7684\u5173\u952e\u601d\u60f3\u662f\uff0c\u6211\u4eec\u53ef\u4ee5\u4f18\u5316\u83b7\u53d6\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570process\u7684\u8fc7\u7a0b\uff0c\u4ee5\u4fbf\u5728\u6d4b\u8bd5\u96c6\u4e0a\u83b7\u5f97\u826f\u597d\u7684\u6027\u80fd\u3002 <\/p>\n

                        4.1 \u2014\u914d\u65b9<\/strong> (<\/span>4.1 \u2014 Formulation<\/strong>)<\/span><\/h3>\n

                        Recall that the meta-learning problem can be broken down into two terms below, one that maximizes the likelihood of training data given the task-specific parameters and one that maximizes the likelihood of task-specific parameters given meta-parameters:<\/p>\n

                        \u56de\u60f3\u4e00\u4e0b\uff0c\u5143\u5b66\u4e60\u95ee\u9898\u53ef\u4ee5\u5206\u4e3a\u4ee5\u4e0b\u4e24\u4e2a\u672f\u8bed\uff0c\u4e00\u662f\u5728\u7ed9\u5b9a\u7279\u5b9a\u4efb\u52a1\u53c2\u6570\u7684\u60c5\u51b5\u4e0b\u6700\u5927\u5316\u8bad\u7ec3\u6570\u636e\u7684\u53ef\u80fd\u6027\uff0c\u4e00\u662f\u5728\u7ed9\u5b9a\u5143\u53c2\u6570\u7684\u60c5\u51b5\u4e0b\u6700\u5927\u5316\u4efb\u52a1\u7279\u5b9a\u53c2\u6570\u7684\u53ef\u80fd\u6027\uff1a <\/p>\n

                        \n
                        \n
                        \n
                        \n
                        \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                        \n Equation 16
                        \n <\/figcaption>
                        \n \u5f0f16
                        \n <\/figcaption><\/figure>\n

                        Here the meta-parameters \u03b8 are pre-trained during training time and fine-tuned during test time. The equation below is a typical optimization procedure via gradient descent, where \u03b1 is the learning rate.<\/p>\n

                        \u6b64\u5904\uff0c\u5143\u53c2\u6570\u03b8\u5728\u8bad\u7ec3\u671f\u95f4\u8fdb\u884c\u4e86\u9884\u8bad\u7ec3\uff0c\u5e76\u5728\u6d4b\u8bd5\u671f\u95f4\u8fdb\u884c\u4e86\u5fae\u8c03\u3002 \u4e0b\u9762\u7684\u65b9\u7a0b\u5f0f\u662f\u901a\u8fc7\u68af\u5ea6\u4e0b\u964d\u7684\u5178\u578b\u4f18\u5316\u8fc7\u7a0b\uff0c\u5176\u4e2d\u03b1\u662f\u5b66\u4e60\u7387\u3002 <\/p>\n

                        \n
                        \n
                        \n
                        \n
                        \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                        \n Equation 17
                        \n <\/figcaption>
                        \n \u5f0f17
                        \n <\/figcaption><\/figure>\n

                        To get the pre-trained parameters, we can use standard benchmark datasets such as ImageNet for computer vision, Wikipedia Text Corpus for language processing, or any other large and diverse datasets that we have access to. As expected, this approach becomes less effective with a small amount of training data<\/em>.<\/p>\n

                        \u4e3a\u4e86\u83b7\u5f97\u9884\u5148\u8bad\u7ec3\u7684\u53c2\u6570\uff0c\u6211\u4eec\u53ef\u4ee5\u4f7f\u7528\u6807\u51c6\u57fa\u51c6\u6570\u636e\u96c6\uff0c\u4f8b\u5982\u7528\u4e8e\u8ba1\u7b97\u673a\u89c6\u89c9\u7684ImageNet\uff0c\u7528\u4e8e\u8bed\u8a00\u5904\u7406\u7684Wikipedia\u6587\u672c\u8bed\u6599\u5e93\u6216\u6211\u4eec\u53ef\u4ee5\u8bbf\u95ee\u7684\u4efb\u4f55\u5176\u4ed6\u5927\u578b\u591a\u6837\u7684\u6570\u636e\u96c6\u3002 \u5982\u9884\u671f\u7684\u90a3\u6837\uff0c\u8fd9\u79cd\u65b9\u6cd5\u5728\u5c11\u91cf\u8bad\u7ec3\u6570\u636e\u7684\u60c5\u51b5\u4e0b<\/em>\u53d8\u5f97\u4e0d\u592a\u6709\u6548\u3002 <\/p>\n

                        Model-Agnostic Meta-Learning (MAML) from Finn et al. is an algorithm that addresses this exact problem. Taking the optimization procedure in equation 17, it adjusts the loss so that only the best-performing task-specific parameters \u03d5 on test data points are considered. This happens for all the tasks:<\/p>\n

                        Finn\u7b49\u4eba\u7684\u6a21\u578b\u4e0d\u53ef\u77e5\u5143\u5b66\u4e60 (MAML)\u3002 \u662f\u89e3\u51b3\u8fd9\u4e2a\u786e\u5207\u95ee\u9898\u7684\u7b97\u6cd5\u3002 \u91c7\u7528\u7b49\u5f0f17\u4e2d\u7684\u4f18\u5316\u7a0b\u5e8f\uff0c\u5b83\u53ef\u4ee5\u8c03\u6574\u635f\u8017\uff0c\u4ece\u800c\u4ec5\u8003\u8651\u6d4b\u8bd5\u6570\u636e\u70b9\u4e0a\u6027\u80fd\u6700\u4f73\u7684\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u03d5\u3002 \u5bf9\u4e8e\u6240\u6709\u4efb\u52a1\u90fd\u4f1a\u53d1\u751f\u8fd9\u79cd\u60c5\u51b5\uff1a <\/p>\n

                        \n
                        \n
                        \n
                        \n
                        \n
                        \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                        \n Equation 18
                        \n <\/figcaption>
                        \n \u5f0f18
                        \n <\/figcaption><\/figure>\n

                        The key idea is to learn \u03b8 for all the assigned tasks in order for \u03b8 to transfer effectively via the optimization procedure.<\/p>\n

                        \u5173\u952e\u601d\u60f3\u662f\u4e3a\u6240\u6709\u5206\u914d\u7684\u4efb\u52a1\u5b66\u4e60\u03b8\uff0c\u4ee5\u4fbf\u901a\u8fc7\u4f18\u5316\u7a0b\u5e8f\u6709\u6548\u5730\u4f20\u9012\u03b8\u3002 <\/p>\n

                        This is the optimization-based meta-learning algorithm in a nutshell:<\/p>\n

                        \u7b80\u800c\u8a00\u4e4b\uff0c\u8fd9\u662f\u57fa\u4e8e\u4f18\u5316\u7684\u5143\u5b66\u4e60\u7b97\u6cd5\uff1a <\/p>\n

                          \n
                        • We sample a task T\u1d62, as well as the training set D\u1d62\u1d57\u02b3 and test set D\u1d62\u1d57\u02e2 from the task dataset D\u1d62.\n

                          \n

                          \u6211\u4eec\u4ece\u4efb\u52a1\u6570\u636e\u96c6D\u1d62\u4e2d\u91c7\u6837\u4efb\u52a1T\u1d62\uff0c\u4ee5\u53ca\u8bad\u7ec3\u96c6D\u1d62\u1d57\u02b3\u548c\u6d4b\u8bd5\u96c6D\u1d62\u1d57\u02e2\u3002 <\/li>\n

                        • We compute the task-specific parameters \u03d5\u1d62 given the training set D\u1d62\u1d57\u02b3 using the optimization procedure described above: \u03d5\u1d62 \u2190 \u03b8 \u2014 \u03b1 \u2207_\u03b8 L(\u03b8, D\u1d62\u1d57\u02b3)\n

                          \n

                          \u6211\u4eec\u4f7f\u7528\u4e0a\u8ff0\u4f18\u5316\u7a0b\u5e8f\u8ba1\u7b97\u7ed9\u5b9a\u8bad\u7ec3\u96c6D\u1d62\u1d57\u02b3\u7684\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u03d5\u1d62\u2190\u2190\u2014\u03b1\u2207_\u03b8L(\u03b8\uff0cD\u1d62\u1d57\u02b3) <\/li>\n

                        • Then, we update the meta-parameters \u03b8 using the gradient of the objective with respect to the loss function between the computed task-specific parameters \u03d5\u1d62 and D\u1d62\u1d57\u02e2: \u2207_{\u03b8} L(\u03d5\u1d62, D\u1d62\u1d57\u02e2).\n

                          \n

                          \u7136\u540e\uff0c\u6211\u4eec\u4f7f\u7528\u76ee\u6807\u76f8\u5bf9\u4e8e\u6240\u8ba1\u7b97\u7684\u4efb\u52a1\u7279\u5b9a\u53c2\u6570\u03d5\u1d62\u548cD\u1d62\u1d57\u02e2\u4e4b\u95f4\u7684\u635f\u5931\u51fd\u6570\u7684\u68af\u5ea6\u6765\u66f4\u65b0\u5143\u53c2\u6570\u03b8\uff1a\u2207_{\u03b8} L(\u03d5\u1d62\uff0cD\u03d5\u1d62)\u3002 <\/li>\n

                        • This process is repeated iteratively with gradient descent optimizers.\n

                          \n

                          \u4f7f\u7528\u68af\u5ea6\u4e0b\u964d\u4f18\u5316\u5668\u53cd\u590d\u91cd\u590d\u6b64\u8fc7\u7a0b\u3002 <\/li>\n<\/ul>\n

                          As provided in the previous section, the black-box meta-learning approach has the general form: y\u1d57\u02e2 = f_{\u03b8} (D\u1d62\u1d57\u02b3, x\u1d57\u02e2). The optimization-based MAML method described above has a similar form below, where \u03d5\u1d62 = \u03b8 \u2014 \u03b1 \u2207_{\u03b8} L(\u03d5, D\u1d62\u1d57\u02b3):<\/p>\n

                          \u5982\u4e0a\u4e00\u8282\u6240\u8ff0\uff0c\u9ed1\u76d2\u5143\u5b66\u4e60\u65b9\u6cd5\u5177\u6709\u4ee5\u4e0b\u4e00\u822c\u5f62\u5f0f\uff1ay\u1d57\u02e2= f_ {\u03b8}(D\u1d62\u1d57\u02b3\uff0cx\u1d57\u02e2)\u3002 \u4e0a\u8ff0\u57fa\u4e8e\u4f18\u5316\u7684MAML\u65b9\u6cd5\u5177\u6709\u4ee5\u4e0b\u7c7b\u4f3c\u5f62\u5f0f\uff0c\u5176\u4e2d\u03d5\u1d62 =\u03b8\u2014\u03b1\u2207_{\u03b8} L(\u03d5\uff0cD\u1d62\u1d57\u02b3)\uff1a <\/p>\n

                          \n
                          \n
                          \n
                          \n
                          \n
                          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                          \n Equation 19
                          \n <\/figcaption>
                          \n \u5f0f19
                          \n <\/figcaption><\/figure>\n

                          To prove the effectiveness of the MAML algorithm, in Meta-Learning and Universality, Finn and Levine show that the MAML algorithm can approximate any function of D\u1d62\u1d57\u02b3, x\u1d57\u02e2 for a very deep function f. This finding demonstrates that the optimization-based MAML algorithm is as expressive as any other black-box algorithms mentioned previously.<\/p>\n

                          \u4e3a\u4e86\u8bc1\u660eMAML\u7b97\u6cd5\u7684\u6709\u6548\u6027\uff0c\u5728Meta-Learning\u548cUniversality\u4e2d \uff0cFinn\u548cLevine\u8868\u660e\uff0c\u5bf9\u4e8e\u975e\u5e38\u6df1\u7684\u51fd\u6570f\uff0cMAML\u7b97\u6cd5\u53ef\u4ee5\u903c\u8fd1D\u1d62\u1d57\u02b3\uff0cx\u1d57\u02e2\u7684\u4efb\u4f55\u51fd\u6570\u3002 \u8fd9\u4e00\u53d1\u73b0\u8868\u660e\uff0c\u57fa\u4e8e\u4f18\u5316\u7684MAML\u7b97\u6cd5\u4e0e\u524d\u9762\u63d0\u5230\u7684\u4efb\u4f55\u5176\u4ed6\u9ed1\u76d2\u7b97\u6cd5\u4e00\u6837\u5177\u6709\u8868\u73b0\u529b\u3002 <\/p>\n

                          4.2 \u2014\u4f53\u7cfb\u7ed3\u6784<\/strong> (<\/span>4.2 \u2014 Architectures<\/strong>)<\/span><\/h3>\n
                          \n
                          \n
                          \n
                          \n
                          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                          \n Figure 5: The probabilistic graphical model for which MAML provides an inference procedure as described in \u201c<\/em>
                          \n \u56fe5\uff1aMAML\u63d0\u4f9b\u4e86\u4e00\u4e2a\u63a8\u7406\u8fc7\u7a0b\u7684\u6982\u7387\u56fe\u5f62\u6a21\u578b\uff0c\u5982\u201c<\/em>
                          \n Recasting Gradient-Based Meta-Learning as Hierarchical Bayes<\/em>
                          \n \u5c06\u57fa\u4e8e\u68af\u5ea6\u7684\u5143\u5b66\u4e60\u91cd\u94f8\u4e3a\u5206\u5c42\u8d1d\u53f6\u65af<\/em>
                          \n .\u201d<\/em>
                          \n \u201d\u4e2d\u6240\u8ff0\u3002<\/em>
                          \n <\/figcaption><\/figure>\n

                          In \u201cRecasting Gradient-Based Meta-Learning as Hierarchical Bayes\u201d, Grant et al. provide another MAML formulation as a method for probabilistic inference via hierarchical Bayes<\/strong>. Let\u2019s say we have a graphical model as illustrated in figure 5, where J is the task, x_{j_n} is a data point in that task, \u03d5\u2c7c are the task-specific parameters, and \u03b8 is the meta-parameters.<\/p>\n

                          Grant\u7b49\u4eba\u5728\u201c \u5c06\u57fa\u4e8e\u68af\u5ea6\u7684\u5143\u5b66\u4e60\u91cd\u94f8\u4e3a\u5206\u7ea7\u8d1d\u53f6\u65af \u201d\u4e2d\u3002 \u63d0\u4f9b\u53e6\u4e00\u79cdMAML\u516c\u5f0f\u4f5c\u4e3a\u901a\u8fc7\u5206\u5c42\u8d1d\u53f6\u65af\u6982\u7387\u63a8\u7406<\/strong>\u7684\u65b9\u6cd5\u3002 \u5047\u8bbe\u6211\u4eec\u6709\u4e00\u4e2a\u5982\u56fe5\u6240\u793a\u7684\u56fe\u5f62\u6a21\u578b\uff0c\u5176\u4e2dJ\u662f\u4efb\u52a1\uff0cx_ {j_n}\u662f\u8be5\u4efb\u52a1\u4e2d\u7684\u6570\u636e\u70b9\uff0c\u03d5\u2c7c\u662f\u4efb\u52a1\u7279\u5b9a\u7684\u53c2\u6570\uff0c\u800c\u03b8\u662f\u5143\u53c2\u6570\u3002 <\/p>\n

                          To do inference with respect to this graphical model, we want to maximize the likelihood of the data given the meta-parameters:<\/p>\n

                          \u4e3a\u4e86\u5bf9\u6b64\u56fe\u5f62\u6a21\u578b\u8fdb\u884c\u63a8\u65ad\uff0c\u6211\u4eec\u8981\u5728\u7ed9\u5b9a\u5143\u53c2\u6570\u7684\u60c5\u51b5\u4e0b\u6700\u5927\u5316\u6570\u636e\u7684\u53ef\u80fd\u6027\uff1a <\/p>\n

                          \n
                          \n
                          \n
                          \n
                          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                          \n Equation 20
                          \n <\/figcaption>
                          \n \u5f0f20
                          \n <\/figcaption><\/figure>\n

                          The probability of the data given the meta-parameters can be expanded into the probability of the data given the task-specific parameters and the probability of the task-specific parameters given the meta-parameters. Thus, equation 20 can be rewritten as:<\/p>\n

                          \u7ed9\u5b9a\u5143\u53c2\u6570\u7684\u6570\u636e\u6982\u7387\u53ef\u4ee5\u6269\u5c55\u4e3a\u7ed9\u5b9a\u4efb\u52a1\u7279\u5b9a\u53c2\u6570\u7684\u6570\u636e\u6982\u7387\u548c\u7ed9\u5b9a\u5143\u53c2\u6570\u7279\u5b9a\u4efb\u52a1\u53c2\u6570\u7684\u6982\u7387\u3002 \u56e0\u6b64\uff0c\u7b49\u5f0f20\u53ef\u4ee5\u91cd\u5199\u4e3a\uff1a <\/p>\n

                          \n
                          \n
                          \n
                          \n
                          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                          \n Equation 21
                          \n <\/figcaption>
                          \n \u5f0f21
                          \n <\/figcaption><\/figure>\n

                          This integral in equation 21 can be approximated with a Maximum a Posteriori estimate for \u03d5\u2c7c:<\/p>\n

                          \u65b9\u7a0b21\u4e2d\u7684\u8fd9\u4e2a\u79ef\u5206\u53ef\u4ee5\u7528\u03d5\u2c7c\u7684\u6700\u5927\u540e\u9a8c\u4f30\u8ba1\u6765\u8fd1\u4f3c\uff1a <\/p>\n

                          \n
                          \n
                          \n
                          \n
                          \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                          \n Equation 22
                          \n <\/figcaption>
                          \n \u5f0f22
                          \n <\/figcaption><\/figure>\n

                          In order to compute this Maximum a Posteriori estimate, the paper performs inference on Maximum a Posteriori under an implicit Gaussian prior<\/em> \u2014 with mean that is determined by the initial parameters and variance that is determined by the number of gradient steps and the step size.<\/p>\n

                          \u4e3a\u4e86\u8ba1\u7b97\u8be5\u6700\u5927\u540e\u9a8c\u4f30\u8ba1\uff0c\u672c\u6587\u5bf9\u9690\u5f0f\u9ad8\u65af\u5148\u9a8c<\/em>\u6761\u4ef6\u4e0b\u7684\u6700\u5927\u540e\u9a8c\u4f30\u8ba1\u8fdb\u884c\u4e86\u63a8\u8bba\uff0c\u5747\u503c\u7531\u521d\u59cb\u53c2\u6570\u786e\u5b9a\uff0c\u65b9\u5dee\u7531\u68af\u5ea6\u6b65\u6570\u548c\u6b65\u957f\u786e\u5b9a\u3002 <\/p>\n

                          There have been other attempts to compute the Maximum a Posteriori estimate in equation 22:<\/p>\n

                          \u8fd8\u8fdb\u884c\u4e86\u5176\u4ed6\u5c1d\u8bd5\u6765\u8ba1\u7b97\u516c\u5f0f22\u4e2d\u7684\u6700\u5927\u540e\u9a8c\u4f30\u8ba1\uff1a <\/p>\n

                            \n
                          • \n

                            Rajeswaran et al. propose an implicit MAML algorithm that uses gradient descent with an explicit Gaussian prior<\/em>. More specifically, they regularize the inner optimization of the algorithm to be close to the meta-parameters \u03b8: \u03d5 \u2190 min_{\u03d5\u2019} L(\u03d5\u2019, D\u1d57\u02b3) + \u03bb\/2 ||\u03b8 \u2014 \u03d5\u2019||\u00b2. The mean and the variance of this explicit Gaussian prior is a function of \u03bb regularizer.<\/p>\n

                            Rajeswaran\u7b49\u3002 \u63d0\u51fa\u4e86\u4e00\u79cd\u9690\u5f0fMAML\u7b97\u6cd5 \uff0c\u8be5\u7b97\u6cd5\u4f7f\u7528\u68af\u5ea6\u4e0b\u964d\u4e0e\u663e\u5f0f\u9ad8\u65af\u5148\u9a8c<\/em> \u3002 \u66f4\u5177\u4f53\u5730\u8bf4\uff0c\u4ed6\u4eec\u5c06\u7b97\u6cd5\u7684\u5185\u90e8\u4f18\u5316\u89c4\u5219\u5316\u4e3a\u63a5\u8fd1\u5143\u53c2\u6570\u03b8\uff1a\u03d5\u2190min_ {\u03d5'} L(\u03d5'\uff0cD\u1d57\u02b3)+\u03bb\/ 2 ||\u03b8\u03d5'||\u00b2\u3002 \u8be5\u663e\u5f0f\u9ad8\u65af\u5148\u9a8c\u7684\u5747\u503c\u548c\u65b9\u5dee\u662f\u03bb\u6b63\u5219\u5316\u51fd\u6570\u7684\u51fd\u6570\u3002 <\/p>\n<\/li>\n<\/ul>\n

                            \n
                            \n
                            \n
                            \n
                            \n
                            \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                            \n Figure 6: The ALPaCA algorithm which uses Bayesian linear regression as described in \u201c<\/em> Meta-Learning Priors for Efficient Online Bayesian Regression
                            \n .\u201d<\/em>
                            \n <\/figcaption>
                            \n \u56fe6\uff1a\u4f7f\u7528\u8d1d\u53f6\u65af\u7ebf\u6027\u56de\u5f52\u7684ALPaCA\u7b97\u6cd5\uff0c\u5982\u201c<\/em> \u6709\u6548\u5728\u7ebf\u8d1d\u53f6\u65af\u56de\u5f52\u7684\u5143\u5b66\u4e60\u5148\u9a8c
                            \n \u201d\u4e2d\u6240\u8ff0<\/em>
                            \n \u3002<\/em>
                            \n <\/figcaption><\/figure>\n
                              \n
                            • \n

                              Harrison et al. propose the ALPaCA algorithm that uses an efficient Bayesian linear regression on top of the learned features from the inner optimization loop to represent the mean and variance of that regression as meta-parameters themselves (illustrated in figure 6). The inclusion of prior information here reduces computational complexity and adds more confidence to the final predictions.<\/p>\n

                              \u54c8\u91cc\u68ee\u7b49\u3002 \u63d0\u51fa\u4e86\u4e00\u79cdALPaCA\u7b97\u6cd5 \uff0c\u8be5\u7b97\u6cd5\u5728\u5185\u90e8\u4f18\u5316\u5faa\u73af\u7684\u5b66\u4e60\u7279\u5f81\u4e4b\u4e0a\u4f7f\u7528\u6709\u6548\u7684\u8d1d\u53f6\u65af\u7ebf\u6027\u56de\u5f52\uff0c\u4ee5\u5c06\u56de\u5f52\u7684\u5747\u503c\u548c\u65b9\u5dee\u8868\u793a\u4e3a\u5143\u53c2\u6570\u672c\u8eab(\u5982\u56fe6\u6240\u793a)\u3002 \u6b64\u5904\u5305\u542b\u5148\u9a8c\u4fe1\u606f\u53ef\u964d\u4f4e\u8ba1\u7b97\u590d\u6742\u6027\uff0c\u5e76\u4e3a\u6700\u7ec8\u9884\u6d4b\u589e\u52a0\u66f4\u591a\u4fe1\u5fc3\u3002 <\/p>\n<\/li>\n

                            • \n

                              Bertinetto et al. attempt to solve meta-learning with differentiable closed-form solutions. In particular, they apply a ridge regression as a base learner for the features in the inner optimization loop. The mean and variance predictions from the ridge regression are then used as meta-parameters in the outer optimization loop.<\/p>\n

                              Bertinetto\u7b49\u3002 \u5c1d\u8bd5\u7528\u5fae\u5206\u5c01\u95ed\u5f0f\u89e3\u51b3\u65b9\u6848\u89e3\u51b3\u5143\u5b66\u4e60\u3002 \u7279\u522b\u662f\uff0c\u4ed6\u4eec\u5c06\u5cad\u56de\u5f52\u4f5c\u4e3a\u5185\u90e8\u4f18\u5316\u5faa\u73af\u4e2d\u8981\u7d20\u7684\u57fa\u7840\u5b66\u4e60\u8005\u3002 \u7136\u540e\uff0c\u5c06\u6765\u81ea\u5cad\u56de\u5f52\u7684\u5747\u503c\u548c\u65b9\u5dee\u9884\u6d4b\u7528\u4f5c\u5916\u90e8\u4f18\u5316\u5faa\u73af\u4e2d\u7684\u5143\u53c2\u6570\u3002 <\/p>\n<\/li>\n<\/ul><\/div>\n<\/p><\/div>\n

                              \n
                              \n
                              \n
                              \n
                              \n
                              \n
                              \n
                              \n
                              \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                              \n Figure 7: The MetaOptNet approach as described in \u201c<\/em> Meta-Learning with Differentiable Convex Optimization
                              \n .\u201d<\/em>
                              \n <\/figcaption>
                              \n \u56fe7\uff1a\u201c\u5177\u6709\u53ef\u5fae\u51f8<\/em> \u4f18\u5316\u7684\u5143\u5b66\u4e60
                              \n \u201d\u4e2d\u4ecb\u7ecd\u7684MetaOptNet\u65b9\u6cd5<\/em>
                              \n \u3002<\/em>
                              \n <\/figcaption><\/figure>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n
                              \n
                              \n
                                \n
                              • \n

                                Lee et al. attempt to solve meta-learning with differentiable convex optimization solutions. The proposed method, called MetaOptNet<\/strong>, uses a support vector machine to learn the features from the inner optimization loop (as seen in figure 7).<\/p>\n

                                Lee\u7b49\u3002 \u5c1d\u8bd5\u7528\u5fae\u5206\u51f8\u4f18\u5316\u89e3\u51b3\u65b9\u6848\u89e3\u51b3\u5143\u5b66\u4e60\u3002 \u63d0\u8bae\u7684\u65b9\u6cd5\u79f0\u4e3aMetaOptNet<\/strong> \uff0c\u5b83\u4f7f\u7528\u652f\u6301\u5411\u91cf\u673a\u4ece\u5185\u90e8\u4f18\u5316\u5faa\u73af\u4e2d\u5b66\u4e60\u7279\u5f81(\u5982\u56fe7\u6240\u793a)\u3002 <\/p>\n<\/li>\n<\/ul>\n

                                4.3 \u2014\u6311\u6218<\/strong> (<\/span>4.3 \u2014 Challenges<\/strong>)<\/span><\/h3>\n

                                The MAML method requires very deep neural architecture in order to effectively get a good inner gradient update. Therefore, the first challenge lies in choosing that architecture.<\/strong> Kim et al. propose Auto-Meta, which searches for the MAML architecture. They found that the highly non-standard architectures with deep and narrow layers tend to perform very well.<\/p>\n

                                MAML\u65b9\u6cd5\u9700\u8981\u975e\u5e38\u6df1\u7684\u795e\u7ecf\u4f53\u7cfb\u7ed3\u6784\uff0c\u4ee5\u4fbf\u6709\u6548\u5730\u83b7\u5f97\u826f\u597d\u7684\u5185\u90e8\u68af\u5ea6\u66f4\u65b0\u3002 \u56e0\u6b64\uff0c\u7b2c\u4e00\u4e2a\u6311\u6218\u5728\u4e8e\u9009\u62e9\u8be5\u67b6\u6784\u3002<\/strong> Kim\u7b49\u3002 \u5efa\u8bae\u4f7f\u7528Auto-Meta\u6765\u641c\u7d22MAML\u4f53\u7cfb\u7ed3\u6784\u3002 \u4ed6\u4eec\u53d1\u73b0\uff0c\u5177\u6709\u6df1\u5c42\u548c\u7a84\u5c42\u7684\u9ad8\u5ea6\u975e\u6807\u51c6\u7684\u4f53\u7cfb\u7ed3\u6784\u5f80\u5f80\u8868\u73b0\u826f\u597d\u3002 <\/p>\n

                                The second challenge that comes up lies in the unreliability of the two-degree optimization paradigm<\/strong>. There are many different optimization tricks that can be useful in this scenario:<\/p>\n

                                \u51fa\u73b0\u7684\u7b2c\u4e8c\u4e2a\u6311\u6218\u5728\u4e8e\u4e24\u7ea7\u4f18\u5316\u8303\u5f0f\u7684\u4e0d\u53ef\u9760\u6027<\/strong> \u3002 \u5728\u8fd9\u79cd\u60c5\u51b5\u4e0b\uff0c\u6709\u8bb8\u591a\u4e0d\u540c\u7684\u4f18\u5316\u6280\u5de7\u53ef\u80fd\u4f1a\u6709\u7528\uff1a <\/p>\n

                                  \n
                                • \n

                                  Li et al. propose Meta-SGD that learns the initialization parameters, the direction of the gradient updates, and the value of the inner learning rate in an end-to-end fashion. This method has proven to increase speed and accuracy of the meta-learner.<\/p>\n

                                  Li\u7b49\u3002 \u63d0\u51fa\u4e86\u4ee5\u7aef\u5230\u7aef\u7684\u65b9\u5f0f\u5b66\u4e60\u521d\u59cb\u5316\u53c2\u6570\uff0c\u68af\u5ea6\u66f4\u65b0\u65b9\u5411\u4ee5\u53ca\u5185\u90e8\u5b66\u4e60\u7387\u503c\u7684Meta-SGD \u3002 \u5b9e\u8df5\u8bc1\u660e\uff0c\u8fd9\u79cd\u65b9\u6cd5\u53ef\u4ee5\u63d0\u9ad8\u5143\u5b66\u4e60\u5668\u7684\u901f\u5ea6\u548c\u51c6\u786e\u6027\u3002 <\/p>\n<\/li>\n

                                • \n

                                  Behl et al. come up with Alpha-MAML, which is an extension of the vanilla MAML. Alpha-MAML uses an online hyper-parameter adaptation scheme to automatically tune the learning rate, making the training process more robust.<\/p>\n

                                  Behl\u7b49\u3002 \u63d0\u51faAlpha-MAML \uff0c\u5b83\u662f\u9999\u8349MAML\u7684\u6269\u5c55\u3002 Alpha-MAML\u4f7f\u7528\u5728\u7ebf\u8d85\u53c2\u6570\u81ea\u9002\u5e94\u65b9\u6848\u6765\u81ea\u52a8\u8c03\u6574\u5b66\u4e60\u901f\u7387\uff0c\u4ece\u800c\u4f7f\u8bad\u7ec3\u8fc7\u7a0b\u66f4\u52a0\u5f3a\u5927\u3002 <\/p>\n<\/li>\n<\/ul><\/div>\n<\/p><\/div>\n

                                  \n
                                  \n
                                  \n
                                  \n
                                  \n
                                  \n
                                  \n
                                  \n
                                  \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                                  \n Figure 8: The Deep Meta Learning as described in \u201c<\/em> Deep Meta-Learning: Learning to Learn in the Concept Space
                                  \n .\u201d<\/em>
                                  \n <\/figcaption>
                                  \n \u56fe8\uff1a\u201c<\/em> \u6df1\u5ea6\u5143\u5b66\u4e60\uff1a\u5728\u6982\u5ff5\u7a7a\u95f4\u4e2d\u5b66\u4e60
                                  \n \u201d\u4e2d\u63cf\u8ff0\u7684<\/em> \u6df1\u5ea6\u5143\u5b66\u4e60
                                  \n \u3002<\/em>
                                  \n <\/figcaption><\/figure>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n
                                  \n
                                  \n
                                    \n
                                  • \n

                                    Zhou et al. devise Deep Meta-Learning, which performs meta-learning in a concept space. As illustrated in figure 8, the concept generator generates the concept-level features from the inputs, while the concept discriminator distinguishes the features generated from that first step. The final loss function includes both the loss from the discriminator and the loss from the meta-learner.<\/p>\n

                                    \u5468\u7b49\u3002 \u8bbe\u8ba1Deep Meta-Learning \uff0c\u5728\u6982\u5ff5\u7a7a\u95f4\u4e2d\u6267\u884c\u5143\u5b66\u4e60\u3002 \u5982\u56fe8\u6240\u793a\uff0c\u6982\u5ff5\u751f\u6210\u5668\u4ece\u8f93\u5165\u751f\u6210\u6982\u5ff5\u7ea7\u522b\u7684\u7279\u5f81\uff0c\u800c\u6982\u5ff5\u9274\u522b\u5668\u5219\u533a\u5206\u4ece\u7b2c\u4e00\u6b65\u751f\u6210\u7684\u7279\u5f81\u3002 \u6700\u7ec8\u635f\u5931\u51fd\u6570\u65e2\u5305\u62ec\u9274\u522b\u5668\u7684\u635f\u5931\uff0c\u4e5f\u5305\u62ec\u5143\u5b66\u4e60\u5668\u7684\u635f\u5931\u3002 <\/p>\n<\/li>\n

                                  • \n

                                    Zintgraf et al. design CAVIA, which stands for fast context adaptation via meta-learning. To handle the overfitting challenge with vanilla MAML, CAVIA optimizes only a subset of the input parameters in the inner loop at test time (deemed context parameters<\/em>), instead of the whole neural network. By separating the task-specific parameters and task-independent parameters, they show that training CAVIA is highly efficient.<\/p>\n

                                    Zintgraf\u7b49\u3002 design CAVIA \uff0c\u5b83\u8868\u793a\u901a\u8fc7\u5143\u5b66\u4e60\u8fdb\u884c\u5feb\u901f\u4e0a\u4e0b\u6587\u9002\u5e94\u3002 \u4e3a\u4e86\u5e94\u5bf9\u9999\u8349MAML\u7684\u8fc7\u62df\u5408\u6311\u6218\uff0cCAVIA\u53ea\u4f1a\u5728\u6d4b\u8bd5\u65f6\u4f18\u5316\u5185\u90e8\u5faa\u73af\u4e2d\u8f93\u5165\u53c2\u6570\u7684\u4e00\u4e2a\u5b50\u96c6(\u89c6\u4e3a\u4e0a\u4e0b\u6587\u53c2\u6570<\/em> )\uff0c\u800c\u4e0d\u662f\u6574\u4e2a\u795e\u7ecf\u7f51\u7edc\u3002 \u901a\u8fc7\u5206\u79bb\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u548c\u72ec\u7acb\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\uff0c\u5b83\u4eec\u8868\u660e\u8bad\u7ec3CAVIA\u662f\u9ad8\u6548\u7684\u3002 <\/p>\n<\/li>\n

                                  • \n

                                    Antoniou et al. ideate MAML++, which is a comprehensive guideline on reducing the hyper-parameter sensitivity, lowering the generalization error, and improving MAML stability. One interesting idea is that they disentangle both the learning rate and the batch-norm statistics per step of the inner loop.<\/p>\n

                                    Antoniou\u7b49\u3002 ideate MAML ++ \uff0c\u8fd9\u662f\u964d\u4f4e\u8d85\u53c2\u6570\u654f\u611f\u6027\uff0c\u964d\u4f4e\u6cdb\u5316\u8bef\u5dee\u548c\u63d0\u9ad8MAML\u7a33\u5b9a\u6027\u7684\u7efc\u5408\u6307\u5357\u3002 \u4e00\u4e2a\u6709\u8da3\u7684\u60f3\u6cd5\u662f\uff0c\u5b83\u4eec\u4f7f\u5185\u90e8\u5faa\u73af\u7684\u6bcf\u4e2a\u6b65\u9aa4\u90fd\u65e0\u6cd5\u4e86\u89e3\u5b66\u4e60\u7387\u548c\u6279\u6b21\u89c4\u8303\u7edf\u8ba1\u4fe1\u606f\u3002 <\/p>\n<\/li>\n<\/ul>\n

                                    In conclusion, optimization-based meta-learning works by constructing a two-degree optimization procedure, where the inner optimization computes the task-specific parameters \u03d5 and the outer optimization computes the meta-parameters \u03b8. The most representative method is the Model-Agnostic Meta-Learning algorithm, which has been studied and improved upon extensively since its conception.<\/p>\n

                                    \u603b\u4e4b\uff0c\u57fa\u4e8e\u4f18\u5316\u7684\u5143\u5b66\u4e60\u901a\u8fc7\u6784\u5efa\u4e24\u7ea7\u4f18\u5316\u7a0b\u5e8f\u6765\u8fdb\u884c\uff0c\u5176\u4e2d\u5185\u90e8\u4f18\u5316\u8ba1\u7b97\u4efb\u52a1\u7279\u5b9a\u53c2\u6570task\uff0c\u5916\u90e8\u4f18\u5316\u8ba1\u7b97\u5143\u53c2\u6570\u03b8\u3002 \u6700\u6709\u4ee3\u8868\u6027\u7684\u65b9\u6cd5\u662f\u4e0e\u6a21\u578b\u65e0\u5173\u7684\u5143\u5b66\u4e60\u7b97\u6cd5\uff0c\u81ea\u4ece\u6982\u5ff5\u5f00\u59cb\u5c31\u5bf9\u5176\u8fdb\u884c\u4e86\u5e7f\u6cdb\u7684\u7814\u7a76\u548c\u6539\u8fdb\u3002 <\/p>\n

                                    The big benefit of MAML is that we can optimize the model\u2019s initialization scheme, in contrast to the black box approach where the initial optimization procedure is not optimized. Furthermore, MAML is highly consistent, which extrapolates well to learning problems where the data is out-of-distribution (compared to what the model has seen during meta-training). Unfortunately, because optimization-based meta-learning requires second-order optimization, it is very computationally expensive.<\/p>\n

                                    MAML\u7684\u6700\u5927\u597d\u5904\u662f\uff0c\u6211\u4eec\u53ef\u4ee5\u4f18\u5316\u6a21\u578b\u7684\u521d\u59cb\u5316\u65b9\u6848\uff0c\u8fd9\u4e0e\u4e0d\u4f18\u5316\u521d\u59cb\u4f18\u5316\u7a0b\u5e8f\u7684\u9ed1\u76d2\u65b9\u6cd5\u76f8\u53cd\u3002 \u6b64\u5916\uff0cMAML\u5177\u6709\u9ad8\u5ea6\u4e00\u81f4\u6027\uff0c\u53ef\u4ee5\u5f88\u597d\u5730\u63a8\u65ad\u51fa\u6570\u636e\u5206\u5e03\u4e0d\u5f53\u7684\u5b66\u4e60\u95ee\u9898(\u4e0e\u5143\u8bad\u7ec3\u4e2d\u6a21\u578b\u770b\u5230\u7684\u60c5\u51b5\u76f8\u6bd4)\u3002 \u4e0d\u5e78\u7684\u662f\uff0c\u7531\u4e8e\u57fa\u4e8e\u4f18\u5316\u7684\u5143\u5b66\u4e60\u9700\u8981\u4e8c\u9636\u4f18\u5316\uff0c\u56e0\u6b64\u5b83\u5728\u8ba1\u7b97\u4e0a\u975e\u5e38\u6602\u8d35\u3002 <\/p>\n

                                    5 \u2014\u975e\u53c2\u6570\u5143\u5b66\u4e60<\/strong> (<\/span>5 \u2014 Non-Parametric Meta-Learning<\/strong>)<\/span><\/h2>\n

                                    So can we perform the learning procedure described above without a second-order optimization? <\/strong>This is where non-parametric methods fit in.<\/p>\n

                                    \u90a3\u4e48\uff0c\u6211\u4eec\u53ef\u4ee5\u5728\u4e0d\u8fdb\u884c\u4e8c\u9636\u4f18\u5316\u7684\u60c5\u51b5\u4e0b<\/strong>\u6267\u884c\u4e0a\u8ff0\u5b66\u4e60\u8fc7\u7a0b\u5417\uff1f<\/strong> \u8fd9\u662f\u9002\u5408\u975e\u53c2\u6570\u65b9\u6cd5\u7684\u5730\u65b9\u3002 <\/p>\n

                                    Non-parametric methods are very effective at learning with a small amount of data (k-Nearest Neighbor, decision trees, support vector machines). In non-parametric meta-learning<\/strong>, we compare the test data with the training data using some sort of similarity metric. If we find the training data that are most similar to the test data, we assign the labels of those training data as the label of the test data.<\/p>\n

                                    \u975e\u53c2\u6570\u65b9\u6cd5\u5728\u5b66\u4e60\u5c11\u91cf\u6570\u636e(k\u6700\u8fd1\u90bb\uff0c\u51b3\u7b56\u6811\uff0c\u652f\u6301\u5411\u91cf\u673a)\u65f6\u975e\u5e38\u6709\u6548\u3002 \u5728\u975e\u53c2\u6570\u5143\u5b66\u4e60\u4e2d<\/strong> \uff0c\u6211\u4eec\u4f7f\u7528\u67d0\u79cd\u76f8\u4f3c\u6027\u5ea6\u91cf\u5c06\u6d4b\u8bd5\u6570\u636e\u4e0e\u8bad\u7ec3\u6570\u636e\u8fdb\u884c\u6bd4\u8f83\u3002 \u5982\u679c\u6211\u4eec\u627e\u5230\u4e0e\u6d4b\u8bd5\u6570\u636e\u6700\u76f8\u4f3c\u7684\u8bad\u7ec3\u6570\u636e\uff0c\u5219\u5c06\u8fd9\u4e9b\u8bad\u7ec3\u6570\u636e\u7684\u6807\u7b7e\u5206\u914d\u4e3a\u6d4b\u8bd5\u6570\u636e\u7684\u6807\u7b7e\u3002 <\/p>\n

                                    5.1 \u2014\u914d\u65b9<\/strong> (<\/span>5.1 \u2014 Formulation<\/strong>)<\/span><\/h3>\n

                                    This is the non-parametric meta-learning algorithm in a nutshell:<\/p>\n

                                    \u7b80\u800c\u8a00\u4e4b\uff0c\u8fd9\u662f\u975e\u53c2\u6570\u5143\u5b66\u4e60\u7b97\u6cd5\uff1a <\/p>\n

                                      \n
                                    • We sample a task T\u1d62, as well as the training set D\u1d62\u1d57\u02b3 and test set D\u1d62\u1d57\u02e2 from the task dataset D\u1d62.\n

                                      \n

                                      \u6211\u4eec\u4ece\u4efb\u52a1\u6570\u636e\u96c6D\u1d62\u4e2d\u91c7\u6837\u4efb\u52a1T\u1d62\uff0c\u4ee5\u53ca\u8bad\u7ec3\u96c6D\u1d62\u1d57\u02b3\u548c\u6d4b\u8bd5\u96c6D\u1d62\u1d57\u02e2\u3002 <\/li>\n

                                    • \n

                                      We predict the test label y\u1d57\u02e2 via the similarity between training data and test data (represented by f_\u03b8: y\u1d57\u02e2 = \u2211{x_k, y_k<\/em> \u2208 D\u1d57\u02b3} f_\u03b8 (x\u1d57\u02e2, x_k) y_k.<\/p>\n

                                      \u6211\u4eec\u901a\u8fc7\u8bad\u7ec3\u6570\u636e\u548c\u6d4b\u8bd5\u6570\u636e\u4e4b\u95f4\u7684\u76f8\u4f3c\u6027\u6765\u9884\u6d4b\u6d4b\u8bd5\u6807\u7b7ey\u1d57\u02e2(\u7528f_\u03b8<\/em>\u8868\u793a\uff1ay\u1d57\u02e2= \u2211 {x_k\uff0c<\/em> y_k\u2208D\u1d57\u02b3}f_\u03b8(x\u1d57\u02e2\uff0cx_k)y_k\u3002 <\/p>\n<\/li>\n

                                    • Then we update the meta-parameters \u03b8 of this learned embedding function with respect to the loss function of how accurate our predictions are on the test set: \u2207_{\u03b8} L(y\u1d57\u02e2, y\u1d57\u02e2).\n

                                      \n

                                      \u7136\u540e\uff0c\u76f8\u5bf9\u4e8e\u6211\u4eec\u7684\u9884\u6d4b\u5728\u6d4b\u8bd5\u96c6\u4e0a\u7684\u51c6\u786e\u5ea6\u7684\u635f\u5931\u51fd\u6570\uff1aupdate_ {\u03b8} L(y\u1d57\u02e2\uff0cy\u1d57\u02e2)\uff0c\u6211\u4eec\u66f4\u65b0\u8be5\u5b66\u4e60\u7684\u5d4c\u5165\u51fd\u6570\u7684\u5143\u53c2\u6570\u03b8\u3002 <\/li>\n

                                    • This process is repeated iteratively with gradient descent optimizers.\n

                                      \n

                                      \u4f7f\u7528\u68af\u5ea6\u4e0b\u964d\u4f18\u5316\u5668\u53cd\u590d\u91cd\u590d\u6b64\u8fc7\u7a0b\u3002 <\/li>\n<\/ul>\n

                                      Unlike the black-box and optimization-based approaches, we no longer have the task-specific parameters \u03d5, which is not required for the comparison between training and test data.<\/p>\n

                                      \u4e0e\u57fa\u4e8e\u9ed1\u76d2\u548c\u57fa\u4e8e\u4f18\u5316\u7684\u65b9\u6cd5\u4e0d\u540c\uff0c\u6211\u4eec\u4e0d\u518d\u5177\u6709\u7279\u5b9a\u4e8e\u4efb\u52a1\u7684\u53c2\u6570\u03d5\uff0c\u8fd9\u5bf9\u4e8e\u8bad\u7ec3\u6570\u636e\u548c\u6d4b\u8bd5\u6570\u636e\u4e4b\u95f4\u7684\u6bd4\u8f83\u4e0d\u662f\u5fc5\u9700\u7684\u3002 <\/p>\n

                                      5.2 \u2014\u4f53\u7cfb\u7ed3\u6784<\/strong> (<\/span>5.2 \u2014 Architectures<\/strong>)<\/span><\/h3>\n

                                      Now let\u2019s go over the different architectures used in non-parametric meta-learning methods.<\/p>\n

                                      \u73b0\u5728\u8ba9\u6211\u4eec\u770b\u4e00\u4e0b\u975e\u53c2\u6570\u5143\u5b66\u4e60\u65b9\u6cd5\u4e2d\u4f7f\u7528\u7684\u4e0d\u540c\u4f53\u7cfb\u7ed3\u6784\u3002 <\/p>\n

                                      \n
                                      \n
                                      \n
                                      \n
                                      \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                                      \n Figure 9: The Siamese Network strategy devised in \u201c<\/em> Siamese Neural Network for One-Shot Image Recognition
                                      \n .\u201d<\/em>
                                      \n <\/figcaption>
                                      \n \u56fe9\uff1a\u5728\u201c<\/em> \u7528\u4e8e\u4e00\u5e45\u56fe\u50cf\u8bc6\u522b\u7684\u66b9\u7f57\u795e\u7ecf\u7f51\u7edc
                                      \n \u201d\u4e2d\u8bbe\u8ba1\u7684\u66b9\u7f57\u7f51\u7edc\u7b56\u7565<\/em>
                                      \n \u3002<\/em>
                                      \n <\/figcaption><\/figure>\n

                                      Koch et al. propose a Siamese network that consists of two tasks: the verification task and the one-shot task. Taking in pairs of images during training time, the network verifies whether they are of the same class or different classes. At test time, the network performs one-shot learning: comparing each image x\u1d57\u02e2 to the images in the training set D\u2c7c\u1d57\u02b3 for a respective task and predicting the label of x\u1d57\u02e2 that corresponds to the label of the closest image. Figure 9 illustrates this strategy.<\/p>\n

                                      Koch\u7b49\u3002 \u63d0\u51fa\u4e00\u4e2a\u7531\u4e24\u4e2a\u4efb\u52a1\u7ec4\u6210\u7684\u66b9\u7f57\u7f51\u7edc \uff1a\u9a8c\u8bc1\u4efb\u52a1\u548c\u4e00\u6b21\u6027\u4efb\u52a1\u3002 \u5728\u8bad\u7ec3\u671f\u95f4\u5bf9\u56fe\u50cf\u8fdb\u884c\u6210\u5bf9\u62cd\u6444\uff0c\u7f51\u7edc\u5c06\u9a8c\u8bc1\u5b83\u4eec\u662f\u76f8\u540c\u7c7b\u522b\u8fd8\u662f\u4e0d\u540c\u7c7b\u522b\u3002 \u5728\u6d4b\u8bd5\u65f6\uff0c\u7f51\u7edc\u6267\u884c\u4e00\u6b21\u5b66\u4e60\uff1a\u5c06\u6bcf\u4e2a\u56fe\u50cfx\u1d57\u02e2\u4e0e\u8bad\u7ec3\u96c6D\u2c7c\u1d57\u02b3\u4e2d\u9488\u5bf9\u76f8\u5e94\u4efb\u52a1\u7684\u56fe\u50cf\u8fdb\u884c\u6bd4\u8f83\uff0c\u5e76\u9884\u6d4b\u4e0e\u6700\u63a5\u8fd1\u56fe\u50cf\u7684\u6807\u7b7e\u76f8\u5bf9\u5e94\u7684x\u1d57\u02e2\u6807\u7b7e\u3002 \u56fe9\u8bf4\u660e\u4e86\u6b64\u7b56\u7565\u3002 <\/p>\n

                                      Vinyals et al. propose Matching Networks, which matches the actions happening during training time at test time. The network takes the training data and the test data and embeds them into their respective embedding spaces. Then, the network compares each pair of train-test embeddings to make the final label predictions:<\/p>\n

                                      Vinyals\u7b49\u3002 \u63d0\u51faMatching Networks ( \u5339\u914d\u7f51\u7edc) \uff0c\u4ee5\u5339\u914d\u6d4b\u8bd5\u65f6\u95f4\u5728\u8bad\u7ec3\u671f\u95f4\u53d1\u751f\u7684\u52a8\u4f5c\u3002 \u7f51\u7edc\u83b7\u53d6\u8bad\u7ec3\u6570\u636e\u548c\u6d4b\u8bd5\u6570\u636e\uff0c\u5e76\u5c06\u5b83\u4eec\u5d4c\u5165\u5230\u5404\u81ea\u7684\u5d4c\u5165\u7a7a\u95f4\u4e2d\u3002 \u7136\u540e\uff0c\u7f51\u7edc\u4f1a\u6bd4\u8f83\u6bcf\u5bf9\u8bad\u7ec3\u6d4b\u8bd5\u5d4c\u5165\uff0c\u4ee5\u505a\u51fa\u6700\u7ec8\u7684\u6807\u7b7e\u9884\u6d4b\uff1a <\/p>\n

                                      \n
                                      \n
                                      \n
                                      \n
                                      \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                                      \n Equation 23
                                      \n <\/figcaption>
                                      \n \u5f0f23
                                      \n <\/figcaption><\/figure>\n
                                      \n
                                      \n
                                      \n
                                      \n
                                      \n
                                      \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                                      \n Figure 10: The Matching Networks architecture described in \u201c<\/em> Matching Networks for One-Shot Learning
                                      \n .\u201d<\/em>
                                      \n <\/figcaption>
                                      \n \u56fe10\uff1a\u201c<\/em> \u4e00\u7ad9\u5f0f\u5b66\u4e60\u7684\u5339\u914d\u7f51\u7edc
                                      \n \u201d\u4e2d\u63cf\u8ff0\u7684\u5339\u914d\u7f51\u7edc\u4f53\u7cfb\u7ed3\u6784<\/em>
                                      \n \u3002<\/em>
                                      \n <\/figcaption><\/figure>\n

                                      The Matching Network architecture used in Matching Networks includes a convolutional encoder network to embed the images and a bi-directional Long-Short Term Memory network to produce the embeddings of such images. As seen in figure 10, the examples in the training set match the examples in the test set.<\/p>\n

                                      \u5339\u914d\u7f51\u7edc\u4e2d\u4f7f\u7528\u7684\u5339\u914d\u7f51\u7edc\u4f53\u7cfb\u7ed3\u6784\u5305\u62ec\u7528\u4e8e\u5d4c\u5165\u56fe\u50cf\u7684\u5377\u79ef\u7f16\u7801\u5668\u7f51\u7edc\u548c\u7528\u4e8e\u751f\u6210\u6b64\u7c7b\u56fe\u50cf\u7684\u5d4c\u5165\u7684\u53cc\u5411\u957f\u77ed\u65f6\u8bb0\u5fc6\u7f51\u7edc\u3002 \u5982\u56fe10\u6240\u793a\uff0c\u8bad\u7ec3\u96c6\u4e2d\u7684\u793a\u4f8b\u4e0e\u6d4b\u8bd5\u96c6\u4e2d\u7684\u793a\u4f8b\u5339\u914d\u3002 <\/p>\n

                                      Snell et al. propose Prototypical Networks, which create prototypical embeddings for all the classes in the given data. Then, the network compares those embeddings to make the final label predictions for the corresponding class.<\/p>\n

                                      Snell\u7b49\u3002 \u63d0\u51fa\u539f\u578b\u7f51\u7edc \uff0c\u8be5\u7f51\u7edc\u4e3a\u7ed9\u5b9a\u6570\u636e\u4e2d\u7684\u6240\u6709\u7c7b\u521b\u5efa\u539f\u578b\u5d4c\u5165\u3002 \u7136\u540e\uff0c\u7f51\u7edc\u5c06\u8fd9\u4e9b\u5d4c\u5165\u8fdb\u884c\u6bd4\u8f83\uff0c\u4ee5\u5bf9\u76f8\u5e94\u7c7b\u522b\u8fdb\u884c\u6700\u7ec8\u6807\u7b7e\u9884\u6d4b\u3002 <\/p>\n

                                      Figure 11 provides a concrete illustration of how Prototypical Networks look like in the few-shot scenario. c\u2081, c\u2082, and c\u2083 are the class prototypical embeddings, which are computed as:<\/p>\n

                                      \u56fe11\u63d0\u4f9b\u4e86\u5728\u5c11\u6570\u60c5\u51b5\u4e0b\u539f\u578b\u7f51\u7edc\u7684\u5916\u89c2\u7684\u5177\u4f53\u8bf4\u660e\u3002 c\u2081\uff0cc2\u548cc\u2083\u662f\u7c7b\u539f\u578b\u5d4c\u5165\uff0c\u5176\u8ba1\u7b97\u516c\u5f0f\u5982\u4e0b\uff1a <\/p>\n

                                      \n
                                      \n
                                      \n
                                      \n
                                      \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                                      \n Equation 24
                                      \n <\/figcaption>
                                      \n \u5f0f24
                                      \n <\/figcaption><\/figure>\n
                                      \n
                                      \n
                                      \n
                                      \n
                                      \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                                      \n Figure 11: Prototypical Networks as described in \u201c<\/em> Prototypical Networks for Few-Shot Learning
                                      \n .\u201d<\/em>
                                      \n <\/figcaption>
                                      \n \u56fe11\uff1a\u5982\u201c<\/em> \u5c11\u91cf\u5b66\u4e60\u7684
                                      \n \u539f\u578b\u7f51\u7edc\u201d\u4e2d\u6240\u8ff0\u7684<\/em> \u539f\u578b\u7f51\u7edc
                                      \n \u3002<\/em>
                                      \n <\/figcaption><\/figure>\n

                                      Then, we compute the distances from x to each of the prototypical class embeddings: D(f\u03b8(x), c_<\/em>k).<\/em><\/p>\n

                                      \u7136\u540e\uff0c\u6211\u4eec\u8ba1\u7b97\u4ecex\u5230\u6bcf\u4e2a\u539f\u578b\u7c7b\u5d4c\u5165\u7684\u8ddd\u79bb\uff1a D(f\u03b8(x)\uff0cc_<\/em> k) \u3002<\/em> <\/p>\n

                                      To get the final class prediction p_\u03b8(y=k|x), we look at the probability of the negative distances after a softmax activation function, as seen below:<\/p>\n

                                      \u4e3a\u4e86\u83b7\u5f97\u6700\u7ec8\u7c7b\u522b\u9884\u6d4bp_\u03b8(y = k | x)\uff0c\u6211\u4eec\u770b\u4e00\u4e0bsoftmax\u6fc0\u6d3b\u51fd\u6570\u540e\u8d1f\u8ddd\u79bb\u7684\u6982\u7387\uff0c\u5982\u4e0b\u6240\u793a\uff1a <\/p>\n

                                      \n
                                      \n
                                      \n
                                      \n
                                      \n
                                      \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                                      \n Equation 25
                                      \n <\/figcaption>
                                      \n \u5f0f25
                                      \n <\/figcaption><\/figure>\n

                                      5.3 \u2014\u6311\u6218<\/strong> (<\/span>5.3 \u2014 Challenge<\/strong>)<\/span><\/h3>\n

                                      For non-parametric meta-learning, how can we learn deeper interactions between our inputs?<\/strong> The nearest neighbor probably will not work well when our data is high-dimensional. Here are three papers that attempt to accomplish this:<\/p>\n

                                      \u5bf9\u4e8e\u975e\u53c2\u6570\u5143\u5b66\u4e60\uff0c \u6211\u4eec\u5982\u4f55\u5b66\u4e60\u8f93\u5165\u4e4b\u95f4\u7684\u66f4\u6df1\u5c42\u6b21\u7684\u4e92\u52a8\uff1f<\/strong> \u5f53\u6211\u4eec\u7684\u6570\u636e\u662f\u9ad8\u7ef4\u6570\u636e\u65f6\uff0c\u6700\u8fd1\u7684\u90bb\u5c45\u53ef\u80fd\u65e0\u6cd5\u6b63\u5e38\u5de5\u4f5c\u3002 \u8fd9\u662f\u4e09\u7bc7\u8bd5\u56fe\u8fbe\u5230\u8fd9\u4e00\u76ee\u7684\u7684\u8bba\u6587\uff1a <\/p>\n

                                        \n
                                      • \n

                                        Sung et al. come up with RelationNet (figure 12), which has two modules: the embedding module and the relation module. The embedding module embeds the training and test inputs to training and test embeddings. Then the relation module takes in the embeddings and learns a deep distance metric to compare those embeddings (function D in equation 25).<\/p>\n

                                        Sung\u7b49\u3002 \u63d0\u51fa\u4e86RelationNet (\u56fe12)\uff0c\u5b83\u5177\u6709\u4e24\u4e2a\u6a21\u5757\uff1a\u5d4c\u5165\u6a21\u5757\u548c\u5173\u7cfb\u6a21\u5757\u3002 \u5d4c\u5165\u6a21\u5757\u5c06\u8bad\u7ec3\u548c\u6d4b\u8bd5\u8f93\u5165\u5d4c\u5165\u5230\u8bad\u7ec3\u548c\u6d4b\u8bd5\u5d4c\u5165\u4e2d\u3002 \u7136\u540e\uff0c\u5173\u7cfb\u6a21\u5757\u63a5\u53d7\u5d4c\u5165\u5e76\u5b66\u4e60\u6df1\u5ea6\u8ddd\u79bb\u5ea6\u91cf\u4ee5\u6bd4\u8f83\u8fd9\u4e9b\u5d4c\u5165(\u516c\u5f0f25\u4e2d\u7684\u51fd\u6570D)\u3002 <\/p>\n<\/li>\n

                                      • \n

                                        Allen et al. propose an Infinite Mixture of Prototypes. This is an extension of the Prototypical Networks, in the sense that it adaptively sets the model capacity based on the data complexity. By assigning each class its own cluster, this method allows the use of unsupervised clustering, which is helpful for many purposes.<\/p>\n

                                        \u827e\u4f26\u7b49\u3002 \u63d0\u51fa\u539f\u578b\u7684\u65e0\u9650\u6df7\u5408 \u3002 \u8fd9\u662f\u539f\u578b\u7f51\u7edc\u7684\u6269\u5c55\uff0c\u56e0\u4e3a\u5b83\u53ef\u4ee5\u6839\u636e\u6570\u636e\u590d\u6742\u6027\u81ea\u9002\u5e94\u5730\u8bbe\u7f6e\u6a21\u578b\u5bb9\u91cf\u3002 \u901a\u8fc7\u4e3a\u6bcf\u4e2a\u7c7b\u5206\u914d\u81ea\u5df1\u7684\u7fa4\u96c6\uff0c\u6b64\u65b9\u6cd5\u5141\u8bb8\u4f7f\u7528\u65e0\u76d1\u7763\u7fa4\u96c6\uff0c\u8fd9\u5bf9\u4e8e\u8bb8\u591a\u76ee\u7684\u90fd\u662f\u6709\u5e2e\u52a9\u7684\u3002 <\/p>\n<\/li>\n

                                      • \n

                                        Garcia and Bruna use a Graph Neural Network in their meta-learning paradigm. By mapping the inputs into their graphical representation, they can easily learn the similarity between training and test data via the edge and node features.<\/p>\n

                                        Garcia\u548cBruna\u5728\u4ed6\u4eec\u7684\u5143\u5b66\u4e60\u8303\u5f0f\u4e2d\u4f7f\u7528\u4e86Graph Neural Network \u3002 \u901a\u8fc7\u5c06\u8f93\u5165\u6620\u5c04\u5230\u5176\u56fe\u5f62\u8868\u793a\u4e2d\uff0c\u4ed6\u4eec\u53ef\u4ee5\u8f7b\u677e\u5730\u901a\u8fc7\u8fb9\u7f18\u548c\u8282\u70b9\u7279\u5f81\u6765\u5b66\u4e60\u8bad\u7ec3\u6570\u636e\u4e0e\u6d4b\u8bd5\u6570\u636e\u4e4b\u95f4\u7684\u76f8\u4f3c\u6027\u3002 <\/p>\n<\/li>\n<\/ul><\/div>\n<\/p><\/div>\n

                                        \n
                                        \n
                                        \n
                                        \n
                                        \n
                                        \n
                                        \n
                                        \n
                                        \n \"\u5143\u5b66\u4e60\n <\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>
                                        \n Figure 12: Relation Networks as described in \u201c<\/em> Learning to Compare: Relation Network for Few-Shot Learning
                                        \n .\u201d<\/em>
                                        \n <\/figcaption>
                                        \n \u56fe12\uff1a\u201c<\/em> \u5b66\u4e60\u6bd4\u8f83\uff1a\u5c11\u91cf\u5b66\u4e60\u7684
                                        \n \u5173\u7cfb\u7f51\u7edc\u201d\u4e2d\u6240\u8ff0\u7684<\/em> \u5173\u7cfb\u7f51\u7edc
                                        \n \u3002<\/em>
                                        \n <\/figcaption><\/figure>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n
                                        \n
                                        \n

                                        \u516d\uff0c\u7ed3\u8bba<\/strong> (<\/span>6 \u2014 Conclusion<\/strong>)<\/span><\/h2>\n

                                        In this post, I have discussed the motivation for meta-learning, the basic formulation and optimization objective for meta-learning, as well as the three approaches regarding the design of the meta-learning algorithm. In particular:<\/p>\n

                                        \u5728\u672c\u6587\u4e2d\uff0c\u6211\u8ba8\u8bba\u4e86\u5143\u5b66\u4e60\u7684\u52a8\u673a\uff0c\u5143\u5b66\u4e60\u7684\u57fa\u672c\u516c\u5f0f\u548c\u4f18\u5316\u76ee\u6807\uff0c\u4ee5\u53ca\u6709\u5173\u5143\u5b66\u4e60\u7b97\u6cd5\u8bbe\u8ba1\u7684\u4e09\u79cd\u65b9\u6cd5\u3002 \u7279\u522b\u662f\uff1a <\/p>\n

                                          \n
                                        • \n

                                          Black-box meta-learning algorithms<\/strong> have very strong learning capacity, in the sense that neural networks are universal function approximators. But if we impose certain structures into the function, there is no guarantee that black-box models will produce consistent results. Additionally, we can use black-box approaches with different types of problem settings such as reinforcement learning and self-supervised learning. However, because black-box models always learn from scratch, they are very data-hungry.<\/p>\n

                                          \u5728\u795e\u7ecf\u7f51\u7edc\u662f\u901a\u7528\u51fd\u6570\u903c\u8fd1\u5668\u7684\u610f\u4e49\u4e0a\uff0c \u9ed1\u76d2\u5143\u5b66\u4e60\u7b97\u6cd5<\/strong>\u5177\u6709\u975e\u5e38\u5f3a\u7684\u5b66\u4e60\u80fd\u529b\u3002 \u4f46\u662f\uff0c\u5982\u679c\u6211\u4eec\u5c06\u67d0\u4e9b\u7ed3\u6784\u5f3a\u52a0\u7ed9\u51fd\u6570\uff0c\u5219\u4e0d\u80fd\u4fdd\u8bc1\u9ed1\u76d2\u6a21\u578b\u4f1a\u4ea7\u751f\u4e00\u81f4\u7684\u7ed3\u679c\u3002 \u6b64\u5916\uff0c\u6211\u4eec\u53ef\u4ee5\u5c06\u9ed1\u76d2\u65b9\u6cd5\u7528\u4e8e\u4e0d\u540c\u7c7b\u578b\u7684\u95ee\u9898\u8bbe\u7f6e\uff0c\u4f8b\u5982\u5f3a\u5316\u5b66\u4e60\u548c\u81ea\u6211\u76d1\u7763\u5b66\u4e60\u3002 \u4f46\u662f\uff0c\u7531\u4e8e\u9ed1\u5323\u5b50\u6a21\u578b\u603b\u662f\u4ece\u5934\u5f00\u59cb\u5b66\u4e60\uff0c\u56e0\u6b64\u5b83\u4eec\u975e\u5e38\u8017\u8d39\u6570\u636e\u3002 <\/p>\n<\/li>\n

                                        • \n

                                          Optimization-based meta-learning algorithms<\/strong> can be reduced down to gradient descent; thus, it\u2019s reasonable to expect consistent predictions. For deep enough neural networks, optimization-based models also have very high capacity. Because the initialization is optimized internally, optimization-based models have a better head-start than black-box models. Furthermore, we can try out different architectures without any real difficulty, as evidenced by the Model-Agnostic Meta-Learning (MAML) learning paradigm. However, the second-order optimization procedure makes optimization-based approaches quite computationally expensive.<\/p>\n

                                          \u57fa\u4e8e\u4f18\u5316\u7684\u5143\u5b66\u4e60\u7b97\u6cd5<\/strong>\u53ef\u4ee5\u51cf\u5c11\u5230\u68af\u5ea6\u4e0b\u964d\u3002 \u56e0\u6b64\uff0c\u671f\u671b\u4e00\u81f4\u7684\u9884\u6d4b\u662f\u5408\u7406\u7684\u3002 \u5bf9\u4e8e\u8db3\u591f\u6df1\u7684\u795e\u7ecf\u7f51\u7edc\uff0c\u57fa\u4e8e\u4f18\u5316\u7684\u6a21\u578b\u4e5f\u5177\u6709\u5f88\u9ad8\u7684\u5bb9\u91cf\u3002 \u56e0\u4e3a\u521d\u59cb\u5316\u662f\u5728\u5185\u90e8\u4f18\u5316\u7684\uff0c\u6240\u4ee5\u57fa\u4e8e\u4f18\u5316\u7684\u6a21\u578b\u6bd4\u9ed1\u76d2\u6a21\u578b\u5177\u6709\u66f4\u597d\u7684\u8d77\u70b9\u3002 \u6b64\u5916\uff0c\u6b63\u5982\u6a21\u578b\u4e0d\u53ef\u77e5\u5143\u5b66\u4e60(MAML)\u5b66\u4e60\u8303\u4f8b\u6240\u8bc1\u660e\u7684\u90a3\u6837\uff0c\u6211\u4eec\u53ef\u4ee5\u6beb\u65e0\u56f0\u96be\u5730\u5c1d\u8bd5\u4e0d\u540c\u7684\u4f53\u7cfb\u7ed3\u6784\u3002 \u4f46\u662f\uff0c\u4e8c\u9636\u4f18\u5316\u8fc7\u7a0b\u4f7f\u57fa\u4e8e\u4f18\u5316\u7684\u65b9\u6cd5\u5728\u8ba1\u7b97\u4e0a\u975e\u5e38\u6602\u8d35\u3002 <\/p>\n<\/li>\n

                                        • \n

                                          Non-parametric meta-learning algorithms<\/strong> have good learning capacity for most choices of architectures as well as good learning consistency under the assumption that the learned embedding space is effective enough. Furthermore, non-parametric approaches do not involve any back-propagation, so they are computationally fast and easy to optimize. The downside is that they are hard to scale to large batches of data because they are non-parametric.<\/p>\n

                                          \u5728\u5b66\u4e60\u7684\u5d4c\u5165\u7a7a\u95f4\u8db3\u591f\u6709\u6548\u7684\u5047\u8bbe\u4e0b\uff0c \u975e\u53c2\u6570\u5143\u5b66\u4e60\u7b97\u6cd5<\/strong>\u5bf9\u4e8e\u5927\u591a\u6570\u4f53\u7cfb\u7ed3\u6784\u9009\u62e9\u90fd\u5177\u6709\u826f\u597d\u7684\u5b66\u4e60\u80fd\u529b\uff0c\u5e76\u4e14\u5177\u6709\u826f\u597d\u7684\u5b66\u4e60\u4e00\u81f4\u6027\u3002 \u6b64\u5916\uff0c\u975e\u53c2\u6570\u65b9\u6cd5\u4e0d\u6d89\u53ca\u4efb\u4f55\u53cd\u5411\u4f20\u64ad\uff0c\u56e0\u6b64\u5b83\u4eec\u8ba1\u7b97\u5feb\u901f\u4e14\u6613\u4e8e\u4f18\u5316\u3002 \u7f3a\u70b9\u662f\u5b83\u4eec\u5f88\u96be\u6269\u5c55\u5230\u5927\u6279\u91cf\u6570\u636e\uff0c\u56e0\u4e3a\u5b83\u4eec\u662f\u975e\u53c2\u6570\u7684\u3002 <\/p>\n<\/li>\n<\/ul>\n

                                          There are a lot of exciting directions for the field of meta-learning, such as Bayesian Meta-Learning (the probabilistic view of meta-learning) and Meta Reinforcement Learning (the use of meta-learning in the reinforcement learning setting). I\u2019d certainly expect to see more real-world applications in wide-ranging domains such as healthcare and manufacturing using meta-learning under the hood. I\u2019d highly recommend going through the course lectures and take detailed notes on the research on these topics!<\/p>\n

                                          \u5143\u5b66\u4e60\u9886\u57df\u6709\u5f88\u591a\u4ee4\u4eba\u632f\u594b\u7684\u65b9\u5411\uff0c\u4f8b\u5982\u8d1d\u53f6\u65af\u5143\u5b66\u4e60(\u5143\u5b66\u4e60\u7684\u6982\u7387\u89c6\u56fe)\u548c\u5143\u5f3a\u5316\u5b66\u4e60(\u5728\u5f3a\u5316\u5b66\u4e60\u73af\u5883\u4e2d\u4f7f\u7528\u5143\u5b66\u4e60)\u3002 \u6211\u5f53\u7136\u5e0c\u671b\u770b\u5230\u5728\u5e55\u540e\u4f7f\u7528\u5143\u5b66\u4e60\u7684\u5e7f\u6cdb\u9886\u57df\uff0c\u4f8b\u5982\u533b\u7597\u4fdd\u5065\u548c\u5236\u9020\u4e1a\u4e2d\u7684\u66f4\u591a\u5b9e\u9645\u5e94\u7528\u3002 \u6211\u5f3a\u70c8\u5efa\u8bae\u60a8\u9605\u8bfb\u8bfe\u7a0b\u8bb2\u5ea7\uff0c\u5e76\u5c31\u8fd9\u4e9b\u4e3b\u9898\u7684\u7814\u7a76\u505a\u8be6\u7ec6\u7684\u8bb0\u5f55\uff01 <\/p>\n

                                          \u8be5\u5e16\u5b50\u6700\u521d\u53d1\u5e03\u5728\u6211\u7684\u7f51\u7ad9\u4e0a\uff01 (<\/span>This post is originally published on my website!)<\/span><\/h3>\n<\/p><\/div>\n<\/p><\/div>\n<\/section>\n
                                          \n
                                          \n
                                          \n

                                          If you would like to follow my work on Recommendation Systems, Deep Learning, MLOps, and Data Journalism, you can follow my <\/em>Medium and <\/em>GitHub, as well as other projects at<\/em> https:\/\/jameskle.com\/. You can also tweet at me on <\/em>Twitter, <\/em>email me directly<\/em>, or <\/em>find me on LinkedIn. Or <\/em>join my mailing list to receive my latest thoughts right at your inbox!<\/em><\/p>\n

                                          \u5982\u679c\u60a8\u60f3\u5173\u6ce8\u6211\u5728\u63a8\u8350\u7cfb\u7edf\uff0c\u6df1\u5ea6\u5b66\u4e60\uff0cMLOps\u548c\u6570\u636e\u65b0\u95fb\u5b66\u65b9\u9762\u7684\u5de5\u4f5c\uff0c\u53ef\u4ee5\u5173\u6ce8\u6211\u7684<\/em> Medium \u548c<\/em> GitHub \u4ee5\u53ca\u5176\u4ed6\u9879\u76ee\uff0c<\/em> \u7f51\u5740 \u4e3a<\/em> https:\/\/jameskle.com\/ \u3002<\/em> \u60a8\u4e5f\u53ef\u4ee5\u5728Twitter\u5728\u6211\u7684<\/em> \u5fae\u535a \uff0c<\/em> \u76f4\u63a5\u7ed9\u6211\u53d1\u7535\u5b50\u90ae\u4ef6<\/em> \uff0c\u6216\u8005<\/em> \u627e\u5230\u6211\u7684LinkedIn \u3002<\/em> \u6216<\/em> \u52a0\u5165\u6211\u7684\u90ae\u4ef6\u76ee\u5f55\uff0c \u76f4\u63a5\u5728\u60a8\u7684\u6536\u4ef6\u7bb1\u4e2d\u63a5\u6536\u6211\u7684\u6700\u65b0\u60f3\u6cd5\uff01<\/em> <\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/section><\/div>\n

                                          \n

                                          \u7ffb\u8bd1\u81ea: https:\/\/medium.com\/cracking-the-data-science-interview\/meta-learning-is-all-you-need-3bd0bafdf289<\/p>\n<\/blockquote>\n

                                          \u5143\u5b66\u4e60 \u8fc1\u79fb\u5b66\u4e60<\/p>\n<\/article>\n","protected":false},"excerpt":{"rendered":"\u5143\u5b66\u4e60 \u8fc1\u79fb\u5b66\u4e60_\u5143\u5b66\u4e60\u5c31\u662f\u60a8\u6240\u9700\u8981\u7684\u5143\u5b66\u4e60\u8fc1\u79fb\u5b66\u4e60Update:ThispostispartofablogseriesonMeta-LearningthatI\u2019mworkingon.C...","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"_links":{"self":[{"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/posts\/6428"}],"collection":[{"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/comments?post=6428"}],"version-history":[{"count":0,"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/posts\/6428\/revisions"}],"wp:attachment":[{"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/media?parent=6428"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/categories?post=6428"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mushiming.com\/wp-json\/wp\/v2\/tags?post=6428"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}