word2vec_中的数学原理详解
word2vec_中的数学原理详解个人收集电子书,仅用学习使用,不可用于商业用途,如有版权问题,请联系删除!wordzvec中的数学hoty@163.com2014年7月目录前言2预备知识2.1 sigmoid函数2.2逻辑回归3 Bayes公式2.4 Huffman编码,,,,,,,,524.1Humu树242 Huttman树的构造62.4.3 Huffman编码..,.3背景知识3.1统计语言模3.2n-gram模型103.3神经概率语言模型123.4词向量的理解4基于 Hierarchical softmanⅹ的模型41CBOW模型..191.1.1网络结构41.2梯度计算201.2 Skip-gram模型42.1网络结构42.2梯度计算255基于 Negative sampling的模型285.1CBOW模型285.2 Skip-gram模型53负采样算法326若干源码细节346.1a(x)的近似计算62词典的存储63换行符3564低频词和高频词366.5窗口及上下文3766自应学习率3767参数初始化与训练386.8多线程并行3869几点疑问和思考11m3881前言word2vec是 Google于2013年开源推出的一个用于获取 word vector的工具包,它简单、高效,因此引起了很多人的关注,由于word2vec的作者 Tomas nikolov在两篇相关的论文(,[4)中并没有谈及太多算法细节,因而在一定程度上增加了这个工具包的神秘感些按捺不住的人于是选择了通过解剖源代码的方式来一窥究竟第一次接触word2ve是2013年的10月份,当时读了复且大学郑骁庆老师发表的论文7,其主要工作是将SENA的那套算法(8])搬到中文场景.觉得挺有意思,于是做了一个实现(可参见[20),但苦于其中字向量的训练时间太长,便选择使用word2we来提供字向量,没想到中文分词效果还不错,立马对word2vec刮目相看了一把,好奇心也随之增长后来.陆陆续续看到∫word2ve的一些具体应用,而 lomas nikolov团队本身也将其推广到了句子和文档(),因此觉得确实有必要对word2vec里的算法原理做个了解,以便对他们的后续研究进行追踪.于是,沉下心来,仔细读了一回代码,算是基本搞明臼里面的做法了.筼一个感觉就是,“明明是个很简单的浅层结构,为什么被那么多人沸沸扬扬地说成是Decp Learning呢?”解剖word2vec溟代码的过程中,除了算法层面的收获,其实编程技巧方面的收获乜颇多.既然花了功夫来读代码,还是把理解到的东西整理成文,给有需要的朋友提供点参考吧在整理本文的过程中,和深度学习群的群友北流浪子(15,16)进行了多次有益的讨论在比表示感谢另外,也参考了其他人的一些资料,鄱列在参考文献了,在此对他们的工作也并表示感谢2预备知识本节介绍word2vee中将用到的些重要知识点,包括 sigmoid函数、 Beyes公式和Huffman编码等821 sigmoid函数sigmoid函数是神经网络中常用的激活函数之一,其定义为1+e该函数的定义域为(-x,+x),值域为(0,1).图1给出了 sigmoid函数的图像0.5图1 sigmoid函数的图像sigmoid数的导函数具有以下形式)=0(x)1-0(x)由此易得,函数logo(a)和log(1-0(x)的导函数分别为log o(a)(21)公式(2.1)在后面的推寻中将用到822逻辑回归生活中经常会碰到二分类问题,例如,某封电子邮件是否为垃圾邮件,某个客户是否为在客户,某次在线交易是舌仔在诈行为,等等.设{(x,)}1为一个二分类问题的样本数据,其中x∈R",∈{0,1},当1=1时称相应的样本为正例,当v=0时称相应的样本为负例利用 sigmoid函数,对于任意样木x=(x1,x2,…,xn),可将二分类问题的 hypothesis函数写成h(x)=0(o+61x1+622+…+nxn),其中0=(0o,01,…,O)为待定参数.为了符号上简化起见,引入x0=1将x扩展为(x0,x1,x2,…,xrn)},且在不引起混淆的情况下仍将其记为ⅹ.于是,he可简写为取阀值T-0.5,则二分类的判别公式为1,b(x)≥0.5y(x0.5那参数θ如何求呢?通常的做法是,先确定一个形如下式的整体损失函数∑co(x,v)然后对其进行优化,从而得到最优的參数θ实际应用中,单个样本的损失函数cost(x,)常取为对数似然函数cosl(xi, yi)),v-1;(1-(x),v=0注意,上式是一个分段函数,也可将其写成如下的整体表达式cost(x2,3)=·log(ho(x)(1y1)·log(1h(x)323 Baves公式贝叶斯公式是英国数学家贝叶斯( Thomas Bayes)提出来的,用来描述两个条件概率之间的关系.若记P(A),P(B)分别表示事件A和事件B发生的概率,P(AB)我示事件B发生的情况下事件4发生的慨率P(A,B)表示事A.B同时发生的概率.则有P(AB)P(B), P(BLA)=P(A, B)P(A, B利用上式,进一步可得P(B AP(AB)-P(A)P(B)这就是 Bayes公式g2.4 Huffman编码本节简单介绍Humn编码(具体内容主要来白百度百F的词条.[10),为此,首先介绍Huffman树的定义及其构造算法§24.1 Huffman树在计算机科学中,树是一种重要的非线性数据结构,它是数据元素(在树中称为结点)按分支关系组织起来的结构.若干棵互不相交的树所构成的集合称为森林.下面给出几个与树相关的常用概念·路径和路径长度在一棵树中,从一个结点往下可以达到的孩子或孙子结点之间的通路,称为路径.通路中分支的数目称为路径长度.若规定根结点的层号为1,则从根结点到第L层结虑的路径长度为L-1●结点的权和带权路径长度若为树中结点赋予一个具有某种含义的(非负)数值,则这个数值称为该结点的权结点的带权路径长度是指,从根结点到该结点之间的路径长度亐该结点的杈的乘矾·树的带权路径长度树的带权路径长度规定为所有叶子结点的带权路径长度之和二叉树是每个结点最多有两个子树的有序树.两个子树通常被称为“左子树”和“右子树”,定义中的“有序”是指两个子树有左石之分,顺序不能颠倒给定n个权值作为n个叶子结点,树造一棵二叉树,若它的带权路径长度达到最小,则称这样的二叉树为最优二叉树,也称为 Huffman树82.4.2 Huffman树的构造给定m个权值{mn,m2;…,mn}作为二叉树的m个叶子结点,可通过以下算法来构造颗 Huffman树算法2.Ⅰ(Hu「man树构造算法)(1)将{1,2,……,wn}看成是有n棵树的表林(每树仅有一个结点)2)在森林中选出两个根结,的权值最小的树合并,作为-棵新树的左、右子树,且新树的根结点权值为其左、右子树根结点权值之和〔3)从森林中燜除选取的两樑树,并将新树加入森林(4)重复(2)、(3)步,直到森林中只剩一棵树为止,该树即为所求的 luffman树接下来,给出算法2.1的一个具体实例例2.1假设2114年世界杯期间,从新浪毀博中抓取了若干条与足球相关的微博,经统计,“我”、“喜欢”、“观看”、“巴西”、“足球”、“世界杯”这六个词岀现的次薮分别为15,8,6,5,3,1.请以这6个词为叶子结点,以相应词频当权值,构造一棵Hu∥n树.⊙Q⑨Q⊙只66如→只只③⊙图2 Huffman树的构造过程利用算法.,易知其枃造过程如国g所示,团中第六步给出了最终的 Hutman树,由囚可见词频越大的词离根结点越近构造过程中,通过合并新増的结点被标记为黄色.由于每两个结点邡要进行一次合并,因此,若叶子结点的个数为η,刘枃造的H們πω树中新増结点的个数为π-1.本例中n6,因此新增结,的个数为5注意,前面有捉到,二叉树的丙个子树是分左右的,对于某个非叶子结点来说,就是其两个孩子结点是分左右的,在本例中,统一将词频大的结点作为左孩子结点,词频小的作为右孩子结点当然,这只昃一个约定:你要将词頻大的结点作为右孩子结点也浸有问题§24.3 Huffman编码在数据通倍中,需要将传送的文宁转换成二进制的字符串,用0,1码的不同排列米表示字符.例如,需传送的报文为“A上 TER DATA EAR ARE ART AREA”,这里用到的字符集为“A,E,R,T,F,D”,各字母出现的次数为84,5,3,1,1,现要求为这些字母设计编码要区别6个字母,最简单的二进制编码方式是等长编码,固定采用3位二进制(23=8>6),可分别用000.001、010、011、100、101对“A,E,R,T,F,D”进行编码发送,当对方接收报文时再按照三位一分进行译码显然编码的长度取决报文中不同字符的个数,若报文中可能出现26个不同字符,则固定编码长度为5(2=32>26).然而,传送报文时总是希望总长度尽可能短.在实际应用中,各个字符的出现频度或使用次数是不相同的,如A、B、C的使用频率远远高于X、Y、7,自然会想到设计编码时,让使用频率高的用短码,使用频率低的用长码,以优化整个报文编码.为使不等长编码为前缀编码(即要求一个字符的编码不能是另一个字符編码的前缀),可用字符集中的每个宇符作为叶子结点生成一棵编码二叉树,为了获得传送报文的最短长度,可将每个字符的岀现频率作为字符结烹的权值赋予该结点上,显然字使用频率越小权值越小,权值越小叶子就越靠下,于是颎率小编码长,频率高编码短,这样就保证了此树的最小带权路径长度,效果上就是传送报文的最短长度.因此,求传送报文的最短长度问题转化为求由字符集中的所有字符作为叶子结点,由字符出现频率作为其权值所产生的Hman树的问题.利用 Hultman树设计的二进制前缀編码,称为 LuminaL编码,它既能满足前缀编码的条件,又能保证报文编码总长最短本文将介绍的word2ve工具中也将用到 Huffman编码,它把训练语料中的词当成叶子缩点,其在语料中出现的次数当作权值,通过构造相应的 Huttman树来对每一个词进行Huffman编码图3给岀了例2.1中六个词的 Huffman编码,其中约定(词频较大的)左孩子结点编码为1,(词频较小的)石孩子编码为θ.这惮一米,“我”、“喜欢”、“观看”、“巴西”、“足球”、“世界杯”这六个词的 Huffman编码分别为0.111,110,101,1001和10000我告欢巴匹0足球图3 Huffman编码示意图注意,到目前为止,关于 Huttman树和 Huttman編码,有两个约定:(1)将权值大的结点作为左孩子结点,权值小的作为右孩子结点(2)左孩子结点编码为1,右孩子结点编码为0.在word2vec源码中将权值较大的孩子结点编码为1,较小的孩子结点编码为0.为与上述约定统一起见,下文中提到的“左孩了结点"都是指权值较大的孩了结点83背景知识word2vec是用来生成词向量的工具,而词向量与语言模型有着密切的关系,为此,不妨先了解一些语言模型方面的知识83.1统计语言模型当今的互联网迅猛发展,每天都在产生大量的文本、图片、语音和视频数据,要对这些数据进行处理并从中挖掘岀有价值的信息,离不开自然语言处理( Nature Language processing,NP)技术,其中统计语言模型( Statistical language model)就是很重要的一环,它是所有NLP的基础,被广泛应用于语音识别、机器翻译、分词、词性标注和信息检索等任务.例.1在语音识别糸统中,对于给定的语音段Vire,霄要找到一个使概率p( TertVoice最大的文本段Tert.利用 Bayes公式,有P(Teat voice)p(VoiceText). p(Textp(Voice)其中p( CicetE.c)为声学模型,而 elEct)为语言模型(18])简单地说统计语言模型是用来计算一个句子的概率的概率模驷,它通常基于一个语料库来构建.那什么叫做一个句子的概率呢?假设W=m1:=(m1,2,…,mr)表示由T个词,2,……,按顺序构成的一个句子,则1,c2…,w的联合慨率p()=p(x1)=p(01,t2,…,r)就是这个句子的概率利用 Bayes公式,上式可以被链式地分解为p(uh)-p(1)·p(u2lu1)p(u3lu2)…p( wru-1),(3.1)其中的(条件)概率p(1),p(2t1),p(un),…,p(mr1-)就是语言模型的参数,若这些参数已经全部算得,那么给定一个句子U1,就可以很快地算出相应的p(1)了看起来奷像很简单,是吧?但是,具体实现起来还是有点麻烦.例如.先来看看模型参数的个数.剛刚才是考虑一个给定的长度为T的句子,就需要计算T个参数.不妨假设语料库对应词典D的大小(即词汇量)为N,那么,如果考虑长度为T的任意句子,理论上就有M种可能.而每种可能都要计算T个参数,总共就需要计算TN7个参数.当然,这里只是简单估算,并没有考虑重复参数,但这个量级还是有蛮吓人.此外,这些概率计算好后,还得保存下来,因此,存储这些信息乜需要很大的內存开销此外,这些参数如何计算呢?常见的方法有n-gram模型、决策树、最大熵模型、最大熵马尔科夫模型、条件随机场、神经网络等方法,本文只讨论n-gram模型和神经网络两种方法.首先来看看 n-gram模型
- 2020-12-04下载
- 积分:1
Lectures on Stochastic Programming-Model
这是一本关于随机规划比较全面的书!比较难,不太容易啃,但是读了之后收获很大。这是高清版的!To Julia, Benjamin, Daniel, Nalan, and Yael;to Tsonka Konstatin and Marekand to the memory of feliks, Maria, and dentcho2009/8/20pagContentsList of notationserace1 Stochastic Programming ModelsIntroduction1.2 Invento1.2.1The news vendor problem1.2.2Constraints12.3Multistage modelsMultiproduct assembl1.3.1Two-Stage Model1.3.2Chance Constrained ModeMultistage modelPortfolio selection131.4.1Static model14.2Multistage Portfolio selection14.3Decision rule211.5 Supply Chain Network Design22Exercises2 Two-Stage Problems272.1 Linear Two-Stage Problems2.1.1Basic pi272.1.2The Expected Recourse Cost for Discrete Distributions 302.1.3The Expected Recourse Cost for General Distributions.. 322.1.4Optimality Conditions垂Polyhedral Two-Stage Problems422.2.1General Properties422.2.2Expected recourse CostOptimality conditions2.3 General Two-Stage Problems82.3.1Problem Formulation, Interchangeability482.3.2Convex Two-Stage Problems2.4 Nonanticipativity2009/8/20page villContents2.4.1Scenario formulation2.4.2Dualization of Nonanticipativity Constraints2.4.3Nonanticipativity duality for general Distributions2.4.4Value of perfect infExercises3 Multistage problems3. 1 Problem Formulation633.1.1The general setting3.1The Linear case653.1.3Scenario trees3.1.4Algebraic Formulation of nonanticipativity constraints 7lDuality....763.2.1Convex multistage problems·763.2.2Optimality Conditions3.2.3Dualization of Feasibility Constraints3.2.4Dualization of nonanticipativity ConstraintsExercises4 Optimization models with Probabilistic Constraints874.1 Introduction874.2 Convexity in Probabilistic Optimization4.2Generalized Concavity of Functions and measures4.2.2Convexity of probabilistically constrained sets1064.2.3Connectedness of Probabilistically Constrained Sets... 113Separable probabilistic Constraints.1144.3Continuity and Differentiability Properties ofDistribution functions4.3.2p-Efficient Points.1154.3.3Optimality Conditions and Duality Theory1224 Optimization Problems with Nonseparable Probabilistic Constraints.. 1324.4Differentiability of Probability Functions and OptimalityConditions13344.2Approximations of Nonseparable ProbabilisticConstraints134.5 Semi-infinite Probabilistic Problems144E1505 Statistical Inference155Statistical Properties of Sample Average Approximation Estimators.. 1555.1.1Consistency of SAA estimators1575.1.2Asymptotics of the saa Optimal value1635.1.3Second order asStochastic Programs5.2 Stoch1745.2.1Consistency of solutions of the SAA GeneralizedEquatio1752009/8/20pContents5.2.2Atotics of saa generalized equations estimators 1775.3 Monte Carlo Sampling Methods180Exponential Rates of Convergence and Sample sizeEstimates in the Case of a finite Feasible se1815.3.2Sample size estimates in the General Case1855.3.3Finite Exponential Convergence1915.4 Quasi-Monte Carlo Methods1935.Variance-Reduction Techniques198Latin hmpling1985.5.2Linear Control random variables method200ng and likelihood ratio methods 205.6 Validation analysis5.6.1Estimation of the optimality g2025.6.2Statistical Testing of Optimality Conditions2075.7Constrained Probler5.7.1Monte Carlo Sampling Approach2105.7.2Validation of an Optimal solution5.8 SAA Method Applied to Multistage Stochastic Programmin205.8.1Statistical Properties of Multistage SAA Estimators22l5.8.2Complexity estimates of Multistage Programs2265.9 Stochastic Approximation Method2305.9Classical Approach5.9.2Robust sA approach..23359.3Mirror Descent sa method235.9.4Accuracy Certificates for Mirror Descent Sa Solutions.. 244Exercis6 Risk Averse Optimi2536.1 Introductio6.2 Mean-Risk models.2546.2.1Main ideas of mean -Risk analysis546.2.2Semideviation6.2.3Weighted Mean Deviations from Quantiles.2566.2.4Average value-at-Risk2576.3 Coherent risk measures2616.3.1Differentiability Properties of Risk Measures2656.3.2Examples of risk Measures..2696.3.3Law invariant risk measures and Stochastic orders2796.3.4Relation to Ambiguous Chance Constraints2856.4 Optimization of risk measures.2886.4.1Dualization of Nonanticipativity Constraints2916.4.2Examples...2956.5 Statistical Properties of Risk measures6.5.IAverage value-at-Ris6.52Absolute semideviation risk measure301Von mises statistical functionals3046.6The problem of moments306中2009/8/20page xContents6.7 Multistage Risk Averse Optimization3086.7.1Scenario tree formulation3086.7.2Conditional risk mappings3156.7.3Risk Averse multistage Stochastic Programming318Exercises3287 Background material3337.1 Optimization and Convex Analysis..334Directional Differentiability3347.1.2Elements of Convex Analysis3367.1.3Optimization and duality3397.1.4Optimality Conditions.............3467.1.5Perturbation analysis3517.1.6Epiconvergence3572 Probability3597.2.1Probability spaces and random variables7.2.2Conditional Probability and Conditional Expectation... 36372.3Measurable multifunctions and random functions3657.2.4Expectation Functions.3687.2.5Uniform Laws of Large Numbers...,,3747.2.6Law of Large Numbers for Random Sets andSubdifferentials3797.2.7Delta method7.2.8Exponential Bounds of the Large Deviations Theory3877.2.9Uniform Exponential Bounds7.3 Elements of Functional analysis3997.3Conjugate duality and differentiability.......... 4017.3.2Lattice structure4034058 Bibliographical remarks407Biibliography415Index4312009/8/20pageList of Notationsequal by definition, 333IR", n-dimensional space, 333A, transpose of matrix(vector)A, 3336I, domain of the conjugate of risk mea-C(X) space of continuous functions, 165sure p, 262CK, polar of cone C, 337Cn, the space of nonempty compact sub-C(v,R"), space of continuously differ-sets of r 379entiable mappings,176set of probability density functions,I Fr influence function. 3042L, orthogonal of (linear) space L, 41Sz, set of contact points, 3990(1), generic constant, 188b(k; a, N), cdf of binomial distribution,Op(), term, 382214S, the set of &-optimal solutions of theo, distance generating function, 236true problem, 18g(x), right-hand-side derivative, 297Va(a), Lebesgue measure of set A C RdCl(A), topological closure of set A, 334195conv(C), convex hull of set C, 337W,(U), space of Lipschitz continuousCorr(X, Y), correlation of X and Y 200functions. 166. 353CoV(X, Y, covariance of X and y, 180[a]+=max{a,0},2ga, weighted mean deviation, 256IA(, indicator function of set A, 334Sc(, support function of set C, 337n(n.f. p). space. 399A(x), set ofdist(x, A), distance from point x to set Ae multipliers vectors334348dom f, domain of function f, 333N(μ,∑), nonmal distribution,16Nc, normal cone to set C, 337dom 9, domain of multifunction 9, 365IR, set of extended real numbers. 333o(z), cdf of standard normal distribution,epif, epigraph of function f, 333IIx, metric projection onto set X, 231epiconvergence, 377convergence in distribution, 163SN, the set of optimal solutions of the0(x,h)d order tangent set 348SAA problem. 156AVOR. Average value-at-Risk. 258Sa, the set of 8-optimal solutions of thef, set of probability measures, 306SAA problem. 181ID(A, B), deviation of set A from set Bn,N, optimal value of the Saa problem,334156IDIZ], dispersion measure of random vari-N(x), sample average function, 155able 7. 2541A(, characteristic function of set A, 334吧, expectation,361int(C), interior of set C, 336TH(A, B), Hausdorff distance between setsLa」, integer part of a∈R,219A and B. 334Isc f, lower semicontinuous hull of funcN, set of positive integers, 359tion f, 3332009/8/20pageList of notationsRc, radial cone to set C, 337C, tangent cone to set C, 337V-f(r), Hessian matrix of second orderpartial derivatives, 179a. subdifferential. 338a, Clarke generalized gradient, 336as, epsilon subdifferential, 380pos w, positive hull of matrix W, 29Pr(A), probability of event A, 360ri relative interior. 337upper semideviation, 255Le, lower semideviation, 255@R. Value-at-Risk. 25Var[X], variance of X, 149, optimal value of the true problem, 1565=(51,……,5), history of the process,{a,b},186r, conjugate of function/, 338f(x, d), generalized directional deriva-g(x, h), directional derivative, 334O,(, term, 382p-efficient point, 116lid, independently identically distributed,1562009/8/20page xlllPrefaceThe main topic of this book is optimization problems involving uncertain parametersfor which stochastic models are available. Although many ways have been proposed tomodel uncertain quantities stochastic models have proved their flexibility and usefulnessin diverse areas of science. This is mainly due to solid mathematical foundations andtheoretical richness of the theory of probabilitystochastic processes, and to soundstatistical techniques of using real dataOptimization problems involving stochastic models occur in almost all areas of scienceand engineering, from telecommunication and medicine to finance This stimulates interestin rigorous ways of formulating, analyzing, and solving such problems. Due to the presenceof random parameters in the model, the theory combines concepts of the optimization theory,the theory of probability and statistics, and functional analysis. Moreover, in recent years thetheory and methods of stochastic programming have undergone major advances. all thesefactors motivated us to present in an accessible and rigorous form contemporary models andideas of stochastic programming. We hope that the book will encourage other researchersto apply stochastic programming models and to undertake further studies of this fascinatinand rapidly developing areaWe do not try to provide a comprehensive presentation of all aspects of stochasticprogramming, but we rather concentrate on theoretical foundations and recent advances inselected areas. The book is organized into seven chapters The first chapter addresses modeling issues. The basic concepts, such as recourse actions, chance(probabilistic)constraintsand the nonanticipativity principle, are introduced in the context of specific models. Thediscussion is aimed at providing motivation for the theoretical developments in the book,rather than practical recommendationsChapters 2 and 3 present detailed development of the theory of two-stage and multistage stochastic programming problems. We analyze properties of the models and developoptimality conditions and duality theory in a rather general setting. Our analysis coversgeneral distributions of uncertain parameters and provides special results for discrete distributions, which are relevant for numerical methods. Due to specific properties of two- andmultistage stochastic programming problems, we were able to derive many of these resultswithout resorting to methods of functional analvsisThe basic assumption in the modeling and technical developments is that the proba-bility distribution of the random data is not influenced by our actions(decisions). In someapplications, this assumption could be unjustified. However, dependence of probability dis-tribution on decisions typically destroys the convex structure of the optimization problemsconsidered, and our analysis exploits convexity in a significant way
- 2020-12-09下载
- 积分:1