Finite-Dimensional Vector Spaces - P. Halmos (Springer, 1987)
在学习代数学之余,值得一看的代数学书籍。里面介绍了更为丰富的代数学概念和结论。PREFACEMy purpose in this book is to treat linear transformations on finite-dimensional vector spaces by the methods of more general theories. Theidea is to emphasize the simple geometric notions common to many partsof mathematics and its applications, and to do so in a language that givesaway the trade secrets and tells the student what is in the back of the mindsof people proving theorems about integral equations and Hilbert spaces.The reader does not, however, have to share my prejudiced motivationExcept for an occasional reference to undergraduate mathematics the bookis self-contained and may be read by anyone who is trying to get a feelingfor the linear problems usually discussed in courses on matrix theory orhigher"algebra. The algebraic, coordinate-free methods do not lose powerand elegance by specialization to a finite number of dimensions, and theyare, in my belief, as elementary as the classical coordinatized treatmentI originally intended this book to contain a theorem if and only if aninfinite-dimensional generalization of it already exists, The temptingeasiness of some essentially finite-dimensional notions and results washowever, irresistible, and in the final result my initial intentions are justbarely visible. They are most clearly seen in the emphasis, throughout, ongeneralizable methods instead of sharpest possible results. The reader maysometimes see some obvious way of shortening the proofs i give In suchcases the chances are that the infinite-dimensional analogue of the shorterproof is either much longer or else non-existent.A preliminary edition of the book (Annals of Mathematics Studies,Number 7, first published by the Princeton University Press in 1942)hasbeen circulating for several years. In addition to some minor changes instyle and in order, the difference between the preceding version and thisone is that the latter contains the following new material:(1) a brief dis-cussion of fields, and, in the treatment of vector spaces with inner productsspecial attention to the real case.(2)a definition of determinants ininvariant terms, via the theory of multilinear forms. 3 ExercisesThe exercises(well over three hundred of them) constitute the mostsignificant addition; I hope that they will be found useful by both studentPREFACEand teacher. There are two things about them the reader should knowFirst, if an exercise is neither imperative "prove that.., )nor interrogtive("is it true that...?" )but merely declarative, then it is intendedas a challenge. For such exercises the reader is asked to discover if theassertion is true or false, prove it if true and construct a counterexample iffalse, and, most important of all, discuss such alterations of hypothesis andconclusion as will make the true ones false and the false ones true. Secondthe exercises, whatever their grammatical form, are not always placed 8oas to make their very position a hint to their solution. Frequently exer-cises are stated as soon as the statement makes sense, quite a bit beforemachinery for a quick solution has been developed. A reader who tries(even unsuccessfully) to solve such a"misplaced"exercise is likely to ap-preciate and to understand the subsequent developments much better forhis attempt. Having in mind possible future editions of the book, I askthe reader to let me know about errors in the exercises, and to suggest im-provements and additions. (Needless to say, the same goes for the text.)None of the theorems and only very few of the exercises are my discovery;most of them are known to most working mathematicians, and have beenknown for a long time. Although i do not give a detailed list of my sources,I am nevertheless deeply aware of my indebtedness to the books and papersfrom which I learned and to the friends and strangers who, before andafter the publication of the first version, gave me much valuable encourage-ment and criticism. Iam particularly grateful to three men: J. L. Dooband arlen Brown, who read the entire manuscript of the first and thesecond version, respectively, and made many useful suggestions, andJohn von Neumann, who was one of the originators of the modern spiritand methods that I have tried to present and whose teaching was theinspiration for this bookP、R.HCONTENTS的 FAPTERPAGRI SPACESI. Fields, 1; 2. Vector spaces, 3; 3. Examples, 4;4. Comments, 55. Linear dependence, 7; 6. Linear combinations. 9: 7. Bases, 108. Dimension, 13; 9. Isomorphism, 14; 10. Subspaces, 16; 11. Calculus of subspaces, 17; 12. Dimension of a subspace, 18; 13. Dualspaces, 20; 14. Brackets, 21; 15. Dual bases, 23; 16. Reflexivity, 24;17. Annihilators, 26; 18. Direct sums, 28: 19. Dimension of a directsum, 30; 20. Dual of a direct sum, 31; 21. Qguotient spaces, 33;22. Dimension of a quotient space, 34; 23. Bilinear forms, 3524. Tensor products, 38; 25. Product bases, 40 26. Permutations41; 27. Cycles,44; 28. Parity, 46; 29. Multilinear forms, 4830. Alternating formB, 50; 31. Alternating forms of maximal degree,52II. TRANSFORMATIONS32. Linear transformations, 55; 33. Transformations as vectors, 5634. Products, 58; 35. Polynomials, 59 36. Inverses, 61; 37. Mat-rices, 64; 38. Matrices of transformations, 67; 39. Invariance,7l;40. Reducibility, 72 41. Projections, 73 42. Combinations of pro-jections, 74; 43. Projections and invariance, 76; 44. Adjoints, 78;45. Adjoints of projections, 80; 46. Change of basis, 82 47. Similarity, 84; 48. Quotient transformations, 87; 49. Range and null-space, 88; 50. Rank and nullity, 90; 51. Transformations of rankone, 92 52. Tensor products of transformations, 95; 53. Determinants, 98 54. Proper values, 102; 55. Multiplicity, 104; 56. Triangular form, 106; 57. Nilpotence, 109; 58. Jordan form. 112III ORTHOGONALITY11859. Inner products, 118; 60. Complex inner products, 120; 61. Innerproduct spaces, 121; 62 Orthogonality, 122; 63. Completeness, 124;64. Schwarz e inequality, 125; 65. Complete orthonormal sets, 127;CONTENTS66. Projection theorem, 129; 67. Linear functionals, 130; 68. P aren, gBCHAPTERtheses versus brackets, 13169. Natural isomorphisms, 138;70. Self-adjoint transformations, 135: 71. Polarization, 13872. Positive transformations, 139; 73. Isometries, 142; 74. Changeof orthonormal basis, 144; 75. Perpendicular projections, 14676. Combinations of perpendicular projections, 148; 77. Com-plexification, 150; 78. Characterization of spectra, 158; 79. Spec-ptral theorem, 155; 80. normal transformations, 159; 81. Orthogonaltransformations, 162; 82. Functions of transformations, 16583. Polar decomposition, 169; 84. Commutativity, 171; 85. Self-adjoint transformations of rank one, 172IV. ANALYSIS....17586. Convergence of vectors, 175; 87. Norm, 176; 88. Expressions forthe norm, 178; 89. bounds of a self-adjoint transformation, 17990. Minimax principle, 181; 91. Convergence of linear transformations, 182 92. Ergodic theorem, 184 98. Power series, 186APPENDIX. HILBERT SPACERECOMMENDED READING, 195INDEX OF TERMS, 197INDEX OF SYMBOLS, 200CHAPTER ISPACES§L. FieldsIn what follows we shall have occasion to use various classes of numbers(such as the class of all real numbers or the class of all complex numbers)Because we should not at this early stage commit ourselves to any specificclass, we shall adopt the dodge of referring to numbers as scalars. Thereader will not lose anything essential if he consistently interprets scalarsas real numbers or as complex numbers in the examples that we shallstudy both classes will occur. To be specific(and also in order to operateat the proper level of generality) we proceed to list all the general factsabout scalars that we shall need to assume(A)To every pair, a and B, of scalars there corresponds a scalar a+called the sum of a and B, in such a way that(1) addition is commutative,a+β=β+a,(2)addition is associative, a+(8+y)=(a+B)+y(3 there exists a unique scalar o(called zero)such that a+0= a forevery scalar a, and(4)to every scalar a there corresponds a unique scalar -a such that十(0(B)To every pair, a and B, of scalars there corresponds a scalar aBcalled the product of a and B, in such a way that(1)multiplication is commutative, aB pa(2)multiplication is associative, a(Br)=(aB)Y,( )there exists a unique non-zero scalar 1 (called one)such that al afor every scalar a, and(4)to every non-zero scalar a there corresponds a unique scalar a-1or-such that aaSPACES(C)Multiplication is distributive with respect to addition, a(a+n)If addition and multiplication are defined within some set of objectsscalars) so that the conditions(A),B), and (c)are satisfied, then thatset(together with the given operations) is called a field. Thus, for examplethe set Q of all rational numbers(with the ordinary definitions of sumand product)is a field, and the same is true of the set of all real numberaand the set e of all complex numbersHHXERCISIS1. Almost all the laws of elementary arithmetic are consequences of the axiomsdefining a field. Prove, in particular, that if 5 is field and if a, and y belongto 5. then the following relations hold80+a=ab )Ifa+B=a+r, then p=yca+(B-a)=B (Here B-a=B+(a)(d)a0=0 c=0.(For clarity or emphasis we sometimes use the dot to indi-cate multiplication.()(-a)(-p)(g).If aB=0, then either a=0 or B=0(or both).2.(a)Is the set of all positive integers a field? (In familiar systems, such as theintegers, we shall almost always use the ordinary operations of addition and multi-lication. On the rare occasions when we depart from this convention, we shallgive ample warningAs for "positive, "by that word we mean, here and elsewherein this book, "greater than or equal to zero If 0 is to be excluded, we shall say"strictly positive(b)What about the set of all integers?(c) Can the answers to these questiong be changed by re-defining addition ormultiplication (or both)?3. Let m be an integer, m2 2, and let Zm be the set of all positive integers lessthan m, zm=10, 1, .. m-1). If a and B are in Zmy let a +p be the leastpositive remainder obtained by dividing the(ordinary) sum of a and B by m, andproduct of a and B by m.(Example: if m= 12, then 3+11=2 and 3. 11=9)a) Prove that i is a field if and only if m is a prime.(b What is -1 in Z5?(c) What is囊izr?4. The example of Z, (where p is a prime)shows that not quite all the laws ofelementary arithmetic hold in fields; in Z2, for instance, 1 +1 =0. Prove thatif is a field, then either the result of repeatedly adding 1 to itself is always dif-ferent from 0, or else the first time that it is equal to0 occurs when the numberof summands is a prime. (The characteristic of the field s is defined to be 0 in thefirst case and the crucial prime in the second)SEC. 2VECTOR SPACES35. Let Q(v2)be the set of all real numbers of the form a+Bv2, wherea and B are rational.(a)Ie(√2) a field?(b )What if a and B are required to be integer?6.(a)Does the set of all polynomials with integer coefficients form a feld?(b)What if the coeficients are allowed to be real numbers?7: Let g be the set of all(ordered) pairs(a, b)of real numbers(a) If addition and multiplication are defined by(a月)+(,6)=(a+y,B+6)and(a,B)(Y,8)=(ary,B6),does s become a field?(b )If addition and multiplication are defined by(α,月)+⑦,b)=(a+%,B+6)daB)(,b)=(ay-6a6+的y),is g a field then?(c)What happens (in both the preceding cases)if we consider ordered pairs ofcomplex numbers instead?§2. Vector spaceWe come now to the basic concept of this book. For the definitionthat follows we assume that we are given a particular field s; the scalarsto be used are to be elements of gDEFINITION. A vector space is a set o of elements called vectors satisfyingthe following axiomsQ (A)To every pair, a and g, of vectors in u there corresponds vectora t y, called the aum of a and y, in such a way that(1)& ddition is commutative,x十y=y十a(2)addition is associative, t+(y+2)=(+y)+a(3)there exists in V a unique vector 0(called the origin) such thata t0=s for every vector and(4)to every vector r in U there corresponds a unique vector -rthat c+(-x)=o(B)To every pair, a and E, where a is a scalar and a is a vector in u,there corresponds a vector at in 0, called the product of a and a, in sucha way that(1)multiplication by scalars is associative, a(Bx)=aB)=, and(2 lz a s for every vector xSPACESSFC B(C)(1)Multiplication by scalars is distributive with respect to vectorddition, a(+y=a+ ag, and2)multiplication by vectors is distributive with respect to scalar ad-dition, (a B )r s ac+ Bc.These axioms are not claimed to be logically independent; they aremerely a convenient characterization of the objects we wish to study. Therelation between a vector space V and the underlying field s is usuallydescribed by saying that v is a vector space over 5. If S is the field Rof real number, u is called a real vector space; similarly if s is Q or if gise, we speak of rational vector spaces or complex vector space§3. ExamplesBefore discussing the implications of the axioms, we give some examplesWe shall refer to these examples over and over again, and we shall use thenotation established here throughout the rest of our work.(1) Let e(= e)be the set of all complex numbers; if we interpretr+y and az as ordinary complex numerical addition and multiplicatione becomes a complex vector space2)Let o be the set of all polynomials, with complex coeficients, in avariable t. To make into a complex vector space, we interpret vectoraddition and scalar multiplication as the ordinary addition of two poly-nomials and the multiplication of a polynomial by a complex numberthe origin in o is the polynomial identically zeroExample(1)is too simple and example (2)is too complicated to betypical of the main contents of this book. We give now another exampleof complex vector spaces which(as we shall see later)is general enough forall our purposes.3)Let en,n= 1, 2,. be the set of all n-tuples of complex numbers.Ix=(1,…,轨)andy=(m1,…,n) are elements of e, we write,,bdefinitionz+y=〔1+叽,…十物m)0=(0,…,0),-inIt is easy to verify that all parts of our axioms(a),(B), and (C),52, aresatisfied, so that en is a complex vector space; it will be called n-dimenaionalcomplex coordinate space
- 2020-12-05下载
- 积分:1
Google word2vec算法 数学原理
文档是 word2vec 算法 数学原理详解。word2vec是google的一个开源工具,能够仅仅根据输入的词的集合计算出词与词直接的距离,既然距离知道了自然也就能聚类了,而且这个工具本身就自带了聚类功能,很是强大。32预备知识本节介绍word2v中将用到的一些重要知识点,包括 sigmoid函数、 Bccs公式和Huffman编码等821 sigmoid函数sigmoid函数是神经网络中常用的激活函数之一,其定义为1+e该函数的定义域为(-∞,+∞),值域为(0,1).图1给出了 sigmoid函数的图像0.56图1 sigmoid函数的图像sigmoid函数的导函数具有以下形式(x)=0(x)1-0(x)由此易得,函数loga(x)和log(1-0(x)的导函数分别为log a(a)-1 a(a),log(1 o(a))l-a(a),(2.1)公式(2.1)在后面的推导中将用到32.2逻辑回归生活中经常会碰到二分类问题,例如,某封电子邮件是否为垃圾邮件,某个客户是否为潜在客户,某次在线交易是否存在欺诈行为,等等设{(x;)}温1为一个二分类问题的样本数据,其中x∈Rn,∈{0,1},当v=1时称相应的样本为正例当v=0时称相应的样本为负例利用 sigmoid函数,对于任意样本x=(x1,x2,…,xn),可将二分类问题的 hypothesis函数写成h(x)=o(6o+b1x1+62+…+bnxn)其中θ=(0,61,…,On)为待定参数.为了符号上简化起见,引入x0=1将x扩展为(x0,x1,x2,……,xn),且在不引起混淆的情况下仍将其记为ⅹ.于是,he可简写为取阀值T=0.5,则二分类的判别公式为ho(x)≥0.5:X)=0,ha(x)6),可分别用000001、010、011、100、101对“A,E,R,T,F,D”进行编码发送,当对方接收报文时再按照三位一分进行译码显然编码的长度取决报文中不同字符的个数.若报文中可能出现26个不同字符,则固定编码长度为5(25=32>26).然而,传送报文时总是希望总长度尽可能短.在实际应用中各个字符的出现频度或使用次数是不相同的,如A、B、C的使用颗率远远高于X、Y、Z,自然会想到设计编码时,让使用频率高的用短码,使用频率低的用长码,以优化整个报文编码为使不等长编码为前缀编码(即要求一个字符的编码不能是另一个字符编码的前缀),可用字符集中的每个字符作为叶子结点生成一棵编码二叉树,为了获得传送报文的最短长度,可将每个字符的岀现频率作为字符结点的权值赋于该结点上,显然字使用频率越小权值起小,权值越小叶子就越靠下,于是频率小编码长,频率高编码短,这样就保证了此树的最小带权路径长度,效果上就是传送报文的最短长度.因此,求传送报文的最短长度问题转化为求由字符集中的所有字符作为叶子结点,由字符出现频率作为其权值所产生的 Huffman树的问题.利用 Huffman树设计的二进制前缀编码,称为 Huffman编码,它既能满足前缀编码的条件,又能保证报文编码总长最短本文将介绍的word2ve工具中也将用到 Huffman编码,它把训练语料中的词当成叶子结点,其在语料中岀现的次数当作权值,通过构造相应的 Huffman树来对每一个词进行Huffman编码图3给岀了例2.1中六个词的 Huffman编码,其中约定(词频较大的)左孩子结点编码为1,(词频较小的)右孩子编码为0.这样一来,“我”、“喜欢”、“观看”、“巴西”、“足球”、“世界杯”这六个词的 Huffman编码分别为0,111,110,101,1001和100000欢观有巴西足球图3 Huffman编码示意图注意,到目前为止关于 Huffman树和 Huffman编码,有两个约定:(1)将权值大的结点作为左孩子结点,权值小的作为右孩子结点;(②)左孩子结点编码为1,右孩子结点编码为0.在word2vee源码中将权值较大的孩子结点编码为1,较小的孩子结点编码为θ.为亐上述约定统一起见,下文中提到的“左孩子结点”都是指权值较大的孩子结点3背景知识word2vec是用来生成词向量的工具,而词向量与语言模型有着密切的关系,为此,不妨先来了解一些语言模型方面的知识83.1统计语言模型当今的互联网迅猛发展,每天都在产生大量的文本、图片、语音和视频数据,要对这些数据进行处理并从中挖掘出有价值的信息,离不开自然语言处理( Nature Language processingNIP)技术,其中统计语言模型( Statistical language model)就是很重要的一环,它是所有NLP的基础,被广泛应用于语音识别、机器翻译、分词、词性标注和信息检索等任务例3.1在语音识别亲统中,对于给定的语音段Voie,需要找到一个使概率p(Tcrt| Voice最大的文本段Tert.利用 Bayes公式,有P(Teact Voice)p(VoiceTert)p(Text)P(Veonce其中p( Voice Teat)为声学模型,而p(Tert)为语言模型(l8])简单地说,统计语言模型是用来计算一个句子的概率的概率模型,它通常基于一个语料库来构建那什么叫做一个句子的概率呢?假设W=m1:=(n1,w2,…,tr)表示由T个词1,2,…,ur按顺序构成的一个句子,则n,U2,…,wr的联合概率p(W)=p(u1)=p(u1,u2,…,r)就是这个句子的概率.利用 Baves公式,上式可以被链式地分解为1)=p(u1)·p(u2l1)·p(vai)…p(ur1-)3.1其中的(条件)概率p(1),p(U2mn1),p(u3),…,p(urln1-1)就是语言模型的参数,若这些参数巳经全部算得,那么给定一个句子1,就可以很快地算出相应的p(1)了看起来妤像很简单,是吧?但是,具体实现起来还是有点麻烦.例如,先来看看模型参数的个数.刚才是考虑一个给定的长度为T的句子,就需要计算T个参数.不妨假设语料库对应词典D的大小(即词汇量)为N,那么,如果考虑长度为T的任意句子,理论上就有N种可能,而每种可能都要计算T个参数,总共就需要计算TN个参数.当然,这里只是简单估算,并没有考虑重复参数,但这个量级还是有蛮吓人.此外,这些概率计算好后,还得保存下来,因此,存储这些信息也需要很大的內存开销此外,这些参数如何计算呢?常见的方法有 II-gram模型、决策树、最大熵模型、最大熵马尔科夫模型、条件随杋场、神经网络等方法.本文只讨论n-gram模型和神经网络两种方法.首先来看看n-gram模型32n-gram模型考虑pko4-)(k>1)的近似计算.利用 Baves公式,有p(wr wi)P(uP(w根据大数定理,当语料库足够大时,p(k4-1)可近似地表示为P(wwi)count(wi)(3.2)count(a其中 count(u4)和 count-)分别表示词串t和v-在语料中出现的次数,可想而知,当k很大时, count(o4)和 count(4-1)的统计将会多么耗时从公式(3.1)可以看出:一个词出现的慨率与它前面的所有词都相关.如果假定一个词出现的概率只与它前面固定数目的词相关呢?这就是n-gran模型的基本思想,它作了一个n-1阶的 Markov假设,认为一个词出现的概率就只与它前面的n-1个词相关,即-1)≈p(kk-1+),于是,(3.2)就变成了p(wxJuk-)count(n+1countri(3.3以〃=2为例,就有p(uk4-1)≈count(k-1, Wk)count(Wk-1)这样一简化,不仅使得单个参数的统计变得更容易(统计时需要匹配的词串更短),也使得参数的总数变少了那么, n-gran中的参数n取多大比较合适呢?一般来说,n的选取需要同时考虑计算复杂度和模型效果两个因素表1模型参数数量与n的关系模型参数数量1( ingram)2×1052(bigram)4×10103( trigram)8×10154(4grm)16×10在计算复杂度方面,表1给出了n-gram模型中模型参数数量随着n的逐渐增大而变化的情况,其中假定词典大小N=2000(汉语的词汇量大致是这个量级).事实上,模型参数的量级是N的指数函数(O(N"),显然n不能取得太大,实际应用中最多的是采用n=3的三元模型在模型效果方面,理论上是π越大,效果越奷.现如今,互联网的海量数据以及机器性能的提升使得计算更高阶的语言模型(如n>10)成为可能,但需要注意的是,当n大到一定程度时,模型效果的提升幅度会变小.例如,当n从1到2,再从2到3时,模型的效果上升显著,而从3到4时,效果的提升就不显著了(具体可参考吴军在《数学之美》中的相关章节).事实上,这里还涉及到一个可靠性和可区别性的问题,参数越多,可区别性越好,但同时单个参数的实例变少从而降低了可靠性,因此需要在可靠性和可区别性之间进行折中另外, n-gran模型中还有一个叫做平滑化的重要环节.回到公式(3.3),考虑两个问题:若 count(uk-n+1)=0,能否认为p(kln1-1)就等于0呢?若 count(kn+)= count(uk-+1,能否认为p(uur-)就等于1呢?显然不能!但这是一个无法回避的问题,哪怕你的语料库有多么大.平滑化技术就是用来处理这个问题的,这里不展开讨论,具体可参考[11总结起来,n-gram模型是这样一种模型,其主要工作是在语料中统计各种词串岀现的次数以及平滑化处理.概率值计算好之后就存储起来,下次需要计算一个句子的概率时,只需找到相关的概率参数,将它们连乘起来就好了然而,在机器学习领域有一种通用的招数是这样的:对所考虑的问题建模后先为其构造一个目标函数,然后对这个目标函数进行优化,从而求得一组最优的参数,最后利用这组最优参数对应的模型来进行预測对于统计语言模型而言,利用最大似然,可把目标函数设为plwlConteat(w))∈C其中C表示语料( Corpus), Context(u)表示词U的上下文( Context),即周边的词的集合.当 Context(u)为空时,就取p( Context(w)=p(u).特别地,对于前面介绍的 n-gran模型,就有 Context(mn)=2-n+1注3.1语料¢和词典仍的区别:词典仍是从语料¢中抽取岀来的,不存在重复的词;而语料C是指所有的文本內容,包括重复的词当然,实际应用中常采用最大对数似然,即把目标函数设为∑ logp(u( ontext(o)(3.4)然后对这个函数进行最大化从(3.4)可见,概率p( CONtex()已被视为关于和 Context()的函数,即p(w Context(w))= F(w, Conteact(w), 0)
- 2020-06-14下载
- 积分:1