风控建模一等奖
使用拍拍贷数据,建模全过程,从数据预处理开始到最后的模型比较。仅用于交流学习。队伍介绍队名“不得仰视本王”,队伍由五个小伙伴组成,我们是在一个类以的比赛(微额借款用户人品预测大赛)认识的,对数据挖掘竹热爱让我们走到了一起,以下是成员简介:姓名学校、学历比赛经历匚陈靖」中国科学技术大学研二天泡科学家总分第三,微额借贷用户人品预测大赛季军朱治亮浙江大学研二淘宝穿衣搭配比赛李军,微额借贷用户人品预测大赛李军质耀重庆邮电大学研二微额借贷用户人品预测大赛冠军匚赵蕊」重庆邮电大学研微额借贷用户人品预测大赛亚军黄伟鹏北京大学研一微额借贷用户人品预测大赛冠军解决方案概述2.1项目介绍与问题分析拍拍贷“魔镜风控系统”从平均400个数据维度评估厍户当前的信用状态,给每个告款人打出当前状态的信用分,在此基础上再结合新发标的信息,打出对于每个标约6个月内逾期率的预沨,为没资人提供关键的决策依据。本次竞赛目标是根据用户历史行为数据来颈测用户在六来6个月内是否会逾期还款的概率。问题转换成2分类问题,评估指标为AUC,从 Master, LogInfo, Update Info表中构建恃征,考虑评估指标为AUC,其本质是排序优化问题,所以我们在模型顶层融合也使用基于排序优化的 RANK AVG融合方法。2.2项目总体思路本文首先从数据清洗开始,介绍我们对缺失值的多维度处、对离群点的剔除方法以及对字符、空格等的处理;其次进行特征工程,包括对地理位置信息的特征构建、成交玉间特征、类别特征编码、组合特征构建、 Lpdatelnfo和 Log Info表的特征提取等;再次进行特征选择,我们采用了 boost, boost的训练过栏即对特征重要性的排序过程;然后处理类别的不平衡度,由于赛题数据出现了类不平衡的情况,我们采用了代价敏感学习和过采样两和方法,重点介绍我们所使用的过采样方法;最后一部分是模型设计与分析,我们采用了二业界广泛应用的逻辑回归模型、数据挖掘比赛大杀器 ghost.,创新性地揆索了large- scale sⅧm的方法在本赛题二的应用,玟得了不错的效果,此外还介绍了模型融合方、数据清洗3.1缺失值的多维度处理在征信领域,用户信总的完善程度可能会影响该层户的信用评级。一个信息完苦程度为100%的户比起完善程度为50%的用户,会更加容易官核通过并得到借款。从这一点亡发,我们对缺失值进行了多维度的分析和处理按列(属性)统计缺失值个数,进一步得到各列的缺失比率,下图(图1)显示了含有缺失值的属性和相应的缺失比率sing rate of Attributes图1.属性缺失比枣WeblogInfo_1和 WeblogInfo3的缺失值比率为97%,这两列属性基本不携带有用的信息,直接剔除。 Uscr Info_11、 Userinfo_12和 Uscr info_13的缺失值比率为63%,这三列属性是类别型的,可以将缺失值用-1垣充,相当于“是否缺失”当成另一种类别。其他缺失值比卒较小的数值型属性用中值填充按行统计每个样本的属性缺矢值个数,将缺失值个数从小到大排序,以序号为横坐标,缺失值个数为纵坐标,画出如下散点图(图2)test set16016014014C12012Cw9mczE100400060008000Order Numbe(sort ircreasinglyOrde Nt mber(sort increasing ly)图2.样本属性缺失个数对比 trainset和 testset上的样本的属性缺失值个数,可以发现其分有基本一致,但是trainset上出了几个缺失值个数特别多的样本(红框区域内),这几个样本可以认为是离群点,将其剔除另外,缺矢值个数可以作为一个特征,衡量用户信息的完善程度。3.2剔除常变量原始数据宁有190维数值型特征,通过计算每个数值型特征的标准差,剔除部分变亿很小的特征,下表(表1)列出的15个特征是标准差接近于0的,我们剔了这15维特征表1.剔除数值特征标准差属性标准差属性标准差属忾标准差Webloglnfo_10 0.0707 WeblogInfo_41 0.0212 Webloglnfo_490.0071Webloglnfo_23.0939 WeblogInfo_43 0.0372 Webloglnfo_5200512Webloglnfo_31.0828 Webloglnfo_44.0166 Webloglnfo_5400946Webloglnfo_32 0.0834 Webloglnfo_46.0290 WeblogInfo_5500331Webloglnfo_40.0666 Webloglnfo_47 0.0401 WeblogInfo_58006093.3高群点剔除在样本空间中与其他样本点的一般行为或特征不一致的点称为离群点,考虑到离群点的异常特征可能是多维度的组合,我们通过分析样本属性的缺矢值个数,剔除了极少量的离群点(见3.1节)此外,我们还采用了另外一种简单有效的方法:在原始数捶上训练ⅹ gboost,用得到的xgb模型输出特征的重要性,取最重要的前20个特征(如图3所示),统计每个栏本在这20个特征上的缺失值个数,将缺矢值个数大于10的样本作为离群点。ThrciParty Ifn PeriodIntrAparty nto HerodThrcPorty hfo Penod3ardiParty hfo Period?ThirdParty Info Penod图3.Xgb特征重要性通过这个方法,易除了400多个样水。这些样在重要特征上的取值是缺失的,会使得模型学习变得因难,从这个角度妖说,它们可以看成是离群点,应剔除掉。3.4其他处理(1)字符大小写转换Userupdate Info表宁的 Userupdate Info1字段,属性取值为英文字符,包含了大小写,如Q"和”qQ",很玥显是同一和取值,我们将所有字符统一转换为小写(2)空格符处理Mastor表中 UserInfo9字段的取值包含了空格字符,如“中国移动”和“中国移动”它们是同一种取值,需要将空格符去除。(3)城市名处理Userinfo_8包含有“重庆”、“重庆市”等取僬,它们实际上是同一个城市,需要把字符中的“市”全部去掉。去掉“市”之后,城市数由600多下降到400多。四、特征工程4.1地理位置的处理对地理位置信(类别型变量)最简单的处理方式是独热编码(one- hot encoding),但是这样会得到很高维的稀疏特征,影响糢型的学习,我们在独热编码旳基础上,做了特征选择。下面介绍具体的方法。赛题数据提供了用户的地挛位置信息,包括7个字段: Userinfo2、 Userinfo4、UserInfo7、 UserInfo8、 UserInfo I9、 UserInfo20,其中 UserInfo_7和 UserInfo19是省份信息,其余为城市信息。我们统计了每个省份和城市的违约率,下图以 Userinfo_7为例图1.省分违约率可视化图5可视化了每个省份的违约率,颜色越深代表违约率越大,其中违约率最大的几个省份或直辖市为四川、湖南、湖北、吉林、天津、山东,如下图所示:图5.违约深突出省份可视化因此我们可以构建6个二值特征:“是否为四川省”、“是否为湖南省”...“是否为山东省”,其取值为或1。其实这相当于对地理位置信息做了独热编码,然后保留其中有判别性的菜些列。这里 UserInfo_7何含32和取值,编码后可以得到32维的稀疏特征,而我们只保留其宇的6维以上我们是通过人工的分析方法去构延二值特征,在处理省份信息时还是匕较直观的,但是处理城市信息,比如 Userinfo2,包含了33个减市,就没有那么直观了。为了得到有判别性的二值特征,我们首先对 Userinfo2进行独热编码,得到333维的二值特征,然后在这333维稀疏特征上训练ⅹgb模型,再根据xgb输出的特征重要性刷选二值痔征,以下是选取到的部分二值特征(对应的城市):“淮纺市”、“九江市”、“三门峡市”、“汕头市”、“长春市”、“铁岭市”、“济菊市”、“成都市”、“淄博市”、“牡丹江市”。按城市等级合并类别型特征取值个数太多时,独热编码后得到太高维的稀疏特征,除了采用上面提到的特征选择方法外,我们还使用了合并变量的方法。按照城市等级,将类别变量合并,例如线城市北京、上海、广州、深圳合并,赋值为1,同样地,二线城市合并为2,三线城市合并为3>经纬度特征的引入以上对地理位置信息的处理,都是基于类别型的,我们另外收集了各个城市的经纬度,将城市名用经纬度替换,这样就可以将类别型的变量转化为数值型的变量,比如北京市,用经纬度(39.92,116.46)替换,得到北纬和东经两个数值型特征。加入经纬度后,线下的cross validation有千分位的提升。城市特征向量化我们将城可特征里的城市计数,并取Log,然后等值离散化到610个区间内。以下图为例,将 serino2这个特征里面的325个城市离散为一个6维向量。向量“100000”表示该城位于第一个区间。线下的 cross validation有千分位的提升。Loglui2 num)6.城市特征离散化地理位置差异特征如图8所示,1,2,1,6列郗是城市。那么我们构建一个城市差异的特征,比妇diff_12表示1,2列的城市是否相同。如此构建 diff l2,diff_14,diff_l6,diff_24,diff26,diff46这6个城市差异的特征。线下的 cross validation有千分位的提升。⊥aJse⊥nfa2 userinfo4 Userinfo7 Userinfo8 Userinfo19uer⊥nf。201C013郴州1C020惠州1C033零1c035深圳东东东东建东福建省10038济104连云港远言港带1C042德州1c043青岛聊拔东自聊城市46深圳汕广东广东省汕尾市105所多工新乡图7.地理位置差异样例4.2成交时间特征按日统计训练集中每天借贷的成交量,正负样本分别统计,得到如下的曲线图8,横坐标是日期(20131101至20141109),纵坐标是每天的借贷量。蓝色由线是违约的样本每天的数量(为了对比明显,将数量乘上了2),绿色曲线对应不违约的样本train set1200count o10008004002广外从20030350Date20131101~20141109图8.每日借贷量统计可以发现拍拍贷的业务量总体是在埤长的,而违约数量一开始也是缓慢增长,后面基本保持不变,总体上违约率是平稳甚至下降的。在横坐标300~350对应的日期区间,出现了些借贷量非鸴大的时间苄点,这些可能隐减着苿些信息,我们尚未挖掘出来。考虑到违约率跟时间线有关,我们将戒交时间的字段 Listinginfc傲了几种处理,一和是直接将其当做连续值特征,也就是上图对应的横坐标,另一和是离散化夂理,每10天作为一个区间,乜就是将日期0`10离散化为1,日期1120离散化为2.4.3类别特征的处理除了上面提到的对菜些类别特征进行特殊处理外,其他类别特征都做独热编码。44组合特征Xgboost的训练完成后可以输出特征的重要性,我们发现第三方数据特征ThirdParty Info Period XX”的 feature score比较大(见图3),即判别性比较高,于是用这部分特征构建了ξ合特征:将特征两两相除得到7000个特征,然后使用 boost对这7000多个特征单独训练模型,训练完成后得到特征重要性的排序,取其中top500个特征线下cv能达到0.73+的AUC值。将这500个特征添加到原始特征体系中,线下cv的AC值从0.777捉高到0.7833。另外,也组合了乘法特征(取对数):10g(x*y),刷选出其中的270多维,加入到原始特征休系中,单模型cv又提高到、0.785左右。4.5 Upadte Info表特征根据提供的修改信息表,我们从中抽取了用户的修改信息特征,比如:修改信息次数,修改信息时间到成交时间的跨度,每和信息的修改次数等等特征。46 LogInfo表特征类似地,我们从登录信息表里提取了用户的登录信息特征,比如登录天数,平均登录间隔以及每种操作代码的次数等47排序特征对原始特征中190维数值型特征接数值从小到大进行排序,得到190维排序特征。排序特征对异常数据有更强的鲁棒性,使得模型更加稳定,降低过拟合的风险。五、特征选择在特征工程部分,我们构建了一系列位置信息相关的特征、组合特征、成交时间特征、排序特征、类别稀疏侍征、 updateinfo和1 oginfo相关的特征等,所有特征加起来将近1500维,这么多维特征一方面可能会导致维数灾难,另一方面很容易导致过拟合,需要做降维处理,降维方法赏用的有如PCA,tSNE等,这类方法的计算复杂度比较高。并且根据以往经验,在数据挖掘类的匕赛中,PCA或t-SNE效果仨往不好。除了釆用降维算法之外,也可以通过特征选择来降低特征维度。特征选择的方法很多:最大信息系数(MIC)、皮尔森相关系数(衡量变量间的线性相关性)、正则化方法(L1,L2)、基于模型的特征排序方法。比较高效的是最后一种,即基于学习模型的特征排序方法,这种方法有一个好处:模型学习的过程和特征选择的过程是同时进行的,医此我们采用这和方法,基于 boost来做特征选择, xgboost模型洲练完成后可以输岀特征的重要性(见3.3图),据此我们可以保留TopN个特征,从而达到特在选择的目的。
- 2020-06-23下载
- 积分:1
NGSIM使用手册(1)
美国NGSIM系统的使用手册,方便读者高效的利用NGSIM进行数据下载,完成交通领域的研究Technical Report Documentation Page1. Report No2. Government Accession no3. Recipients Catalog NoFHWA-HOP-06-0124. Title and subtitle5. Report DateNext Generation Simulation(NGSIM) Data Format Planly20046. Performing Organization Code7. Author(s8. Performing Organization report noVijay Kovvali, richard margiotta, Robert franc, vassiliAlexiadis9. Performing Organization Name and Address10. Work Unit NoCAMBRIDGE SYSTEMATIC INC150 CAMBRIDGE PARK DRIVE SUITE 400011. Contract or grant noCAMBRIDGE MA 02140DTFH61-02-C-0003612. Sponsoring Agency Name and Address13. Type of Report and Period CoveredDepartment of transportationFinal reportFederal Highway AdministrationJuly 2003-july 2004Office of Acquisition Management14. Sponsoring Agency Code400 Seventh Street SW, RM 4410Washington, DC 2059015. Supplementary notesFHWA COTR: John Halkias, Office of Operations, and James Colyar, Office of Operations r&d16. AbstractThe Next Generation Simulation Program(NGSIM) Data Format Plan was developed to define thestructure, documentation, and transfer requirements for data that will be collected for estimationcalibration, and validation of core behavioral algorithms. The development of the data Format Plan isbased on existing formats that are relevant to ngsim and augmented to fill in gaps. to this end, a reviewof existing data formats was undertaken and their relevance to NGSiM was assessed. The review includeddata standards developed for intelligent transportation systems(ITS), data formats developed specificallyfor traffic simulation models, and data formats developed for broader transportation applications. Thespecified data formats were developed with the objective of promoting efficient research by maintainingonsistency between data collection and research, and providing consistent storage and transmittalprotocols. On the other hand, this plan intentionally avoids over specification of data formats, so as tominimize unnecessary limitations to research. This document specifies the conceptual data model by meansof Unified Modeling Language UMl class diagrams; the data dictionary in the data standard prescribed bP1489-1999 format developed by the Institute of Electrical and Electronics Engineers(IEEE); the dataexchange structure for data transfer from user to user or from the database/repository to users; and theNGSIM metadata17. Key words1 8. Distribution StatementNext generation simulation, NGSIM, trafficNo restrictions. This document is available to thesimulation, high-level plan, traffic data collection, public through the National Technical Informationvehicle trajectory dataService, Springfield, VA2216119. Security Classif. (of this report) 20. Security Classif. (of this page) 21. No of Pages22. PriceUnclassifiedUnclassifiedForm dot e1700.7(8-72)Reproduction of completed pages authorizedTABLE OF CONTENTSEXECUTIVE SUMMARY1.0 INTRODUCTION1.1High- Level plan context……垂垂垂垂·着垂垂垂垂非垂·非垂垂看垂音非非;·垂垂音看垂看垂1.2 Background1.3 Data Collection Types……1344581. 4 Data Conversion1.5 Data Formats·····.···············.··.···.·;···..·.··..·.···2.0 NGSIM DATA REQUIREMENTS22 Microsimulation Software Data format,…………172.1 NGSIM Data…192.3 Rcquirements for NgsiM data Collection..................193.0 RECOMMENDED NGSIM DATA FORMATS..m. 233.1 NGSIM Data model233.2 NGSIM Data Dictionary……………243.3 NGSIM Metadata.............................253.4 NGSIM Data Exchange Format273.5 File and Directory Naming Convention……293.6 Summary30REFERENCES31APPENDIX A-REVIEW OF EXISTING TRANSPORTATION DATA FORMATS3APPENDIX B-ACCURACY REQUIREMENTS FOR NGSIM DATACOLLECTION,45APPENDIX C-DATA MODEL∴….,,53APPENdIXD-DATA DICTIONARY.APPENDIX E-METADATA. ...........................................................................................99APPENdIX F-SYSTEM-STATE DATA看香音看音香n117List of FiguresFigure 1 Diagram. NGSIM task interdependencies4Figure2. Diagran. Data format classification relevant to ngsim1………Figure 3. Diagram. Top level data model of general traffic simulation55Figure4. Diagram. Influencing factors database packages………………56Figure 5. Diagram. Behavioral models packages57Figure6. Diagran. Facility type generalization…………18Figure 7. Diagram. Traffic management systems generalization......59Figure 8. Diagram. Transit management systems generalizationFigure9. Diagran. nvironment generalization.………………………0Figure 10. Diagram NTCIP Controller class diagram61Figure 11 Diagram Actuated traffic signal controller generalization2Figure12 Diagram. Generalized microsimulation data model………………63Figure 13 Diagram Data concept components and constructs(IEEE Std 1489-1999)66List of tablesTable 1. Example validation data by algorithm categoryTable 2. Summary of NGSiM categorizations for data formatsTable 3. Accuracy requirements for vehicle trajectory data. ..45Table 4. Accuracy requirements for instrumented vehicle data.........46Table 5. Accuracy requirements for wide-area detector data......... 47Table 6. Accuracy requirements for nctwork-rclated data48Table 7. Accuracy requirements for representative transportation managementsystems data52Table 8. Terminology for UMLmodeler54Table 9. Data dictionary for NgSim.67Table 10 processing documentation metadata for ngsimwwwwwm116Table1l. Requisite vehicle trajectory data…………………………117Table 12. requisite wide-area detector data requirements……118EXECUTIVE SUMMARYThe Next Generation Simulation Program(NGsim) Data Format plan was developed todefine the structure, documentation, and transfer requirements for data that will be col-lected for estimation, calibration, and validation of core behavioral algorithms. Thedevelopment of the data format plan is based on existing formats that are relevant toNGSIM and augmented to fill in gaps. To this end, a review of existing data formats wasundertaken and their relevance to ngsim was assessed. The review included data standards developed for intelligent transportation systems (ITS), data formats developed spe-cifically for traffic simulation models and data formats developed for broader transporta-tion applications. The specified data formats were developed with the objective of pro-moting efficient research by maintaining consistency between data collection andresearch, and providing consistent storage and transmittal protocols. On the other handthis plan intentionally avoids overspecification of data formats, so as to minimize unnecessary limitations to researchFour data format components were specified in this document, including: 1)data model,2)data dictionary, 3 )metadata, and 4) data exchange formatNGSIM Data Model- The conceptual data model for NGSIM data formats is pre-sented by means of Unified Modeling Language() class diagrams. Used in con-junction with the data dictionary, the data model allows for construction of a formaldatabase/repository for NGSIM validation dataNGSIM Data Dictionary This provides definition of individual data elementsrequired by NGsim. It follows the data standard prescribed by P1489-1999 formatdeveloped by the Institutc of Elcctrical and Electronics Engineers(ieee)NGSIM Data Exchange Format- The data cxchange structure dcfincs how datashould be transferred from user to user or from the database /repository to users. Thisdocument specifies the framework for developing data exchange formats by providingthe data model and the data dictionary; it also provides clear guidance on the formatstandards with which the data exchange format should conform Currently it doesnot provide specific schema for the data exchange formatsNGSIM Metadata- This includes both traditional metadata(definitions, specificationsand valid value lists for data elements and general information about the dataset andits availability); and processing metadata(what has happened to the data from data col-lection to data archival). Administrative metadata formats were adapted fromContent Standard for Digital Geospatial Metadata(FGDC-STD-001-1998), developedby the Federal Geographic Data Committee(FGDC). Recommendations for NGSiMprocessing metadata are based on the guidance provided in ASTM E2259-03, devel-oped by the American Society for Testing and Materials(ASTm)1.0 INTRODUCTIONThe objectives of the NGsim program include the followingDevelopment of a core set of open behavioral algorithms in support of traffic simulation with a primary focus on microscopic modelingCollection of extensive data that will be used for estimation calibration and validationof the core behavioral algorithms; and storing the data in a repository that can be uni-versally accessedThe High-Level Plan for DatasetsTask E3)identified different kinds of traffic data col-lection methods and technologies and recommended three kinds of data collection effortsfor ngsim, including vehicle trajectory data wide area detector data and instrumentedvehicle dataThis report Task F)presents the documentation, format structure, and transfer requirements for the ngsim data formats for these data collection efforts identified in task e3This report is organized as followsExecutive Summary -Provides an executive summary of this documentSection1.0-Provides an overview and introduction to this report, including the con-text of the data format plan within NGsIM, information on NGsim data collection anddata types, information on data conversion, general information on data formats, anda summary of available transportation data formats and their relevance for ngsimSection 2.0-Presents definitions and categorization for different data types, and pro-vides ngsim data requirementsSection 3.0-Presents data format recommendations for the NGsim program,including a data model, data elements for the data dictionary, metadata to describe thedata collection effort and data exchange formatsReferences-Presents references used in developing this data format planAppendix a-Presents a review of existing transportation data formatsAppendix B-Presents accuracy requirements for NGSiM data collectionAppendix C- Presents a UML representation of the ngsim data modelAppendix D-Presents a high-level NGSIM data dictionaryAppendix E- Presents metadata categories, dictionary, and recommended metadataformats for ngsim1.1 HIGH-LEVEL PLAN CONTEXTInterdependencies among NGSIM tasks are shown in figure 1. The High-Level Plan forDatasets(Task E. 3) presents an assessment of existing datasets of potential use for NGSIM,and makes recommendations on the focus for nGsim data collection methodologies. Thisreport on the data Format Plan task f) provides recommendations on the data exchangeformat(s) for NGSIM data collection efforts. The data formats are also influenced by theHigh-Level Verification and Validation Plan(task e 2)Task E 1-1Core algorithmAssessmentTask e,3Task e.1-2Task e2High-LevelCore AlgorithmHigh-Level Verificationlan for DatasetsPlanandⅤ alidation planTask eData format planFigure 1 Diagram. NGSIM task interdependencies.1.2 BACKGROUNDThe NGSiM field data collection effort pursues data required for developing, estimating,calibrating and validating traffic behavioral algorithms. Tactical route execution, opera-tional driving, and en-route strategic traveler behaviors were identified as the focus of theNGSIM core behavioral algorithm research in the identification and prioritization of coreAlgorithms Task D)report. The High-Level Verification and Validation Plan(task e2)provides an example of the data collection datasets for each algorithm category as shownin table 1. the table illustrates the extent over which data must be collected for each levelof algorithm. For example, for operational driving algorithms, a single stretch of roadwayon a freeway will likely be sufficient, while, for development of tactical driving algo-rithms, the data collection effort should be expanded to include the freeway section andmultiple entry and exit ramps that feed the freeway. The data formats developed in thisplan address the data, both static and dynamic, that are pertinent to the data collectionefforts necessary for developing and validating all three categories of driver behavioralalgorithms4
- 2020-12-05下载
- 积分:1