[go: up one dir, main page]

Academia.eduAcademia.edu

resMBS

2016

resMBS: Constructing a Financial Supply Chain from Prospectus Doug Burdick Soham De Louiqa Raschid IBM University of Maryland University of Maryland Mingchao Shao sohamde@cs.umd.edu Zheng Xu louiqa@umiacs.umd.edu Elena Zotkina drburdic@us.ibm.com University of Maryland University of Maryland shao@cs.umd.edu xuzh@cs.umd.edu University of Maryland ezotkina@umiacs.umd.edu ABSTRACT Understanding the behavior of complex financial supply chains is usually difficult due to a lack of data capturing the interactions between financial institutions (FIs) and the roles that they play in financial contracts (FCs). resMBS is an example supply chain corresponding to the US residential mortgage backed securities that were critical in the 2008 US financial crisis. In this paper, we describe the process of creating the resMBS graph dataset from financial prospectus. We use the SystemT rule-based text extraction platform to develop two tools, ORG NER and Dict NER, for named entity recognition of financial institution (FI) names. The resMBS graph comprises a set of FC nodes (each prospectus) and the corresponding FI nodes that are extracted from the prospectus. A Role-FI extractor matches a role keyword such as originator, sponsor or servicer, with FI names. We study the performance of the Role-FI extractor, and ORG NER and Dict NER, in constructing the resMBS dataset. We also present preliminary results of a clustering based analysis to identify financial communities and their evolution in the resMBS financial supply chain. Keywords Figure 1: Summary of a Prospectus Rule based text extraction, SystemT, named entity recognition, mortgage backed securities (MBS), clustering, community detection. tors, analysts or academic researchers. Our proposition is that through text analytics, financial supply chains can be successfully extracted from prospectus that are filed with the Securities and Exchange Commission (SEC). Creating such public financial big data collections would then enable new streams of data science for finance research. Figure 1 illustrates the summary section of a resMBS prospectus; typically such a prospectus can be over 500 pages. The summary describes key particulars of the prospectus including the following: the title of the series and the nominal value of the contract; the names of financial institutions (FI names) that play a key role in the contract; the roles, e.g., issuing entity, depositor, sponsor, servicer, originator, trustee, etc.; important dates, e.g., cut-off date, distribution dates, etc. In this paper, we describe the process and tools to create the resMBS dataset. We make extensive use of the SystemT information extraction platform [2, 4]. SystemT provides a high-performance scalable runtime environment for executing rule-based extractors, implemented in the declar- 1. INTRODUCTION There has been significant interest in understanding the behavior of complex financial eco-systems and supply chains. An example is the supply chain comprising US residential mortgage backed securities, resMBS [1, 5, 8]. This system combined with the subprime mortgage crisis to lead to the 2008 U.S. financial crisis. Unfortunately, information about the complex financial networks that comprise such a supply chain have not historically been made available to regulaPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. DSMM’16, June 26-July 01 2016, San Francisco, CA, USA c 2016 ACM. ISBN 978-1-4503-4407-4/16/06. . . $15.00 DOI: http://dx.doi.org/10.1145/2951894.2951895 1 ative AQL language, over unstructured text. Our extraction pipeline uses two tools, ORG NER and Dict NER, developed in AQL for named entity recognition (NER), to extract FI names. ORG NER is a general purpose extractor while Dict NER is customized to use dictionaries to extract FI names. The resMBS graph comprises a set of financial contract (FC) nodes, one for each prospectus, and the corresponding FI nodes that are extracted from the prospectus. A Role-FI extractor matches a role keyword such as originator, sponsor or servicer, with FI names. This creates a set of (Role, FI) pairs, where the FI plays the specified role in the FC. From example, from Figure 1, we can say that the (Role, FI) pair of (Servicer, EMC Mortgage Corporation) shoud be extracted from the prospectus and associated with the corresponding FC node. We evaluate the effectiveness of the Role-FI extractor and Dict NER versus ORG NER, over a collection of 5000+ resMBS prospectus that were filed with the SEC between 2000 and 2008 1 . We map each of the FI names to a corpus of FI names that was obtained from ABSNet2 . Our evaluation reveals that while ORG NER and Dict NER provide similar performance for approximately 2100+ documents, they extract different (Role, FI) pairs for the other documents. This suggests that both approaches are needed to successfully create the resMBS dataset. As an exemplar of novel data science for finance research, we present preliminary results of clustering the resMBS graph dataset. Our analysis illustrates the emergence and evolution of communities along the resMBS supply chain. First, we apply clustering to the financial institutions based on pairwise FI-FI similarity. This allows us to observe community evolution from 2002 to 2008. We note that there are multiple strong communities in 2004, but by 2007, two financial institutions, Wells Fargo and Countrywide dominate the resMBS supply chain. We also cluster the financial institutions based on FC-FC similarity; this allows us to drill down further to observe communities at a more granular level. This drill down reveals that while Wells Fargo and Countrywide do indeed dominate in 2007, each also participates in several smaller communities. This is a first step to more sophisticated data science for finance research to understand the role played by communities on the performance of financial contracts. Figure 2: Named Entity Recognition (NER) and Role-FI Extraction Pipeline NER tool that is built on the SystemT platform. It has been shown to achieve state-of-the-art performance on several standard NER tasks [3]. ORG NER has multiple customization points which are exposed as user-defined dictionaries. These dictionaries allow ORG NER to be tuned to capture a variety of specialized root and suffix values across many specialized domains. Dict NER [8] was developed for the specialized task of extracting financial institution (FI) names and named entity resolution of FIs. It was designed to meet the following assumptions: • The FI names are composed of a root fragment and a suffix. The root fragment tends to be distinct and does not show much variation. There are a limited number of suffix options. • An (almost) complete list(s) of formal complete names for financial institutions (FI names) will be provided. • A version of the formal FI name will appear at least once in the document so that Dict NER can extract at least one mention that will match the formal name. 2.2 • A two column table or a list-like layout (see Figure 1) where one column includes a role (keyword) and the second has one or more mentions of FI names. 2. EXTRACTION PIPELINE Figure 2 provides an overview of the SystemT based extraction pipeline. AQL rules are used to identify the header (cover sheet) and the summary section of the prospectus. Both sections were identified by domain experts to contain relevant information. Next, FI names of participant financial institutions are extracted from the header and the summary, using Dict NER and ORG NER. The Role-FI extractor then matches roles with (adjacent) FI names to produce (Role, FI) pairs, each of which corresponds to a Role (FI, FC) labeled edge. Finally, we map the FI names to the ABSNet corpus of FI names. 2.1 Role-FI Extractor The first task is the identification of the relevant context, i.e., sentences in the prospectus header and/or summary. We summarize the most common cases, layout, etc., as follows: • Lists with unstructured text; this corresponds to the text describing servicers and originators in Figure 1. • Detection of double or multi-column layouts, e.g., the two columns in Figure 3. We use heuristics, e.g., identifying two roles which occur one after the other, in the same sentence, without a comma or ’and’ separating them. As can be seen in Figure 1, there are also double columns that fail this heuristic, e.g., Depositor followed by Seller and Sponsor. While we use these cases above to refine and modify the logic behind our Role-FI extraction and matching rules, we found from experience that we had better performance with a single set of rules that cover all cases, compared to implementing separate rules for each case. The next task is Role-FI extraction and matching. We consider the following cases: Named Entity Recognition (NER) ORG NER is a sophisticated general purpose rule-based 1 We downloaded the documents from the SEC website http: //www.sec.gov/ 2 http://www.absnet.net/ABSNet/ 2 Failed to extract Count (Role, FI) pairs Duplicates Distinct count Distinct count (matching ABSNet) Common (Role, FI) pairs Excess (Role, FI) pairs Dict NER 17 53235 15029 38206 32942 29472 3470 ORG NER 43 55040 14369 40671 32826 29472 3354 Table 1: Summary statistics for extracted (Role, FI) pairs for ORG NER and Dict NER. as a servicer was not extracted by both approaches. Dict NER extracted National City Mortgage Co., but it was wrongly labeled as both seller and servicer, instead of only servicer. Figure 3: Summary of Role-FI pairs extracted by Dict NER and ORG NER. 3. EFFECTIVENESS OF EXTRACTION • The base case is where the role keyword and the mention of the FI name are in proximity, in the same sentence, with a ’FI as role’ structure. An example is the following sentence: Lehman Brothers Inc., as underwriter, will purchase from Lehman ABS Corporation, as depositor, ... A variation of this case is where phrases such as will sell, will service replace the role keyword. We evaluate the effectiveness of the Role-FI extractor, and Dict NER versus ORG NER, over the collection of 5000+ resMBS prospectus. For this evaluation, we map each of the FI names to the corpus of FI names from ABSNet. We make the following assumptions: • An FI name that does not match an entry from the corpus represents an incorrect (Role, FI) pair. This is reasonable since ABSNet maintains a fairly comprehensive collection of asset backed products. • The next case is where an organization may play multiple roles; this may result in the following sentence structure: ’fi-1 role-1 role-2’. An example is the following: Accredited Home Lenders, Inc. (Sponsor and Servicer) ... Variations include the case where the roles precede the FI name mention. • We eliminate duplicates; they are usually extracted from both the header and the summary. • When an FI name from the (Role, FI) pair matches an entry in the ABSNet corpus correctly, then this is a valid (Role, FI) pair. This last assumption is the most difficult to validate since our experience, as illustrated in Figure 3 is that there can be errors in the extraction process. In ongoing research we are performing an extensive manual evaluation of (Role, FI) pairs over randomly sampled documents, to estimate the actual proportion of valid (Role, FI) pairs that are extracted. Next, for each document, we examine the (Role, FI) pairs extracted by ORG NER and Dict NER. Table 1 provides summary statistics of the (Role, FI) pairs for ORG NER and Dict NER, respectively. We see that the extraction failed for a small number of documents. After eliminating duplicates, we retained only those (Role, FI) pairs that match FI names from ABSNet. We then determine the (Role, FI) pairs that were common to both ORG NER and Dict NER and those excess pairs that were in extracted by each tool. Table 2 compares the extraction statistics of the excess (Role, FI) pairs, for ORG NER and Dict NER, respectively. As can be seen, ORG NER and Dict NER performance is identical, and there is no excess of (Role, FI) pairs, for 2100+ documents. ORG NER has an excess ranging from 1, up to 5 or more (Role, FI) pairs, for approximately 1000+ documents. Similarly, Dict NER has a similar excess of (Role, FI) pairs, for approximately 1400+ documents. What is really noteworthy is that there are also documents in which both ORG NER and Dict NER produce an excess of (Role, FI) pairs. For example, for 180 documents, ORG NER excess is 1 and Dict NER excess is 1. For a further 100+ documents, ORG NER excess ranges from 2 to 4+, • An outlier case is where the FI name mention is not in the same sentence as the role keyword. To address this case, we extend the matching so that adjacent FI name mentions and roles will be matched, despite not being located in the same sentence. As expected, heuristics for such complex cases can be difficult to develop and may not be always successful. 2.3 Sample Role-FI Extraction Figure 3 provides an example of the (Role, FI) pairs extracted by Dict NER and ORG NER. The top section of the figure displays the sample text from the prospectus header. The middle section shows (Role, FI) pairs that were successfully extracted by both NERs, as well as the (Role, FI) pairs that both missed. The bottom section of the image shows the (Role, FI) pairs that are extracted only by Dict NER (on the left) and those pairs extracted only by ORG NER (on the right). Incorrect (Role, FI) pairs are in red and preceded with **. While the general purpose ORG NER performs well on most documents, it has difficulty with a multi-column layout. Dict NER can handle such structures and is better at detecting the boundaries of entity mentions. For example, ORG NER incorrectly extracted American Mortgage Network, Inc. National City Mortgage Co., while Dict NER correctly extracted two separate FI names. Wells Fargo 3 ORG NER Excess Dict NER Excess Count 0 0 2164 1 0 429 2 0 179 3 0 89 ORG NER Excess Dict NER Excess Count 0 0 2164 0 1 885 0 2 271 0 3 144 ORG NER Excess Dict NER Excess Count 1 1 180 2 1 48 ORG NER Excess Dict NER Excess Count 1 1 180 1 2 117 3 1 20 4 0 81 0 4 73 5+ 0 126 0 5+ 35 4+ 1 30 1 3 39 1 4+ 20 Table 2: Distribution of excess Role-FI pairs, for ORG NER and Dict NER, across the 5000+ financial contracts (FC). Year FC Count FI-FC edges $ Value (109 ) 2002 475 4037 273 Year FC Count FI-FC edges $ Value (109 ) 2006 1051 13922 895 2003 666 5900 414 2007 680 9300 510 2004 969 8367 723 Figure 4: Graph 2005 1184 10447 1019 Analysis of the Role(FI, FC) resMBS larity is proportional to the count of FIs that co-occur in the pair of FCs. The similarity is normalized to between 0 and 1. We then performed three types of clustering as follows: 2008 71 1026 16 • FI-FI similarity clustering: This is a direct clustering of the financial institutions based on FI-FI similarity. We use the normalized cut (NCut) method [7]. This clustering will reveal dominant associations among financial institutions. Table 3: Summary statistics for the financial contracts (FC). • FC-FC similarity: This is an indirect clustering based on FC-FC similarity using the NCut method. We then use the (FI, FC) labeled edges associated with each FC to produce clusters of FIs. while Dict NER excess remains at 1. Similarly, for a further 180+ documents, Dict NER excess ranges from 2 to 4+, while ORG NER excess remains at 1. To summarize, the performance of ORG NER and Dict NER is identical in 2100+ documents. In approximately 2400+ documents, either ORG NER or Dict NER produces an excess of 1 to 5+ (Role, FI) pairs. Finally, in about 500+ documents, both ORG NER and Dict NER produce an excess of (Role, FI) pairs, for the same document. This suggests that both approaches are needed to successfully create the resMBS dataset. • (FI-FC) edge partition using semEP [6]: Unlike node based partitioning, this is a partitioning of the edges. The resulting clusters will provide granular insights into the key relationships among financial institutions across groups of contracts. 4.1 Observations from Clustering based on FIFI Similarity Figures 5 and 6 present the output of clustering FIs based on FI-FI similarity, for 2004 and 2007, respectively. We selected 2004 since we consider 2002 and 2003 as an emerging period, where the number of financial contracts increased rapidly and partnerships were being forged among the FIs. The number of contracts peaked in 2005 and declined in 2006 and 2007. 2007 is considered the terminal period since the number of financial contracts rapidly decreased with the onset of the financial crisis in 2008. We chose the number of clusters to be eight for ease of visualization. In 2004, we observe that there are three major communities (purple, light green and light blue), three moderate communities (orange, dark green and light orange), and two small communities (red and pink) in Figure 5. Among the major communities, Wells Fargo dominates one (light green) and GreenPoint is a major partner. Another community (light blue) is dominated by Countrywide with Bear Stearns as a partner. The third (purple) community is dominated by Chase Manhattan with Cendant as a major partner. The three moderate communities are dominated by Aurora (light orange), GMAC (orange) and National City and Goldman 4. ANALYTICS PIPELINE AND PRELIMINARY RESULTS Table 3 provides summary statistics for the resMBS graph dataset for each year, from 2002 through 2008. As can be seen, 2002 through 2003 is an emerging period. The number of contracts reach a peak in 2005 and then decline while the supply chain dries up with the 2008 crisis. Figure 4 provides an overview of the analysis steps. The first step is data cleaning. This will remove some FC and FIs with degree less than 3 and 20, respectively. We further filtered the dataset to only include the following four roles: Issuer, Sponsor, Originator and Servicer. While this significantly reduced the size of the graph, it also allowed us to focus on more important roles and interactions. We then compute the pairwise FI-FI and FC-FC similarity. Both FI-FI and FC-FC similarity are based on cooccurrence. FI-FI similarity is proportional to the count of documents in which the pair of FIs co-occur. FC-FC simi4 Figure 5: Direct clustering of the FIs from 2004 into eight clusters using NCut. Figure 6: Direct clustering of the FIs from 2007 into eight clusters using NCut. Sachs and IndyMac (dark green). Finally, among the two small communities, Lehman is prominent in one (red) and Morgan Stanley in the other (pink). We observe a very different pattern in 2007 in Figure 6. For example, there is one large community (light green), 2 moderate communities (purple and light blue), and multiple smaller communities. The large (light green) community is dominated by Countrywide and Wells Fargo. Among the moderate communities, one (purple) includes GMAC, Deutsche Bank and American Home Mortgage, while the second (light blue) includes GreenPoint, Aurora, Lehman and IndyMac. 4.2 Lehman now participate in a (light orange) community in Figure 8. 4.3 Edge Partitioning Finally, we can apply edge partitioning to the Role (FI, FC) bipartite graph to further drill down to community details, i.e., the role played by the FIs. Figure 7 illustrates a community of eight FCs where SACO is the issuer (green edge) and where multiple other FIs are the originators (pink edges). We observe that all eight contracts in this community are also associated with Bear Stearns & Co. and its affiliate, EMC Mortgage. Clustering based on FC-FC Similarity As was just discussed, clustering FIs based on FI-FI similarity does not provide insight at a deeper level about smaller and more connected communities. As an alternative, we can use FC-FC similarity to cluster financial contracts, and then induce communities of FIs that are connected to those contracts. Figure 8 presents communities that were obtained following this approach. As can be seen, Figure 8 reveals many smaller communities, compared to Figure 6, for the same 2007 dataset. We note that the direct clustering of financial institutions of Figure 6 restricted each financial institution to one community. In contrast, the FC-FC based clustering of Figure 8 does not have that restriction. Consider that Countrywide and Wells Fargo dominated the largest community in Figure 6. We can now observe in Figure 8 that Countrywide participates in three communities, with Wells Fargo (light green), with GMAC and Deutsche Bank (dark red) and with Morgan Stanley and Saxon (dark orange). Similarly, Wells Fargo also participates in a (pink) community with Bank of America and Washington Mutual and Sun Trust. The (purple) community of GMAC, Deutsche Bank and American Home Mortgage of Figure 6 is retained in Figure 8, with the addition of Wells Fargo into that community. Finally, the (light blue) community of Figure 6 is further divided; Aurora and 5. SUMMARY We describe the use of text analytics and the SystemT platform to successfully create the resMBS financial big data dataset, corresponding to the MBS supply chain. Both ORG NER and Dict NER are needed for good performance during the extraction for FI names. Preliminary analysis using a variety of clustering methods illuminate the emergence and evolution of financial communities along this supply chain. This is a first step to more sophisticated research to understand the role played by communities. We now discuss ongoing and future research. Inspired by the Latent Dirichlet Allocation (LDA) and topic models, we have developed two probabilistic financial community models [9]. Our models are based on an intuitive assumption that FIs will form communities within an FC, and FIs within a community are more likely to collaborate with other FIs in that community, and play the same role, in another FC. Results from [9] indicate that these communities are indeed created, and that they can be used to describe abd characterize the resMBS financial supply chain. In future research, we will explore if specific (Role, FI) pairs are correlated with the performance of the securities in the financial contracts. Alternatively, network characteristics of the resMBS graph may reveal issues related to contagion or stability under var5 Figure 8: Indirect clustering of the FIs from 2007 into eight clusters using NCut. [2] [3] [4] [5] Figure 7: FI FC Bipartite graph; this illustrates a community of FCs with a single issuer SACO (green edges) and multiple originators (pink edges). [6] ious stress conditions. 6. ACKNOWLEDGMENTS The authors would like to thank Nancy Wallace and Paulo Issler from the Haas School of Business, UC Berkeley, and Rajasekar Krishnamurthy and Howard Ho from IBM Research, for many discussions and insightful feedback. This research was partially supported by awards NSF CNS1305368, NIST 70NANB15H194 and the Smith School of Business. [7] [8] 7. REFERENCES [9] [1] D. Burdick, M. Franklin, P. Issler, R. Krishnamurthy, L. Popa, L. Raschid, R. Stanton, and N. Wallace. Data science challenges in real estate asset and capital markets. In Proceedings of the International Workshop 6 on Data Science for Macro-Modeling, pages 1–5. ACM, 2014. D. Burdick, M. A. Hernández, H. Ho, G. Koutrika, R. Krishnamurthy, L. Popa, I. Stanoi, S. Vaithyanathan, and S. R. Das. Extracting, linking and integrating data from public sources: A financial case study. IEEE Data Eng. Bull., 34(3):60–67, 2011. L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1002–1012. Association for Computational Linguistics, 2010. L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP, pages 827–832, 2013. J. Hunt, R. Stanton, and N. Wallace. U.s. residential mortgage transfer systems: A data management crisis. In M. Brose, M. Flood, D. Krishna, and B. Nichols, editors, Handbook of Financial Data and Risk Information II: Software and Data, 2014. G. Palma, M. Vidal, and L. Raschid. Drug-target interaction prediction using semantic similarity and edge partitioning. In The Semantic Web - ISWC 2014 13th International Semantic Web Conference Proceedings, Part I, pages 131–146, 2014. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, 2000. Z. Xu, D. Burdick, and L. Raschid. Exploiting lists of names for named entity identification of financial institutions from unstructured documents. arXiv preprint arXiv:1602.04427, 2016. Z. Xu and L. Raschid. Probabilistic financial community models with latent dirichlet allocation for financial supply chains. In SIGMOD DSMM. ACM, 2016.