Big Data - Information Management Perspective

As the business realms keep expanding, the need for Information would keep growing. Information Management is critical to business growth and there is an invisible string that ties them together in a partnership. For new enterprises, the agility is higher and technology pretty much keeps up with the business expectations, but same is not true for older enterprises as they keep struggling with legacy issues. Delivering the right data on time to business for the decisioning & other reporting purposes is the biggest challenge for major enterprises and hence Information Management is the hottest topic. Data Integration, Data quality, time to market are critical measures for success of CIO. Particularly in US Banking and Finance industry, OCC and other regulators are pressing hard to get details around exposures (Mortgage Debt, HELOCs, Credit Card, Commercial Loans) with a much quicker turnaround time than previously. Additionally Regulatory agencies are enforcing penalty if there are errors discovered on the final reports that are being submitted. Hence, the line of business is under tremendous pressure to deliver the reports on time and with quality. At the same time, introduction of new government mandated programs for mortgages/Helocs products, there is an increased flux of changes in the business processes. These changes again come with stringent timelines for implementation.

Having said that, IT systems have challenges and scalability issues. Struggling with legacy issues, the various business processes run in disparate applications managed by multiple IT Groups and the applications have convoluted dependencies with each other. With challenging deadlines, the IT groups struggle to inform other groups on how one application might impact the business process on other applications. Some of the applications have moved to cloud based applications (Salesforce - Case Management, CRM).

As the data continuously changes in the source systems and no appropriate data monitors in place, maintaining a unified metadata for the data as it flows through multiple applications becomes difficult. This impacts the overall data quality and hence the reporting

The other bigger issue is around humongous data growth. The regulation requires that the account holder information is maintained in system for 7 years. For some of the deeper trending analysis, it is required that the data be maintained forever (E.g. Reports that require to provide details of losses at borrower and loan level, fees that were charged to customer during life cycle of a loan). With the demand for improving the time to market increasing due to regulatory and compliance issues, archiving the data to tapes and retrieving for reporting is costly. This means, data sources would have to maintain historical and transactional history online.

Consequently, the data would keep exploding at all levels, starting from origination, underwriting, fulfillment, servicing, modifications, post closing, bankruptcy, and foreclosures. The regulators have also started requesting the immediate availability of pertaining loan documents at each of the loan processing stage to verify if the processing was carried out per the rules. This implies the size of the data would also grow and this would define a limit for compressing the data for storage purposes. This would mean increased load on the IT systems to have these being immediately made available and the collective data stores would be ~50 PB and this would keep accumulating. 

But this is not just regulations which are driving these changes. Banking and financial industries have realized that they are sitting on pile of data and consumer events would be able to reflect customer's ability to pay back debt. If you look into the graph below, the data of which was compiled in recently indicated a trend of bankruptcy vs. foreclosures. There is close correlation between Bankruptcy and foreclosure. If this data was compiled and analyzed at the right time, probably the mortgage crisis could potentially have been prevented. 

Similarly, there could be potential correlations between delinquencies, liquidations and customer's credit score changes. Such data would need to be thoroughly analyzed. There are other situations where we would need to analyze if a loan despite getting modified under the provisions of law still becomes delinquent. The analytics would have to foresee some of these situations and should be able to raise alerts accordingly.

Similarly, unstructured data sources like social media, feedback captured in web sites, customer support discussions can help gain foresight of the customer issues and could also help predict the impact of a new product or changes on existing products. Such analysis is critical for organizations to ensure the customer issues are being managed and escalated properly within the organization to ensure speedy resolution.

But this data is HUGE and this would keep growing and data mining to extract key business insights would become more challenging. The entities would also keep on expanding as there are new data sources and the hence current approach for data analysis may not be able to scale up. If we try to bring all the transactional data for Loan Origination, Fulfillment and Underwriting on a single platform for reporting, stability becomes an issue.

This gives a perfect business case for Big Data products and leveraging its power analytics for the business benefits. In 2012, there would be increased demand for predictive and real time analytics capability. I read definition of "Big Data" as - Big Data technology applies the power of massively parallel distributed computing to capture and sift through data gone wild – that is, data at an extreme scale of volume, velocity, and variability.  This is where the open source Apache Hadoop helps which offers advanced analytics using Distributed File systems for analyzing structured and unstructured data.  But Hadoop alone doesn’t provide database services. We would need to couple them with NoSQL  databases which would facilitate the map-reduce framework which otherwise would be very cumbersome to implement in Java, C++ or Python (Imagine writing 10 line of code for select count(*) query).   There are few other relational databases which implement the Map Reduce framework (Eg Teradata AsterData) which have demonstrated the ability to be able to crunch huge amount of data and provide analytics. 

How can hadoop help to deliver value in the rapidly changing business process environment-
Hadoop should be used to build out a centralized hub (Operational Data Store) for storing and managing the transactional data, but keep the master and reference data in their current database platforms. The transactional data should include the loan application movement from Origination, Underwriting, Fulfillment and servicing. The other miscellaneous transactions should include other critical events such as fees, credit reversal, loan modification related. Additionally, the other consumer transactions - Deposits, Cards, Trading Transactions, and High value investment should also be brought onto this centralized hub.

These transactions would be huge in size and Hadoop by virtue of its huge data crunching ability would be able to maintain and manage this data. This platform would have to evangelized across different groups and would serve as Data as a service for the multiple applications. The current application would not have to migrate onto the new application and can continue leveraging their current database platform. For transactional data, they would be able to connect onto the Hadoop platform and their current ETL and analytical framework would not have to be modified. With all the transactional data at single platform and analytical power provided by Hive, HBase over the data, organizations could look to transform the data into actionable business. 

I am trying to list down few of the critical impact that leveraging Hadoop as Big Data Product would help.

a. Data Management: Better profiling of the customer data and related activities enabling 360 degrees of the borrowers' current financial health which would provide critical business sensitive insights. For Eg, if the FICO score is reducing and so is the deposit available is also reducing then it is an alert that borrower's financial condition is deteriorating.  Correspondingly if the outstanding debt (including cards & mortgage) is higher than a threshold value, then IT systems might be able to provide an alert to the corresponding Line of Business to communicate with the customer, understand his needs and accordingly plan for next steps.

Impose security & data governance to ensure the data (even though centralized) is only available to appropriate users of the data. 

Big Data based solutions can help interconnect the disparate data sources as well as cloud based solutions through a solution integrator approach (Informatica, IIS) or direct connectivity through adaptors.

b. Data Governance and Enterprise Risk Management: Enterprise Risk can be better managed and organization wide exposure can be adequately identified. E.g.  Fraud detection can be centralized and set up as shared service across the different LoBs. In most of the banking and financial services, these are NOT yet centralized as it is very difficult to have so much data amassed on a single platform to perform the right analytics.

c. Analytics & Reporting: Pre-determined scenarios can be quickly programmed through Hadoop and appropriate views can be built out of an internal ETL framework in Hadoop / HBase / Hive. These views would in turn help for business reporting and raising alerts when an event requiring attention occurs.

By building self learning algorithms for analytical purposes of the huge data, data analysts and engineers would be able to automate several analysis and the system would in-turn be self sufficient to discover, quantify, and predict business conditions.

c. IT - Cost and Time to Market: IT system stability is achieved quickly as the investment is minimal as compared to other databases like Teradata, Netezza, Oracle, SQL server or DB2 which would otherwise require lot of scaling up for continous data augmentation. In the big data architecture, the existing DW platforms can still be leveraged and Hadoop based products need not get integrated with the core architecture. They can remain external and act as service providers to internal IT systems and can be leveraged on need basis. The major DW platforms like Teradata, Oracle, DB2 and SQL Server have started applying Map Reduce framework and they can integrated with Hadoop based products using SQOOP (Open Source solution which connects different databases to Hadoop) or custom connectors (There is Teradata Hadoop connector which ships free of cost along with Teradata Asterdata to connect with Cloudera Hadoop based solution. For SQL Server, there is a Hadoop connector being made available which connects with Hadoop solution).  

On the reporting side, major Microsoft BI tools, Microstrategy, Cognos also offer ODBC driver connectivity to Hadoop/Hive so that the reporting becomes seamless.

If the enterprises are able to build a strategy to have the Reporting Structures pull in modeled data from warehouse platform while leveraging Hadoop based platforms for sourcing in transactional data, the overall cost of managing data would come down drastically. With the tools being made available in both ETL & reporting side for Hadoop based platform and with the right migration strategy to move the data onto Hadoop based Operational Data Sources would help provide the latest insight to the data for business reporting purposes.

There are challenges in implementing Hadoop products and integrating them in the current architecture. The current open source setup does not help the big enterprises as they would have to customize the product to fit in with their enterprise architecture strategy. Enterprises would have to develop on top of Hadoop to ensure it complies with the enterprise security standards, data modeling, work load management, job scheduling etc. However, this skillset is rare and the technology resources (architects, developers & testers) who would lead the implementation would be difficult to identify.

The alternative is to identify and implement enterprise compliant products like Teradata Asterdata, Hortonworks, EMC) and have them integrate with the current DW/DM architecture. The transactional data would have to be migrated over from other databases to Hadoop products which require investment in terms of resources and knowledge acquisition.

The technology knowhow for Implementing Hadoop based solutions is critical to success. Taking the existing IT staff and train them onto Hadoop products is definitely a challenge. But if there is a right strategy in place, this can be achieved. 

Security and Governance would require radically different approach as the volume is huge and conventional security & governance strategy might not work. Data profiling, Metadata management would have to change and scale up for the large data that is being stored and managed. Metadata tools would find it difficult to continuously scan the data for any updates and populating the metadata dictionary. Given that the Hadoop / Hive would store transactional data and would be subject to rapidly changing source applications, data integrity would also be challenge. Data Quality Monitors as many DW/DMs have started installing which would raise alert when the data goes out of sync and loses its business value would also become difficult to implement owing to the size of the data. There are few solutions like splunkGanglia help with the data monitoring, but the suitability of the tools in a particular business environment would have to be decided by a proof of concept.

While building a plan to transition to big data infrastructure, CIO would face a daunting task of when and how to leverage opportunities created by transforming the business processes. The challenges would include the right opportunity selection, gathering LoB support, ROI assessment, Right Hadoop vendor, and Resource training and skill upgrade, Change Management. 

Even from line of business perspective - Even with Hadoop as technology solution to help as data store, another problem of data deluge starts to shape up. With Data sourced in to single platform. Identifying right set of people who have sufficient business understanding, intuition and insight are tough to find. But with the way Google, Amazon, Capital One have been able discover the true essence of data they hold, it goes beyond a point to prove that it can be accomplished. Again, this needs a well thought over strategy & persistence that is specific to the enterprise overall vision & culture. 


Mortgage business is the oldest line of business for banks and hence the legacy issues still persist as the cost of opportunity to convert them to newer architectures have been high throughout. But given the changing business landscape and the urgency to improve the time to market, there has been dire need to improve the overall Information Management technology strategy to meet the needs of business. 

Introducing Big Data Products is a major IT overhaul and there are bound to be challenges and every situation is unique to the respective group. Even within an organization, a IT solution/architecture which works for one group may not fit in with the other group. Even within the mortgage business, it may not be possible to have all the transactional data sourced into a single platform because of complex business interdependencies. However, CIOs would have to understand how these challenges interplay and ensure business and technology teams collaborate to deliver the true value to the organization. 

Next Post Newer Post Previous Post Older Post Home


Anonymous said...

Neat paper and well addressed. tks for sharing arvind

Anonymous said...

Good info. Lucky me I ran across your website by chance (stumbleupon).

I've book-marked it for later!

Here is my webpage:

Anonymous said...

Wow that was unusual. I just wrote an extremely long comment but after I clicked submit
my comment didn't appear. Grrrr... well I'm not writing all that over again.
Anyways, just wanted to say great blog!

Have a look at my blog - Click This Link

Anonymous said...

Generally I don't read post on blogs, but I would like to say that this write-up very compelled me to take a look at and do so! Your writing style has been amazed me. Thanks, quite nice article.

Here is my website ...

Anonymous said...

A motivating discussion is worth comment. There's no doubt that that you should publish more on this issue, it may not be a taboo subject but typically people don't speak about these subjects.
To the next! Best wishes!!

Feel free to surf to my web page anuncios