Apache Sqoop provides a framework to move data between HDFS and relational databases in a parallel fashion using Hadoop’s MR framework. As Hadoop becomes more popular in enterprises, there is a growing need to move data from non-relational sources like mainframe datasets to Hadoop. Following are possible reasons for this:
- HDFS is used simply as an archival medium for historical data living on the mainframe. It is cost effective to store data in HDFS.
- Organizations want to move some processing workloads to Hadoop to free up CPU cycles on the mainframe.
The open source contribution will provide an implementation for the new Mainframe Import tool. This implementation allows the user to specify a directory on mainframe and move all the files located in this directory to Hadoop in a parallel fashion. The files can be stored in any format supported by Sqoop. The details of the command line options are included in the SQOOP-1272 design overview. The user can control the level of parallelism by specifying the existing Sqoop option for number of mappers.
Once we decided to make a contribution, I started working on the functional specification with the Syncsort engineering team. In terms of functionality, we wanted to start with small, but fundamental steps to accomplishing the objective of mainframe connectivity via Sqoop. Our goals were very modest:
- Connect to the mainframe using open source software.
- Support datasets that can be accessed using sequential access methods.
At the top level, the design involves implementing a new Sqoop tool class (MainframeImportTool) and a new connection manager class (MainframeManager.) If you dig a little deeper, there are support classes like mainframe specific Mapper implementation (MainframeDatasetImportMapper), InputFormatimplementation (MainframeDatasetInputFormat), InputSplit implementation (MainframeDatasetInputSplit),RecordReader implementation (MainframeDatasetRecordReader), and so on.
Next came the most difficult part, the actual implementation. Members of Syncsort engineering played a vital role in the detailed design and implementation. Since it is impossible to connect to a real mainframe in Apache testing environment, we decided to contribute unit tests based on Java mock objects. It was amazing to discover that Sqoop never used mock objects for testing before! We ran end-to-end tests with real mainframe in-house to verify the correctness of the implementation.
The design and implementation of SQOOP-1272 allow anyone to extend the functionalities. Users can extend MainframeManager class to provide more complex functionalities while making use of the current implementation.