Building a Productive Domain-Specific Cloud for Big Data Processing and Analytics Service

Abstract

Cloud Computing as a disruptive technology, provides a dynamic, elastic and promising computing climate to tackle the challenges of big data processing and analytics. Hadoop and MapReduce are the widely used open source frameworks in Cloud Computing for storing and processing big data in the scalable fashion. Spark is the latest parallel computing engine working together with Hadoop that exceeds MapReduce performance via its in-memory computing and high level programming features. In this paper, we present our design and implementation of a productive, domain-specific big data analytics cloud platform on top of Hadoop and Spark. To increase user’s productivity, we created a variety of data processing templates to simplify the programming efforts. We have conducted experiments for its productivity and performance with a few basic but representative data processing algorithms in the petroleum industry. Geophysicists can use the platform to productively design and implement scalable seismic data processing algorithms without handling the details of data management and the complexity of parallelism. The Cloud platform generates a complete data processing application based on user’s kernel program and simple configurations, allocates resources and executes it in parallel on top of Spark and Hadoop.

Share and Cite:

Yan, Y. , Hanifi, M. , Yi, L. and Huang, L. (2015) Building a Productive Domain-Specific Cloud for Big Data Processing and Analytics Service. Journal of Computer and Communications, 3, 107-117. doi: 10.4236/jcc.2015.35014.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Agrawal, D., Das, S. and El Abbadi, A. (2011) Big Data and Cloud Computing: Current State and Future Opportunities. Proceedings of the 14th International Conference on Extending Database Technology, ACM, 2011, 530-533. http://dx.doi.org/10.1145/1951365.1951432
[2] Hadoop Introduction (2014). http://hadoop.apache.org/
[3] Ghe-mawat, J.D.S. (2008) MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51, 107-113. http://dx.doi.org/10.1145/1327452.1327492
[4] Islam, N.S., Rahman, M., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C. and Panda, D.K. (2012) High Performance RDMA-Based Design of HDFS over InfiniBand. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 35. http://dx.doi.org/10.1109/SC.2012.65
[5] Kim, K., Jeon, K., Han, H., Kim, S.-G., Jung, H. and Yeom, H.Y. (2008) Mrbench: A Benchmark for Mapreduce Framework. 14th IEEE International Conference on Parallel and Distributed Systems, 2008, 11-18. http://dx.doi.org/10.1109/ICPADS.2008.70
[6] Lu, X., Wang, B., Zha, L. and Xu, Z. (2011) Can MPI Benefit Hadoop and MapReduce Applications? 2011 40th International Conference on Parallel Processing Workshops (ICPPW), 2011, 371-379.
[7] Spark Lightning-Fast Cluster Computing (2014). http://spark.incubator.apache.org/
[8] Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S. and Stoica, I. (2010) Spark: Cluster Computing with Working Sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, Berkeley, 2010, 10. http://dl.acm.org/citation.cfm?id=1863103.1863113
[9] Odersky, M., Spoon, L. and Venners, B. (2008) Programming in Scala. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.202.9255n&rep=rep1n&type=pdf
[10] Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K. and Currey, J. (2008) Dryadlinq: A System for General-Purpose Distributed Data Parallel Computing Using a High-Level Language. OSDI, 8, 1-4.
[11] Mosharaf Chowdhury, M.Z. and Das, T. (2012) Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing. NSDI’12 Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, San Jose, USENIX Association Berkeley, April 2012.
[12] Su, X., Swart, G., Goetz, B., Oliver, B. and Sandoz, P. (2014) Changing Engines in Midstream: A Java Stream Computational Model for Big Data Processing. Proceedings of the VLDB Endowment, 7.
[13] Mesos: A Distributed Systems Kernel (2014). http://mesos.apache.org
[14] S. T. S. Committee (2002) SEG Y rev 1 Data Exchange Format.
[15] Spark Jobserver: REST Job Server for Spark (2014). https://github.com/ooyala/spark-jobserver
[16] Penobscot 3D-Survey (2015). https://opendtect.org/osr/pmwiki.php/Main/PENOBSCOT3DSABLEISLAND
[17] Free Open Source Seismic Inter-pretation Platform (2015). http://opendtect.org/
[18] Part #1-Tuning Java Garbage Collection for HBase (2015). https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase
[19] nmon_analyser (2015). https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power%20Systems/page/nmon_analyer

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.