At a recent webinar, Java in the database-is it really useful? Solving impossible big data challenges, my colleague Mark Sweeney and I took a deep dive into embedded analytics in-database and how that can help you get the information you need quickly out of the vast amounts of data that is available.
Benefits of embedded analytics
I started off talking (1:00) about the benefits of embedded analytics. On average, 60-70 percent of time spent is on data extraction; with embedded analytics this can be performed faster and more efficient with real-time on-demand calls, no synchronization delays, and running models on-the-fly. Embedded analytics also provides a simpler user experience, enables users to analyze larger data sets, and overall has a lower opportunity and maintenance cost. Data is also more secure as it never leaves the database when used as inputs for analytics.
Taxonomies for database analytics compared
There are multiple ways to do a taxonomy for database analytics as explained in the webinar starting at 8:25.
- Proprietary platform: Third party software (SAS, MatLab, etc.)
- Multitier: Client/server or separate analytics database
- Distributed platform: Hadoop or Cassandra-Spark
- Database: SAP HANA, Oracle Advanced Analytics
- In-database JMSL Numerical Libraries: Analytics run on databases’ internal JVM, JMSL classes stored as database objects
Mark compared and analyzed these five options against eight criteria: non-proprietary language, efficient data transfer, distributed/scalable, secure, portable/reusable, algorithm coverage, performance and low cost of setup.
So as you can see (and hear) in the webinar (13:37) there are numerous advantages to adopting a 100 percent Java codebase for in-database analytics. Security is doubly enhanced by performing all analytics in the database. The code is highly portable as the identical Java classes that run in the database will run on any client with any operating system. Plus the modern paradigm of taking the algorithms to the data is elegantly achieved with minimal effort.
What is the biggest challenge you find when it comes to data analytics?
Before diving further into our solution, we polled the webinar audience (13:50) to see what they found to be the biggest challenge either from an analysis, data assembly, or post-processing perspective.
We aren’t surprise that performance and data quality are the top challenges. We have seen that typically 70-80 percent of time is spent on cleaning up data, which lends to the importance of data quality. Being able to get the answer quickly is always an important feature to organizations.
Advantages of in-database analytics
At 16:28 in the webinar, Mark takes a more in-depth look at the advantages of using the in-database analytics solution. Until now, a single Java solution with all these qualities wasn’t available.
- Faster results.
- Higher accuracy and better quality of data: Data cleaning routines, eliminate data staging before loading, reducing network traffic reduces risk of data corruption, data and formula in one place avoids user errors.
- Higher security: Data stays in in-database, user privileges can be fine-tuned so that they can analyse reports but never have access to underlying data.
- Greater accessibility: Easy to use, developers only need to write SQL/Java interfaces to JMSL routines.
- Trusted technology: Java and SQL are broad existing technologies and languages. Does not require learning a new ecosystem. Does not require learning the latest, greatest programming language. Only requirement is the JVM.
- Minimal risk: Works with many platforms without modification.
The technical solution
Mark begins looking into the finer details of the solution at 22:27 in the webinar, starting with a look under the JMSL hood, the architecture set-up and code samples for installation.
JMSL and SQL is not a paradigm shift. Java and relational databases go well together leveraging stable and familiar technologies. Using the strengths of both SQL for queries, data definition, and data manipulation languages, and JMSL for advanced analytics, you create a dynamic solution.
By using JMSL, you get a suite of algorithms with routines for predictive analytics, data mining, regression, forecasting, and data cleaning. JMSL is scalable and can be used in Hadoop MapReduce applications. Now, JMSL makes Java in the database more than useful – it makes it unbeatable.
- Read this technical tutorial for more specific details on embedding analytics into a database using JMSL.
- Learn how JMSL can be used in Hadoop MapReduce applications.
- Find out more about stochastic gradient boosting with JMSL 7.2.
- Visit us at Supercomputing 2015 - booth 1324.