Showing posts with label SDLC. Show all posts
Showing posts with label SDLC. Show all posts

Saturday, August 9, 2014

Enabling Scalable Realtime Healthcare Analytics with Apache Spark


Modern and massively parallel computing platforms can process humongous amounts of data in real time to obtain actionable insights for effective clinical decision making. In this blog, I discuss an emerging Big Data platform called Apache Spark and its application to remote real-time healthcare monitoring using data from medical devices and wearable sensors. The goal is to provide effective remote care for an increasingly aging population as well as public health surveillance.


The Apache Spark Framework


Apache Spark has emerged during the last couple of years as an innovative platform for Big Data and in-memory cluster computing capable of running programs up to 100x faster than traditional Hadoop MapReduce. Apache Spark is written in Scala, a functional programming language (see my previous post titled Navigating in Scala land). Spark also offers a Java and a Python APIs. The Scala API allows developers to interact with Spark by using very concise and expressive Scala code.






The Spark stack also includes the following integrated tools:

  • Spark SQL which allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark through a data abstraction called SchemaRDD. Supported data sources include Parquet files (a columnar storage format for Hadoop), JSON datasets, or data stored in Apache Hive.

  • Spark Streaming which enables fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets. The ingested data can be directly processed with Spark built-in Machine Learning algorithms.

  • MLlib (Machine Learning Library) provides a library of practical Machine Learning algorithms including support vector machines (SVM), logistic regression, decision trees, naive Bayes, and k-means clustering.

  • GraphX which provides graph-parallel computation for graph-analytics application like social networks.


Apache Spark can also play nicely with other frameworks within the Hadoop ecosystem. For example, it can run standalone or on a Hadoop 2's YARN cluster manager, on Amazon EC2 or a Mesos cluster manager. Spark can also read data from HFDS, HBase, Cassandra or any other Hadoop data source. Other noteworthy integrations include:

  • SparkR, an R package allowing the use of Spark from R, a very popular open source software environment for statistical computing with more that 5800 packages including Machine Learning packages; and

  • H2O-Sparkling which provides an integration with the H2O platform through in-memory sharing with Tachyon, a memory-centric distributed file system for data sharing across cluster frameworks. This allows Spark applications to leverage advanced distributed Machine Learning algorithms supported by the H2O platform like emerging Deep Learning algorithms.

 

Wearable Sensors for Remote Healthcare Monitoring 


Three factors are contributing to the availability of massive amounts of clinical data: the rising adoption of EHRs by providers thanks in part to the Meaningful Use incentive program; the increasing use of medical devices including wearable sensors used by patients outside of healthcare facilities; and medical knowledge (for example in the form of medical research literature).

One promising area in Healthcare Informatics where Big Data architectures like the one provided by Apache Spark can make a difference is in applications using data from wearable health monitoring sensors for anomaly detection, care alerting, diagnosis, care planning, and prediction. For example, anomaly detection can be performed at scale using the k-means clustering machine learning algorithm in Spark.

These sensors and devices are part of a larger trend called the "Internet of Things". They enable new capabilities such as remote health monitoring for personalized medicine and chronic care management for an increasingly aging population as well as public health surveillance for outbreaks and epidemics.

Wearable sensors can collect vital signs data like weight, temperature, blood pressure (BP), heart rate (HR), blood glucose (BG), respiratory rate (RR), electrocardiogram (ECG), oxygen saturation (SpO2), and Photoplethysmography (PPG). Spark Streaming can be used to perform real-time stream processing on sensors data and the data can be processed and analyzed using the Machine Learning algorithms available in MLlib and the other integrated frameworks like R and H2O. What makes Spark particularly suitable for this type of applications is that sensor data meet the Big Data criteria of volume, velocity, and variety.

Researchers predict that internet use on mobile phones will increase 20-fold in Africa in the next five years. The number of mobile subscriptions in sub-Saharan Africa is expected to reach 635 millions by the end of this year. This unprecedented level of connectivity (fueled in part by the historical lack of land line infrastructure) provides opportunities for effective public health surveillance and disease management in the developing world.

Apache Spark is the type of open source computing infrastructure that is needed for distributed, scalable, and real-time healthcare analytics for reducing healthcare costs and improving outcomes.

Sunday, November 10, 2013

Toward Polyglot Programming on the JVM

In my previous post titled Treating Javascript as a first class language, I wrote about how the Java Virtual Machine (JVM) is evolving with new languages and frameworks like Groovy, Grails, Scala, Akka, and the Play Framework. In this post, I report on my experience in learning and evaluating these emerging technologies and their roles in the Java ecosystem.

A KangaRoo on the JVM


On a previous project, I used Spring Roo to jumpstart the software development process. Spring Roo was created by Ben Alex, an Australian engineer who is also the creator of Spring Security. Spring Roo was a big productivity boost and generated a significant amount of code and configuration based on the specification of the domain model. Spring Roo automatically generated the following:

  • The domain entities with support for JPA annotations.
  • Repository and service layers. In addition to JPA, Spring Roo also supports NoSQL persistence for MongoDB based on the Spring Data repository abstraction.
  • A web layer with Spring MVC controllers and JSP views with support for Tiles-based layout, theming, and localization. The JSP views were subsequently replaced with a combination of Thymeleaf (a next generation server-side HTML5 template engine) and Twitter Boostrap to support a Responsive Web Design (RWD) approach. Roo also supports GWT and JSF.
  • REST and JSON remoting for all domain types.
  • Basic configuration for Spring Security, Spring Web Flow, Spring Integration, JMS, Email, and Apache Solr.
  • Entity mocking, automatic generation of test data ("Data on Demand"),  in-container integration testing, and end-to-end Selenium integration tests.
  • A Maven build file for the project and full integration with Spring STS.
  • Deployment to Cloud Foundry.
Roo also supports other features such as database reverse engineering and Ajax . Another benefit of using Roo is that it helped enforce Spring best practices and other architectural concerns such as proper application layering.

For my future projects, I am looking forward to taking developer's productivity and innovation to the next level. There are several criteria in my mind:

  • Being able to do more with less. This means being able to write code that is concise, expressive, requires less configuration and boilerplate coding, and is easier to understand and maintain (particularly for difficult concerns like concurrency which is a key factor in scalability).
  • Interoperability with the Java language and being able to run on the JVM, so that I can take advantage of the larger and rich Java ecosystem of tools and frameworks.
  • Lastly, my interest in responsive, massively scalable, and fault-tolerant systems has picked up recently.


Getting Groovy


Maven has been a very powerful build system for several projects that I have worked on. My goal now is to support continuous delivery pipelines as a pattern for achieving high quality software. Large open source projects like Hibernate, Spring, and Android have already moved to Gradle. Gradle builds are written in a Groovy DSL and are more concise than Maven POM files which are based on a more verbose XML syntax. Gradle supports Java, Groovy, and Scala out-of-the box. It also has other benefits like incremental builds, multi-project builds, and plugins for other essential development tools like Eclipse, Jenkins, SonarQube, Ivy, and Artifactory.

Grails is a full-stack framework based on Groovy, leveraging its concise syntax (which includes Closures), dynamic language programming, metaprogramming, and DSL support. The core principle of Grails is "convention over configuration". Grails also integrates well with existing and popular Java projects like Spring Security, Hibernate, and Sitemesh. Roo generates code at development time and makes use of AOP. Grails on the other hand generates code at run-time, allowing the developer to do more with less code. The scaffolding mechanism is very similar in Roo and Grails.

Grails has its own view technology called Groovy Server Pages (GSP) and its own ORM implementation called Grails Object Relational Mapping (GORM) which uses Hibernate under the hood. There is also decent support for REST/JSON and URL routing to controller actions. This makes it easy to use Grails together with Javascript MVC frameworks like AngularJS in creating more responsive user experiences based on the Single Page Application (SPA) architectural pattern.

There are many factors that can influence the decision to use Roo vs. Grails (e.g., the learning curve associated with Groovy and Grails for a traditional Java team). There is also a new high-productivity framework called Spring Boot that is emerging as part of the soon to be released Spring Framework 4.0.


Becoming Reactive


I am also interested in massively scalable and fault-tolerant systems. This is no longer a requirement solely for big internet players like Google, Twitter, Yahoo, and LinkedIn that need to scale to millions of users. These requirements (including response time and up time) are also essential in mission-critical applications such as healthcare.

The recently published "Reactive Manifesto" makes the case for a new breed of applications called "Reactive Applications". According to the manifesto, the Reactive Application architecture allows developers to build "systems that are event-driven, scalable, resilient, and responsive." That is the premise of the other two prominent languages on the JVM: Scala and Clojure. They are based on a different programming paradigm (than traditional OOP) called Functional Programming that is becoming very popular in the multi-core era.

Twitter uses Scala and has open-sourced some of their internal Scala resources like "Effective Scala" and "Scala School". One interesting framework based on Scala is Akka, a concurrency framework built on the Actor Model.

The Play Framework 2 is a full-stack web application framework based on Scala which is currently used by LinkedIn (which has over 225 millions registered users worldwide). In addition to its elegant design, Play's unique benefits include:

  • An embedded Java NIO (New I/O) non-blocking server based on JBoss Netty, providing the ability to call collaborating services asynchronously without relying on thread pools to handle I/O. This new breed of servers is called "Evented Servers" (NodeJS is another implementation) as opposed to the old "Threaded Servers". Older frameworks like Spring MVC use a threaded and synchronous approach which is more difficult to scale.
  • The ability to make changes to the source code and just refresh the browser page to see the changes (this is called hot reload).
  • Type-safe Scala templates (errors are displayed in the browser during development).
  • Integrated support for Akka which provides (among other benefits) fault-tolerance, the ability to quickly recover from failure.
  • Asynchronous responses (based on the concepts of "Future" and "Promise" also found in AngularJS), caching, iteratees (for processing large streams of data), and support for real-time push-based technologies like WebSockets and Server-Sent Events.
The biggest challenge in moving to Scala is that the move to Functional Programming can be a significant learning curve for developers with a traditional OOP background in Java. Functional Programming is not new. Languages like Lisp and Haskell are functional programming languages. More recently, XML processing languages like XSLT and XQuery have adopted functional programming ideas.


Bringing Clojure to the JVM


Clojure is a dialect of LISP and a dynamically-type functional programming language which compiles to JVM bytecode. Clojure supports multithreaded programming and immutable data structures. One interesting application of Clojure is Incanter, a statistical computing and data visualization environment enabling big data analysis on the JVM.

Sunday, April 28, 2013

How I Make Technology Decisions

The open source community has responded to the increasing complexity of software systems by creating many frameworks which are supposed to facilitate the work of developing software. Software developers spend a considerable amount of time researching, learning, and integrating these frameworks to build new software products. Selecting the wrong technology can cost an organization millions of dollars. In this post, I describe my approach to selecting these frameworks. I also discuss the frameworks that have made it to my software development toolbox.

Understanding the Business


The first step is to build a strong understanding of the following:

  • The business goals and challenges of the organization. For example, the healthcare industry is currently shifting to a value-based payment model in an increasingly tightening regulatory environment. Healthcare organizations are looking for a computing infrastructure that support new demands such as the Accountable Care Organization (ACO) model, patient-centered outcomes, patient engagement, care coordination, quality measures, bundled payments, and Patient-Centered Medical Homes (PCMH).

  • The intended buyers and users of the system and their concerns. For example, what are their pain points? which devices are they using? and what are their security and privacy concerns?

  • The standards and regulations of the industry.

  • The competitive landscape in the industry. To build a system that is relevant, it is important to have some ideas about the following: what is the competition? what are the current capabilities of their systems? what is on their road map? and what are customers saying about their products. This knowledge can help shape a Blue Ocean Strategy.

  • Emerging trends in technologies.

This type of knowledge comes with industry experience and a habit of continuously paying attention to these issues. For example, on a daily basis, I read industry news as well as scientific and technical publications. As a member of the American Medical Informatics Association (AMIA), I receive the latest issue of the Journal of the American Medical Informatics Association (JAMIA) which allows me to access cutting-edge research in medical informatics. I speak at industry conferences when possible and this allows me not only to hone my presentation skills, but also attend all sessions for free or at a discounted price. For the latest in software development, I turn to publications like InfoQ, DZone, and TechCrunch.

To better understand the users and their needs and concerns, I perform early usability testing (using sketches, wireframes, or mockups) to test design ideas and obtain feedback before actual development starts. For generating innovative design ideas, I recommend the following book: Universal Methods of Design: 100 Ways to Research Complex Problems, Develop Innovative Ideas, and Design Effective Solutions by Bruce Hanington and Bella Martin.

 

Architecting the Solution


Armed with a solid understanding of the business and technological landscape as well as the domain, I can start creating a solution architecture. Software development projects can be chaotic. Based on my experience working on many software development projects across industries, I found that Domain Driven Design (DDD) can help foster a disciplined approach to software development. For more on my experience with DDD, see my previous post entitled How Not to Build A Big Ball of Mud, Part 2.

Frameworks evolve over time. So, I make sure that the architecture is framework-agnostic and focused on supporting the domain. This allows me to retrofit the system in the future with new frameworks as they emerge.


 

Due Diligence


Software development is a rapidly evolving field. I keep my eyes on the radar and try not to drink the vendors Kool-Aid. For example, not all vendors have a good track record in supporting standards, interoperability, and cross-platform solutions.

The ThoughtWorks Technology Radar is an excellent source of information and analysis on emerging trends in software. Its contributors include software thought leaders like Martin Fowler and Rebecca Parson. I also look at surveys of the developers community to determine the popularity, community size, and usage statistics of competing frameworks and tools. Sites like InfoQ often conduct these types of surveys like the recent InfoQ survey on Top JavaScript MVC Frameworks. I also like Matt Raible's Comparing JVM Web Frameworks.

I value the opinion of recognized experts in the field of interest. I read their books, blogs, and watch their presentations. Before formulating my own position, I make sure that I read expert opinions on opposing sides of the argument. For example, in deciding on a pure Java EE vs. Spring Framework approach, I read arguments by experts on both sides (experts like Arun Gupta, Java EE Evangelist at Oracle and Adrian Colyer, CTO at SpringSource).

Finally, consider a peer review of the architecture using a methodology like the Architecture Tradeoff Analysis Method (ATAM). Simply going through the exercise of explaining the architecture to stakeholders and receiving feedback can significantly help in improving it.


Rapid Prototyping 

 

It's generally a good idea to create a rapid prototype to quickly learn and demonstrate the capabilities and value of the framework to the business. This can also generate excitement in the development team, particularly if the framework can enhance the productivity of developers and make their life easier.

 

The Frameworks I've Selected


The Spring Framework

I am a big fan of the Spring Framework. I believe it is really designed to support the need of developers from a productivity standpoint. In addition to dependency injection (DI), Aspect Oriented Programming (AOP), and Spring MVC, I like the Spring Data repository abstraction for JPA, MongoDB, Neo4J, and Hadoop. Spring supports Polyglot Persistence and Big Data today. I use Spring Roo for rapid application development and this allows me to focus on modeling the domain. I use the Roo scaffolding feature to generate a lot of Spring configuration and Java code for the domain, repository (Roo supports JPA and MongDB), service, and web layers (Roo supports Spring MVC, JSF, and GWT). Spring also support for unit and integration testing with the recent release of Spring MVC Test.

I use Spring Security which allows me to use AOP and annotations to secure methods and supports advanced features like Remenber Me and regular expressions for URLs. I think that JAAS is too low-level. Spring Security allows me to meet all OWASP Top Ten requirements (see my previous post entitled  Application-Level Security in Health IT Systems: A Roadmap).

Spring Social makes it easy to connect a Spring application to social network sites like Facebook, Twitter, and LinkedIn using the OAuth2 protocol. From a tooling standpoint, Spring STS supports many Spring features and I can deploy directly to Cloud Foundry from Spring STS. I look forward to evaluating Grails and the Play Framework which use convention over configuration and are built on Groovy and Scala respectively.

Thymeleaf, Twitter Boostrap, and JQuery

I use Twitter Boostrap because it is based on HTML5, CSS3, JQuery, LESS, and also supports a Responsive Web Design (RWD) approach. The size of the components library and the community is quite impressive.

Thymeleaf is an HTML5 templating engine and a replacement for traditional JSP. It is well integrated with Spring MVC and supports a clear division of labor between back-end and front-end developers. Twitter Boostrap and Thymeleaf work well together.


AngularJS

For Single Page Applications (SPA) my definitive choice is AngularJS. It provides everything I need including a clean MVC pattern implementation, directives, view routing, Deep Linking (for bookmarking), dependency injection, two-way databinding, and BDD-style unit testing with Jasmine. AngularJS has its own dedicated debugging tool called Batarang. There are also several learning resources (including books) on AngularJS.

Check this page comparing the performance of AngulaJS vs. KnockoutJS. This is a survey of the popularity of  Top JavaScript MVC Frameworks.

 

D3.js 

D3.js is my favorite for data visualization in data-intensive applications. It is based on HTML5, SVG, and Javascript. For simple charting and plotting, I use jqPlot which is based on JQuery. See my previous post entitled Visual Analytics for Clinical Decision Making.

 

I use R for statistical computing, data analysis, and predictive analytics. See my previous post entitled Statistical Computing and Data Mining with R.


Development Tools


My development tools include: Git (Distributed Version Control), Maven or Gradle (build), Jenkins (Continuous Integration), Artifactory (Repository Manager), and Sonar (source code quality management). My testing toolkit includes Mockito, DBUnit, Cucumber JVM, JMeter, and Selenium.

Sunday, January 20, 2013

Application-Level Security in Health IT Systems: A Roadmap

An investigative report titled "Health-care sector vulnerable to hackers, researchers say" published last month in the Washington Post on the state of cybersecurity reveals that:

"...health care is among the most vulnerable industries in the country, in part because it lags behind in addressing known problems."

When it comes to application-level security, the healthcare industry is indeed lagging when compared to other industries that handle consumer sensitive information. The Payment Card Industry Data Security Standard (PCI DSS) is an information security standard for organizations that handle cardholder information. The PCI DSS certification includes requirements for security code reviews, penetration testing, and compliance validation by an external Qualified Security Assessor (QSA).

This week, the Department of Health and Human Services (HHS) issued a final omnibus rule on the HIPAA Privacy, Security, Enforcement, and Breach Notification Rules. The rules impose the following:

  • Increased and tiered civil money penalty structure for security breaches depending on "reasonable diligence", "willful  neglect", and "timely correction". The penalty amount varies from $100 to $50,000 per violation with a maximum penalty of $1.5 million annually for all violations of an identical provision.
  • Expansion of accountability and liability for Business Associates (BAs) and subcontractors.
  • Increased privacy protections under the Genetic Information Nondiscrimination Act (GINA).

Furthermore, the Security and Privacy Tiger Team of the US Office of the National Coordinator (ONC) for health IT released a set of recommendations related to the Meaningful Use (MU) Stage 2 requirements for patients access to health record portals. The need for patient engagement as a prerequisite to a successful transformation of healthcare means that particular attention should be paid to the security needs of consumer-facing web applications.


Security in the Software Development Life Cycle (SDLC)


Unfortunately, security as a non-functional requirement, is often relegated to an afterthought in the software development life cycle (SDLC). As an afterthought, security is added to the software later or at the end of the development cycle. At that point, adding adequate security is difficult and costly, requiring significant rework. In some cases, penetration testing is not performed at all before the application is deployed into production.

This situation can be exacerbated by an interpretation of the Agile methodology that puts the emphasis on the early and frequent demonstrations to the customer of functional (as opposed to non-functional) features of the system under development.

Another issue is that developers and architects often over-rely on 3rd-party security infrastructure, as opposed to (1) developing a Threat Model for the application they are building and (2) creating a security implementation approach to address the Threat Model. 3rd-party security infrastructure can be helpful, but should serve the security implementation strategy as opposed to driving it. As Bruce Schneier, a well-known cryptographer and computer security specialist said in an article titled "Computer Security: Will We Ever Learn?":
"Security is a process, not a product. Products provide some protection, but the only way to effectively do business in an insecure world is to put processes in place that recognize the inherent insecurity in the products. The trick is to reduce your risk of exposure regardless of the products or patches."


Understanding Potential Security Vulnerabilities


Application Security is a mature discipline. Developers and architects should build a deep understanding of web application security vulnerabilities as opposed to completely relying on 3rd-party security infrastructure for addressing security concerns. The following are well documented bodies of knowledge on security vulnerabilities:

  1. The OWASP Top 10 Web Application Security Risks (cheat sheets explaining each of those vulnerabilities and how to address them are available on the OWASP web site):

    A1: Injection
    A2: Cross-Site Scripting (XSS)
    A3: Broken Authentication and Session Management
    A4: Insecure Direct Object References
    A5: Cross-Site Request Forgery (CSRF)
    A6: Security Misconfiguration
    A7: Insecure Cryptographic Storage
    A8: Failure to Restrict URL Access
    A9: Insufficient Transport Layer Protection
    A10: Unvalidated Redirects and Forwards.

  2. The CWE/SANS Top 25 Most Dangerous Software Errors, the result of collaboration between the SANS Institute, MITRE, and many top software security experts in the US and Europe.
  3. Programming language-specific vulnerabilities such as those listed in the Cert Oracle Secure Coding Standard for Java.
  4. Well-documented security vulnerabilities introduced by the use of 3rd-party open source application development frameworks.
  5. The National Vulnerability Database
  6. The Common Weakness Enumeration (CWE) which is currently maintained by the MITRE Corporation with support from the National Cyber Security Division (DHS). The diagram below  from the CWE web site shows a portion of the CWE hierarchical structure. Click on the image below to enlarge it. 
  7. Obviously, developers should be on the lookout for new and emerging threats to web application security.





Application Threat Modelling


Armed with a deep understanding of potential vulnerabilities, developers and architects can build a Security Policy (who has what type of access to which resource in the system) and a Threat Model including:

  • An analysis of the attack surface of the application.
  • Identification of potential threats and attackers (both inside and outside the organization and its business associates and subcontractors) and their characteristics, tactics, and motivations. A threat categorization methodology such as STRIDE can be used. STRIDE defines the following threat categories: Spoofing of user identity, Tampering, Repudiation, Information disclosure (privacy breach or Data leak), Denial of Service (D.o.S.), and Elevation of privilege
  • The consequences of those attacks for patients and the healthcare organization serving them.
  • Countermeasures and a risk mitigation strategy. The Application Security Frame (ASF) defines the following categories of countermeasures:  Authentication, Authorization, Configuration Management, Data Protection in Storage and Transit, Data Validation/Parameter Validation, Error Handling and Exception Management, User and Session Management, Auditing and Logging.
  • How the deployment environment will impact privacy and security. NIST and the Cloud Security Alliance (CSA) provide specific security guidance for cloud deployment.
  • New software architectures like the Single Page Application (SPA) approach present new challenges in securing web applications. Single Page Applications are subject to common web application vulnerabilities like Cookie Snooping, Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), and JSON Injection. Security is mainly the responsibility of the server, although client-side frameworks like AngularJS also provide some features to enhance the security of Single Page Applications.

 

Developing a Security Implementation Strategy


To address the issues of secure software development in the context of Agile, the Software Assurance Forum for Excellence in Code (SAFECode) published a guide titled "Practical Security Stories and Tasks for Agile Development Environment".


Secure Coding Standards, Static Analysis, and Security Code Review


Many developers are aware of coding conventions (such as the Code Conventions for the Java Programming Language),  and the benefits of peer code reviews and static code analysis (using tools like Checkstyle, PMD, FindBugs, and Sonar). These practices should be expanded to cover secure coding as well. The following resources can help:

  • The Cert Oracle Secure Coding Standard for Java.
  • The OWASP Code Review Guide.
  • The "Fundamental Practices for Secure Software Development: A Guide to the Most Effective Secure Development Practices in Use Today" published by the Software Assurance Forum for Excellence in Code (SAFECode)
  • The Payment Card Industry Data Security Standard (PCI DSS) "Information Supplement: Requirement 6.6 Code Reviews and Application Firewalls Clarified" is an example of secure code review requirements in an industry vertical.

There are secure code static analysis tools that can be particularly useful when used in combination with a secure code review process. If possible, the static code analysis should be integrated into the build and continuous integration process to provide specific secure code metrics as well as the evolution of those metrics over time.


Penetration Testing


Finally, the application should go through penetration testing before it is deployed into production. Application-level penetration testing should be done in addition to network-level penetration testing. OWASP provides a detailed Testing Guide and a number of open source and commercial penetration testing tools are available as well.