A 685Kb PDF of this article as it appeared in the magazine complete with images is available by clicking HERE
I was recently engaged in a discussion regarding topics to be considered for a technical seminar. Naturally the subject of Big Data (it is only appropriate that Big Data would use big letters, right?) came up. After all, our industry defines Big Data with the collection and processing of remote sensed information of every ilk. As the discussion progressed, my mind drifted off into thinking about what these things mean, from a contextual point of view.
Big Data is one of those terms that has been applied to so many different areas of processing and analysis that it has no specific meaning. It now shows up in marketing material, one would assume as a differentiator; "We know Big Data!" This is akin to the roofing contractor’s "We know roofs!" Expected but not very helpful!
Our industry has been handling large digital data sets in one form or another since the dawn of computing. A good example is the use of Terrain Contour Matching (TERCOM) in guidance systems. This technology was developed in the late 1970s with a requirement for managing and accessing an on-board elevation data base in real time, performing correlations to a preplanned mission and making navigational course correction in three dimsienons. For the technology of the day, this wasn’t Big Data, it was Huge Data! Today the algorithms and all data would comfortably fit and run on an iPhone. My point is that viewing data as small, medium or large is a function of not only the problem but the context in which it must run and, of course, the performance expectations. A quick look at the definition in Wikipedia (where else?) illustrates the inadequacy of the definition; "Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate." As typical, this speaks only to the data, ignoring completely the other critical elements of the problem.
I think to qualify as Big Data, the application must be considered in the scope of context, question span, size, response time and lastly (and probably most importantly) budget. Let’s illustrate by an example. In Figure 1 is depicted an airborne LIDAR data set (USGS Quality Level 2) of Davidson County, Tennessee. These data comprise about 3.9 billion points that occupy (uncompressed LAS) about 110 GB of storage. Let’s have a look at two scenarios.
The first is that of a data producer who wishes to run an automatic ground classification algorithm on the data set. Examining my loose categories we have:
Context–the data are on an engineering workstation’s local solid state disk (SSD) and they do not have to leave this context during the entire process
Question Span–If a point is in class G (say a ground point), this knowledge can be used in classifying "near" points. For typical airborne LIDAR data, this "near" means out to about 100 to 500 meters from the known point.
Size–3.9 billion points occupying 110 GB
Response Time– I’d like to get this done in 8 hours
Budget–I do not want to use more than 2 workstations
To anyone familiar with LIDAR data processing, this is a small to very small problem. The key consideration here (as it often is) is the question span. Since the fact that a particular point is in the ground class only influences decisions regarding other points out to a radius of 500 meters (more or less), I can process this problem in fairly small chunks. Since the context is a local workstation with an SSD, access to those chunks is a non-issue. In fact, I would argue that if my project size expands to the state of Tennessee, I can scale quite easily by adding workstations with no need to change my algorithms. Of course, if I cannot expand my original constraints, the problem can become Big. For example, I need to do ground classification of QL2 data for the state of Tennessee in 8 hours on this same workstation!
I can, however, make processing with this same data set a Big Data problem by asking a different question. An example of such a question might be "construct a seamless topologically and hydrologically correct drainage model from the LIDAR elevation data." The span of the question has dramatically changed as compared to the simple classification question. For any particular watershed, the entire area must be involved in the modeling of each individual draining feature. If I want to solve this problem in constrained time, I may have to resort to more exotic hardware. For example, I may need a few terabytes of internal random access memory.
It has been my experience that question (or "analytic") span and the time budget are the two areas to examine to decide if you are facing a "Big Data" problem. We can think of question span as "ripple" effect. Network scheduling is a classic example of a Big Data problem (I usually let the issue of simultaneous access be part of my question span criteria). Changing a reservation on a 4 leg set of connected flights ripples throughout a complex web of connected schedules for thousands of transactions; it has enormous span.
An example closer to home is that of radiometrically adjusting a large area comprised of a mosaic of orthophotos. This is a classic "whack a mole" problem. Perfectly matching a spatial subset of a few hundred of the contributing images completely unbalances many other areas. This is a problem that remains unsolved in image processing. We typically count on limited viewing scope and perhaps some on-the-fly adjustments to do a "good enough" job.
In summary, Big Data is not about the physical quantity of data that is being input to an analysis process. If the span is small (and remember, I count the number of inquiries per second in span) and the response time is reasonable, the problem is most likely not a Big Data problem.
Lewis Graham is the President and CTO of GeoCue Corporation. GeoCue is North America’s largest supplier of LIDAR production and workflow tools and consulting services for airborne and mobile laser scanning.
A 685Kb PDF of this article as it appeared in the magazine complete with images is available by clicking HERE