Dealing With Large Scale Data: Theory And Practice


Ian Soboroff’s “Pseudo” Posterous blog entry considers the problem of big data, which he defines as “data that’s bigger than what you can comfortably store and process right now“.  He offers a background to the ‘big data’ problem and considers various Cloud oriented techniques as well as using readily available (and free) Linux based solutions for setting up your own Cloud:

What is “big data”, anyway? The generations of yore (like, 20 years ago) measured it in gigabytes. In 1991, TREC brought the information retrieval community (the researchers who studied what would come to be called “search engines”) from working with megabytes to a two gigabyte text corpus. This caused tremendous engineering havoc in the research world; we might assume that industrial groups of the time already worked with data an order of magnitude or more larger. Today’s largest web collection made generally available to researchers (at cost! thanks to NSF), CLuEWeb09, is 25 terabytes of raw web pages, approximately equivalent to the top tier of a commercial web search engine. Social media datasets, such as the Twitter archive to be housed at LOC, exist on different scales; the text in such a collection might be a hundreds of gigabytes or a few terabytes, but occupy a graph structure of billions of nodes.” [Read More]

What is “big data”, anyway?  The generations of yore (like, 20 years ago) measured it in gigabytes.  In 1991, TREC brought the information retrieval community (the researchers who studied what would come to be called “search engines”) from working with megabytes to a two gigabyte text corpus.  This caused tremendous engineering havoc in the research world; we might assume that industrial groups of the time already worked with data an order of magnitude or more larger.  Today’s largest web collection made generally available to researchers (at cost! thanks to NSF), CLuEWeb09, is 25 terabytes of raw web pages, approximately equivalent to the top tier of a commercial web search engine.  Social media datasets, such as the Twitter archive to be housed at LOC, exist on different scales; the text in such a collection might be a hundreds of gigabytes or a few terabytes, but occupy a graph structure of billions of nodes.
Share this Article:

Leave a Reply

Your email address will not be published. Required fields are marked *