E M E R G I C . o r g
Rajesh Jain's Weblog on Emerging Technologies, Enterprises and Markets
TECH TALK: Constructing the Memex

Monday, April 21, 2003
TECH TALK: Constructing the Memex: First Memories

For much of human civilisation, we have had a single memory device – our own brain. The world around was what we remembered, or at best, what others around us remembered. Part of the reason was that it was, arguably, a much less complex world, but more importantly, it was difficult to connect with others beyond a geographical vicinity. Our memory has served as well, in general. Of course, the problem is that we will never know what we aren’t aware of or cannot remember.

In the past decade, the Internet has extended our world by making accessible a vast quantity of information that was unimaginable earlier in our lives. It started in the early days with bulletin boards and newsgroups, pooling together a collective of documents on a single server, and then the ease of hyperlinking combined with directories and search engines made the physical location of information irrelevant. If it was out there on the Internet, it could, in theory, be found.

For many of us, our first Internet website memories are probably linked to Yahoo. Navigating through its hierarchy of categories or doing a search across helped us get to what we were looking for. Altavista and Excite started providing search within pages, allowing us to type a word or phrase and know that there were tens of thousands of matching documents.
Google then came along and refined the process to perfection by using its PageRank algorithm, giving us results very much relevant to what we were looking for. In effect, Google became our other memory.

This is an important development. Google’s relevance and consistency in returning search results has ensured that we no longer need to communicate web addresses to each other in order to find specific information. If it is out there, Google will find it for us. Just like My Yahoo helped personalise news, stock quotes and the weather for us and thus became a utility for many, Google has become a utility when it comes to helping is find information that is out hidden in the billions of pages of the Web.

Google does a great job in searching the Web. But there are still some things which it does not cover. Our own information space comprising of emails and documents is still hard to search – the irony being that it is easier to find information out on the Internet than on our own hard disk! We do not have tools to search our space, but these are not integrated. There are also a large number of news sites like the New York Times and Wall Street Journal which have restricted access via subscription or registration that Google does not search.

In its efforts to provide uniformity and consistency, Google has become a mass-market search utility which is a good starting point for becoming “our other memory”. But it is not enough. What is missing is the context that each of us have – this is embedded in the web we browse, the documents we chose to save (or email to ourselves), and the subject-matter experts we know (or would like to know).

First, let us survey the current state of the search industry.

Tomorrow: From Yahoo…

Tech Talk | PermaLink

Tuesday, April 22, 2003
TECH TALK: Constructing the Memex: From Yahoo…

Let us begin by taking a look at how information management has evolved in the past decade thanks to the Internet.

In the Yahoo days, the directory was at the centre of the world. Websites were categorised by human editors into appropriate categories. The taxonomy was at the heart of finding pages. One had to drill down through multiple levels of categories to get to the one that seemed to be the one we were interested in. Then, we clicked through to the website and began our search for information there. When we came across good sites, we bookmarked them in our browser, so the next time we did not have to go through the directory once again. Hard to believe, but this was how we navigated the Web maybe 5-6 years ago.

Red Herring’s October 1994 issue had this to say about Yahoo: “Yahoo!'s value is obvious to anyone who's surfed the Web, because it categorizes and creates paths to all the pages that are fit to read. As a vital directory, it's virtually the operating system of the Internet.” It is interesting to read what Yahoo’s founders, Jerry Yang and David Filo, said in an interview then:


The volume of information on the Internet is for practical purposes an infinite problem, because not only is the content itself exploding, but the existing content is changing all the time. If you don't have a committed way of doing it, you can throw any amount of money at it and not solve the problem." Therefore, nobody can be the final solution, and we are just one alternative. The goal, which is fairly modest, is to make the Internet intuitive for the user and to act as a starting point, not an end. It's kind of a discovery experience. Our vision is to provide different ways of viewing that content, whether it's through hierarchy or through a search or through customization.

Ultimately, the best tool is the human brain. Obviously, leveraging our users will be the best form of artificial intelligence… The search part of it, whether visible or invisible, will be a big part of our operations…Our goal is to create an artificial intelligence library and list sites with different degrees of relevance, instead of just alphabetically. So sites that are definitely relevant are listed first, whereas others that may not be as relevant come after. But that's going to be a manual editorial process over time, because I think that no amount of artificial intelligence can establish the inference needed. The more context you have, the better it is over time, but we're not building context for context's sake. If it's one of those categories no one ever visits, why build context for it? The context-sensitive retrieval is very powerful if you can get it to work, but you have to manage people's expectations.


News.com provides a historical perspective:

Conceived by co-founders Jerry Yang and David Filo in a Stanford trailer in 1994, much of Yahoo's popularity was built on the directory's ability to give order and organization to the unruly Web. As legend has it, Yahoo was developed by Yang and Filo as a way to categorize their favorite sumo wrestling Web sites. Even the company name--originally the acronym "Yet Another Hierarchical Officious Oracle"--highlighted its directory roots.

Unlike the other search competitors that emerged in the mid-1990s, such as Excite, Lycos, Infoseek and AltaVista, Yahoo did not develop its technology to crawl through millions of Web sites. Instead, it hired humans to manually search the Web to find, organize and review sites about thousands of topics. Yahoo's editorial team became an emblem of the Internet's rise where legions of college graduates would do the heavy lifting to help Web newbies find what they want.


Into this world came Google.

Tomorrow: …to Google

Tech Talk | PermaLink

Wednesday, April 23, 2003
TECH TALK: Constructing the Memex: …to Google

As Yahoo prospered, along came search engines like Altavista, Excite, Lycos and Webcrawler. They had programs which crawled the web, bringing with them whole sets of pages. They would index the words in the pages. Now, the granularity went from searching for a site to a page, which helped get us to the content we were looking for much faster. Or at least that was how it should have been. But faced with a result pool of tens of thousands or even millions of pages which had our search terms, it was difficult to know where to begin (or end). And so, even as the Web grew, the search industry stagnated as it was weighed down by its own weight.

Into this world came Google. Google also crawled the web. What was different about Google was the way it used to present the results. It used a technique called PageRank analysis, which ranked pages based on their incoming links and which pages linked to it. If an important page pointed to another page, then it was likely that the page pointed to was also quite authoritative. This is somewhat akin to people giving references to others. Who gives the references carries a lot of weightage.

The turning point in the industry came when Yahoo decided to replace its outsourced search provider Inktomi with Google in June 2000. Inktomi’s stock fell 18% on the news. It is interesting to read a Red Herring article of that time:


Up until Monday, Inktomi was the lead search engine provider to the top four Internet portals. Inktomi still provides primary search engine capabilities to Microsoft’s MSN, America Online and Lycos. And Inktomi has been in a similar situation with one of those companies before.

"I wouldn't be surprised if Yahoo bought Google," says Tomas Isakowitz, an analyst with Janney Montgomery Scott. But despite the acquisition rumors, Mr. Isakowitz thinks that Yahoo switched to Google just to distinguish themselves from the competition.


The story three years hence is very different. Yahoo’s 2003 revenues from all its services are expected to grow nominally to about USD 1.2 billion. Inktomi was bought by Yahoo recently. Privately held Google’s 2003 revenues are expected to be USD 750 million, a 150% increase from 2002.

The passing of the baton from Yahoo to Google over the past few years is symbolic of the evolution of search. This is a point made by Danny Sullivan, editor of Search Engine Watch: "In October 2002, Yahoo made the directory secondary to Google. Suddenly the value of getting listed in Yahoo seemed to disappear. Now, if you're not listed with Yahoo, it may not matter." Elwyn Jenkins of Microdocs makes a similar point: “Yahoo rested on its laurels as a great Internet Directory, not thinking that search would overtake a directory service at some time. However, what searchers seem to want is the immediacy of search rather than the hand documented web that Yahoo gives.”

News.com highlights the transition from directory to search engines as the navigation norm on the Internet.


Once the primary road signs to navigating the Internet, directories have moved to the shoulder. They are being displaced by algorithmic search tools and commercial services that many people now believe do a better job in satisfying Web surfers and advertisers. The transformation is bringing to an end an altruistic era of human editors, who once wielded significant clout in driving traffic to Web sites through recommendations made without regard for commercial considerations.

How did Google come to dominate the search industry?

Tomorrow: Google’s Domination

Tech Talk | PermaLink

Thursday, April 24, 2003
TECH TALK: Constructing the Memex: Google’s Domination

Google has barely spent any money on advertising. It has focused on search and providing the best results fast and free of clutter. It is a rare breed of companies that has put technology above everything else. It launched when a category was seemingly stagnant. Wired (October 2001) takes up the story:


Everyone loves Google, and therein lies its dilemma. The phenomenally popular search engine - it now performs more than 100 million searches a day - achieved much of its early success by being resolutely uncommercial. As other search engines were selling banner ads and turning into portals to make a buck off what had become a commodity service, Google just did search. Its stripped-down interface (only three elements: a text-entry box, a Search button, and an "I'm Feeling Lucky" link that takes you straight to the top-ranking result) trades looks for speed. And it does search brilliantly, using a unique technique that ranks pages by how many other pages link to them.

To understand Google’s success, it is important to first understand its PageRank technology. Its site has an explanation:

PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.


Adds the New York Times:

Google's rise initially flowed from a single software innovation: a formula to retrieve pages ordered by their relevance to a Web surfer's request.

The basic idea, known as "link analysis," was not new. But in 1996, Sergey Brin and Larry Page, then graduate students in computer science at Stanford, began applying it to the global links that connect Web pages. Their idea was to exploit existing human intelligence by tracking the popularity of billions of different Web pages. Two years later, the two men would found Google.

Applied to the explosively growing thicket of electronic pointers that make up the World Wide Web, the approach — simultaneously being explored at an IBM research laboratory in San Jose — created a technical breakthrough.

Google now employs 800 people, yet it handles 200 million searches of the Web each day, a staggering one-third of the estimated daily total. To keep up with that torrent, Google has essentially built a home-brew supercomputer that is distributed across eight data centers.


The result of Google’s innovative algorithms was that the most relevant pages, as perceived by analyzing the link structure of the web, started showing up on top when we did searches. We suddenly found sense in search – even though the results still showed a huge number of matching pages, more often than not the information we were looking for was more likely to be found in the first few pages listed in the results of the Google search. This relevance focus (along with the simplicity of its design) has helped Google occupy centrestage in our lives. Search has come back into fashion. Perhaps too much so.

Tomorrow: Google’s Domination (continued)

Tech Talk | PermaLink

Friday, April 25, 2003
TECH TALK: Constructing the Memex: Google’s Domination (Part 2)

To convert its technological superiority into commercial success, Google stuck to its simplicity rule by creating “web advertising that actually works”, according to a Fortune article by David Kirkpatrick: “For all the flash and animation that marketers have put into building Internet ads, the geeks have figured out the real trick: Relevance is more important than style. We're turning to the Internet more and more in the ordinary course of our lives. Whether I'm researching a person or a company, finding the distance between Phoenix and Santa Fe for next week's vacation, seeking a movie review, buying a book, or learning about bird watching, I turn to Google first, then move out. The marketer that can reach me with a relevant message while I'm searching will win.”

Google has, reportedly, over 100,000 advertisers. It takes only a few minutes to set up an advertising program on Google – and it can be all done online in do-it-yourself mechanism. Wrote Wall Street Journal recently: “Google's site has become the prime battleground because of its unprecedented power over the Web. Barely four years old, Google has grown largely by word of mouth to become the place where most people start to look for something on the Internet. Three-quarters of all online searches use Google or sites that use Google's search results, according to WebSideStory Inc… Because of its importance, Google can make or break businesses that sell over the Web. It's the new ‘location, location, location’ for online retailers, for whom ranking at the top of a Google search is the Web equivalent of landing a choice corner on Miracle Mile or Fifth Avenue.”

Adds Business Week: “Advertisers love Google. They supply two-thirds of its revenue by purchasing keywords on Google.com and Google's network of affiliates, including America Online. Owning a keyword allows the advertiser to place simple text spots on pages returned for searches containing that keyword. The ads on Google.com are unobtrusive. No Flash player or screen effects are allowed, and ads are confined to a small box on the side of the screen and a handful of slots at the very top. Still, according to Google, the barebones format is effective enough to drive click-through rates several times those of standard Web ads.”

Google has become the eBay of information, in the words of Mary Meeker. For advertisers, it is important to be part of the Google Economy. Wrote the New York Times: “Much as eBay spawned an army of entrepreneurial auctioneers, Google has become enough of a Web gatekeeper that its leads now prop up plenty of commercial sites.”

Next Week: Constructing the Memex (continued)

Tech Talk | PermaLink

Monday, April 28, 2003
TECH TALK: Constructing the Memex: Overture

One of the amazing commercial success stories in Internet search so far has been that of Overture, which has focused on paid search placements, and ended 2002 with revenues of USD 668 million. More from a News.com story:


Overture began as the brainchild of Bill Gross, whose start-up investment company, Idealab, incubated one-time Internet highfliers like eToys. He founded the company as GoTo.com in September 1997 and a year later, launched its search advertising service with results appearing on GoTo.com and partners including Netscape Communications.

The company compares its service to the Yellow Pages, the phone book that offers a useful resource even as it serves the marketing goals of its advertisers.
The company claims its goal is to create a win-win situation for customers and Web surfers, enforced by self-interest. Because advertisers are required to pay a fee each time someone clicks on one of their links--a practice known as pay for performance--companies are discouraged from misleading readers.


In a world driven by advertising revenues, the importance of what Overture started and others have followed is highlighted by a recent Business Week story:

Placing ads near search results offers the simple appeal of the Yellow Pages, but with different economics. Search-engine companies such as Overture, Google, Ask Jeeves, and LookSmart charge most advertisers by the click. These ads can be presented among the search results, looking like any of the other Web links that have been rounded up. That's known as paid inclusion. More often, other search-related ads are featured as "sponsored listings" at the top or side of the search results. Advertisers say that search-related ads, whether overt or camouflaged, attract far more interest than regular scattershot Internet ads. Why so? They give people what they're already looking for.

Search advertising is also cheap. At an average of 35 cents a click, paid search undercuts the $1-per-lead average for Yellow Pages ads. The money is split between the portal, which generates the traffic, and its search-advertising provider.

Changes in Internet usage also power this trend. As Web surfers grow more sophisticated, they focus on specific tasks, such as checking mail or finding a recipe. More are using search engines to hurry through their to-do lists. The percentage of Web site visitors who arrived via search engines nearly doubled in the past year, to 13%, says analytics firm WebSideStory. Increasingly, says Jupiter Research analyst Gary Stein, "people are tuned out on banner ads and tuned in to search results."


A May 2002 Fortune article on Google, now Overture’s main competitor, puts the battle between the two for revenues in perspective:

Overture and then Google started selling something called sponsored links, which is a fancy name for a classified ad with an Internet link. Sponsored links cost nothing to produce, load easily through a narrowband connection, and make a more subtle pitch than banner ads. They're also more popular with advertisers, which pay based on how many times people actually click on the ad. With banners, advertisers have to pay based on how many times the ads were displayed, which gives no indication of how the ad is doing. Google took the model a step further, marrying the text-based ads with its search results, something Overture did not have. In other words, if you do a search on Google for, say, Botox, an ad and link for Laserlightrx.com comes up alongside your search results. The upshot was something that Website operators had been trying to accomplish since the beginning of the Internet: meaningful search results accompanied by relevant advertising.

Tomorrow: DMOZ and Microsoft

Tech Talk | PermaLink

Tuesday, April 29, 2003
TECH TALK: Constructing the Memex: DMOZ and Microsoft

Another development in the search and directory industry has been one which is diametrically opposite to Overture in terms of process and business model.

DMOZ (also called the Open Directory Project or ODP) is, according to the website, “the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors… It provides the means for the Internet to organize itself. As the Internet grows, so do the number of net-citizens. These citizens can each organize a small portion of the web and present it back to the rest of the population, culling out the bad and useless and keeping only the best content.”

ODP has over 3.8 million sites, 56,429 editors and over 460,000 categories. It “is the most widely distributed data base of Web content classified by humans. Its editorial standards body of net-citizens provide the collective brain behind resource discovery on the Web. The Open Directory powers the core directory services for the Web's largest and most popular search engines and portals, including Netscape Search, AOL Search, Google, Lycos, HotBot, DirectHit, and hundreds of others.”

At present, according to Business Week, “Yahoo boasts the biggest audience, Overture the most advertising, and Google has the leading search technology.” The stage is set for a battle royal, with Microsoft as the dark horse. Microsoft Research has been looking at ways to improve search, according to News.com:


While search tools exist today, a major focus of Microsoft's research will be to allow for a freer flow of associations between data and to expand how searches can take place. Currently, data on computers is largely stored in a hierarchical fashion: A picture or document gets a file name and is stuffed into a folder. To find a document, people largely hunt and peck, a technique that also gets used on search engines.

People, however, don't think that way, Rashid said. To find a vacation shot from Australia using newer tools, for example, a person could ask a computer to pull up pictures that feature an ocean background or family members. A search engine inside an application would then comb through the visual images to get matches.

"The problem with hierarchies is this conceit that all knowledge has a place, but no single thing fits in one space," he said. "They become very cumbersome."
Microsoft's "Sapphire," another lab experiment, exemplifies the difference. The application lists associations with a word in a document. Scroll over a person's e-mail address, and Sapphire will pop up a balloon listing the person's instant message address, work title, recent publications, and lists of e-mail exchanges and meetings you've had with this person.


Tomorrow: A Personal View

Tech Talk | PermaLink

Wednesday, April 30, 2003
TECH TALK: Constructing the Memex: A Personal View

For much of the period from 1997-1999, I too was a player in the directory and search business. My company, IndiaWorld Communications, had launched India’s first search engine, appropriately titled khoj in March 1997. (Since November 1999, khoj has been part of Sify, following its acquisition of IndiaWorld.)

The problem I set out to solve in March 1997 was that of India-centricity in search. Yahoo was then the de facto king. It would take a long time to get sites registered into its directory. When one did a search, it was difficult to get India-centric results – Yahoo covered the world, but there were times when one wanted to limit the results to one’s local context. I also realised then that search was one of the key attractors on the Internet. As new people came online, they needed to know which sites to visit. As new sites get launched, they needed a place to list them to tell the surfers. This is what khoj set out to solve.

We launched khoj on the second anniversary of the launch of IndiaWorld. We positioned it as the Indian alternative to Yahoo. Here’s an extract from our press release (sourced from Google Groups):


Finding Indian Web sites just got easier. IndiaWorld, India's largest Web site, has launched khoj, an online directory of over 800 India-related Web sites. khoj catalogues the Web sites into 11 primary categories, and has a multi-level classification system for business, education, entertainment, news and government. "Khoj" is a Hindi word which means "search."

"Think of khoj as the Indian alternative to Yahoo. It will help people worldwide find Indian resources, information and companies much more easily. khoj is the first Asian venture on such a large scale," said Rajesh Jain.


I remember sitting up for about two weeks prior to the launch going through a catalogue of Indian sites and classifying them one-by-one on a slow link to the Internet. In fact to make classification easier, we had written a program to get the top pages of various sites and store them offline in our office so that classification did not necessarily need a real-time connection to the Internet.

It was this crawling of pages that gave us the idea to add a search engine to the khoj directory. This way, people had three ways to find sites: navigate the directory, search the website descriptions in the directory, and get results from the actual cached pages of the Indian sites. This combination is what helped khoj become extremely popular and made it the top-ranked Indian search engine.

Tomorrow: What’s Missing

Tech Talk | PermaLink

Thursday, May 1, 2003
TECH TALK: Constructing the Memex: What’s Missing

In our daily quest for information, a few years ago in the early days of the Internet, we used to go to Yahoo, navigating through the multiple levels of its directory to reach the site(s) we wanted. As time passed, we started using search engines - first Altavista and Excite, and now Google, which has become our “other memory”. It can, in fact, be thought of as a knowledge operating system”, according to Elwyn Jenkins of Microdoc News:


In general terms, an operating system is a management system. The operating system that runs your computer manages the demands that each of the different programs you are running at the same time, handles your filing system, hard drives, printers and more. Applying the concept of "operating system" to Google, a Knowledge Operating System (KOS) manages your knowledge activity on the Internet. Google, as a KOS, manages your requests for information, indexes your web pages, responds to applications you may be running on your computer that interface to it via the Google APIs, and integrates knowledge and information from millions of computers into a single large managed database.

Website owners and webmaster who build more static websites do not gain the same degree of operating system-ness of Google, as do bloggers who have a closer relationships with Google. I can write a page today, and have my page indexed and readily available for recall in the Google Database within a day.

This is like a massive disk drive directory -- only there is a time lag between when I saved the file and when it is accessible. As Google becomes more adept at sending Googlebot around to collect new pages, this sense of "saving something to disk" will increase, thus making Google not only indispensable for others to find my pages, but also, a great tool for me to locate my own pages.

Already I use Google as a bookmark manager. No longer do I remember URLs - it is much simpler to remember how to obtain a site's listing by remembering a word to locate that site…I go to all my favorite sites with a single word or two-word combination.

What are the benefits of considering Google an operating system? From a user perspective, it places Google in a position of centrality to my tasks. It is where my knowledge is indexed, it is where I locate new knowledge, and it is the system that underlies my writing in Word, preparation of weblogs, and so on.


Yahoo and Google, in some ways, represent the two extremes. Navigating through directories like Yahoo has its limitations. There is a single global directory (or at best, country-level directories). Also, they do not take us to the document - they will leave us at the site's home page. Most of the directories are also not scalable because of their centralisation and manual updation process. In fact, this is what created the opportunity for automatons like Google - the web had simply grown too big.

In relying on Google so extensively now, we are also losing out on something important. Of course, it is reasonably accurate in what we are looking for most of the time. Or at least that is what we think because we have no way to tell. But the results are the same irrespective of who does the search. We do not have an easy way of specifying clusters of documents to search, or a time period. In short, what is missing is a "context" for the search.

Google has centralised search, which is good, because we do need a single place to turn to. But the Web and the people who have built it are much more complex and distributed. Documents and websites have associated people and ideas with them. As search has become narrower and we have focused on Google to provide our results, the wider view of the world which a directory used to offer has been somewhat lost.

What the Web and Google have done is exposed us to the amazing richness and depth of information that is out there. This has only us hungrier for creating a memory which extends our own – and is our own.

Tomorrow: Imagine

Tech Talk | PermaLink

Friday, May 2, 2003
TECH TALK: Constructing the Memex: Imagine

We have our own memory and we have Google as our other memory. (We also have the option of the Yahoo and DMOZ directories.) Now imagine, if we could bridge the chasm between directories and search engines, making it much more customized to our likes and trails that we leave as we surf the Internet, and also taking into account all that we write in emails, blogs or otherwise.

Imagine a system that uses our memory and knowledge as the starting point. We begin by outlining our interest areas - the topics that form the ecosystem of our lives. This is akin to the Yahoo or DMOZ directory of topics – only, much more relevant to us. For example, in my case the main categories of this list would be something like this: Affordable Computing, ICT for Development, Emerging Markets, Enterprise Software, Information Management, New Technologies and India.

If one were to search these topics in Google, the resulting set of links would be helpful only to a small degree and only for the first few times that we did the search (since the results would be nearly the same each time in a short span of time).

These topics are wide topics, and need to be narrowed down. What is needed is a taxonomy for each of the topics, which helps in further refining our interests. The Google search results, perhaps the Yahoo (or DMOZ directory) and our own knowledge form the basis of this hierarchy. For example, my outline for Affordable Computing could look like this: Hardware (Thin Clients, Refurbished PCs, PDAs), Software (Linux, Applications, Language Computing), Communications (Ethernet, WiFi, WLL, VSAT).

This hierarchy of topics serves as the basis for our interests. It gives a unique lens and context to the information that we browse on the Web, write in emails and receive as attachments. These topics will evolve as our interests change and as we come across experts who may have done a better job in building out a certain part of the information ecosystem.

This is an evolving information base – built not by a centralised organization, but in a distributed manner by each of us. We all have expertise in specific areas. This was manifested in the early days of the Web through the millions of home pages created on Geocities and Tripod. At that time, the only way to build out these pages were by explicit and time-consuming personal involvement – something few of us were prepared to do. (Basically, the web was good for reading, but not as friendly for writing.)

So, now, imagine if each of us could build out these personal directories – outlines of topics and connections to other directories, people and documents. Much of this would happen automatically as we browsed and marked pages of interest, embellishing them with our comments. When we search, it would first scan our world of relevant information rather than the world wide web of documents.

In other words, each of us would have a microcosm of the information space, created and updated continuously by what we did. It would ensure that our ideas would have a context, that we would never forget something, and that we could leverage on similar work done by millions of others like us. This is the real two-way web – linking not just documents, but people, ideas and information.

Vannevar Bush imagined just such a system – in 1945. He called it the Memex.

Next Week: Constructing the Memex (continued)

Tech Talk | PermaLink

Monday, May 5, 2003
TECH TALK: Constructing the Memex: Vannvar Bush...

Write Randall Packer and Ken Jordan in their introduction to Vannevar Bush’s paper in their book “Multimedia: From Wagner to Reality”:


Vannevar Bush rose to prominence during World War II as chief scientific advisor to Franklin Roosevelt and director of the government’s Office of Scientific Research and Development, where he supervised the research that led to the creation of the atomic bomb and other military technologies. By orchestrating this ambitious collaboration between the military, scientific, and academic communities, Bush is considered the founder of what came to be known as the military-industrial complex. His contribution to the evolution of the computer ranges far and wide: from the invention in 1930 of the Differential Analyzer, one of the first automatic electronic computers, to his concept of the “memex”, the prototypical hypermedia machine.

Adds Adam Brates in his book “Technomanifestos: Visions from the Information Revolutionaries”:

Bush’s immense administrative burden – the daily strain of sorting, allocating, researching, analyzing, synthesizing, crosslinking, and filing – spurred his idea for an invention that would perform this work for people. Bush popularized the idea that machines could solve the problem of information overload.

Bush wondered whether all the sprigs of scientific wisdom, if not somehow preserved, would fall from the tree of knowledge. Information must somehow by connected to be relevant, lest it become forgotten. Knowledge accumulated and stored in massive filing cabinets under lock and key would languish. An idea developed today might not be relevant until some point in the future. What happens, though, if it is forgotten? Application of all new knowledge would require some means of keeping it available, accessible, and relevant.

Bush saw purposeful communication and feedback as a means to fight entropy. Information that is unused and unorganized will disperse into the known. Bush wanted to liberate information from its Byzantine card catalogs, musty libraries, and research facilities. He wanted specialists to draw connections between their work and that of others in different disciplines. He wanted them to forge new alloys in science, mixing engineering with the abstract powers of mathematics, the solutions of chemistry, the vitalism of biology. Scientists weren’t the only ones suffering under the burden of specialization and information overload. So were lawyers, historians, businesspeople, and administrators. The world, this “greatest of apparatus men” proclaimed, is becoming increasingly complex.


So, it was in 1945, just after the end of the Second World War, that Bush published his ideas in The Atlantic Monthly. The essay was entitled “As We May Think”. In fact, Bush had written it originally in 1939, and waited till the end of the war to publish it, perhaps feeling that interest in his ideas during wartime may have been less.

Tomorrow: …and the Memex

Tech Talk | PermaLink

Tuesday, May 6, 2003
TECH TALK: Constructing the Memex: …and the Memex

Write Randall Packer and Ken Jordan in their introduction to Vannevar Bush’s paper in their book “Multimedia: From Wagner to Reality”:


Bush [proposed] a solution to what he considered the paramount challenge of the day: how information would be gathered, stored, and accessed in an increasingly information-saturated world…Although he addresses the subject from the vantage point of the 1940s technology – relying on film processing, microfilm storage, and mechanical retrieval – Bush introduces many of the concepts central to hypermedia. The machine that he proposes, the memex, is a new approach to the storing and sharing of information – a “memory extender” (hence memex) that could organize diverse materials according to an individual’s own personal associations. Conceived as a vast encyclopedia of text, images and sounds that is able to mimic the mind’s capability to link between ideas freely, the memex would effectively remember the leaps of thought someone had while researching a particular topic, and then make that trail of associations available to others. Bush never used the word hyperlink, but in his essay he invented that notion.

Adds Adam Brates in his book “Technomanifestos: Visions from the Information Revolutionaries”:

Bush imagined the memex to be an “enlarged, intimate supplement” of the human memory. The user would store in the computer’s memory magazines, newspapers, photographs, manuscripts, books, and letters. He or she would establish links – “trails” – between implicitly related documents. The memex philosophy:

· We should no longer organize information in classes, subclasses and sub-subclasses.
· Information should be organized by association. When an item is selected, the device should jump to the next item, and then to a third, and so on. These trails are like synapses in the brain.
· Like those of memory, these trails should bifurcate, cross other trails, and become complex.
· If items are used, such trails should be emphasized. If not used, they should fade out.
· The machine should be fast – faster and more intuitive than any existing means of information retrieval.

Bush imagined that the memex would revolutionise not only the organization of information, but its use and form. New encyclopedias and newspapers would contain built-in associative trails. Lawyers would be able to tie one case to the rest in legal history. Scientists and technologists could develop projects by building on the pieces of past projects and finding associations between different disciplines. The problem with specialization would diminish as users found links that transcended time, place and discipline…Users could hop, skip, and jump along trails, finding easy, intuitive ways to draw parallels and patterns. All information could be expressed as pattern and path.

In the best of worlds, the memex would empower the individual as well as the community in which the individual works. Colleagues could share trails…Yet each station would also be unique, incorporating the user’s own trails and person documents.


Tomorrow: As We May Think

Tech Talk | PermaLink

Wednesday, May 7, 2003
TECH TALK: Constructing the Memex: As We May Think

Here are a few extracts from Vannevar Bush’s 1945 essay “As We May Think”:


[The human mind] operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental pictures, is awe-inspiring beyond all else in nature.

[The memex] consists of a desk, and while it can presumably be operated from a distance, it is primarily the piece of furniture at which he works. On the top are slanting translucent screens, on which material can be projected for convenient reading. There is a keyboard, and sets of buttons and levers. Otherwise it looks like an ordinary desk.

In one end is the stored material. The matter of bulk is well taken care of by improved microfilm. Only a small part of the interior of the memex is devoted to storage, the rest to mechanism. Yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so he can be profligate and enter material freely.

Most of the memex contents are purchased on microfilm ready for insertion. Books of all sorts, pictures, current periodicals, newspapers, are thus obtained and dropped into place. Business correspondence takes the same path. And there is provision for direct entry. On the top of the memex is a transparent platen. On this are placed longhand notes, photographs, memoranda, all sorts of things. When one is in place, the depression of a lever causes it to be photographed onto the next blank space in a section of the memex film, dry photography being employed.

[Associative indexing is] the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.

When the user is building a trail, he names it, inserts the name in his code book, and taps it out on his keyboard. Before him are the two items to be joined, projected onto adjacent viewing positions. At the bottom of each there are a number of blank code spaces, and a pointer is set to indicate one of these on each item. The user taps a single key, and the items are permanently joined. In each code space appears the code word. Out of view, but also in the code space, is inserted a set of dots for photocell viewing; and on each item these dots by their positions designate the index number of the other item.

Presumably man's spirit should be elevated if he can better review his shady past and analyze more completely and objectively his present problems. He has built a civilization so complex that he needs to mechanize his records more fully if he is to push his experiment to its logical conclusion and not merely become bogged down part way there by overtaxing his limited memory. His excursions may be more enjoyable if he can reacquire the privilege of forgetting the manifold things he does not need to have immediately at hand, with some assurance that he can find them again if they prove important.


Vannevar Bush wrote his essay in 1945 – before we had the computer, Internet, Web, Yahoo and Google. Even today, we struggle with information overload. The memex could be the panacea in our info-centric world. So, the challenge before us is: can we leverage all the recent developments in technology to construct the memex? People have been thinking about it a lot of late.

Tomorrow: Google, Blogger and Memex

Tech Talk | PermaLink

Thursday, May 8, 2003
TECH TALK: Constructing the Memex: Google, Blogger and Memex

Recent interest in the Memex was sparked off by Google’s purchase of Pyra Labs. By itself, the purchase of a small, private company (Pyra had all of five people) would not have garnered much attention. But Pyra is important for an especially crucial section of people on the web. Pyra runs Blogger.com, which hosts more than half a million bloggers. Bloggers are the trailblazers of a new world and an especially vocal lot – leading a writing revolution in a largely read-only web. So it was only natural that speculation mounted on the motives of Google’s purchase.

It was Matt Web who made the association between Google, Pyra and Memex. This is what he wrote (on his blog):


[Google have] got one-to-one connections. Links. Now they've realised – like Ted Nelson - that the fundamental unit of the web isn't the link, but the trail. And the only place that's online is... weblogs.

There are two levels to the trail:

1 - what you see
2 - what you do
("And what you feel on another track" -- what song is that?)

And the trail is, in its simplest form, organised chronologically. Later it gets more complex. Look to see Google introduce categories based on DMOZ as a next step.

So, the Google Toolbar tracks everything you do on the web, giving you low-level anonymous trails tying the web together. These are analagous to the strings of physics, or the rows and columns of Excel. This is 1, what you see.

Now there's the semantics, the meaning extracted from these, and that's done with the human mind. This is 2, what you do. What you choose to elevate. Now these trails are the basic units.

The combination of the two is startling.

Oh, and you can analyse how people search to add extra data. Stop and start points.

Imagine, searching at Google, and then:
- this trail is highly followed
- do you only want to see what people suggest, or where people went?
- here's a worn track in the interweb. Follow the Google Pixie!
- this trail is uncommon, but made by someone we see (by your weblog) that you value

And next, it's the true Memex. The Google appliance based on microfiche, punchcards and cameras...


Matt Webb made a mention of the Google Toolbar. This is a small application which anyone can download and install on one’s local computer. It provides a direct Google search window as part of the browser, and also provides information on the page displayed in the browser. More importantly, it provides Google the ability to see we surf – what are the trails that we follow as click on links to navigate from one page to another.

This ability to access the trails that people is also possible in another application that can be downloaded – Alexa, which provides information on related sites. (Alexa is now owned by Amazon.) One’s own history as captured by the browser is another such places where information is stored.

The challenge is to connect up the information from many people. The trails collected by Google Toolbar, Alexa are only available to the two organisations whose applications they are. This is where bloggers come on – they are now putting up on their page links to pages they like. While not capturing the entire browsing history, blogs are collecting links to articles that the blog author likes (or dislikes). Taken over thousands of people, it now becomes possible to envision a system that can start building associations. This can serve as a starting point for constructing the Memex.

Tomorrow: Google, Blogger and Memex (continued)

Tech Talk | PermaLink

Friday, May 9, 2003
TECH TALK: Constructing the Memex: Google, Blogger and Memex (Part 2)

Steven Johnson, the author of “Emergence”, then picked up the Google-Blogger story in an article in Slate:


Google has not yet ventured into managing the information and surfing history of individual users. If Google went in this direction with the Blogger acquisition, it would hearken back to one of the seminal documents of the computing age: Vannevar Bush's “As We May Think” essay, which envisioned a new tool to augment human memory. Bush's imaginary device, called the Memex, would help manage the ever-accelerating explosion of information in the world.

Bush imagined the Memex as a machine of connected documents that from one angle looks a great deal like the modern, Web-enabled computer. But in one crucial respect, Bush's vision differed from today's Web: He placed great importance on the trails created as the user moved through information space, assuming that a record of those trails would be of great use in amplifying the signal of human memory. In many ways, our networked computers have wildly exceeded Bush's vision, but our trail-recording tools are still woefully undernourished.

By acquiring Blogger, Google gets access to the user base, thousands of individuals who are already sold on the premise of storing their Web actions for posterity. How might Google's tools improve the existing Blogger technology?

One feature might work like this: Each time I search for something on Google, a list of URLs is generated. When I click on one of those URLs, the page I've selected is automatically blogged for me: storing for posterity the text and location of the document. If I were an exhibitionist sort, I could choose to publish this list to the world, but more likely I'd keep it as a private archive, visible only to me. It would be a kind of outsourced memory, but one capable of making new connections on its own. Google could easily generate a list of all the pages that linked to the pages in my archive, or notify me if a page I discovered two years ago suddenly grew popular. I'd have the option of searching just my personal archive, instead of the entire Web—or searching the archive's extended family: both the pages I've surfed through, and the Web sites that link to those pages.

This idea of personalized link collections, augmented by software, is straight from the pages of "As We May Think”: "Wholly new forms of encyclopedias will appear," Bush predicted, "ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified. … There is a new profession of trail blazers, those who find delight in the task of establishing useful trails through the enormous mass of the common record."
Google is the encyclopedia of the connected age, and bloggers are the trailblazers. If Google simply uses Blogger to update its database more rapidly, it won't change the Web experience as we know it in any profound way. But a genuine trailblazing device would be a way of preserving—and widening—the paths that our lives have carved through information space.


A story that began more than half a century ago, long before the era of information technology as we know it now, with Vannevar Bush’s article on the Memex is reaching its climax. The problems related to information overload that Bush outlined are even more in evidence now tan ever before. And yet, for the first time, there are also solutions in sight. The question is: do we wait for Google to construct the Memex? Or, can we – lots of us – build it in an emergent fashion?

Next Week: Constructing the Memex (continued)

Tech Talk | PermaLink

Monday, May 12, 2003
TECH TALK: Constructing the Memex: Information Overload

The problem of information overload has existed for long. Even as Vannevar Bush refers to it in his essay on the Memex, all modern technology of today still does not have adequate solutions to the problem. In fact, if anything the situation in the past decade has got worse.

Compare life a decade ago to now. We have probably seen at least a 10x increase in much of what we are doing (or should be doing). Email, instant messaging and cellphones (with SMS) have made us more reachable, increasing the circle of people who can reach us – anytime, anywhere. Email, especially at the workspace, has made it easier for us to be “in the loop” on many more things – increasing the ongoing threads that we are aware of or are involved in dramatically.

Decision-making time as reduced since everything needs to be “in real-time” – after all, if one get access to information in real-time, how can decisions take longer! Thanks to the Web, the information available to us is greater than before. Today, no website is inaccessible, no book is more a few clicks away, no person is unreachable.

What has not changed is time – the number of hours in a day is a universal, perpetual constant. What has also not changed much is our own cognitive ability – we still need time to process information. What has increased is the number of “context switches” we need to make in a day so many different tasks seek to grab our attention; what has not decreased is the time that each such switch takes. Technology may follow Moore’s Law; Humans don’t. We are in a world which is not just accelerating, but even the rate of acceleration seems increasing.

The basic productivity tools we have in front of us are two – one is our human brain, and the other is the personal computer. For much of our lives, it was the first tool that we relied on. The second tool – the PC – may have grown by leaps and bounds in what we can do, but we still use a small fraction of its power. There have been other developments around us – storage costs have fallen, bandwidth has become cheaper and more ubiquitous. Search engines have complemented our memory by being able to find anything “if it is out there”.

Complexity in the world that we see around us will not decrease. What needs to change is how we fit ourselves in this and get a greater control on activities that we do. The challenge before us, is therefore, to see how we can use recent technology developments to augment our memory, so that we are able to amplify our ability to process information manifold. The building blocks are at hand to bring to life the vision of the Memex that Vannevar Bush outlined more than half-a-century ago.

Tomorrow: Memex Objectives

Tech Talk | PermaLink

Tuesday, May 13, 2003
TECH TALK: Constructing the Memex: Memex Objectives

Let us begin by outlining the objectives of what we want our Memex to do.

Gathering and Storing: Information – and Insight – comes to us through various sources – email, print, web pages, CDs, webcams, sensors, conversations, and our own thinking. The Memex needs to be able to aggregate all of the information coming in, and store it in a manner where it is easily accessible. One issue to consider is that at times, it may not be enough to just store links to articles. For example, in sites like Wall Street Journal and New York Times, articles in the archives are charged extra after a month or so. So, it may make sense to store a copy of the full-text of these articles locally.

Annotating: As the information items are stored, we may want to annotate them with our comments. This is akin to what many bloggers are doing with links to articles that they like – this helps give a context to why that piece of information is useful, even though it may not have an immediate relevance.

Indexing: Vannevar Bush talks about associative indexing – connecting a set of related ideas together by association, rather than classes, much the way our memory works. Either way, an indexing system is important so one can find information faster.

Publishing: We may want to selectively publish the information that we have on the Intranet or the Web, once again similar to what bloggers are doing. By making information publish, we are contributing back the system – that is, we are making the transition from an information consumer to an information producer. By itself, we may appear a cog in the wheel. In fact, we are more like ants building an anthill. Our local actions help create the global system. This is emergence at work.

Accessing: This is perhaps at the heart of all that we are doing. Most information is not something that can be processed and consumed right away. So, we are storing it with our annotations. The reason is that we want to access it at a later date. This process of information retrieval needs to ensure that we get the right information at the right time. Multiple devices may be available for access – from desktops to PDAs to cellphones. It should also be possible to have layers of information – for example, first, the search is performed within the constellation of documents that one has stored; next, the search is done on documents within two degrees (think of this as a “friend of a friend” approach); and finally, the search could be done on the global database that Google and other search engines have.

Recording Trails: One of the key elements of the Memex as Bus has described it is its ability to record the “leaps” that people make. This is the one weak area in the infrastructure of the Web as we know it (and a point echoed by Steven Johnson in his idea of “personalised link collections”). Two ways to record the trails of websites visited would be to use the local history of the browser, or to get the data from a proxy server. Either way, the trails – links – that we select are the decisions we make which need to feed back into the system.

Tomorrow: Memex Objectives (continued)

Tech Talk | PermaLink

Wednesday, May 14, 2003
TECH TALK: Constructing the Memex: Memex Objectives (Part 2)

Learning and Recommending: The Memex needs to learn from all that we do – the types of searches, the links we click on. This learning can make it more efficient in its recommendations. Today, we are seeing Bayesian analysis being used to detect and filter spams from our inbox. We also see recommendations of books at Amazon based on our prior history. This needs to apply much more to the information that we access.

Making Connections: Linking us to people, ideas and information is one of the most important aspects of the Memex. As the sources of information and its quantum increases, we will increasingly rely on experts – specialists whom we trust to make the right judgments. The Memex will help in identifying these experts and connecting us to their ideas. Think of these as “shortcuts” that we are building in the information network.

Alerting: This makes the difference between Push and Pull. Today, we are used to pulling in information from all kinds of sources. What would be good is to have a system pushing relevant information our way, and alerting us to items of relevance for us. The last-mile to the user has been bridged with always-on wireless devices like cellphones.

Personalising: The Memex needs to take into account our context, and thus provide a custom view of the information space. In its efforts to maintain consistency, Google has forsaken the individualised view. By being able to remember the trails one took and the information gathered, it should be able to create distinctive views of the information space.

Visualising: The Memex needs to use the new developments in presentation, especially in visualization to present richer views of the information space. It needs to, like video games, provide an integrated query-and-response space.

In a sense, the Memex is more about assimilation than aggregation, more value-added integration than scanning silos, more amplification than just presentation. It needs to work silently in the background, rather than making us change dramatically the way we do our normal activities. (Of course, some change in the way we interact with our information sources will be inevitable.) It should attempt to augment, not try and replace, our memory. It should be able to widen the information net that we are able to access, and yet, specialise it to just what we need.

It may appear that what is being attempted is the Holy Grail of Information (and Knowledge) Management. It may seem like a Mission Impossible. Far from it! As we shall see, the tools and technologies to build the Memex are now at hand. The interesting thing is that, as individuals, if we do our information-related activities just a little differently, we can be active participants in an emergent system which will help build out our very own Memex.

Tomorrow: Building Blocks: Blogs

Tech Talk | PermaLink

Thursday, May 15, 2003
TECH TALK: Constructing the Memex: Building Blocks: Blogs

Let us begin by taking a look at the building blocks for the Memex. Later, we will see how these can be combined together to construct the Memex. The building blocks can be classified under three categories: Blogs, RSS and OPML. A number of technologies can be thought of as coming together in each of these three ecosystems to enable the construction of the Memex. We’ll begin with the Blogs Ecosystem.

Weblogs are personal journals, with links, comment and analysis. They represent the individual’s likes (or dislikes). A blogger is making decisions about what to include on the blog, and where to link to. Links can be to other blogs as part of a “blogroll” or to specific articles from news media sites and blog posts as part of a blog entry. In each of these cases, there is a certain structure that a blog has, with the granularity of a blog is its blog post. A blog is created by using a blogging tool or service, like Userland’s Radio, SixApart’s MovableType or Blogger.com of Pyra Labs (now owned by Google). Every blog post has a “permalink” and can thus be referred by someone else.

Unlike websites which are self-standing and exist on their own, blogs are part of an ecology – think of it as the blogosphere. Blogs point to other blogs. This enables us to think in terms of the “neighbourhood” of a blog – a collection of blogs linked directly or indirectly with two degrees of separation. The analogy here is that we have friends, and these friends in turn have friends. A term used in this context is FOAF - friend of a friend. With blogs,it is possible to therefore do a scan of the blogosphere to search for both friends and FOAF for a given blog. This is what BlogStreet does – here is an example of the neighbourhood of my blog.

Why is this important? Just as we are more likely to listen or turn to friends for advice and recommendations, the blog neighbourhood can be an important consideration when it comes to searching and finding appropriate content. It is a set of people we are more likely to “trust” than any other.

What is now required from each of us is to create a personal blog. For a start, it could just provide links to articles that we read and like, along with a blogroll. As a next step, it could fetch the articles that we like from some sources we know will not be available later. For example, stories from the New York Times or the Wall Street Journal become available for additional fees (even if one is a subscriber) after a specified time (7 or 30 days). What the personal blog tool could do is fetch the stories and archive them locally so that they are always available by posting them to the personalblog using the MetaWeblog API.

The potential of blogs was highlighted by Steven Johnson in an article in Salon about a year ago: “The true revolution promised by the rise of bloggerdom is not about journalism. It's about information management. The bloggers have the potential to do something far more original than offer up packaged opinions on the news of the day; they can actually help organize the Web in ways tailored to your minute-by-minute needs. Often dismissed as self-obsessed `vanity sites', the bloggers actually have an important collective role to play on the Web. But they're not challengers to the throne of the New York Times and the Wall Street Journal. They're challengers to the throne of Google.”

Tomorrow: Building Blocks: Blogs (continued)

Tech Talk | PermaLink

Friday, May 16, 2003
TECH TALK: Constructing the Memex: Building Blocks: Blogs (Part 2)

Steven Johnson built on the thinking of using blogs for information management further in the Salon article of May 2002:


The beautiful thing about most information captured by the bloggers is that it has an extensive shelf life. The problem is that it's being featured on a rotating shelf…I don't always want to know what über-blogger Jason Kottke happens to be thinking about this morning -- I want to know what he thinks about the page I'm currently reading, or the paragraph I just wrote. If I stumble across a page 10 weeks after Jason wrote up a description of it on Kottke.org, his description is just as valuable to me as it was 10 weeks before -- in fact, it's probably more valuable, because I've come across the page on my own personal journey. But as it stands now, to figure out if Jason's referenced the page I have to copy the URL and paste it into the search engine on Kottke.org. If I've got 20 or 30 bloggers that I'm following, I've got to paste that URL into 20 separate input fields.

But the bloggers needn't be anchored to the headline-news mentality. Think of them as less like a newspaper substitute and more a kind of guardian angel, hovering over your shoulder as you surf. Punch up a URL and if Jason, or Andrew Sullivan, or Sopsy has an opinion about that page, you see their comments in a floating window alongside your main browser window. It's a simple enough trick: Sites like Blogdex are already tracking blog-borne references to different URLs. All your browser would have to do is send an additional request to a database of blogged URLs anytime you pulled up a page: If there's a match -- if one of the bloggers you're following has referenced the URL -- their comments get sent back to your machine and appear in the floating palette.

You define a few "guardian" Bloggers, perhaps by checking a box when you visit their site. You also instruct your software to watch the activity on sites maintained by "friends" of those key bloggers. You tell the software that you want a medium level of intrusiveness: In other words, you want the system to point out useful information to you, but you don't want it constantly bombarding you with data at every turn. And then you start using your computer as you normally do: surfing, writing e-mail, drafting Word documents.


The first steps in this direction are already being taken by blog analysis sites like Blogdex, Daypop, Feedster,Technorati and our own BlogStreet. No single one of them has the answer, but it is possible, for example, to combine BlogStreet’s neighbourhood analysis tool to limit the search space on Google to a specified list of blogs. Or better still, imagine if one of these sites can start building up a database of blog posts being done by the bloggers. That could provide a central repository of blog posts to be searched with a neighbourhood as filter. This could even be extended to a peer-to-peer approach if the various blog tools offered a web service via XML and SOAP to offer a search for a specific word or phrase and returned the results in a manner which could be aggregated. To make this even more effective, one could even set up an RSS feed on specific search terms for a blog.

The point is that the blogging ecosystem is now ripe for harvesting. Over the past two years, there has been a critical mass of bloggers who have mapped out the information space. Even as they have done their work (and continue to do it) individually, the tools and technologies are now available to provide each of us personalized maps and paths to navigate the world of information.

Next Week: Constructing the Memex (continued)

Tech Talk | PermaLink

Monday, May 19, 2003
TECH TALK: Constructing the Memex: Building Blocks: RSS

It is all too easy to say that we should all become bloggers, setting up pages with links of stories that we like and which are relevant to our interests. How do we enable this without getting totally consumed by the time it would take to do this? This is where the second ecosystem comes in: this one is built around RSS.

RSS (Rich Site Summary) is an XML file format, with a standardised way to represent a story, so that a software program can easily identify the title, description (or contents) and a link to it. The newer version of RSS also enables categories to be specified. Here is a sample example of an RSS feed for my weblog. What you will see is a lot of tags – to make greater sense of it, do a “View Source” in your browser on the page, and then compare with the newest entries posted on the blog.

An RSS feed serves as the input to a special program called the RSS (or News) Aggregator (or Reader), which parses the feed into its constituent items for display. We can now navigate through these items without having to actually go visit the website to find out “what’s new” on the site or blog. The News Reader works on the publish-subscribe principle — content providers publish RSS feeds for their content, which can be subscribed to by users. There are various News Readers (some free, some paid for) which are available.

[A more elaborate discussion on RSS and its wider implications is available in one of my earlier Tech Talk series: RSS, Blogs and Beyond.]

This is where it gets interesting. Imagine if instead of setting up a separate program as a News Reader, the email client itself can work as one. It already has a three-pane view, with the left panel showing the folders, the right top showing the list of items, and the bottom right showing the item details. Each of the items will have a “permalink” which the user can click on to get to the site for additional details on the story.

A centralized service – on the local network or a hosted service on the Internet – can offer to fetch RSS feeds from subscribed sites and create emails out of the incoming feeds. There is one email for every item. These items are then sent into the user’s mailbox. This is ideally a separate mail account – think of it as an RSS IMAP Mailbox. The user can then set up filters, if required, to manage the incoming feeds.

The use of the email client itself as the News Reader eliminates the need for the use of a separate program that needs to be downloaded and installed. Everyone knows how to use an email client, so no additional learning is required. This will make the use of RSS much more mass-market than it currently is. Of course, the drawback is that now, instead of the user’s computer working as the RSS Aggregator, a centralized service needs to do the same.

Tomorrow: Building Blocks: RSS (continued)

Tech Talk | PermaLink

Tuesday, May 20, 2003
TECH TALK: Constructing the Memex: Building Blocks: RSS (Part 2)

One additional utility will bridge the world of RSS and blogs. What is needed is the creation of a special folder in the RSS IMAP mail account – let us call it “blog”. Any mail moved into this should get posted on to the user’s blog. What this does is to make the act of posting to a blog as easy as drag-and-drop. This simple enhancement is an important one because it bridges two worlds – the world of blogs and RSS, and the world of emails. A user can also now post personal emails to the blog in the same way. By doing so, an email gets a “permalink” which can be used for cross-referencing at a later stage.

One issue to be tackled is that of availability of RSS feeds. Many sites still do not have RSS feeds – in fact, some of the news sites do not even have “permalinks” to refer to stories. This needs to be addressed. While there are sites like NewsIsFree and Sydic8 which offer RSS feeds for some of the news sites, one needs to go further. There should be a “nano-blog” for each of the popular news sites. This blog should list out the stories, giving each story a permalink, and then generating an RSS feed for others to subscribe.

What this nano-blog does is also address another drawback: it is difficult in most news sites to see stories chronologically or by issue. So, while a current issue or day’s newspaper may have its Table of Contents (ToC), it is difficult to get to the ToC for an older issue. Thus, creating a blog-like format for a news site can help in navigating the archives as well as provide permalinks for linking to the stories.

As we shall see soon, these individual actions taken across tens of thousands (or even millions) of individuals can help in ferreting out useful content based on what we and our friends are reading.

Once the RSS Ecosystem is in place – from subscription to a feed, to receiving it in one’s mail client, to being able to post an item to a blog, which can in turn generate an RSS feed for redistribution – there is no limitation on what type of feeds can be handled. The calendar we use as part of our desktop could put out an RSS feed. So could various enterprise programs. Search engines could offer their results as RSS feeds. Because RSS is a standard and it is fairly easy to create, content publishers and enterprise software programs could use it to distribute news, information and events. Interested users can subscribe to these feeds and have the updates pushed to them on the desktop (or for that matter, to an IM client or a cellphone or PDA).

Tech Talk | PermaLink

Wednesday, May 21, 2003
TECH TALK: Constructing the Memex: OPML

OPML (Outline Processor Markup Language) is the third element which is the foundation of the Memex. What OPML enables is the creation of outlines, which in turn enables the creation of personal directories. Why do we need personal directories? Isn’t Yahoo or DMOZ good enough? The short answer is no. Here’s the long answer.

There are two ways to navigate the Web today: search via a search engine like Google, or navigating a hierarchical directory like Yahoo. Both are impersonal. Both lack context. I had discussed this in a blog post entitled: “The Missing Link In Information Management”:


Let us consider Google Search. Of course, it is reasonably accurate in what we are looking for most of the time. Or at least that is what we think because we have no way to tell. But the results are the same irrespective of who does the search. We do not have an easy way of specifying clusters of documents to search, or a time period. In short, what is missing is a "context" for the search.

Navigating through directories like Yahoo also has its problems. There is a single global directory (or at best, country-level directories). Also, they do not take us to the document - they will leave us at the site's home page. Most of the directories are also not scalable because of their centralisation and manual updation process. In fact, this is what created the opportunity for automatons like Google - the web had simply grown too big.

Into this Search Engine and Directory world have come bloggers. Think of them as a collection of ants, each of which makes its local decisions, and yet as a collective creates structures which no single ant would have been able to "command and control". In other words, bloggers are creating an emergent system with their individual decisions of what to link to (and what not to link to). Bloggers are putting their own brains, their own knowledge at the centre and creating a nano-version of the Internet around their area of expertise.

There is a problem, though. What we say as a blog is actually a "what's new" page - this is because it is organised reverse chronologically (by time, the newest entries on top). Yes, many blogs have categories, which is good, but even there, the entries are by date and time of post. What's missing - even though its there embedded within the blog - is the overall context and perspective that is the blogger's expertise. What's missing is an Outline, or in other words, a blogger's directory of the posts which are there.

Why is this important? When I go to a blog, I am not going there just for finding new links and comments on specific areas. I'd like to get a wider and deeper perspective, because I trust the blogger's expertise. We like talking to experts because they help in putting things in context, like a good book. There is an introduction, there is a set of key ideas, each of which can be explored further, and there is also an overview of the latest developments. Today, most blogs and bloggers only make visible the last of these - the most recent ideas and news. As a reader, I want more.

As a reader, I want every blog to have an outline, a directory of the posts which provide the context. So, if there is an event or news item, I can now place it in the wider view of things, by just seeing where it is in the directory of items. The blogger has this mental map, it is just not visible on blogs today. The result is that it can make blogs and blogger's viewpoints hard to understand quickly - one is just seeing a snapshot. It is like reading page of a book at random, without having the benefit of a Table of Contents.


Tomorrow: OPML (continued)

Tech Talk | PermaLink

Thursday, May 22, 2003
TECH TALK: Constructing the Memex: Building Blocks: OPML (Part 2)

A more extended discussion on comes from Dave Winer [1 2]:


Imagine a new format, like HTML, but for hierarchies. It's called OPML, an XML-based format I designed in Y2K. You edit OPML files with an outliner. Several of them support the format now, including the one that UserLand includes in Radio. Eventually, I believe (and hope) all outliners and many other kinds of programs, ones that create and understand hierarchies, will support the format.

You can save OPML files to the Web, just like HTML files, and browse them in lots of interesting ways…Another thing outlines are good for is authoring directories, like Yahoo and DMOZ. Everyone can edit their own outlines.
Millions of people can [create directories]. It's not hard. That's key, because what we want to do is enable people who have deep knowledge of important areas to gather resources, organize them, and reorganize, as the world changes.

OPML directories can link to other directories, they can even (theoretically) link into other directories [this is called transclusion]. When this happens, the linked-to directory is "included" in the other. At the bottom of the page, the author's name is different, and the suggest-a-link feature sends an email to the included directory's author, but most readers won't notice. It's almost seamless.

Now, instead of having two or three all-encompassing directories, anyone with an outliner and some server space can compete to be the authority on any subject.

There's no single root of the Web, so why should directories (like Yahoo, DMOZ, Looksmart) have single roots? And therein lies the problem with directories, and why we're not effectively cataloging the knowledge of our species on the Internet.

A case in point. Last week I pointed to a great directory of RSS aggregators. So why not also have it available in a format that allows it to be included in other directories? I should be able to include it in the directory I keep for RSS developers. Why should I have to reinvent the wheel? Would he want me to? And maybe it fits into a directory of tools that are useful for librarians, alongside book inventory software; or in a directory for lawyers, alongside legal databases. See the point? There is no single address for a directory, every directory is a sub-directory of something, yet all the directories we build on the Internet try to put everything in exactly one place, which leads to some really ludicrous placements. My Windows software is categorized under Mac software because we were only available on Mac when it was first categorized. This one-category-for-all-information approach is a vestige of paper catalogs, not a limit of computer-managed catalogs.

I'm burning to get this idea broadly implemented. When we do, the Web will grow by another order of magnitude.

The challenge: Put all that we know on the Internet and give people the tools to present it in a myriad of ways. Let a thousand flowers bloom. No one owns the keys to knowledge. That's Jeffersonian software. The Web, of course, was modeled after the printed page, with all its limits. This new Web is modeled after the mind of man.


Dave Winer also has written about how to implement an OPML Directory Browser.

Taken together, the ecosystems built around Blogs, RSS and OPML help solve the problem of organising unstructured content.

Tomorrow: Unstructured Content

Tech Talk | PermaLink

Friday, May 23, 2003
TECH TALK: Constructing the Memex: Unstructured Content

The problem of information overload has been with us for a long time, and is getting worse. Ray Ozzie puts the situation in context:


Just as the first generation of personal computers was mostly about personal productivity, the first generation of the Internet has largely been about centralized Web sites, used for publishers, transactions and e-mail. For the most part, all seems well and good. At a personal level, however, many of us are overwhelmed. We're chained to e-mail and the Web, drowning in an information flood that leaves us feeling more and more like human message-processing machines.

Unfortunately, mainstay tools are falling behind our needs. Software was conceived in an era with substantially different requirements. For example, e-mail emerged 30 years ago, when computer viruses, spam and e-mail overload weren't even on the radar screen. That era could not conceive of a future in which we'd deal daily with online documents and presentations, e-mail and instant messages, Web sites and blogs.

Each of us will soon face hundreds, thousands, or tens of thousands of "inputs" that we'll need to continuously absorb and coordinate. A world with complex social, economic, organizational and personal interdependencies is inevitable. And as we near this linked future, systems and technologies must evolve or we will simply be unable to cope.


Ozzie believes that “Personal productivity tools will become joint productivity tools designed for online use instead of a paper-only world. A rich cadre of collaborative online writing, media management, presentation and consumption tools will move to the forefront of our daily electronic lives.”

The problem the Memex solves is that of rapid retrieval of relevant content from a humungous pool of unstructured content on the Web. Esther Dyson puts this in perspective in the Jaunary issue of Release 1.0:


To start, let’s just consider how the Web’s unstructured information can be organized. The two leading approaches are exemplified by Yahoo! and Google. Yahoo! has created a single, very broad taxonomy; although it has not in fact organized everything (!), it offers a directory (taxonomy) structure that in theory should be able to classify any content that shows up. By contrast, Google organizes the Web dynamically: Tell us what you want, and we’ll put it at the center of the world and find you the surrounding information…There’s a trade-off between depth and breadth; the directory offers fine-grained, carefully vetted material, while the search engine offers access to everything
else.
Yahoo’s Srinija Srinivasan says: “Directories make most sense when you are browsing, when you want to discover something. Whereas you use search when you know what you are looking for…We can’t possibly manage the entire range of what people might be looking for. The directory was never intended to cover every word of every page out there.”

Google arose from the perspective…that the Web is simply too vast for anyone to define or structure it properly: Best to let each query define its own neighborhood, and to start each search from the query outwards, rather than from some mythical top down, to where the answer lives.

As Yahoo!’s Srinivasan notes, users have turned from browsing directories to searching, from exploring to going after specific results.


Between the two extremes of the centralised approaches of Yahoo’s directory and Google’s search is the individual, ant-like, emergent Memex. To construct the Memex needs the active participation of each of us. As we have seen, the tools to bring to life Vannevar Bush’s 1945 vision are only now becoming available. As writing and self-publishing becomes easier, individuals are starting to provide a shape and form to information on the Web, and embellishing it with their thoughts and ideas. This is creating for a richer, two-way web, built around the blogs, RSS and OPML ecosystems.

Next Week: Constructing the Memex (continued)

Tech Talk | PermaLink

Monday, May 26, 2003
TECH TALK: Constructing the Memex: Connecting Blogs, Search and Personal Directories

Outlines - or Personal Directories - are the missing link in the information milieu that we see today. Imagine if each of us bloggers could create a set of pages which put our writings in context like a directory. So, now, if I wanted to find out more about WiFi or the Digital Divide and if I know that there is an expert in this area, then I can go to that person's blog, knowing that I will get a complete perspective through the outline and links, rather than just what are the new developments. The blogger already has a mental map - a taxonomy, a context - of the space. With transclusion (the ability to connect and show outlines in place), all these individual outlines could be independently linked together to create paths through the web which a search engine or a directory can never do.

What's missing? The language - OPML - is already there. What's missing is a mass-market outlining tool which can be integrated with blogging. Radio Userland has an outliner. But what's needed is integration at the blog post level - so that when I am doing a post, besides categorising it, I can also place it appropriately in my directory. Into this ecosystem of personal directories should then come search, and the ability to narrow searches - in a way the RSS search engines are now doing to blogs. They still do not cover verticals or trusted blogs, but that can be expected soon enough.

What Personal Directories will do is provide a context for viewing information. Instead of just seeing news items as individual specks, we will start seeing the landscape as a whole - through the eyes of the experts. This will create a richer overlay on the world that already exists. The time for a million, linked directories has now come.

Let’s think about a world with personal directories. Imagine we were doing a paper on the Memex. The first step we would do (as I did when I started thinking about this topic) is go to Google and type “memex”. This is the result we would get. It is a good starting point but considering that others have probably also explored this topic in great depth, wouldn’t it be useful to be (a) pointed to experts in this area, and (b) get connected to their outlines of the topic?

What is missing in the blogging world is a directory of experts. For which now, we could perhaps use Google itself, though it is a short step from where we are to build this. Imagine if I am searching for a specific topic, and then it could point me to people who have written extensively on that topic, and perhaps whom others consider as experts. This information could be gleaned by doing a semantic indexing of blog posts, along with seeing what others turn to the blogger for (for example, which of a blogger’s posts have the most inward links).

Basically, this creates a third alternative to finding information: Yahoo’s directory gives us information on websites, Google’s search gives us information on actual web pages, while our blog search gives us information on experts (who also maintain a blog). If bloggers started maintaining personal directories of the content space they have expertise in, it will provide a mapping of the blogosphere which is richer and more insightful and updated than anything we have seen before. By taking ideas from ants, brains, memes, and small worlds, the Memex can weave magic.

Tomorrow: Of Stigmergy and Memes

Tech Talk | PermaLink

Tuesday, May 27, 2003
TECH TALK: Constructing the Memex: Of Stigmergy and Memes

It was one of these serendipitous discoveries that led me to a note by Joe Gregorio on Stigmergy. I was following a link from a Mike Bedan post on Memex. Mike’s blog had shown up tops in a search I had on Google for Memex RSS Blogs OPML. I had put that combination of words in Google after many previous efforts. (One of my posts shows up tops in the search on Google.) Hopefully, it is this accidental discovery and click-and-try process tha the Memex will hopefully address!

Back to Stigmergy and Joe Gregorio. Joe quotes E. Bonabeau, M. Dorigo, and G. Theraulaz in giving a definition of Stigmergy: “Self-Organization in social insects often requires interactions among insects: such interactions can be direct or indirect. Direct interactions are the "obvious" interactions: antennation, trophallaxis (food or liquid exchange), mandibular contact, visual contact, chemical contact (the odor of nearby nestmates), etc. Indirect interactions are more subtle: two individuals interact indirectly when one of then modifies the environment and the other responds to the new environment at a later time. Such an interaction is an example of stigmergy.”

While Joe does not explicitly talk about the Memex (the connection between Stigmergy and the Memex was made by Mike Bedan), he does talk of Weblogs, Neighbourhoods, and Google. And Memes. No, that’s not a typo. Memes are, according to Joe, “a unit of intellectual or cultural information that survives long enough to be recognized as such, and which can pass from mind to mind. They can be carried by word of mouth, dead trees, e-mail, or the web. On the web, in particular on weblogs, memes are tracked by links to particular sites or stories.” In other words, Memes are mind viruses.

A small diversion as we elaborate a little on Memes. To quote Richard Dawkins: “Memes should be regarded as living structures, not just metaphorically but technically. When you plant a fertile meme in my mind you literally parasitize my brain, turning it into a vehicle for the meme’s propagation in just the way that a virus may parasitize the genetic mechanism of a host cell.”

Not only is the word Meme very similar to the Memex that we are talking of constructing, Memes are what a lot of our ideas are about. When we interact with each other, we are transmitting our ideas and thoughts. These stick and grow. This is, in some ways, how writing happens. And as we read what others write, memes are transmitted. What weblogs do is enable the transmission of memes without the need for direct contact. In a way, they provide the shortcuts for meme propagation. And this is a key concept of the “Small Worlds” theory as articulated by Duncan Watts, which we will consider shortly. For now, suffice to say, that our personal Memex in the form of blogs and personal directories work as meme propagating vehicles.

Tomorrow: Of Stigmergy and Memes (continued)

Tech Talk | PermaLink

Wednesday, May 28, 2003
TECH TALK: Constructing the Memex: Of Stigmergy and Memes (Part 2)

Joe Gregorios then connects the threads of the Web, Blogs, Google, Neighbourhoods, Memes and Stigmergy together:


The World-Wide Web is the first stigmeric communication medium for humans. The telephone and email don't count as stigmeric communication since they are only readable by the people on either end of the phone call, or the e-mail. In order for an environment to support stigmeric communication the messages must be readable by everyone. Radio and TV don't count since they are a read-only medium as far as most people are concerned. In order for an environment to support stigmery everyone has to be able to not only read it but to be able to write into it also.

Oh sure, we have had books and newspapers, but for the vast majority of people the only avenue they have to 'write-back' into that environment is in the 'letter-to-the-editors' department. Now we have Yahoo Groups, K5, Slashdot and weblogs. All avenues for anyone to enter into the conversation.
Now that we know web is a stigmeric communication medium and that we've seen some of the power that nature has gotten out of stigmergy the answers to our earlier questions become rather easy.

Why does communicating through a weblog work? Stigmergy. Using a weblog is communicating through stigmergy. Just like an ant, as I blog I leave a trail of information and links to other information I find interesting.

Why is Google's PageRank algorithm so good? It is just following the Ant Trails. If links represent a dropping of pheromone then Google is just following the trails laid down to the tastiest morsels.

Why do Neighborhoods form? Ant Corpse Piles. Just like Ant Corpse Piles, if I link to you and you link to me that brings our weblogs closer together. The more we talk about similar stuff the more likely we are to cross link to each other. The more links to each other and the more links from us to similar material on the web the closely Google thinks we are related. The habit of 'welcoming' new bloggers with similar interests by linking to their site with a welcome message only grows the pile.

Why do memes spread so effectively on the web? Stigmergy. Because they are travelling through a stigmeric medium. They can live on the internet where anyone can find them either intentionally, by using Google to follow the trail, or serendipitously by the idea moving into a receptive neighborhood.


Joe concludes: “The World-Wide Web is human stigmergy. The web and it's ability to let anyone read anything and also to write back to that environment allows stigmeric communication between humans. Some of the most powerful forces on the web today, Google and weblogs are fundamentally driven by stigmeric communication and their behaviour follows similar natural systems like Ant Trails and Nest Building that are accomplished using stigmergy.”

What Joe left unsaid and what Mike did is make the connection between Stigmery and the Memex, an emergent system for information management based on the individual, collective efforts of all of us.

Tomorrow: Emergence

Tech Talk | PermaLink

Thursday, May 29, 2003
TECH TALK: Constructing the Memex: Emergence

What is most interesting about the Memex is that it is an emergent system made up of local decisions made by a large number of individuals. Each of us is just going about our normal course of (blogging) life – making decisions on what content we like, whom to link to, what taxonomy to use for our personal directory, and so on. But out of these local decisions comes a bottom-up system that is beyond what a Yahoo or Google can ever hope of creating – both because it cannot be cached and because it is continually evolving.

Writes Steven Johnson in his book “Emergence”: “If you’re building a system designed to learn from the ground, a system where macrointelligence and adaptability derive from local knowledge,there are five fundamental principles you need to follow.” Steven Johnson discusses the principles in the context of harvester ants. We will apply these principles in the context of the Memex.

The first principle is: More is different. “It is only by observing the entire system that the global behavior becomes apparent.” Individuals (think bloggers) do not know the big picture as they keep doing their routine of linking, commenting and outlining – it is as if they are working at the street level, with little understanding of the topology of the city.

The second principle is: Ignorance is useful. “Better to build a densely interconnected system with simple elements, and let the more sophisticated behavior trickle up.” Bloggers do their bit in terms of the simple acts of categorising and connecting, without resort of any complex algorithms or top-down instructions.

The third principle is: Encourage random encounters. “These encounters are individually arbitrary, but because there are so many individuals in the system, they allow the individuals to gauge and alter the macrostate of the system itself.” In the world of bloggers, this translates to the people or content they connect to via search engines or the ones who land up at the blogs. This is over and beyond the ones that are “friends” or “friends of friends”. This opens up new content worlds and ideas.

The fourth principle is: Look for patterns in the signs. Just as ants look for patterns in pheromone secretions, bloggers can look for patterns in sites like Blogdex and Daypop, which provide an idea of the popular memes. Technorati’s links to new and promising bloggers is another example. “This knack for pattern detection allows metainformation to circulate through the mind: signs about signs.”

The fifth principle is: Pay attention to your neighbours. “Local information can lead to global wisdom.” Bloggers are not putting up a random collection of links, they are basing their decisions on what their neighbourhood does. This provides a feedback mechanism into the system.

Thus, local decisions made by bloggers is what enables the formation of the global Memex. This is emergence at work.

Tomorrow: Small Worlds

Tech Talk | PermaLink

Friday, May 30, 2003
TECH TALK: Constructing the Memex: Small Worlds

The Memex is about connecting people, ideas and information. In a way, it creates a “small world” out of the unstructured content that is out there on the Web. Writes Duncan Watts is his book “Small Worlds”:


The small-world phenomenon formalises the anecdotal notion that “you are only ever ‘six degrees of separation’ away from anybody else on the planet.” Almost everyone is familiar with the sensation of running into a complete stranger at a party or in some public arena and, after a short conversation, discovering that they know somebody unexpected in common. “Well, it’s a small world”, they exclaim. The small-world phenomenon is a generalised version of this experience, the claim that even when people do not have a friend in common, they are separated by only a short chain of intermediaries.

Adds Mark Buchanan in his book “Nexus”:

These small-world networks work magic. From a conceptual point of view, they reveal how it is possible to wire up a social world so as to get only six degrees of separation, while still permitting the richly clustered and intertwined social groups and communities that we see in the real world. Even a tiny fraction of weak links – long-distance bridges within the social world – has an immense influence on the number of degrees of separation…The long-distance social short-cuts that make the world small are mostly invisible in our ordinary social networks.

So, can this social networks truth be extended to ideas and memes using the Memex?

What the Memex does is create a small-world out of the content that is out there. So, in theory, a few clicks should be all that should be required to take us from one page to another. The invisible short-cuts are created by bloggers. Just as people have some “weak ties” which shorten the distance in the social world, blogs, because they represent people’s interests, also make connections through some weak ties to other blogs.

So, while my blog may cover mostly about new technologies and ideas relevant to emerging markets like India, I also write about a few other topics that are of interest to me – like Memex or Entrepreneurship, for example. These are the weak ties that connect me to other people who would have probably been outside the gamut of the reading I would have done in the normal course of events. What the Memex does is make these weak ties visible.

The Memex makes it possible connect us to not just information and ideas, but ultimately to people. In the year-long existence of my blog, I have made many interactions with people I would probably have never interacted with otherwise. Blogging means putting in public a part of one’s persona and brain. The Memex then makes the connections, making possible short-cuts through weak links to people (and memes) whom otherwise one could not have possibly not been aware of. The Memex makes the world smaller and more connected.

This is important because in a world of plentiful information, we need a refinery to convert the raw, unstructured content “ores” into the gold of Knowledge and Insight. This is ultimately the challenge and hidden promise of the Memex.

Next Week: Constructing the Memex (continued)

Tech Talk | PermaLink

Monday, June 2, 2003
TECH TALK: Constructing the Memex: Three Elements

As we have seen, the three primary building blocks for the Memex are the ecologies around weblogs, RSS and OPML. Let us now put these elements together with some of the ideas from ants (stigmergy and emergence), social relationships (small worlds) and biology (memes) to put together the Memex.

The Memex ecosystem actually comprises of three elements:

  • MyMemex: This is by, for, and of the individual – a personal knowledge management system. It consists of the person’s blog and directory. It can also have a visualization engine for a richer display of the embedded relationships and easier navigation.

  • OurMemex: We all belong to groups – be it in social circles or in enterprises. This Memex is constructed jointly by members of the group. Another way to create it is by simply specifying clusters of bloggers in which case the result is an aggregate of the individual Memexes.

  • MemexCentral: This is the back-office of the Memex ecosystem. This is where the analytics takes place. It can also play host to the personal and group Memexes. It should be able to offer its services using the web services protocols.

    We can think of an equivalent analogy in the messaging ecosystem. MyMemex is equivalent to our individual mailboxes. We can have separate IDs for personal and for work purposes. OurMemex is akin to the group mailing lists. These could be simply aliases which bounce messages to various people (in our case, a Memex created by specifying a collection of bloggers) or specially created mailing lists like Yahoo Groups (akin to a Memex jointly and explicitly created by members).

    MemexCentral finds its counterpart in the messaging world in the various software tools (like Microsoft Exchange, Lotus Notes and Groove), the hosted services (Hotmail and YahooMail) and accessories (the spam filters). The role of web services is played by the SMTP protocol which allows for the exchange of messages between various mail servers.

    The blogging world offers analogies (no surprise, since blogs are one of the key cornerstones of the Memex). The blog creation tools like the desktop-centric Radio, server-centric MovableType and hosted Blogger.com mirror the possible approaches that can be taken for the creation of MyMex. Group blogging platforms like Slashdot, Traction and TypePad allow multiple people to participatively build up content.

    The MemexCentral mirror comes in the form of aggregation services like Weblogs.com, which lists recently updated blogs, and sites like Blogdex, Daypop, BlogStreet, Technorati, Popdex and BlogShares, all of which capture some flavour of the blogosphere.

    Tomorrow: MyMemex

    Tech Talk | PermaLink

  • Wednesday, June 4, 2003
    TECH TALK: Constructing the Memex: MyMemex (Part 2)

    Page Archiver, to fetch and store pages as specified so as to ensure that articles from sites which hay restrict access at a later date can be archived locally and given a unique (local) URL for permanent reference.

    Summariser, to take a page specified by the user, and create a brief summary, extracting the essential ideas from the page. This would be especially useful when doing a search. (It could also be a web service offered by MemexCentral.)

    Search, which needs to be supported by a web services API to ensure not just full-text local search, but also to ensure that other Memexes can request a search. This is where the interlinkages start happening. An innovative idea proposed by Marciej Ceglowski is peer-to-peer Semantic Indexing.

    Visualiser, for a better display of the Memex and its relationships. In recent times, there has been extensive development of visualising tools like Grokker, MindMaps and TouchGraph, which can represent networks in a more intuitive manner.

    Digital Dashboard, to integrate all the information that is coming in on to a single screen. It can allow for a writing space to enable quick searches and additions to the blog, or an “events horizon” which shows the new feeds as they come in.

    PIM Connector, so as to capture information from the calendar and address book. We want to make sure that there are no silos of information, so the ability to have a 2-way linkage with the likes of Outlook and Evolution will be important.

    IM/SMS Integration, so that the user can receive alerts on different devices. A user should be able to set up filters on the type of events that will need to be tracked.

    Trail Tracking, which can be done by either capturing the user’s browsing history from the local computer or via the proxy server. Being able to show the pages surfed and the trail followed is an important indication of interest and should be preserved for future reference. Think of this as a Personal Panopticon.

    Google API Key, so the user can integrate and leverage searches using the web services provided by Google. By using Google as a web service, the results can be better integrated into what the user sees, rather than going off on to a separate page. The Google API can also be used to restrict searches to sites that more closely match our interests.

    What’s missing in this picture? The Mirror Blog, a constantly updating view of the world and information space around us. But first, before we talk about the Mirror Blog, we will take a small detour into a remarkable concept outlined more than a decade ago.

    Tomorrow: Mirror Worlds

    Tech Talk | PermaLink

    Tuesday, June 3, 2003
    TECH TALK: Constructing the Memex: MyMemex

    MyMemex is the personal Memex. It is our knowledge management system. It comprises of a weblog and a directory. The actions that we need to take at the individual level to help in the construction of the Memex are:

  • Each of us maintains a personal blog and directory. The directory outlines our interest areas, and can “transclude” other directories. The blog publishes an RSS feed, which others can subscribe.

  • We also have an RSS IMAP Mailbox, so that we can subscribe to RSS feeds from different content sources and see them in our existing email client. This also enables us to post items from our regular mailbox, thus giving a permalink to mails.

  • All that we have to do is to keep blogging and ensure we keep our directory updated as we add new posts. (We may need a feature to specify posts as public, private or for a particular group.)

    We may also be participants in some group blogs – these could be to various communities or associations that we belong, or within the enterprise. In each case, the same actions with respect to blogging and maintaining the directory need to be taken. We will need to specify a blog post or a sub-directory as being public, private or visible only to specific groups.

    The browser that we use should have a bookmarklet feature, allowing us to easily post elements of a page that we are reading. The bookmarklet simply opens up a new window and prompts us for the appropriate information to create a blog entry. Ideally, a single click should be able to capture the page details on the personal blog.

    The MyMemex can be run on the desktop or hosted centrally for the individual version, or on the user’s LAN, in the case of the enterprise version. The components that comprise the MyMemex are:

    Blogging Tool, to create and manage the blog. It should also generate an RSS feed.

    News Reader, to aggregate RSS feeds and display them. Two useful features in the News Reader would be (a) RSS2Mail, thus enabling the display of the items in the email client, and (b) Mail2Blog, enabling the posting of items directly to the blog using the MetaWeblog API.

    Directory Manager, to create and manage the OPML-based personal directory. It should also support transclusion. Thus, if the user specifies transclusion of degree 2, it should show in place sub-directories upto two levels deep. The Directory Manager can be constructed on top of an Outliner, and would need an OPML Browser and Editor. A key requirement here is for each sub-directory to have a unique permalink, thus enabling it to be transcluded in another directory. The Directory Manager should integrate with the blogging tool to allow the user to update the directory at the time of posting.

    Tomorrow: MyMemex (continued)

    Tech Talk | PermaLink

  • Thursday, June 5, 2003
    TECH TALK: Constructing the Memex: Mirror Worlds

    In 1991, David Gelernter wrote a landmark book called “Mirror Worlds”. Here’s an extract from an article about the book from Sohodojo:


    Mirror Worlds is the most important book about the Internet that you can read. What is even more amazing? Mirror Worlds isn't supposed to be about the Internet.

    Ten years after its publication, the really impressive about Mirror Worlds is what Gelernter and all the rest of us didn't foresee. The Mirror World is a magical Looking Glass; a transforming two-way mirror. The rapid growth of the Internet and its associated impact on the emerging global economy means that the model has become the system itself. The outside world is changing to reflect our lives inside the wired, network world we live in... not the other way around.

    In Mirror Worlds Gelernter envisioned us mustering the resources and implementation efficiencies to allow us to build grand software simulations of government, economic and social systems. Then, by cleverly instrumenting the simulations to be real-time reflections of the system being modeled... you get a BIG BANG!

    The simulation becomes something qualitatively different. It is a Mirror World. As more and more of our value exchanges and communication take place purely in cyberspace, the model is the system... we don't have to build the simulation and instrument it... the model and the system are one and the same.


    Steven Johnson wrote recently about Gelernter’s vision in a slightly different context:

    In 1991, computer scientist David Gelernter of Yale University predicted in his book Mirror Worlds that advances in computing power and connectivity would lead to the creation of virtual cities: micro versions of the real world built out of data streams and algorithms instead of bricks and concrete...Fast-forward a decade, and evidence of Gelernter's prescience abounds. Millions of people are active participants in virtual worlds that possess the economic and creative vitality of actual communities. The Net denizens who have built a homestead in massively multiplayer games like The Sims Online are the digital world's equivalent of the postwar immigration to California. The worlds are so vivid that the players now take the virtual objects that they've accumulated in these games—swords