TL;DR: This paper surveys the research in the area of Web mining, point out some confusions regarded the usage of the term Web mining and suggest three Web mining categories, which are then situate some of the research with respect to these three categories.
Abstract: With the huge amount of information available online, the World Wide Web is a fertile area for data mining research. The Web mining research is at the cross road of research from several research communities, such as database, information retrieval, and within AI, especially the sub-areas of machine learning and natural language processing. However, there is a lot of confusions when comparing research efforts from different point of views. In this paper, we survey the research in the area of Web mining, point out some confusions regarded the usage of the term Web mining and suggest three Web mining categories. Then we situate some of the research with respect to these three categories. We also explore the connection between the Web mining categories and the related agent paradigm. For the survey, we focus on representation issues, on the process, on the learning algorithm, and on the application of the recent works as the criteria. We conclude the paper with some research issues.
TL;DR: Four factors that are critical to Web site success in EC were identified: information and service quality, system use, playfulness, and system design quality.
TL;DR: The ability to track users’ browsing behavior down to individual mouse clicks has brought the vendor and end customer closer than ever before, and it is now possible for a vendor to personalize his product message for individual customers at a massive scale, a phenomenon that is being referred to as mass customization.
Abstract: The ease and speed with which business transactions can be carried out over the Web have been a key driving force in the rapid growth of electronic commerce. Business-to-business e-commerce is the focus of much attention today, mainly due to its huge volume. While there are certainly gains to be made in this arena, most of it is the implementation of much more efficient supply management, payments, etc. On the other hand, e-commerce activity that involves the end user is undergoing a significant revolution. The ability to track users’ browsing behavior down to individual mouse clicks has brought the vendor and end customer closer than ever before. It is now possible for a vendor to personalize his product message for individual customers at a massive scale, a phenomenon that is being referred to as mass customization.
TL;DR: It is found that while successful search performance requires the combination of the two types of expertise, specific strategies directly related to Web experience or domain knowledge can be identified.
Abstract: Searching for relevant information on the World Wide Web is often a laborious and frustrating task for casual and experienced users. To help improve searching on the Web based on a better understanding of user characteristics, we investigate what types of knowledge are relevant for Web-based information seeking, and which knowledge structures and strategies are involved. Two experimental studies are presented, which address these questions from different angles and with different methodologies. In the first experiment, 12 established Internet experts are first interviewed about search strategies and then perform a series of realistic search tasks on the World Wide Web. From this study a model of information seeking on the World Wide Web is derived and then tested in a second study. In the second experiment two types of potentially relevant types of knowledge are compared directly. Effects of Web experience and domain-specific background knowledge are investigated with a series of search tasks in an economics-related domain (introduction of the Euro currency). We find differential and combined effects of both Web experience and domain knowledge: while successful search performance requires the combination of the two types of expertise, specific strategies directly related to Web experience or domain knowledge can be identified.
TL;DR: This framework has been applied to study how different groups of companies are using the Web for commercial purposes and on average, larger Web sites seem to be ‘richer’ and more advanced.
TL;DR: In this article, the authors apply content analysis techniques to the World Wide Web and find that this stable research technique can be applied to a dynamic environment, however, the rapid growt...
Abstract: Analysis of nineteen studies that apply content analysis techniques to the World Wide Web found that this stable research technique can be applied to a dynamic environment. However, the rapid growt...
TL;DR: The PageGather algorithm, which automatically identifies candidate link sets to include in index pages based on user access logs, is presented and it is demonstrated experimentally that PageGathering outperforms the Apriori data mining algorithm on this task.
TL;DR: Techniques for automatically computing the geographical scope of web resources, based on the textual content of the resources, as well as on the geographical distribution of hyperlinks to them are introduced.
Abstract: Many information resources on the web are relevant primarily to limited geographical communities. For instance, web sites containing information on restaurants, theaters, and apartment rentals are relevant primarily to web users in geographical proximity to these locations. In contrast, other information resources are relevant to a broader geographical community. For instance, an on-line newspaper may be relevant to users across the United States. Unfortunately, current web search engines largely ignore the geographical scope of web resources. In this paper, we introduce techniques for automatically computing the geographical scope of web resources, based on the textual content of the resources, as well as on the geographical distribution of hyperlinks to them. We report an extensive experimental evaluation of our strategies using real web data. Finally, we describe a geographicallyaware search engine that we have built to showcase our techniques.
TL;DR: This paper presents a meta-anatomy of the web, which aims to explain the web in the context of 21st Century society and investigates its role in promoting human rights and democracy.
Abstract: 1. Enquire within upon everything 2. Tangles, links and webs 3. info.cern.ch 4. Protocols: Simple rules for global systems 5. Going global 6. Browsing 7. Changes 8. Consortium 9. Competition and consensus 10. Web of people 11. Privacy 12. Mind to mind 13. Machines and the web 14. Weaving the web
TL;DR: How a comprehensive and flexible strategy for building and maintaining a high-value community Web portal has been conceived and implemented based on an ontology as a semantic backbone for accessing information on the portal, for contributing information, as well as for developing and maintaining the portal is discussed.
Abstract: Community Web portals serve as portals for the information needs of particular communities on the Web. We here discuss how a comprehensive and flexible strategy for building and maintaining a high-value community Web portal has been conceived and implemented. The strategy includes collaborative information provisioning by the community members. It is based on an ontology as a semantic backbone for accessing information on the portal, for contributing information, as well as for developing and maintaining the portal. We have also implemented a set of ontology-based tools that have facilitated the construction of our show case — the community Web portal of the knowledge acquisition community.
TL;DR: WebStudies as mentioned in this paper is a collection of studies of the Web's artistic and creative possibilities with political, economic and international perspectives, focusing on everyday Web life, art and culture, web business, and global web communities, politics and protest.
Abstract: From the Publisher:
Beginning with an introduction to the Web and how it works, followed by the theories and methods of cyberculture studies, WebStudies moves on to consider everyday Web life, art and culture, Web business, and global Web communities, politics and protest Topics covered range from personal and fan websites, cyber-sexualities, webcams and Web-based art and entertainment, to global capitalism and the fight for Web domination, cybercrime and internet propaganda Uniquely, the book combines studies of the Web's artistic and creative possibilities with political, economic and international perspectives Each chapter includes suggestions for ways in which students can use the Web to further their own research; there are also lists of useful websites, a glossary and a bibliography
TL;DR: An architecture and system for the analysis and prediction of user behavior and Web site usability is presented that integrates research on human information foraging theory, a reference model of information visualization and Web data-mining techniques, and new Web usability metrics.
Abstract: Designers and researchers of users' interactions with the World Wide Web need tools that permit the rapid exploration of hypotheses about complex interactions of user goals, user behaviors, and Web site designs. We present an architecture and system for the analysis and prediction of user behavior and Web site usability. The system integrates research on human information foraging theory, a reference model of information visualization and Web data-mining techniques. The system also incorporates new methods of Web site visualization (Dome Tree, Usage Based Layouts), a new predictive modeling technique for Web site use (Web User Flow by Information Scent, WUFIS), and new Web usability metrics.
TL;DR: This survey describes two successful link analysis algorithms and the state-of-the art of the field.
Abstract: The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field.
TL;DR: Results indicate that many educational Web sites are still predominantly text-based and do not yet exhibit evidence of current pedagogical approaches (e.g., use of inquiry-based activities, application of constructivist learning principles, and use of alternative evaluation methods).
Abstract: The Web is a firmly established, though virtual, reality. Educators, well aware of the potential of We technology, have adopted it for creating new Web-based learning environments. This article presents a study of the characteristics of Web sites as teaching and learning environments. The major questions addressed in this study were: 1. What characterizes educational Web sites at the content, teaching, learning, and communication levels?2. How do key teaching and learning issues appearing on educational Web sites relate to educators’ expectations from the new technology?3. What can a consideration of the current state of affairs teach us about further development and implementation of educational Web sites?To answer these questions we developed a classification scheme (the Taxonomy of WBLE); implemented it for the study of 436 educational Web sites focusing on mathematics, science, and technology learning; and elaborated on practical implications of the study’s results. The overall picture we have...
TL;DR: What "current" means for Web search engines and how often they must reindex the Web to keep current with its changing pages and structure are quantified.
Abstract: Most information depreciates over time, so keeping Web pages current presents new design challenges. This article quantifies what "current" means for Web search engines and estimates how often they must reindex the Web to keep current with its changing pages and structure.
TL;DR: A set of tools for extracting data from web sites and transforming it into a structured data format, such as XML, so that the resulting data can be used to build new applications without having to deal with unstructured data.
Abstract: A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.
TL;DR: Two techniques, based on clustering of user transactions and clustered of pageviews, are presented and experimentally evaluated in order to discover overlapping aggregate profiles that can be effectively used by recommender systems for real-time personalization.
Abstract: 1 Please direct correspondence to mobasher@cs.depaul.edu Abstract: Web usage mining, possibly used in conjunction with standard approaches to personalization such as collaborative filtering, can help address some of the shortcomings of these techniques, including reliance on subjective user ratings, lack of scalability, and poor performance in the face highdimensional and sparse data. However, the discovery of patterns from usage data by itself is not sufficient for performing the personalization tasks. The critical step is the effective derivation of good quality and useful (i.e., actionable) "aggregate usage profiles" from these patterns. In this paper we present and experimentally evaluate two techniques, based on clustering of user transactions and clustering of pageviews, in order to discover overlapping aggregate profiles that can be effectively used by recommender systems for real-time personalization. We evaluate these techniques both in terms of the quality of the individual profiles generated, as well as in the context of providing recommendations as an integrated part of a personalization engine.
TL;DR: In this article, an n-gram-based model was proposed to predict the next document in the Web using path profiles of users from very large data sets to predict users' future requests.
Abstract: As an increasing number of users access information on the Web, there is a great opportunity to learn from the server logs to learn about the users' probable actions in the future. We present an n-gram based model to utilize path profiles of users from very large data sets to predict the users' future requests. Since this is a prediction system, we cannot measure the recall in a traditional sense. We, therefore, present the notion of applicability to give a measure of the ability to predict the next document. Our model is based on a simple extension of existing point-based models for such predictions, but our results show for n-gram based prediction when n is greater than three, we can increase precision by 20% or more for two realistic Web logs. Also we present an efficient method that can compress our model to 30% of its original size so that the model can be loaded in main memory. Our result can potentially be applied to a wide range of applications on the Web, including pre-sending, pre-fetching, enhancement of recommendation systems as well as Web caching policies. Our tests are based on three realistic Web logs. Our algorithm is implemented in a prediction system called WhatNext, which shows a marked improvement in precision and applicability over previous approaches.
TL;DR: The authors derive a novel conceptual framework that describes three interdependent phases of the competitive intelligence generation process: (1) organizing for competitive intelligence, (2) searching for information, and (3) sense-making.
Abstract: Marketing strategy begins with customer and competitive intelligence. However, in sharp contrast to customer intelligence, there is little research on how competitive intelligence (CI) is actually generated within an organization. The absence of this knowledge makes it difficult to identify ways to improve the CI generation process. Drawing on both depth interviews with full-time personnel who conduct competitive intelligence and academic literature in related fields, the authors derive a novel conceptual framework that describes three interdependent phases of the competitive intelligence generation process: (1) organizing for competitive intelligence, (2) searching for information, and (3) sense-making. Dimensions of efficacy at each phase are also identified, and they are posited to be influenced by factors pertaining to: (1) the intelligence network, (2) the business environment, (3) the information environment, and (4) analyst characteristics. This framework departs from the existing literature by identifying core components of the competitive intelligence generation process, highlighting its iterative nature, and identifying variables germane to its success. The emergent framework's implications for managing the competitive intelligence generation process are discussed and future research directions are suggested.
TL;DR: The tension between information abundance and attention scarcity implies for the diversity of information accessible to users of the World Wide Web and theories about media content diversity are delineated to suggest what to expect with respect to content diversity online.
TL;DR: In this paper, the authors examine Web-based education and argue that it can successfully simulate face-to-face teaching models, while adding some unique features made possible by the technology.
Abstract: The Internet is changing the very nature of society in ways unparalleled since the industrial revolution. It is affecting local, national and global economies and their infrastructures. Information is available at any time from any place to any Internet user. This is creating tremendous opportunities for universities to provide a learning environment that is accessible to all. The " same time, same place, only some people " traditional educational environment is giving way to " anytime, anyplace and anybody " instructional models. For universities, the question becomes how to preserve and expand the desirable aspects of face-to-face teaching models when translating them into the new environment of Web-based education (WBE). This challenge is made even more complex when seen in the context of other trends in education: the transition from passive classroom lectures to hands-on, student-centered, interactive learning; the perception of students as " customers, " with increased control over the learning process; a higher education market where traditional universities have to compete with for-profit enterprises. This chapter examines Web-based education and argues that it can successfully simulate face-to-face teaching models, while adding some unique features made possible by the technology. To be successful, however, this simulation requires adjustments in many areas, including student assessment , faculty training and expectations, and student expectations and motivation. In addition, the chapter examines several critical aspects of Web
TL;DR: A way to use a user’s personal arrangement of concepts to navigate the Web using the characterizations created by the OBIWAN system and the mapping of the reference ontology to the personal ontology is shown to have a promising level of correctness and precision.
Abstract: The publicly indexable Web contains an estimated 800 million pages, however it is estimated that the largest search engine contains only 300 million of these pages. As the number of Internet users and the number of accessible Web pages grows, it is becoming increasingly difficult for users to find documents that are relevant to their particular needs. Often users must browse through a large hierarchy of categories to find the information for which they are looking. To provide the user with the most useful information in the least amount of time, we need a system that uses each user’s view of the world for classification. This paper explores a way to use a user’s personal arrangement of concepts to navigate the Web. This system is built by using the characterizations for a particular site created by the Ontology Based Informing Web Agent Navigation (OBIWAN) system and mapping from them to the user’s personal ontologies. OBIWAN allows users to explore multiple sites via the same browsing hierarchy. This paper extends OBIWAN to allow users to explore multiple sites via their own browsing hierarchy. The mapping of the reference ontology to the personal ontology is shown to have a promising level of correctness and precision.
TL;DR: It is shown that the use of Web mining for educational purposes is of great interest, and the emerging trends in distance education are facilitating its usability on the Internet.
Abstract: Application of data mining techniques to the WWW (World Wide Web), referred to as Web mining, has been the focus of several research projects and papers. One of several possibilities can be its application to distance education. Taken as a whole, the emerging trends in distance education are facilitating its usability on the Internet. With the explosive growth of information sources available on the WWW, Web mining has become suitable for keeping pace with the trends in education, such as mass customization. In this paper, we define Web mining and present an overview of distance education. We describe the possibilities of application of Web mining to distance education, and, consequently, show that the use of Web mining for educational purposes is of great interest.
TL;DR: The article defines a new research field, namely Web Intelligence (WI), by giving a complete picture of WI related topics for systematic study on advanced Web technology and developing Web based intelligent information systems.
Abstract: The 21st century is the age of the Internet and the World Wide Web. The Web revolutionizes the way we gather, process, and use information. At the same time, it also redefines the meanings and processes of business, commerce, marketing, finance, publishing, education, research, development, as well as other aspects of our daily life. The revolution is just beginning. Although individual Web based information systems are constantly being deployed, advanced issues and techniques for developing and for benefiting from Web intelligence still remain to be systematically studied. The article defines a new research field, namely Web Intelligence (WI) by giving a complete picture of WI related topics for systematic study on advanced Web technology and developing Web based intelligent information systems. Roughly speaking, WI exploits AI and advanced information technology on the Web and Internet. It is the key and the most urgent research field of IT for business intelligence.
TL;DR: The OIL language extends the RDF schema standard to provide just such a layer, which combines the most attractive features of frame based languages with the expressive power, formal rigour and reasoning services of a very expressive description logic.
Abstract: Exploiting the full potential of the World Wide Web will require semantic as well as syntactic interoperability. This can best be achieved by providing a further representation and inference layer that builds on existing and proposed web standards. The OIL language extends the RDF schema standard to provide just such a layer. It combines the most attractive features of frame based languages with the expressive power, formal rigour and reasoning services of a very expressive description logic.
TL;DR: A General Methodological Framework for the Development of Web-Based Information Systems and an Example-Based Environment for Wrapper Generation.
Abstract: Towards Ontology-Based Harmonization of Web Content Standards.- The M*-COMPLEX Approach to Enterprise Modeling, Engineering, and Integration.- Conceptual Design of Electronic Product Catalogs Using Object-Oriented Hypermedia Modeling Techniques.- Generic Linear Business Process Modeling.- Business Modelling Is Not Process Modelling.- Modeling Electronic Workflow Markets.- Building Multi-device, Content-Centric Applications Using WebML and the W3I3 Tool Suite.- Abstraction and Reuse Mechanisms in Web Application Models.- From Web Sites to Web Applications: New Issues for Conceptual Modeling.- Using Webspaces to Model Document Collections on the Web.- Modeling Interactions and Navigation in Web Applications.- A General Methodological Framework for the Development of Web-Based Information Systems.- Managing RDF Metadata for Community Webs.- An Example-Based Environment for Wrapper Generation.- Flexible Category Structure for Supporting WWW Retrieval.
TL;DR: This paper describes the experience with building a general place manager infrastructure useful for creating web representations for physical places and leverage a general web presence architecture for building all different types of web presence.
Abstract: We believe that the future consists of nomadic people depending upon mobile appliances using World Wide Web protocols to communicate with services offered in real world places. Use of web protocols provides a ubiquitous communication infrastructure and allows interaction with the multitude of existing web-based services. Part of the challenge to realize our vision is to bridge the physical and virtual worlds by creating web representations for people, places, and things that interact virtually as they interact physically. We believe that an interesting set of new services can be provided by bridging the virtual and physical worlds in this way. This paper describes our experience with building a general place manager infrastructure useful for creating web representations for physical places. Although we leverage a general web presence architecture for building all different types of web presence, this paper focuses on the specific needs for building web representations for places.