Software Archive
Google 2.0 embraces Semantic Web
By Joab Jackson
The recently-announced revisions to the Google search service may require agency Webmasters to do more preparation to get their agency’s content properly indexed by the site. Those agencies that undertake this effort, however, should enjoy significantly increased exposure as a result, noted Steve Arnold, search consultant.
“It’s rule-changing,” Arnold said of Google’s new approach to search. ”This is will be a different game going forward with how information will be presented and what IT people will be expected to do.”
Earlier this week, Google announced that it was revamping its Internet search engine. Google promised that when people search on its site, they should start getting a wider range of results, including more links to videos, images, news, maps and books.
Among the changes, according to a report issued by the equity research firm Bear, Stearns & Co., is additional capability in semantic reasoning about the material the site indexes. The company has applied for a number of patents around a technology it calls the Programmable Search Engine, which will look for metadata posted on Web sites that defines the material on those sites.
According to the report, PSE will allow Webmasters to program an Internet search engine to categorize site content in very specific ways. This will allow Google and other search engines to identify the content as belonging to specific domains of expertise as well as to identify complex relationships with other sources of information. PSE is expected to be in use by the end of the year.
Google is, in effect, attempting to change the pecking order of the Web, from having search engines scanning the entire sites themselves to asking Webmasters ”to tell us what you got,” Arnold said.
Arnold said that preparing the PSE metadata will require some work, however. The work should be easy to those agencies that have already participated in the Google SiteMaps initiative, in which agency Webmasters provided a list of links to database queries so that they can be indexed by the search engine.
Those unfamiliar with the conventions of sitemaps will find the task more difficult, Arnold said.
Overall, the job of preparing metadata for PSE will be large enough that agencies would need to devote additional funding to get the job done, Arnold predicted. The Bear, Stearns report notes that ”For large Internet sites … we think their management will need to assess the implications of Google’s PSE on their business. … If Google’s dominance grows, for Webmasters, we believe they will have little choice but to play ball with Google.”
Agencies are increasingly finding that more and more of their virtual visitors are arriving by way of a search engine, rather than typing into the browser the address of the agency’s Web site itself. So preparing the site for search engines is becoming increasingly important. ”Visibility is important,” Arnold said. ”Smart Web sites will get the clicks.”
Although Google garners the lion’s share of Internet searches, industry observers predict that other search services offered by Yahoo!, Microsoft and others will use the PSE format for their own services as well.
Arnold, who is head of the ArnoldIT consulting firm, is completing a book on Google, ”Google 2.0: The Subtle Predator.” He provided some of the information to the Bear, Stearns report.
摘要:我们描述一个简单的业务系统的用例,然后介绍不同的实现技术及其比较。我没有做过实际的基于SW技术的信息系统,因此本文只是一些猜想性的东西,希望抛砖引玉,能得到大家的指教,也欢迎大家补充,谢谢!
注意我们讨论的对象是传统的MIS系统,如图书查询系统,学生管理系统之类的。
用例: 用户在查询界面选择查询条件,如搜索论坛的帖子,有主题,作者,时间限制
实现: 按通过的B/S框架,分为三层:表示层,业务逻辑层,数据层
目前常用的技术方案:
1) 表示层: Web表单。 JSP/servlet
2) 业务逻辑层:得到请求,从数据层获取数据,返回结果给表示层
3) 数据层: 数据存在关系数据库,元数据,如主题,作者是数据的字段名,或说元数据即 数据库schema
基于XML的技术方案:
1) 表示层: Web表单。 XML+XSLT 或XForm
2) 业务逻辑层:得到请求,从数据层获取数据,返回结果给表示层(利用XML查询引擎)
3) 数据层: 数据用XML表示,存在XML数据库,元数据是XML的标签或说元数据用XML Schema/DTD 表示
基于RDF的技术方案:
1) 表示层: Web表单。有两种:a)界面不变,对用户屏蔽RDF Schema,则用户的体验和上两种界面一样 b)让用户可以看到RDF Schema(ontology),可以根据RDF Schema中的关系组装查询条件,如SHOE的界面。
2) 业务逻辑层:得到请求,从数据层获取数据,返回结果给表示层(利用RDF查询引擎)
3) 数据层: 数据用RDF表示,存在RDF数据库,元数据是RDF本身和RDF Schema
比较:
1) 用户体验:也许没有多大变化。把RDF Schema呈现给用户并不是个好的主意。
2) 开发人员体验:
(a) 概念层到数据层次上的转化:给用户查询的界面是基于系统的概念模型,而底下的关系数据库是数据模型,开发人员需要在这两种模型之间做很多转换工作,而用RDF表示数据,因为RDF本身是概念层次上的,屏蔽了很多语法层上的东西,因此,这种转换工作最少。
(b) 语义信息的硬编码:我们的世界是需要语义信息的,而传统MIS和XML系统中是没有这种语义信息的,为了让系统能够解决现实世界的问题,方法就是把语义信息写死在程序中,例如:为了让用户查找水果的信息,开发人员首先要知道苹果,香蕉是水果,然后分别去查香蕉,苹果的信息,而如果用RDF技术的话,因为它有语义,有推理,RDF引擎可以告诉开发人员香蕉和苹果是水果。如果有更复杂的推理的话,人的脑袋是应付不了的,所以很容易出现程序的不完全性,而这种不完全性是程序本身无法检验出来的。这也是一些系统,如飞船,核电站的软件需要形式化方法的原因。
3) 系统互操作性:
为了解决传统系统的互操作性的苦难,XML技术被大为提倡,但XML只解决了语法层的问题,XML本身没有形式化的语义,语义信息仍然需要硬编码,可以预见,你硬编码了,我也硬编码了,如果我们对某个冬冬的理解不一样,我们俩的系统一起运作就会出问题。例如,你的系统对工资的理解是税前,我的系统对工资的理解是税后,如果这写死在程序里了,当然两个系统互操作的就有问题了。
结论:
1) MIS系统都可以用SW技术来做。RDF可以看成是有语义的XML数据,所有XML能用的地方,偶RDF也能用,现在还有什么地方XML不能用吗?
2) 封闭的,独立的MIS 没必要用SW技术reengineering。
3) 语义Web技术可以让开发人员更happy,让系统有良好的互操作性。
1. RDF的应用 Mozilla:XUL (XML User Interface)
http://www.mozilla.org/rdf/doc/faq.html#xul_templates
IBM: ORIENT (by IBM CRL)
http://www.alphaworks.ibm.com/aw.nsf/reqs/semanticstk
IBM: Ontology-based Web Services for Business Integration
http://www.alphaworks.ibm.com/tech/owsbi
Intelli dimension: RDFJDBC
http://www.intellidimension.com/default.rsp?topic=/pages/rdfgateway/reference/client/jdbc.rsp
Adobe的工具加入RDF-based metadata描述文件格式,并应用于它的系统中。
波音公司的数据集成管理系统Data integration by Boeing,用RDF/RDFS做为系统之间数据交互的中间语言。类似的应用还有很多,比如:Artiste
OntoWeb:面向知识管理和电子商务的基于本体的信息交换,一个欧盟的项目。
W3C网站内容管理:W3C上的文件信息以RDF形式存取,对同样的信息可以从不同的视图进行浏览。
CC/PP(Composite Capabilities/Preference Profiles)是W3C提出的一种用来描述移动设备能力和用户偏好的标准,使得内容服务器能够根据设备的能力和用户的偏好来决定内容的组织形式。当设备利用某种通信协议(如HTTP协议)发出服务请求的时候,该设备的CC/PP简档便随请求一同发出,则服务器就能够据此来对内容进行过滤、转换,来满足设备的需求。
PRISM:见前面材料
2.RDF已经存在的API:
Java: developed by Stanford
http://www-db.stanford.edu/~melnik/rdf/api.html
惠普 (RDF, DAML+OIL, OWL) RDF API基于stanford
http://www.hpl.hp.com/semweb/
C/C++ :RDFStore
http://rdfstore.sourceforge.net/
Python: pyrple
http://infomesh.net/pyrple/
PHP:RAP
http://www.wiwiss.fu-berlin.de/suhl/bizer/rdfapi/
图形编辑器
IsaViz (Xerox Research/W3C)或RDF Author (Univ. of Bristol)
RDF开发应用环境(翻译器,等)
Jena (供Java, 包括OWL Lite推理), RDFLib (供 Python)
Redland (in C, with interfaces to Tcl, Java, PHP, Perl, …)
Sesame (RDF Schema 为基础的存储和查询)
RDF合法性验证器
http://www.w3.org/RDF/Validator/
RDF向OWL的转换器
http://www.mindswap.org/2002/owl.html
3.国内对RDF一个比较完整的应用
基于语义的中医药数据网格:
目标:以分布在全国各地的、由国家中医药管理局中国中医药文献检索中心及其分中心研制的中医药科技信息数据库群为基础;以中医数据与知识管理、个体化诊疗临床决策支持和计算机辅助中药创新开发为应用需求背景;
成果:
1.基于RDF数据模型的语义图描述语言CaseML
2.支持基于本体论的语义图浏览
3.基于规则的推理机插件:
通过浙江省级鉴定:该项研究成果整体达到了国内领先水平,其中,在基于语义的数据库网格服务协议、基于语义Web的语义浏览等方面达到了国际先进水平
用于知识管理:
http://www-128.ibm.com/developerworks/cn/xml/rdf/part8/index.html
用于social network:
http://www-128.ibm.com/developerworks/cn/xml/x-watch/part3/index.html
RDF用户存放web元数据
http://scholar.ilib.cn/Abstract.aspx?A=qbxb200302009
Firefox 的js内建了对RDF的支持 (XUL)
其他方面的应用