Posts Tagged ‘Search’

Bobo browse

Thursday, December 24th, 2009

A Faceted Search implementation written purely in Java, an extension of Apache Lucene.

Another fine resource on Google code, at:-

http://code.google.com/p/bobo-browse/

Finding files on a Linux machine

Saturday, December 19th, 2009

Searching by file size:-

find / -type f -size +20000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'

http://snippets.dzone.com/posts/show/1491

http://www.secguru.com/article/quick_tips_find_files_linux_file_system

Search by file name

find {dir-name} -name {file-name}

http://www.cyberciti.biz/faq/how-do-i-search-my-linuxunix-server-for-a-file/

eg:-

# find / -name foo.txt

A fuller explanation of the syntax is here:-

http://content.hccfl.edu/pollock/unix/findcmd.htm

Locating Files:

The find command is used to locate files on a Unix or Linux system.  find will search any set of directories you specify for files that match the supplied search criteria.  You can search for files by name, owner, group, type, permissions, date, and other criteria.  The search is recursive in that it will search all subdirectories too.  The syntax looks like this:

find where-to-look criteria what-to-do

All arguments to find are optional, and there are defaults for all parts.  (This may depend on which version of find is used.  Here we discuss the freely available GNU version of find, which is the version available on YborStudent.)  For example where-to-look defaults to . (that is, the current working directory), criteria defaults to none (that is, show all files), and what-to-do (known as the find action) defaults to -print (that is, display the names of found files to standard output).  Technically the criteria and actions are all known as find primaries.

For example:

find

will display the pathnames of all files in the current directory and all subdirectories.  The commands

find . -print
find -print
find .

do the exact same thing.  Here’s an example find command using a search criterion and the default action:

find / -name foo

This will search the whole system for any files named foo and display their pathnames.  Here we are using the criterion -name with the argument foo to tell find to perform a name search for the filename foo.  The output might look like this:

/home/wpollock/foo
/home/ua02/foo
/tmp/foo

If find doesn’t locate any matching files, it produces no output.

The above example said to search the whole system, by specifying the root directory (/) to search.  If you don’t run this command as root, find will display a error message for each directory on which you don’t have read permission.  This can be a lot of messages, and the matching files that are found may scroll right off your screen.  A good way to deal with this problem is to redirect the error messages so you don’t have to see them at all:

find / -name foo 2>/dev/null

You can specify as many places to search as you wish:

find /tmp /var/tmp . $HOME -name foo

Plone Solr integration

Thursday, December 17th, 2009

collective.solr is an approach to integrate the Solr search engine with Plone. content management system.

The Solr Apache based search engine

Thursday, December 17th, 2009

Solr on Wikipedia

Sites using Solr

The features of Solr are:-

Solr in a Nutshell

Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called “indexing”) via XML over HTTP. You query it via HTTP GET and receive XML results.

  • Advanced Full-Text Search Capabilities
  • Optimized for High Volume Web Traffic
  • Standards Based Open Interfaces – XML,JSON and HTTP
  • Comprehensive HTML Administration Interfaces
  • Server statistics exposed over JMX for monitoring
  • Scalability – Efficient Replication to other Solr Search Servers
  • Flexible and Adaptable with XML configuration
  • Extensible Plugin Architecture

Solr Uses the Lucene Search Library and Extends it!

  • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
  • Powerful Extensions to the Lucene Query Language
  • Faceted Search and Filtering
  • Advanced, Configurable Text Analysis
  • Highly Configurable and User Extensible Caching
  • Performance Optimizations
  • External Configuration via XML
  • An Administration Interface
  • Monitorable Logging
  • Fast Incremental Updates and Index Replication
  • Highly Scalable Distributed search with sharded index across multiple hosts
  • XML, CSV/delimited-text, and binary update formats
  • Easy ways to pull in data from databases and XML files from local disk and HTTP sources
  • Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika
  • Multiple search indices

Detailed Features

Schema

  • Defines the field types and fields of documents
  • Can drive more intelligent processing
  • Declarative Lucene Analyzer specification
  • Dynamic Fields enables on-the-fly addition of new fields
  • CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field
  • Explicit types eliminates the need for guessing types of fields
  • External file-based configuration of stopword lists, synonym lists, and protected word lists
  • Many additional text analysis components including word splitting, regex and sounds-like filters

Query

  • HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby, PHP, Velocity, binary)
  • Sort by any number of fields
  • Advanced DisMax query parser for high relevancy results from user-entered queries
  • Highlighted context snippets
  • Faceted Searching based on unique field values, explicit queries, or date ranges
  • Multi-Select Faceting by tagging and selectively excluding filters
  • Spelling suggestions for user queries
  • More Like This suggestions for given document
  • Function Query – influence the score by user specified complex functions of numeric fields or query relevancy scores.
  • Range filter over Function Query results
  • Date Math – specify dates relative to “NOW” in queries and updates
  • Dynamic search results clustering using Carrot2
  • Numeric field statistics such as min, max, average, standard deviation
  • Combine queries derived from different syntaxes
  • Auto-suggest functionality
  • Allow configuration of top results for a query, overriding normal scoring and sorting
  • Performance Optimizations

Core

  • Dynamically create and delete document collections without restarting
  • Pluggable query handlers and extensible XML data format
  • Pluggable user functions for Function Query
  • Customizable component based request handler with distributed search support
  • Document uniqueness enforcement based on unique key field
  • Duplicate document detection, including fuzzy near duplicates
  • Custom index processing chains, allowing document manipulation before indexing
  • User configurable commands triggered on index changes
  • Ability to control where docs with the sort field missing will be placed
  • “Luke” request handler for corpus information

Caching

  • Configurable Query Result, Filter, and Document cache instances
  • Pluggable Cache implementations, including a lock free, high concurrency implementation
  • Cache warming in background
    • When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.
  • Autowarming in background
    • The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabling high cache hit rates across index/searcher changes.
  • Fast/small filter implementation
  • User level caching with autowarming support

Replication

  • Efficient distribution of index parts that have changed
  • Pull strategy allows for easy addition of searchers
  • Configurable distribution interval allows tradeoff between timeliness and cache utilization
  • Replication and automatic reloading of configuration files

Admin Interface

  • Comprehensive statistics on cache utilization, updates, and queries
  • Interactive schema browser that includes index statistics
  • Replication monitoring
  • Full logging control
  • Text analysis debugger, showing result of every stage in an analyzer
  • Web Query Interface w/ debugging output
    • parsed query output
    • Lucene explain() document score detailing
    • explain score for documents outside of the requested range to debug why a given document wasn’t ranked higher.

Website Search Applications

Thursday, December 17th, 2009

A rare and useful comparison of Open Source Search Engines

Another comparison of free search engines

A comparison of Solr and Sphinx