Pulling Documents for Searching

May 29, 2014 Maksym Shyte Leave a comment

In a prior post, I noted how to set up elasticsearch with apache2. In this post, we will look at how to cache a set of files on your web server from a windows share and index them.

To do this, we need to do the following steps:

Initialize the index the first time.
Mount a share.
Rsync the data between the machines.
Get the files that exist on the SMB share.
Read what has been indexed.
Diff the lists from steps 3 and 4.
Index the new files on the share.
Delete (the index and file) the files that no longer exist on the share.

By the way, there was a lot done in python 2.7 (as opposed to python 3x in some other posts I have).

Initialize the Index

The following script will “reset” the index and create it new.

#! /usr/bin/python

import httplib 
import binascii
import os
import glob
import socket

import hostinfo

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connInitialize(conn):
    print connRequest(conn, 'DELETE', hostinfo.INDEX)
    print connRequest(conn, 'PUT', hostinfo.INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

socket.setdefaulttimeout(15)
conn = httplib.HTTPConnection(hostinfo.HOST)
connInitialize(conn)
connRefresh(conn)

Mount a SMB share

On Ubuntu, you will need to install cif-utils: “sudo apt-get install cifs-utils”.

Once done, you can mount it by using the following command. Choose your own mount point obviously and be prepared with your domain password.

sudo mount -t cifs //10.0.4.240/General /mnt/cifs -ousername=maksym.shyte,ro

Rsync Between Server and SMB Share

The easiest way to do this is to create a file list that you want to search for. Then use that list to rsync with. This leaves you with copied files with efficiency and a text file list of the files on the SMB share.

function addToList {
  find "$1" -name \*.pdf -o -name \*.doc -o -name \*.docx -o -name \*.xls -o -name \*.xlsx -o -name \*.ppt -o -name \*.pptx -o -name \*.txt | grep -v ".AppleDouble" | grep -v "~$" >> "$2"
}

cd /mnt/cifs

addToList . $currentPath/rsynclist.txt
#addToList ./Some\ Directory $currentPath/rsynclist.txt

rsync -av --files-from=rsynclist.txt /mnt/cifs /var/www/search/data

Read the Index

To read the index, the following script will pull the indexes out and write them to a file. This will include the name of the document and the key. You will need to take the step of revolving the path from the previous file list with this index as they are related by the source and destination directory passed to rsync.

#! /usr/bin/python

import httplib 
import json
import sys
import os
import codecs

import hostinfo

argc = len(sys.argv)
if argc != 2:
    print os.path.basename(sys.argv[0]), ""
    sys.exit(-1)

indexFileName = sys.argv[1]

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

conn = httplib.HTTPConnection(hostinfo.HOST)
data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))

print data
total = data["hits"]["total"]

#scroll session id, used to request the next batch of data
scrollId = data["_scroll_id"]
counter = 0; 

data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))

#print data

f = codecs.open(indexFileName, "w", "utf8")

while len(data["hits"]["hits"]) > 0:
    for item in data["hits"]["hits"]: 
        f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
        f.flush()

    counter = counter + len(data["hits"]["hits"])
    print "Reading Index:", counter, "of", total

    scrollId = data["_scroll_id"]
    resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
    #print resp
    data = json.loads(resp)

f.close()

Diff the File List and the Index List

Next we need to diff the two. We want to know the files we need to index and the files we want to delete. The following script does that (presuming that the lists have been modified to point at the same directory – i.e. /var/www/search/data). Out comes an “add” text file and a “delete” text file.

#! /usr/bin/python

import sys
import os

argc = len(sys.argv)
if argc != 5:
    print os.path.basename(sys.argv[0]), "   "
    sys.exit(-1)

def createMap(filename):
    ret = {}
    f = open(filename)
    lines = f.readlines()
    f.close()
    for line in lines:
        line = line.replace('\n','')
        split = line.split(',', 1)
        key = split[0]
        ret[key] = line
    return ret

fileMap = createMap(sys.argv[1])
indexMap = createMap(sys.argv[2])

# if the entry is in fileMap but not indexMap, it goes into the add file
# if the entry is in indexMap but not fileMap, it goes into the delete file
add = {}

for key in fileMap:
    if indexMap.has_key(key):
        del indexMap[key]
    else:
        add[key] = fileMap[key]

f = open(sys.argv[3], "w")
for key in add:
    f.write(add[key] + '\n');
f.close()

f = open(sys.argv[4], "w")
for key in indexMap:
    f.write(indexMap[key] + '\n');
f.close()

Add to the Index

Next we iterate through all the files in the “add” list.

#! /usr/bin/python

import httplib 
import binascii
import sys
import os
import socket

import hostinfo

argc = len(sys.argv)
if argc != 3:
    print os.path.basename(sys.argv[0]), " "
    sys.exit(-1)

rootFsDir = sys.argv[2] 

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connInitialize(conn):
    print connRequest(conn, 'DELETE', hostinfo.INDEX)
    print connRequest(conn, 'PUT', hostinfo.INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

def connAddFile(conn, filename, rootFsDir):
    title = os.path.basename(filename)
    location = filename[len(rootFsDir):]

    with open(filename, 'rb') as f:
        data = f.read()

    if len(data) > hostinfo.LARGEST_BASE64_ATTACHMENT:
        print 'Not indexing because the file is too large', len(data)
    else:
        print 'Indexing file size', len(data)
        base64Data = binascii.b2a_base64(data)[:-1]
        attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
        print connRequest(conn, 'POST', hostinfo.INDEX + '/attachment/', attachment)

socket.setdefaulttimeout(30)
conn = httplib.HTTPConnection(hostinfo.HOST)
#connInitialize(conn)

f = open(sys.argv[1])
lines = f.readlines()
f.close()

idx = 0

rootFsDir = rootFsDir + '/'

for line in lines:
    line = line.replace('\n', '')
    idx = idx + 1
    filename = rootFsDir + line
    print idx, filename
    try:
        connAddFile(conn, filename, rootFsDir)
    except Exception, e:
        print str(e)
        conn = httplib.HTTPConnection(hostinfo.HOST)  

connRefresh(conn)

Delete the Files Not Needed

Finally, we delete the index and physical files no longer needed.

#! /usr/bin/python

import httplib 
import binascii
import sys
import os
import socket

import hostinfo

argc = len(sys.argv)
if argc != 3:
    print os.path.basename(sys.argv[0]), " "
    sys.exit(-1)

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

def connDeleteFile(conn, index):
    print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)

socket.setdefaulttimeout(30)
conn = httplib.HTTPConnection(hostinfo.HOST)

f = open(sys.argv[1])
lines = f.readlines()
f.close()

idx = 0

for line in lines:
    line = line.replace('\n', '')
    idx = idx + 1
    split = line.split(',')
    filename = split[0]
    index = split[1]
    print "Delete:", idx, filename, index
    try:
        connDeleteFile(conn, index)
    except Exception, e:
        print str(e)
        conn = httplib.HTTPConnection(hostinfo.HOST)  

    try:
    	os.remove(sys.argv[2] + '/' + filename)    
    except:
        pass

connRefresh(conn)

There it is. I have all these steps including resolving the path between the file list and the index list. One further thing to note is that the hostinfo file referenced by the python scripts look like this:

#! /usr/bin/python

HOST = '127.0.0.1:9200'
SITE = ''

INDEX = SITE + '/basic'

LARGEST_BASE64_ATTACHMENT = 50000000

HTML/Javascript, HTTP, jQuery, Python, Searching

A Search Engine for Office Documents

May 17, 2014 Maksym Shyte Leave a comment

Have you ever worked at a place where there was a mass of files and documents on a share and even old timers forget where important documents are?

Search by file name stinks and SharePoint has been another excuse to dump stuff that gets lost.

So I decided to figure out an easy way to get a content search engine up looking through the files on a share. I found a solution. It isn’t pristine for these reasons.

Browsers can’t link to files on a share for obvious security reasons.
For reason one, the decision was made to copy searchable documents onto the web server. This is time consuming to transfer and duplicates information but the documents are served successfully.
For reason two, it would be possible to add an server plugin that reads and delivers a file on a share. Just haven’t done that yet.

So we will start with what we have and consider changing it later.

The basis for this will be Ubuntu 12.04 LTS. Why? Because I have such a machine handy and it is 9 years old. This will be based on all the wonderful work of elasticsearch and Lucene.

So, here are the steps. Remember, this is a bit hacky.

Install apache2. (In the case of Ubuntu, it is “sudo apt-get install apache2”.)
Install openjdk-7-jre-headless. (“sudo apt-get isntall openjdk-7-jre-headless”).
Download elasticsearch (from elasticsearch.org – the .com site takes you to pay-for products). Because I am using Ubuntu, I thought I would use the apt repository.
Follow the steps to start elasticsearch – in my case listed on the web site. Be advised that elasticsearch binds to all interfaces tp a free port between 9200 and 9300. We will assume that the port is 9200 as it is in my case. However, it probably should only bind to a port on localhost or at least, the security should be evaluated to make sure it complies with what you need.
We will need two plugins. You can install them from you elasticsearch/bin location. In my case it was /usr/share/elasticsearch/bin/plugin.
```
bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/2.0.0
bin/plugin -install de.spinscale/elasticsearch-plugin-suggest/1.0.1-2.0.0
```
Restart elasticsearch. (“sudo service elasticsearch restart”). You will also need to verify the versions of these plugins.
For apache2, make sure to enable the proxy, proxy_http, and ssl modules. On Ubuntu, the “a2enmod” is an easy utility to do this.
In my Apache setup, I added a new file called “elasticsearch” inside /etc/apache2/conf.d. (Note the 13.10 doesn’t use a conf.d directory. It could be added to the bottom of apach2.conf although I am sure there is a more “pristine” location.) The contents are below.
```
<IfModule proxy_module>
<IfModule proxy_http_module>

<Proxy *>
<Limit GET > 
    allow from all 
</Limit>

<Limit POST PUT DELETE>
    order deny,allow 
    deny from all 
</Limit>
</Proxy>

ProxyPreserveHost On
ProxyRequests Off
LogLevel debug
ProxyPass /es http://localhost:9200/
ProxyPassReverse /es http://localhost:9200/

</IfModule>
</IfModule>
```
The application depends on the /es directory under web root. This can be changed along with the web pages that use it.
Restart apache2. (“sudo service apache2 restart”)
Download the HTML and Javascript for the search pages from here: Search HTML and Javascript. It uses jQuery and jQueryUI and AJAX to perform the searching and suggestions. Unzip and place in the web directory where you want it. For me, I wanted a search subdirectory so I placed my in /var/www/search.

So, the last thing is show how to index the files. I am a fan of python so this is python code making http requests to elasticsearch adding the information. The script below deletes the index, recreates, and starts adding content to it – from files in a directory.

#! /usr/bin/python

import httplib 
import binascii
import os

HOST = 'localhost:9200'
INDEX = '/basic'

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connAddFile(conn, filename, rootFsDir, httpPrefix):
    with open(filename, 'rb') as f:
        base64Data = binascii.b2a_base64(f.read())[:-1]

    title = os.path.basename(filename)
    location = httpPrefix + filename[len(rootFsDir):]

    attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
    print connRequest(conn, 'POST', INDEX + '/attachment/', attachment)

conn = httplib.HTTPConnection(HOST)

print connRequest(conn, 'DELETE', INDEX)

print connRequest(conn, 'PUT', INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 

print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )

print connRequest(conn, 'PUT', INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

# Add files here repeatedly
rootFsDir = '/var/www/search/data/'
searchDir = ''          # This is for recursion through the directories
httpPrefix = 'data/'
# Make this recursive some day
for file in os.listdir(rootFsDir + searchDir):
    connAddFile(conn, rootFsDir + searchDir + file, rootFsDir, httpPrefix)

print connRequest(conn, 'POST', '/_refresh')

If you decide to get more creative and add only new files and delete the old ones, we need to understand how to get the list of existing files that are indexed. Then you just have to correlate the current state of the files on disk with the index list. This script gets the indexes and the files associated with them.

#! /usr/bin/python

import httplib 
import json
import sys
import os

import hostinfo

argc = len(sys.argv)
if argc != 2:
    print os.path.basename(sys.argv[0]), ""
    sys.exit(-1)

indexFileName = sys.argv[1]

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

conn = httplib.HTTPConnection(hostinfo.HOST)
data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))

total = data["hits"]["total"]

#scroll session id, used to request the next batch of data
scrollId = data["_scroll_id"]
counter = 0; 

data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))

f = open(indexFileName, 'w')

while len(data["hits"]["hits"]) > 0:
    for item in data["hits"]["hits"]:
        f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
        f.flush()

    counter = counter + len(data["hits"]["hits"])
    print "Reading Index:", counter, "of", total

    scrollId = data["_scroll_id"]
    resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
    #print resp
    data = json.loads(resp)

f.close()

To delete files, the python snippet looks like this where index is the id for the file we want indexing deleted for.
```
def connDeleteFile(conn, index):
    print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)
```

So there we have it. All we have to do figure out where we are getting our data from and copy it to the “data” directory. One particular way I have done this is with rsync across an SMB share.

This by no means is meant to be a lesson on elasticsearch. There can be some improvement here.

However, this is a quick way to set up searching documents for information you never knew existed. (Side note: I have had 10 ms search times across 2500 documents.)

Recycling a Third Party Application with System Tray Icon

May 13, 2014 Maksym Shyte Leave a comment

I had a need to recycle a third party application that had a system tray icon. The application controlled hardware and would get into a funky state.

The application was titled the “user mode driver” but I’m not totally sure if it was the user mode driver framework that Microsoft touted with Vista. The user mode driver (UMD) was really a bridge process between the Ethernet port and a COM (a.k.a. the older timer component object model) in-process DLL that resided in your program memory space.

The UMD also had a system tray component to it that needed a little cleanup when the application was killed. The sytem tray icon was left behind.

This post is recycling other’s work that we will reference. This post is about bringing it all together in C#.

There are three parts to this option.

Stop the process
Restart the process
Clean up the system tray.

For this example though, we will assume that we know the full path to the process and that the process name is the base file name without extension.

Stop the Process

C# has a handy way to stop processes.

private void StopUserModeDriver(string userModeDriverPath)
{
  Process[] procs = null;

  try
  {
    procs = Process.GetProcessesByName(Path.GetFileNameWithoutExtension(userModeDriverPath));

    foreach (Process proc in procs)
    {
      proc.Kill();
      proc.WaitForExit(5000);
    }
  }
  finally
  {
    if (procs != null)
      foreach (Process proc in procs)
        proc.Dispose();
  }
}

Restart the Process

This one is simple.

private void StartUserModeDriver(string userModeDriverPath)
{
  Process.Start(userModeDriverPath);
}

Clean Up the System Tray

This code is present here and we will show it again on this post.

[StructLayout(LayoutKind.Sequential)]
public struct RECT
{
  public int left;
  public int top;
  public int right;
  public int bottom;
}
[DllImport("user32.dll")]
public static extern IntPtr FindWindow(string lpClassName, string lpWindowName);
[DllImport("user32.dll")]
public static extern IntPtr FindWindowEx(IntPtr hwndParent, IntPtr hwndChildAfter, string lpszClass, string lpszWindow);
[DllImport("user32.dll")]
public static extern bool GetClientRect(IntPtr hWnd, out RECT lpRect);
[DllImport("user32.dll")]
public static extern IntPtr SendMessage(IntPtr hWnd, uint msg, int wParam, int lParam);

private void RemoveOrphanedIconsFromSystemTray()
{
  IntPtr systemTrayContainerHandle = FindWindow("Shell_TrayWnd", null);
  IntPtr systemTrayHandle = FindWindowEx(systemTrayContainerHandle, IntPtr.Zero, "TrayNotifyWnd", null);
  IntPtr sysPagerHandle = FindWindowEx(systemTrayHandle, IntPtr.Zero, "SysPager", null);
  IntPtr notificationAreaHandle = FindWindowEx(sysPagerHandle, IntPtr.Zero, "ToolbarWindow32", "Notification Area");
  if (notificationAreaHandle == IntPtr.Zero)
  {
    notificationAreaHandle = FindWindowEx(sysPagerHandle, IntPtr.Zero, "ToolbarWindow32", "User Promoted Notification Area");
    IntPtr notifyIconOverflowWindowHandle = FindWindow("NotifyIconOverflowWindow", null);
    IntPtr overflowNotificationAreaHandle = FindWindowEx(notifyIconOverflowWindowHandle, IntPtr.Zero, "ToolbarWindow32", "Overflow Notification Area");
    RefreshSystemTrayArea(overflowNotificationAreaHandle);
  }
  RefreshSystemTrayArea(notificationAreaHandle);
}

private static void RefreshSystemTrayArea(IntPtr windowHandle)
{
  const uint wmMousemove = 0x0200;
  RECT rect;
  GetClientRect(windowHandle, out rect);
  for (var x = 0; x < rect.right; x += 5)
    for (var y = 0; y < rect.bottom; y += 5)
      SendMessage(windowHandle, wmMousemove, 0, (y << 16) + x);
}

Essentially, we are getting window handles to the notification area that is on your system tray and also the overflow area introduced in Windows 7 (don’t know about Vista – does anyone remember Vista?). That is the little arrow icon in the system tray that opens a little popup where all pestering but insignificant applications’ system tray icons live.

Do you remember how you have an orphaned system tray icon so you move your mouse over it to find magically disappears? That is exactly what this code does. With the Window handles to the system tray and overflow, we simply move our mouse repeatedly up and down and left to right. We don’t actually move the cursor, just send the windows message.

There was another solution presented somewhere (on code project but I can’t find it now) that got information in the private bytes of the window allocations to determine if a process was still operating. This approach was more pristine but did some memory allocation tricks in C# that made me nervous. Sending mouse messages was certainly safer although not elegant.

Geek Droppings

Monthly Archives: May 2014