Category Archives: Python

Infix to Postfix – but Better

Summary

There are many examples of the shunting algorithm that takes infix and converts to postfix (or RPN).  None of them did everything I wanted.  I wanted something full featured sort of like excel.  I wanted:

  1. Negative numbers need to be handled and they are often ignored in converters.
  2. Support single and multi character operations (i.e. <<, >>).
  3. I wanted support for multiple argument functions and even no argument functions.
  4. I wanted solid error messages and an indication where the error was.
  5. I wanted an optimization step where constants are evaluated and reduced prior to execution.
  6. You could name your variables anything you wanted.

The code is here:  InfixParse

Motivation

The motivation was a project where we need a think PC client to process an infix equation and translate this down to a set of opcodes and data for a small processor (memory/horsepower).

The first step is getting the conversation to work in high level constructs.

The second step which will happen later was to convert the high level language constructs to a set of opcodes and data.

Background

There are numerous examples of equation processing but I never found any that did everything sort of like excel did.

This one is pretty good: http://interactivepython.org/runestone/static/pythonds/BasicDS/InfixPrefixandPostfixExpressions.html

Maybe someone else did a full implementation but I never found it.

Tools

PyCharm and Python 3.6.

Definitions

operator – This is a symbol such as +, -, *, /.  It can be more than one character.  The ( is also an operator technically.

function – This is a piece of operation in the form of FUNC(…) with a certain number of arguments defined for that function.  Each argument is separated by a comma.  If there are no arguments (for example, RND for random), then no parenthesis are allowed.  Functions can have zero to “n” arguments.

variable – This is a textual name for an variable (or input) in the equation.  This can be one or more characters.

constant – This is a number in the infix equation.

Steps

As we work through the steps, refer to this equation:

2 * (1 + -3) + abs(-round(rnd, 2))

In text, this is 2 times the sum of 1 and a -3 + the absolute value of a random number rounded to the 2nd decimal place and negated.

The first step is to convert the infix to a set of stack items.  We are effectively splitting on operators.  Then each stack item is evaluated for the type of information between the operators.  After this step, the infix stack looks like this.

[<2, constant>, <*, operator>, <(, operator>, <1, constant>, <+, operator>, <-, operator>, <3, constant>, <), close>, <+, operator>, <ABS, function>, <(, operator>, <-, operator>, <ROUND, function>, <(, operator>, <RND, function>, <,, operator>, <2, constant>, <), close>, <), close>]

Notice there are two subtractions operators that really should be negation.  So the second step is to determine if there is a valid operator in front of the subtraction and if a function, constant, or variable is immediately after it.  After this check, here is the infix stack.  (I used the ! as a negation symbol.)

[<2, constant>, <*, operator>, <(, operator>, <1, constant>, <+, operator>, <!, operator>, <3, constant>, <), close>, <+, operator>, <ABS, function>, <(, operator>, <!, operator>, <ROUND, function>, <(, operator>, <RND, function>, <,, operator>, <2, constant>, <), close>, <), close>]

The next 3 steps are to check the validity of the infix equation.  First, we check to make sure that we do not have operators next to each other.  Next, we check for mismatched parenthesis.  Finally we check for the correct number of arguments in functions.

if anything failure, the python script will tell where the error occurred and what the error was.  This can certainly be beautified into  whatever final UI is chosen.  The information is there.

If there are no errors, the stack is shunted into a postfix stack and the commas and parenthesis are removed.

[<2, constant>, <1, constant>, <3, constant>, <!, operator>, <+, operator>, <*, operator>, <RND, function>, <2, constant>, <ROUND, function>, <!, operator>, <ABS, function>, <+, operator>]

Now, there are constants that can be combined.  While this is a nonsense, there are cases where a couple of constants can be combined.  This example simplifies to this.

[<-4.0, constant>, <RND, function>, <2, constant>, <ROUND, function>, <!, operator>, <ABS, function>, <+, operator>]

Executing this yields a result between negative 4 and negative 3 such as -3.81 or -3.66.

Future

This code will likely be translated into C++ with the functions in C so that it can be shared on the embedded system.   The functions used for optimization and execution must be the exact same and this is a case where we will want to share source.

The python script could be better organized into several modules but this is a sample for the final run which will be in likely C/C++.

Raspberry PI and the Lego EV3 connected by bluetooth

Executive summary

Here is how to interact between Raspberry PI 3 and Lego EV3 using Bluetooth.   There will be a python 3.6 script presented and sample EV3 programs.  All of this is stock stuff on both platforms.

The key is how to interpret the bytes.  The script presented takes care of all of that.  Here is the EV3BT Script.

The fun stuff

So, I could possibly have a family and I could indeed possible like to enjoy time with them.  I play with Lego in the interest of my family.  I am a good dad…  unless explosives are involved.

Don’t judge.

So the project was to determine if we could use a Raspberry PI with open CV and a camera and drive a Lego robot around the floor using the EV3 trying to do cool things.

There are several communication options but I didn’t want to use any sort of wired or wireless Ethernet nor is USB an option from what I can tell.  I also didn’t want to change the EV3’s operating system.

I decided to play with bluetooth.  They are probably other and perhaps better ways to do this but this is what I chose.

Enough disclaimers about the effectiveness of my decision making skills especially about the explosives and related minors.

There are many sites that explain how to connect the EV3 and even program it.  This guy’s is pretty good.

If you are unfamiliar with pairing with Raspberry PI3 and the EV3, google it.  It is a bit cryptic but once pair, you can connect to the EV3 and it appears as a serial device (i.e. /dev/rfcomm0).

The trick and the purpose of the

Sending to the EV3

The program above simply waits for a Bluetooth message and in this case, a text message, prints it on the LCD screen and beeps.  The python below shows the sending of a message to the EV3.  The trick of the EV3BT class is to simply encode and decode the bits.  The serial module takes care of the transmission.

#! /usr/bin/env python3
import serial
import time
import EV3BT

EV3 = serial.Serial('/dev/rfcomm0')
s = EV3BT.encodeMessage(EV3BT.MessageType.Text, 'abc', 'Eat responsibly')
print(EV3BT.printMessage(s))
EV3.write(s)
time.sleep(1)
EV3.close()

You have options for what sort of messages you can send.  These are Text, Numeric, and Logic.  In programmer speak, strings, floating point values (IEEE 754), and Boolean (literally a 1 and a 0).  See below.

The other key is the mailbox.  The ‘abc’ is the mail box and is shown on the block (highlighted in red).  You must pass the mailbox in a form that matches the block so that the block receives the message.  In case it isn’t obvious, you can send messages to multiple mailboxes in your script and control several EV3 loops just but using different mailboxes.

Reading from the EV3

Reading is just a little more complicated in that you probably need to have some sort of loop with blocking logic.  As I say that, every experienced developer reading this is rolling their eyes (back in old school Linux days, should we use select or poll?).

In this example, the EV3 will wait for a touch sensor to be activated and send a message.  To be precise, the message ‘Raspberry PI’ will be sent to the mailbox ‘abc’  in the Raspberry PI.

The key is the decodeMessage.  You tell the method, the string to parse, and the message type expected (as this effects the length of the transfer and there is no identifying mark in the message about the type).  It will return you the mail box, the value, and any remnant bytes left over.  The buffering is not stellar in this example but it gets the point across.

#! /usr/bin/env python3
import serial
import time

import EV3BT

EV3 = serial.Serial('/dev/rfcomm0')
print("Listening for EV3 Bluetooth messages, press CTRL C to quit.")
try:
 while 1:
 n = EV3.inWaiting()
 if n != 0:
  s = EV3.read(n)
  mail,value,s = EV3BT.decodeMessage(s, EV3BT.MessageType.Logic)
  print(mail,value) 
 else:
  # No data is ready to be processed
  time.sleep(0.1)
except KeyboardInterrupt:
 pass
EV3.close()

Moving a robot

So to put it together, I could make a robot move under python control in the Raspberry PI with this script and program.

The EV3 was made as a tank with two treads.  A left and right.  Essentially, the script is sending a numeric to each tank tread.  A positive number moves forward, a negative number moves back.   The larger the number, the faster the rotation.

#! /usr/bin/env python3
import serial
import time
import EV3BT

EV3 = serial.Serial('/dev/rfcomm0')
left = 10
right = 10
s = EV3BT.encodeMessage(EV3BT.MessageType.Numeric, 'left', left)
EV3.write(s)
s = EV3BT.encodeMessage(EV3BT.MessageType.Numeric, 'right', right)
EV3.write(s)
EV3.close()

Areas of improvement

When reading, there is no identifying mark in the package concerning the type.  You also can not assume that number of bytes on the wire will work.  A 3 byte with null looks like a float.

Deriving application specific code to handle the case where you want the EV3 to send a float and a Boolean should be simple to add.  A string?  Not sure the worth of that really from the EV3 to the Raspberry PI.

Determine differences in lists in Python

Sometimes, you just want to take two lists like a previous one and a new one and get back information about what is new, what is old, and what is the same.  Here is one way to do it.

def compareLists(old, new):
  oldset = set(old)
  addlist = [x for x in new if x not in oldset]
  newset = set(new)
  dellist = [x for x in old if x not in newset]
  samelist = [x for x in old if x in newset]
  return (addlist, dellist, samelist)
old = ['One', 'Two', 'Three', 'Four']
new = ['One', 'Two', 'Five']

(addlist, dellist, samelist) = compareLists(old, new)
print("addlist=", addlist)
print("dellist=", dellist)
print("samelist=", samelist)

what you get is:

('addlist=', ['Five'])
('dellist=', ['Three', 'Four'])
('samelist=', ['One', 'Two'])

With this, you can take “instantiation” action with the “add” list, take “removal” action with the “del” list, and simply update data with the “same” list.

Starveless Priority Queue in Python

The python priority queue describes messages in absolute terms.  The most important element will always be popped.  The least important element may never be popped.

Sometimes when a message based data path is serialized to something (like a serial port based piece of hardware), all messages have some importance.  Unlike the priority queue, we would rather not use such absolute terms.  We would rather say that most of the time, the important messages are sent.  However, we guarantee that some of the time, the less important messages are sent.

The language is more gray and the queue needs to be a bit more…  “cooperative”.  (We could say “round robin”.)

Let’s implement our own.  But first, a quick back story.  I have some knowledge of Motorola’s (at the time) Time Processor Unit (TPU) on some of their embedded processors.  The TPU had an event driven model where certain sections of microcode would execute based on priority but lower priority microcode would still operated periodically.  Let’s base our design on this philosophy.  (See section 3 for more information on the TPU scheduler.)

Python has a pretty good queue framework that you derive your queue from and implement several methods.  If you ever look at the priority queue code in python, it is pretty simplistic as it relies on the heapq.  There is much work done in the Queue class including thread synchronization.

Here is our proposed round robin priority queue.

#! /usr/bin/python

import Queue

class RRPriorityQueue(Queue.Queue):
  '''Variant of Queue that retrieves open entries in priority order (lowest first)
  but allows highest priorities to be skipped.

  '''

  def _init(self, priorityOrder):
    if priorityOrder is None:
      raise TypeError

    self._queue = []

    self._priorityOrder = priorityOrder
    self._orderIdx = len(self._priorityOrder) + 1

  def _qsize(self, len=len):
    return len(self._queue)

  def _put(self, item):
    self._queue.append(item)

  def _get(self):
    if (len(self._queue) == 0):
      raise IndexError

    retVal = None

    # change the order index for the priority order list
    self._orderIdx = self._orderIdx + 1
    if self._orderIdx >= len(self._priorityOrder):
      self._orderIdx = 0

    # the priority we are interested in and the highest value
    # in case we don't find one at our priority
    filterPriority = self._priorityOrder[self._orderIdx]
    highest = self._queue[0]

    # look for one of our priority but saving highest priority (lowest number)
    for item in self._queue:
      priority = item.priority
      if priority == filterPriority:
        # we found our item
        retVal = item
        break
      if highest.priority > item.priority:
        highest = item

    # if we didn't find our item, the return the highest
    if retVal is None:
      retVal = highest

    self._queue.remove(retVal)
    return retVal

The _qsize and _put methods are basic.  The _init method requires a priority order.  This specifies the priority guaranteed to pop from the queue for the corresponding _get call.  The _orderIdx is used to know which step of the priority order the algorithm is on.  It increments or resets on every _get call.

The _get call is where the magic happens.  The first step is to increment or reset the _orderIdx value.  Then we are going to get our filterPriority from the _priorityOrder list based on the _orderIdx.  Then we iterate through the queue.  If we find an entity that matches our priority, return it.

However, what happens if we don’t find a message with our priority?  That is ok, we simply return the highest priority item first found in the queue.  The highest variable is a reference to the first item in the queue with the highest priority and is returned if we don’t find a match.

The only thing we require from the entities in the queue is that they implement a priority member (or method) that returns the priority number.  This deviates from the priority queue in that the queue wasn’t concerned about the priority number, just that an entity in the queue could compare to others.  In our case, we need to know the priority to filter the entities the way we want to.

Like the priority queue, the lower the priority number, the more important it is.

We can create our round robin queue like this:

self.q = RRPriorityQueue([1,2,1,3,1,2,1])

In this case, priority 1 messages are guaranteed to pop 4 out of 7 times.  Priority 2 message are guaranteed 2 out of 7.  Priority 3 – 1 out of 7.

You can make whatever pattern with whatever priorities you want.  Just understand that if a priority of an entity is not in the priority order, it can be starved.  It will be popped if it is the most important message when a message of the current filter order isn’t found.

 

Two Important SQLite and Python Lessons

When using SQLite3 and Python (2.x), there are two important lessons that are not obvious (at least not to me).

1. Dictionaries and TimeStamps

Ideally, I would like to do two things.  First, access data from a dictionary and not a list.  It is far more intuitive to access by column name (or query name substitution) than by list index.  Second, the datetime values should be correctly coerced even though SQLite has no implicit timestamp type.  I believe these two simple requirements are expected by most people, especially those family with Microsoft SQL and ADO.NET.

Good news, Python supports this!  However, there are some switches to set so to speak.

After importing sqlite3, the following connect statement will suffice for both needs:

conn = sqlite3.connect(dbname, detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COLNAMES)

The PARSE_COLNAMES returns a dictionary for each row fetched instead of a list.

The PARSE_DECLTYPES empowers python to do a bit more type conversion based on the queries provided.  For example, if you do this:

cur.execute('select mytime from sometable')

You will get mytime as a string.  However, if you do this:

cur.execute('select mytime as "[timestamp]" from sometable')

You will get mytime as a datetime type in python.  This is very useful.

One step further; let’s say you want a timestamp but substitute the column name in the query.  Do this:

cur.execute('select mytime as "mytime [timestamp]" from sometable')

Not only with the data be returned as a datetime object, the dictionary will contain column name substitution provided in the query.  Beware, if you don’t do this on aggregate functions, the python sqlite library will attempt to add to the dictionary with an empty string key.  (Not sure why this is but beware.)

The adapters and converters live in the file dpapi2.py in your python library installation directory for sqlite3.

Refer to the documentation here.

2. Execute and Tuples

This is really a lesson on declaring implicit tuples.

The execute method on a cursor is great because it will safely convert strings for use in the database (reducing the possibility of SQL injection).  It takes a query string and a substitution tuple.

However, I got caught again on the example below as will many who follow me.

This works:

cur.execute('insert into mytable values(?, ?, ?, ?)', (1, 2, 3, 4))

This doesn’t work and throws an exception:

cur.execute('insert into mytable values(?)', (1))

This works:

cur.execute('insert into mytable values(?)', (1,))

The comma after the one declares a tuple of one element.  Argh!!!

I hope this helps.

 

Simple Python JSON server based on jsonrpclib

I needed a simple python JSON server executing in its own thread but that was easily extensible.   Let’s get right to the base class code (or super class for those who build down).

#! /usr/bin/python

import threading

import jsonrpclib
import jsonrpclib.SimpleJSONRPCServer

class JsonServerThread (threading.Thread):
  def __init__(self, host, port):
    threading.Thread.__init__(self)
    self.daemon = True
    self.stopServer = False
    self.host = host
    self.port = port
    self.server = None

  def _ping(self):
    pass

  def stop(self):
    self.stopServer = True
    jsonrpclib.Server("http://" + str(self.host) + ":" + str(self.port))._ping()
    self.join()

  def run(self):
    self.server = jsonrpclib.SimpleJSONRPCServer.SimpleJSONRPCServer((self.host, self.port))
    self.server.logRequests = False
    self.server.register_function(self._ping)

    self.addMethods()

    while not self.stopServer:
      self.server.handle_request()
    self.server = None

  # defined class definitions

  def addMethods(self):
    pass

So the idea is simple,   Derive a new class from this and implement the addMethods method and the methods themselves.

#! /usr/bin/python

import jsonServer

class JsonInterface(jsonServer.JsonServerThread):
  def __init__(self, host, port):
    jsonServer.JsonServerThread.__init__(self, host, port)
    self.directory = directory

  def addMethods(self):
    self.server.register_function(self.doOneThing)
    self.server.register_function(self.doAnother)

  def doOneThing(self, obj):
    return obj

  def doAnother(self):
    return "why am I doing something else?"

In the derived class, implement the methods and register them in addMethods.  That is all.  Now we can worry simply about implementation.  Be aware of any threading synchronization of exception handling.  Jsonrpclib takes care of exception handling as well and converts it into a JSON exception.

One last item of note.  In the base class, the stop method is interesting.  Since handle_request() is a blocking call in the thread, we need to set the “stop” flag and make a simple request.  The _ping method does this for us.  Then we join on the thread waiting for it to end gracefully.

The jsonrpclib is a very useful library and well done.  By the way, this example is for Python 2.7.  On Ubuntu 14.04, you can install this using “apt-get install python-jsonrpclib”.

Pulling Documents for Searching

In a prior post, I noted how to set up elasticsearch with apache2.  In this post, we will look at how to cache a set of files on your web server  from a windows share and index them.

To do this, we need to do the following steps:

  1. Initialize the index the first time.
  2. Mount a share.
  3. Rsync the data between the machines.
  4. Get the files that exist on the SMB share.
  5. Read what has been indexed.
  6. Diff the lists from steps 3 and 4.
  7. Index the new files on the share.
  8. Delete (the index and file) the files that no longer exist on the share.

By the way, there was a lot done in python 2.7 (as opposed to python 3x in some other posts I have).

Initialize the Index

The following script will “reset” the index and create it new.

#! /usr/bin/python

import httplib 
import binascii
import os
import glob
import socket

import hostinfo

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connInitialize(conn):
    print connRequest(conn, 'DELETE', hostinfo.INDEX)
    print connRequest(conn, 'PUT', hostinfo.INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

socket.setdefaulttimeout(15)
conn = httplib.HTTPConnection(hostinfo.HOST)
connInitialize(conn)
connRefresh(conn)

Mount a SMB share

On Ubuntu, you will need to install cif-utils:  “sudo apt-get install cifs-utils”.

Once done, you can mount it by using the following command.  Choose your own mount point obviously and be prepared with your domain password.

sudo mount -t cifs //10.0.4.240/General /mnt/cifs -ousername=maksym.shyte,ro

Rsync Between Server and SMB Share

The easiest way to do this is to create a file list that you want to search for.  Then use that list to rsync with.  This leaves you with copied files with efficiency and a text file list of the files on the SMB share.

function addToList {
  find "$1" -name \*.pdf -o -name \*.doc -o -name \*.docx -o -name \*.xls -o -name \*.xlsx -o -name \*.ppt -o -name \*.pptx -o -name \*.txt | grep -v ".AppleDouble" | grep -v "~$" >> "$2"
}

cd /mnt/cifs

addToList . $currentPath/rsynclist.txt
#addToList ./Some\ Directory $currentPath/rsynclist.txt

rsync -av --files-from=rsynclist.txt /mnt/cifs /var/www/search/data

Read the Index

To read the index, the following script will pull the indexes out and write them to a file.  This will include the name of the document and the key.  You will need to take the step of revolving the path from the previous file list with this index as they are related by the source and destination directory passed to rsync.

#! /usr/bin/python

import httplib 
import json
import sys
import os
import codecs

import hostinfo

argc = len(sys.argv)
if argc != 2:
    print os.path.basename(sys.argv[0]), ""
    sys.exit(-1)

indexFileName = sys.argv[1]

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

conn = httplib.HTTPConnection(hostinfo.HOST)
data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))

print data
total = data["hits"]["total"]

#scroll session id, used to request the next batch of data
scrollId = data["_scroll_id"]
counter = 0; 

data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))

#print data

f = codecs.open(indexFileName, "w", "utf8")

while len(data["hits"]["hits"]) > 0:
    for item in data["hits"]["hits"]: 
        f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
        f.flush()

    counter = counter + len(data["hits"]["hits"])
    print "Reading Index:", counter, "of", total

    scrollId = data["_scroll_id"]
    resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
    #print resp
    data = json.loads(resp)

f.close()

Diff the File List and the Index List

Next we need to diff the two.  We want to know the files we need to index and the files we want to delete.  The following script does that (presuming that the lists have been modified to point at the same directory – i.e. /var/www/search/data).  Out comes an “add” text file and a “delete” text file.

#! /usr/bin/python

import sys
import os

argc = len(sys.argv)
if argc != 5:
    print os.path.basename(sys.argv[0]), "   "
    sys.exit(-1)

def createMap(filename):
    ret = {}
    f = open(filename)
    lines = f.readlines()
    f.close()
    for line in lines:
        line = line.replace('\n','')
        split = line.split(',', 1)
        key = split[0]
        ret[key] = line
    return ret

fileMap = createMap(sys.argv[1])
indexMap = createMap(sys.argv[2])

# if the entry is in fileMap but not indexMap, it goes into the add file
# if the entry is in indexMap but not fileMap, it goes into the delete file
add = {}

for key in fileMap:
    if indexMap.has_key(key):
        del indexMap[key]
    else:
        add[key] = fileMap[key]

f = open(sys.argv[3], "w")
for key in add:
    f.write(add[key] + '\n');
f.close()

f = open(sys.argv[4], "w")
for key in indexMap:
    f.write(indexMap[key] + '\n');
f.close()

Add to the Index

Next we iterate through all the files in the “add” list.

#! /usr/bin/python

import httplib 
import binascii
import sys
import os
import socket

import hostinfo

argc = len(sys.argv)
if argc != 3:
    print os.path.basename(sys.argv[0]), " "
    sys.exit(-1)

rootFsDir = sys.argv[2] 

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connInitialize(conn):
    print connRequest(conn, 'DELETE', hostinfo.INDEX)
    print connRequest(conn, 'PUT', hostinfo.INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

def connAddFile(conn, filename, rootFsDir):
    title = os.path.basename(filename)
    location = filename[len(rootFsDir):]

    with open(filename, 'rb') as f:
        data = f.read()

    if len(data) > hostinfo.LARGEST_BASE64_ATTACHMENT:
        print 'Not indexing because the file is too large', len(data)
    else:
        print 'Indexing file size', len(data)
        base64Data = binascii.b2a_base64(data)[:-1]
        attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
        print connRequest(conn, 'POST', hostinfo.INDEX + '/attachment/', attachment)

socket.setdefaulttimeout(30)
conn = httplib.HTTPConnection(hostinfo.HOST)
#connInitialize(conn)

f = open(sys.argv[1])
lines = f.readlines()
f.close()

idx = 0

rootFsDir = rootFsDir + '/'

for line in lines:
    line = line.replace('\n', '')
    idx = idx + 1
    filename = rootFsDir + line
    print idx, filename
    try:
        connAddFile(conn, filename, rootFsDir)
    except Exception, e:
        print str(e)
        conn = httplib.HTTPConnection(hostinfo.HOST)  

connRefresh(conn)

Delete the Files Not Needed

Finally, we delete the index and physical files no longer needed.

#! /usr/bin/python

import httplib 
import binascii
import sys
import os
import socket

import hostinfo

argc = len(sys.argv)
if argc != 3:
    print os.path.basename(sys.argv[0]), " "
    sys.exit(-1)

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

def connDeleteFile(conn, index):
    print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)

socket.setdefaulttimeout(30)
conn = httplib.HTTPConnection(hostinfo.HOST)

f = open(sys.argv[1])
lines = f.readlines()
f.close()

idx = 0

for line in lines:
    line = line.replace('\n', '')
    idx = idx + 1
    split = line.split(',')
    filename = split[0]
    index = split[1]
    print "Delete:", idx, filename, index
    try:
        connDeleteFile(conn, index)
    except Exception, e:
        print str(e)
        conn = httplib.HTTPConnection(hostinfo.HOST)  

    try:
    	os.remove(sys.argv[2] + '/' + filename)    
    except:
        pass

connRefresh(conn)

There it is.  I have all these steps including resolving the path between the file list and the index list.  One further thing to note is that the hostinfo file referenced by the python scripts look like this:

#! /usr/bin/python

HOST = '127.0.0.1:9200'
SITE = ''

INDEX = SITE + '/basic'

LARGEST_BASE64_ATTACHMENT = 50000000

 

A Search Engine for Office Documents

Have you ever worked at a place where there was a mass of files and documents on  a share and even old timers forget where important documents are?

Search by file name stinks and SharePoint has been another excuse to dump stuff that gets lost.

So I decided to figure out an easy way to get a content search engine up looking through the files on a share.    I found a solution.  It isn’t pristine for these reasons.

  1. Browsers can’t link to files on a share for obvious security reasons.
  2. For reason one, the decision was made to copy searchable documents onto the web server.  This is time consuming to transfer and duplicates information but the documents are served successfully.
  3. For reason two, it would be possible to add an server plugin that reads and delivers a file on a share.  Just haven’t done that yet.

So we will start with what we have and consider changing it later.

The basis for this will be Ubuntu 12.04 LTS.  Why?  Because I have such a machine handy and it is 9 years old.  This will be based on all the wonderful work of elasticsearch and Lucene.

So, here are the steps.  Remember, this is a bit hacky.

  1. Install apache2.  (In the case of Ubuntu, it is “sudo apt-get install apache2”.)
  2. Install openjdk-7-jre-headless.  (“sudo apt-get isntall openjdk-7-jre-headless”).
  3. Download elasticsearch (from elasticsearch.org – the .com site takes you to pay-for products).  Because I am using Ubuntu, I thought I would use the apt repository.
  4. Follow the steps to start elasticsearch – in my case listed on the web site.  Be advised that elasticsearch binds to all interfaces tp a free port between 9200 and 9300.  We will assume that the port is 9200 as it is in my case.  However, it probably should only bind to a port on localhost or at least, the security should be evaluated to make sure it complies with what you need.
  5. We will need two plugins.  You can install them from you elasticsearch/bin location.  In my case it was /usr/share/elasticsearch/bin/plugin.
    bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/2.0.0
    bin/plugin -install de.spinscale/elasticsearch-plugin-suggest/1.0.1-2.0.0

    Restart elasticsearch. (“sudo service elasticsearch restart”).  You will also need to verify the versions of these plugins.

  6. For apache2, make sure to enable the proxy, proxy_http, and ssl modules.  On Ubuntu, the “a2enmod” is an easy utility to do this.
  7. In my Apache setup, I added a new file called “elasticsearch” inside /etc/apache2/conf.d.  (Note the 13.10 doesn’t use a conf.d directory.   It could be added to the bottom of apach2.conf although I am sure there is a more “pristine” location.)  The contents are below.
    <IfModule proxy_module>
    <IfModule proxy_http_module>
    
    <Proxy *>
    <Limit GET > 
        allow from all 
    </Limit>
    
    <Limit POST PUT DELETE>
        order deny,allow 
        deny from all 
    </Limit>
    </Proxy>
    
    ProxyPreserveHost On
    ProxyRequests Off
    LogLevel debug
    ProxyPass /es http://localhost:9200/
    ProxyPassReverse /es http://localhost:9200/
    
    </IfModule>
    </IfModule>

    The application depends on the /es directory under web root. This can be changed along with the web pages that use it.

  8. Restart apache2.  (“sudo service apache2 restart”)
  9. Download the HTML and Javascript for the search pages from here:  Search HTML and Javascript.  It uses jQuery and jQueryUI and AJAX to perform the searching and suggestions.  Unzip and place in the web directory where you want it.  For me, I wanted a search subdirectory so I placed my in /var/www/search.
  10. So, the last thing is show how to index the files.  I am a fan of python so this is python code making http requests to elasticsearch adding the information.  The script below deletes the index, recreates, and starts adding content to it – from files in a directory.
    #! /usr/bin/python
    
    import httplib 
    import binascii
    import os
    
    HOST = 'localhost:9200'
    INDEX = '/basic'
    
    def connRequest(conn, verb, url, body = None):
        if body == None:
            conn.request(verb, url)
        else:
            conn.request(verb, url, body)
        return conn.getresponse().read()
    
    def connAddFile(conn, filename, rootFsDir, httpPrefix):
        with open(filename, 'rb') as f:
            base64Data = binascii.b2a_base64(f.read())[:-1]
    
        title = os.path.basename(filename)
        location = httpPrefix + filename[len(rootFsDir):]
    
        attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
        print connRequest(conn, 'POST', INDEX + '/attachment/', attachment)
    
    conn = httplib.HTTPConnection(HOST)
    
    print connRequest(conn, 'DELETE', INDEX)
    
    print connRequest(conn, 'PUT', INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    
    print connRequest(conn, 'PUT', INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )
    
    # Add files here repeatedly
    rootFsDir = '/var/www/search/data/'
    searchDir = ''          # This is for recursion through the directories
    httpPrefix = 'data/'
    # Make this recursive some day
    for file in os.listdir(rootFsDir + searchDir):
        connAddFile(conn, rootFsDir + searchDir + file, rootFsDir, httpPrefix)
    
    print connRequest(conn, 'POST', '/_refresh')
  11. If you decide to get more creative and add only new files and delete the old ones, we need to understand how to get the list of existing files that are indexed.  Then you just have to correlate the current state of the files on disk with the index list.  This script gets the indexes and the files associated with them.
    #! /usr/bin/python
    
    import httplib 
    import json
    import sys
    import os
    
    import hostinfo
    
    argc = len(sys.argv)
    if argc != 2:
        print os.path.basename(sys.argv[0]), ""
        sys.exit(-1)
    
    indexFileName = sys.argv[1]
    
    def connRequest(conn, verb, url, body = None):
        if body == None:
            conn.request(verb, url)
        else:
            conn.request(verb, url, body)
        return conn.getresponse().read()
    
    conn = httplib.HTTPConnection(hostinfo.HOST)
    data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))
    
    total = data["hits"]["total"]
    
    #scroll session id, used to request the next batch of data
    scrollId = data["_scroll_id"]
    counter = 0; 
    
    data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))
    
    f = open(indexFileName, 'w')
    
    while len(data["hits"]["hits"]) > 0:
        for item in data["hits"]["hits"]:
            f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
            f.flush()
    
        counter = counter + len(data["hits"]["hits"])
        print "Reading Index:", counter, "of", total
    
        scrollId = data["_scroll_id"]
        resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
        #print resp
        data = json.loads(resp)
    
    f.close()
  12. To delete files, the python snippet looks like this where index is the id for the file we want indexing deleted for.
    def connDeleteFile(conn, index):
        print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)

So there we have it.  All we have to do figure out where we are getting our data from and copy it to the “data” directory.  One particular way I have done this is with rsync across an SMB share.

This by no means is meant to be a lesson on elasticsearch.  There can be some improvement here.

However, this is a quick way to set up searching documents for information you never knew existed.  (Side note:  I have had 10 ms search times across 2500 documents.)

 

Grepping any type of file encoding in Python

Let’s take handling any encoding of files one step further.

We need to look for specific text in in files in a directory regardless of encoding.  Here is one way in Python.

#! /usr/bin/python
import sys
import os.path
import os
import re
import fnmatch

def DecodeBytes(byteArray, codecs=['utf-8', 'utf-16']):
  for codec in codecs:
    try:
      return byteArray.decode(codec)
    except:
      pass

def ReadLinesFromFile(filename):
  file = open(filename, "rb")
  rawbytes = file.read()
  file.close()
  content = DecodeBytes(rawbytes)
  if content is not None:
    return content.split(os.linesep)

# this came from http://stackoverflow.com/questions/1863236/grep-r-in-python
# with a substitution of ReadLinesFromFile and a file name match filter
def RecursiveGrep(pattern, dir, match):
  r = re.compile(pattern)
  for parent, dnames, fnames in os.walk(dir):
    fnames = fnmatch.filter(fnames, match)
    for fname in fnames:
      filename = os.path.join(parent, fname)
      if os.path.isfile(filename):
        lines = ReadLinesFromFile(filename)
        if lines is not None:
          idx = 0
          for line in lines:
            if r.search(line):
              yield filename + "|" + str(idx) + "|" + line.strip()	
              idx += 1

lines = RecursiveGrep("needle", "\yourpath", "*.cs")

The will recurse all subdirectories, looking in all .cs files to find needed returning the data in this format (pipe separated):

full file path|line number|line content

Very useful on Windows with multilingual files.

Getting lines of a file of any encoding type in Python

I really don’t want to know the encoding.  I only want the data.  In other words, I don’t want to think.  I don’t want to open notepad++ and convert between types of encoding.

My old standby doesn’t work on various file encodings that aren’t ansi (ascii, cp1252, whatever):

f = open("poo.txt", "r")
lines = f.readlines()
f.close()
for line in lines:
  dosomething(line)

I have had enough.  (I am also venturing into Python 3 as I have been on Python 2 forever but that is a different story.)

The following code will read a file of different encoding and split them into lines:

import os

def DecodeBytes(byteArray, codecs=['utf-8', 'utf-16']):
  for codec in codecs:
    try:
      return byteArray.decode(codec)
    except:
      pass

def ReadLinesFromFile(filename):
  file = open(filename, "rb")
  rawbytes = file.read()
  file.close()
  content = DecodeBytes(rawbytes)
  if content is not None:
    return content.split(os.linesep)

lines = ReadLinesFromFile("poo.txt")
for line in lines:
  dosomething(line)

If you need to add encodings, simply add them to the codecs default assignment (or make it more elegant as you deem).