In a prior post, I noted how to set up elasticsearch with apache2. In this post, we will look at how to cache a set of files on your web server from a windows share and index them.
To do this, we need to do the following steps:
- Initialize the index the first time.
- Mount a share.
- Rsync the data between the machines.
- Get the files that exist on the SMB share.
- Read what has been indexed.
- Diff the lists from steps 3 and 4.
- Index the new files on the share.
- Delete (the index and file) the files that no longer exist on the share.
By the way, there was a lot done in python 2.7 (as opposed to python 3x in some other posts I have).
Initialize the Index
The following script will “reset” the index and create it new.
#! /usr/bin/python import httplib import binascii import os import glob import socket import hostinfo def connRequest(conn, verb, url, body = None): if body == None: conn.request(verb, url) else: conn.request(verb, url, body) return conn.getresponse().read() def connInitialize(conn): print connRequest(conn, 'DELETE', hostinfo.INDEX) print connRequest(conn, 'PUT', hostinfo.INDEX, '{ "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' ) print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{ "attachment" : { "properties" : { "file" : { "type" : "attachment", "fields" : { "title" : { "store" : "yes" }, "file" : { "term_vector":"with_positions_offsets", "store":"yes" } } } } }}' ) def connRefresh(conn): print connRequest(conn, 'POST', '/_refresh') socket.setdefaulttimeout(15) conn = httplib.HTTPConnection(hostinfo.HOST) connInitialize(conn) connRefresh(conn)
Mount a SMB share
On Ubuntu, you will need to install cif-utils: “sudo apt-get install cifs-utils”.
Once done, you can mount it by using the following command. Choose your own mount point obviously and be prepared with your domain password.
sudo mount -t cifs //10.0.4.240/General /mnt/cifs -ousername=maksym.shyte,ro
Rsync Between Server and SMB Share
The easiest way to do this is to create a file list that you want to search for. Then use that list to rsync with. This leaves you with copied files with efficiency and a text file list of the files on the SMB share.
function addToList { find "$1" -name \*.pdf -o -name \*.doc -o -name \*.docx -o -name \*.xls -o -name \*.xlsx -o -name \*.ppt -o -name \*.pptx -o -name \*.txt | grep -v ".AppleDouble" | grep -v "~$" >> "$2" } cd /mnt/cifs addToList . $currentPath/rsynclist.txt #addToList ./Some\ Directory $currentPath/rsynclist.txt rsync -av --files-from=rsynclist.txt /mnt/cifs /var/www/search/data
Read the Index
To read the index, the following script will pull the indexes out and write them to a file. This will include the name of the document and the key. You will need to take the step of revolving the path from the previous file list with this index as they are related by the source and destination directory passed to rsync.
#! /usr/bin/python import httplib import json import sys import os import codecs import hostinfo argc = len(sys.argv) if argc != 2: print os.path.basename(sys.argv[0]), "" sys.exit(-1) indexFileName = sys.argv[1] def connRequest(conn, verb, url, body = None): if body == None: conn.request(verb, url) else: conn.request(verb, url, body) return conn.getresponse().read() conn = httplib.HTTPConnection(hostinfo.HOST) data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' )) print data total = data["hits"]["total"] #scroll session id, used to request the next batch of data scrollId = data["_scroll_id"] counter = 0; data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)) #print data f = codecs.open(indexFileName, "w", "utf8") while len(data["hits"]["hits"]) > 0: for item in data["hits"]["hits"]: f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n') f.flush() counter = counter + len(data["hits"]["hits"]) print "Reading Index:", counter, "of", total scrollId = data["_scroll_id"] resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId) #print resp data = json.loads(resp) f.close()
Diff the File List and the Index List
Next we need to diff the two. We want to know the files we need to index and the files we want to delete. The following script does that (presuming that the lists have been modified to point at the same directory – i.e. /var/www/search/data). Out comes an “add” text file and a “delete” text file.
#! /usr/bin/python import sys import os argc = len(sys.argv) if argc != 5: print os.path.basename(sys.argv[0]), " " sys.exit(-1) def createMap(filename): ret = {} f = open(filename) lines = f.readlines() f.close() for line in lines: line = line.replace('\n','') split = line.split(',', 1) key = split[0] ret[key] = line return ret fileMap = createMap(sys.argv[1]) indexMap = createMap(sys.argv[2]) # if the entry is in fileMap but not indexMap, it goes into the add file # if the entry is in indexMap but not fileMap, it goes into the delete file add = {} for key in fileMap: if indexMap.has_key(key): del indexMap[key] else: add[key] = fileMap[key] f = open(sys.argv[3], "w") for key in add: f.write(add[key] + '\n'); f.close() f = open(sys.argv[4], "w") for key in indexMap: f.write(indexMap[key] + '\n'); f.close()
Add to the Index
Next we iterate through all the files in the “add” list.
#! /usr/bin/python import httplib import binascii import sys import os import socket import hostinfo argc = len(sys.argv) if argc != 3: print os.path.basename(sys.argv[0]), " " sys.exit(-1) rootFsDir = sys.argv[2] def connRequest(conn, verb, url, body = None): if body == None: conn.request(verb, url) else: conn.request(verb, url, body) return conn.getresponse().read() def connInitialize(conn): print connRequest(conn, 'DELETE', hostinfo.INDEX) print connRequest(conn, 'PUT', hostinfo.INDEX, '{ "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' ) print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{ "attachment" : { "properties" : { "file" : { "type" : "attachment", "fields" : { "title" : { "store" : "yes" }, "file" : { "term_vector":"with_positions_offsets", "store":"yes" } } } } }}' ) def connRefresh(conn): print connRequest(conn, 'POST', '/_refresh') def connAddFile(conn, filename, rootFsDir): title = os.path.basename(filename) location = filename[len(rootFsDir):] with open(filename, 'rb') as f: data = f.read() if len(data) > hostinfo.LARGEST_BASE64_ATTACHMENT: print 'Not indexing because the file is too large', len(data) else: print 'Indexing file size', len(data) base64Data = binascii.b2a_base64(data)[:-1] attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }' print connRequest(conn, 'POST', hostinfo.INDEX + '/attachment/', attachment) socket.setdefaulttimeout(30) conn = httplib.HTTPConnection(hostinfo.HOST) #connInitialize(conn) f = open(sys.argv[1]) lines = f.readlines() f.close() idx = 0 rootFsDir = rootFsDir + '/' for line in lines: line = line.replace('\n', '') idx = idx + 1 filename = rootFsDir + line print idx, filename try: connAddFile(conn, filename, rootFsDir) except Exception, e: print str(e) conn = httplib.HTTPConnection(hostinfo.HOST) connRefresh(conn)
Delete the Files Not Needed
Finally, we delete the index and physical files no longer needed.
#! /usr/bin/python import httplib import binascii import sys import os import socket import hostinfo argc = len(sys.argv) if argc != 3: print os.path.basename(sys.argv[0]), " " sys.exit(-1) def connRequest(conn, verb, url, body = None): if body == None: conn.request(verb, url) else: conn.request(verb, url, body) return conn.getresponse().read() def connRefresh(conn): print connRequest(conn, 'POST', '/_refresh') def connDeleteFile(conn, index): print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index) socket.setdefaulttimeout(30) conn = httplib.HTTPConnection(hostinfo.HOST) f = open(sys.argv[1]) lines = f.readlines() f.close() idx = 0 for line in lines: line = line.replace('\n', '') idx = idx + 1 split = line.split(',') filename = split[0] index = split[1] print "Delete:", idx, filename, index try: connDeleteFile(conn, index) except Exception, e: print str(e) conn = httplib.HTTPConnection(hostinfo.HOST) try: os.remove(sys.argv[2] + '/' + filename) except: pass connRefresh(conn)
There it is. I have all these steps including resolving the path between the file list and the index list. One further thing to note is that the hostinfo file referenced by the python scripts look like this:
#! /usr/bin/python HOST = '127.0.0.1:9200' SITE = '' INDEX = SITE + '/basic' LARGEST_BASE64_ATTACHMENT = 50000000