Automating Recalculation of Excel Files in Python And Databricks

By
Sergey Nikonov
December 29, 2020
Files in Python And Databricks

On a weekly basis, we exchange large amounts of data with our clients. This can for instance be information on different SKUs, such as how long they have been in stock, how they are selling, and so on.  

For our customers, it may be easier to receive this data in Excel, as excel is a tool a lot of people are familiar with. Most of us already have it installed on our computer, which means no additional solutions or licenses have to be bought. A benefit Excel has over CSV is that it has a user interface and you do not need to spend a lot time developing. On top of this, we can also use different styles of formatting, colors and visualizations in Excel, so that it looks better and makes the data easier to understand for our clients.  

Taking into account that we want to build some kind of a product, we actually need to automate the process of Excel file creation, calculation of the formulas and saving the files on the client side. As for some customers there are lots of different departmens receiving these types of files, we need to automate this process of using Excel and VBA (the programming language in Excel), because doing it manually would take too long. The automation would entail the process of Excel file creation, calculation of the formulas and saving the files on the client side.

Challenges

At aioneers we use Databricks for the automation of data science processes, and therefore we wanted to do the automation of Excel file creation on Databricks.  

Databricks, however, has some additional challenges that are specific to the platform. The first thing that was immediately evident was that it is on Linux. This means that we don‘t have any version of Excel on the platform, because Excel does not run on Linux. However, there is a possibility to use open source software that we can use to calculate the Excel formulas in the Excel sheets. As we worked on this project, we came across some challenges that we did not expect at the start. The code we’re sharing in this article is only relevant for the working solution that we found. The code that will be written here will only be relevant to the solution that works.

How to make it work

At first we followed the instructions to install Open Office on a Databricks server. However, we soon figured out that OpenOffice uses Python 2 instead of Python 3. Python 3 is the current standard, and we did not want to use an old and unsupported version of Python. This made us switch to LibreOffice.

All the code listed below is written and executed in Databricks notebooks.

Installation of LibreOffice:  

%sh

sudo apt-get install libreoffice -y

In order to start LibreOffice from Python we needed to install additional package python3-uno:

%sh

sudo apt-get --yes --force-yes install  python3-uno

After that we checked what packages we have installed:

%sh

dpkg -l python3-uno libreoffice libreoffice-calc o

This command confirmed that we have all the needed packages installed. The output of the command:

||/ Name             Version                   Architecture Description

+++-================-=========================-============-========================================

ii  libreoffice      1:6.0.7-0ubuntu0.18.04.10 amd64        office productivity suite (metapackage)

ii  libreoffice-calc 1:6.0.7-0ubuntu0.18.04.10 amd64        office productivity suite -- spreadsheet

ii  python3-uno      1:6.0.7-0ubuntu0.18.04.10 amd64        Python-UNO bridge

According to the manual, after installing python3-uno packages, the uno package should be available in Python. But executing import uno tells us that the library is not found.  

Why did it happen? It appears that there are two different Python installations on the system. One is the one that is used by Databricks (/databricks/python/bin/python) and the other one, installed by the system by default (/usr/bin/python3). The installation of python3-uno package only installs the uno package for /usr/bin/python3, and not /databricks/python/bin/python that is actually used on Databricks.

After thinking about this for a while, we decided to create a Python script file and execute it with the right Python (/usr/bin/python3). The Idea of the following script is to open the Excel file, recalculate all the formulas and close it.

We start this script by defining our variables and imports:

temp_file = "temp.xlsx"

py_file_string = """

temp_file = '""" + temp_file + """'

inputFile = temp_file

import time

import os

import uno

from com.sun.star.beans import PropertyValuecode>

import subprocess  

Then we start LibreOffice and wait 3 seconds for it to be up and running:

os.popen('/usr/lib/libreoffice/program/soffice --headless --accept="socket,host=localhost,port=2002;urp;StarOffice.ServiceManager"')

time.sleep(3)  # Sleep for 3 seconds

Then we need to get the desktop class. Contrary to what is written in most manuals for LibreOffice, desktop should be assigned from a Service Manager.

# get the uno component context from the PyUNO runtime

localContext = uno.getComponentContext()

# create the UnoUrlResolver

resolver = localContext.ServiceManager.createInstanceWithContext(

   "com.sun.star.bridge.UnoUrlResolver", localContext)

# connect to the running office

context = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")

smgr = context.ServiceManager

desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",context)

Now we need to create the URL that LibreOffice understands from the filename that we have:

def urlify(path):

   return uno.systemPathToFileUrl(os.path.realpath(path))

thedocurl = urlify(temp_file)

# the new filename URL

newdocurl = urlify(temp_file[:-5]+"_new.xlsx")  

print(thedocurl)

Now open the document from the URL:

document = desktop.loadComponentFromURL(thedocurl,"_blank",0,())

Recalculate all the formulas:

document.calculateAll()

Store as Excel .xlsx format:

p = PropertyValue()

p.Name = 'FilterName'

p.Value = 'Calc Office Open XML'

document.storeAsURL(newdocurl, (p,))

Close the document in LibreOffice:

document.dispose()

After executing all of the lines above the Python in Databricks froze and never finished executing. The solution to this problem was to kill the process of LibreOffice. This was the final step that finally released Python from an endless loop:

subprocess.run(['/usr/bin/killall', 'soffice.bin'], stdout=subprocess.PIPE).stdout.decode('utf-8')

Finishing the code with ending the multiline string and running the process:

"""

with open("init.py", "w") as f:

   f.write(py_file_string)

   

# function to run and print output from shell

def run_process_func(exe):

   p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

   while True:

       # returns None while subprocess is running

       retcode = p.poll()

       line = p.stdout.readline().decode("utf8")

       yield line

       if retcode is not None:

           break

def run_process(exe):

   print(f"running cmd: {exe}")

   for line in run_process_func(shlex.split(exe)):

       print(line, end='')

run_process("/usr/bin/python3 init.py")

In the end the code allowed us to open the Excel code, recalculate all the formulas and save the file.  

Final Code

First cell

%sh

sudo apt-get install libreoffice -y

sudo apt-get --yes --force-yes install  python3-uno

Second cell

temp_file = "temp.xlsx"

py_file_string = """

temp_file = '""" + temp_file + """'

inputFile = temp_file

import time

import os

import uno

from com.sun.star.beans import PropertyValue  

import subprocess

os.popen('/usr/lib/libreoffice/program/soffice --headless --accept="socket,host=localhost,port=2002;urp;StarOffice.ServiceManager"')

time.sleep(3)  # Sleep for 3 seconds

# get the uno component context from the PyUNO runtime

localContext = uno.getComponentContext()  

# create the UnoUrlResolver

resolver = localContext.ServiceManager.createInstanceWithContext(

   "com.sun.star.bridge.UnoUrlResolver", localContext)

# connect to the running office

context = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")

smgr = context.ServiceManager

desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",context)

def urlify(path):

   return uno.systemPathToFileUrl(os.path.realpath(path))

thedocurl = urlify(temp_file)

# the new filename URL

newdocurl = urlify(temp_file[:-5]+"_new.xlsx")

print(thedocurl)

document = desktop.loadComponentFromURL(thedocurl,"_blank",0,())

document.calculateAll()

p = PropertyValue()

p.Name = 'FilterName'

p.Value = 'Calc Office Open XML'

document.storeAsURL(newdocurl, (p,))

document.dispose()

subprocess.run(['/usr/bin/killall', 'soffice.bin'], stdout=subprocess.PIPE).stdout.decode('utf-8')

"""

with open("init.py", "w") as f:

  f.write(py_file_string)

   

# function to run and print output from shell

def run_process_func(exe):

   p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

   while True:

       # returns None while subprocess is running

       retcode = p.poll()

       line = p.stdout.readline().decode("utf8")

       yield line

       if retcode is not None:

           break

def run_process(exe):

   print(f"running cmd: {exe}")

   for line in run_process_func(shlex.split(exe)):

       print(line, end='')

run_process("/usr/bin/python3 init.py")

Meet the Writer
LinkedIn Logo
Sergey Nikonov
Sergey has 8 years of experience in data science and machine learning in supply chain and operations. He likes working with data and believes that AI can change the world we live in drastically.

Categories

Popular Posts

Schedule a meeting

Win at the Supply Chain Transformation Game With AIO SCCC