On a weekly basis, we exchange large amounts of data with our clients. This can for instance be information on different SKUs, such as how long they have been in stock, how they are selling, and so on.
For our customers, it may be easier to receive this data in Excel, as excel is a tool a lot of people are familiar with. Most of us already have it installed on our computer, which means no additional solutions or licenses have to be bought. A benefit Excel has over CSV is that it has a user interface and you do not need to spend a lot time developing. On top of this, we can also use different styles of formatting, colors and visualizations in Excel, so that it looks better and makes the data easier to understand for our clients.
Taking into account that we want to build some kind of a product, we actually need to automate the process of Excel file creation, calculation of the formulas and saving the files on the client side. As for some customers there are lots of different departmens receiving these types of files, we need to automate this process of using Excel and VBA (the programming language in Excel), because doing it manually would take too long. The automation would entail the process of Excel file creation, calculation of the formulas and saving the files on the client side.
Challenges
At aioneers we use Databricks for the automation of data science processes, and therefore we wanted to do the automation of Excel file creation on Databricks.
Databricks, however, has some additional challenges that are specific to the platform. The first thing that was immediately evident was that it is on Linux. This means that we don‘t have any version of Excel on the platform, because Excel does not run on Linux. However, there is a possibility to use open source software that we can use to calculate the Excel formulas in the Excel sheets. As we worked on this project, we came across some challenges that we did not expect at the start. The code we’re sharing in this article is only relevant for the working solution that we found. The code that will be written here will only be relevant to the solution that works.
How to make it work
At first we followed the instructions to install Open Office on a Databricks server. However, we soon figured out that OpenOffice uses Python 2 instead of Python 3. Python 3 is the current standard, and we did not want to use an old and unsupported version of Python. This made us switch to LibreOffice.
All the code listed below is written and executed in Databricks notebooks.
Installation of LibreOffice:
%sh
sudo apt-get install libreoffice -y
In order to start LibreOffice from Python we needed to install additional package python3-uno:
%sh
sudo apt-get --yes --force-yes install python3-uno
After that we checked what packages we have installed:
%sh
dpkg -l python3-uno libreoffice libreoffice-calc o
This command confirmed that we have all the needed packages installed. The output of the command:
||/ Name Version Architecture Description
+++-================-=========================-============-========================================
ii libreoffice 1:6.0.7-0ubuntu0.18.04.10 amd64 office productivity suite (metapackage)
ii libreoffice-calc 1:6.0.7-0ubuntu0.18.04.10 amd64 office productivity suite -- spreadsheet
ii python3-uno 1:6.0.7-0ubuntu0.18.04.10 amd64 Python-UNO bridge
According to the manual, after installing python3-uno packages, the uno package should be available in Python. But executing import uno tells us that the library is not found.
Why did it happen? It appears that there are two different Python installations on the system. One is the one that is used by Databricks (/databricks/python/bin/python) and the other one, installed by the system by default (/usr/bin/python3). The installation of python3-uno package only installs the uno package for /usr/bin/python3, and not /databricks/python/bin/python that is actually used on Databricks.
After thinking about this for a while, we decided to create a Python script file and execute it with the right Python (/usr/bin/python3). The Idea of the following script is to open the Excel file, recalculate all the formulas and close it.
We start this script by defining our variables and imports:
temp_file = "temp.xlsx"
py_file_string = """
temp_file = '""" + temp_file + """'
inputFile = temp_file
import time
import os
import uno
from com.sun.star.beans import PropertyValuecode>
import subprocess
Then we start LibreOffice and wait 3 seconds for it to be up and running:
os.popen('/usr/lib/libreoffice/program/soffice --headless --accept="socket,host=localhost,port=2002;urp;StarOffice.ServiceManager"')
time.sleep(3) # Sleep for 3 seconds
Then we need to get the desktop class. Contrary to what is written in most manuals for LibreOffice, desktop should be assigned from a Service Manager.
# get the uno component context from the PyUNO runtime
localContext = uno.getComponentContext()
# create the UnoUrlResolver
resolver = localContext.ServiceManager.createInstanceWithContext(
"com.sun.star.bridge.UnoUrlResolver", localContext)
# connect to the running office
context = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
smgr = context.ServiceManager
desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",context)
Now we need to create the URL that LibreOffice understands from the filename that we have:
def urlify(path):
return uno.systemPathToFileUrl(os.path.realpath(path))
thedocurl = urlify(temp_file)
# the new filename URL
newdocurl = urlify(temp_file[:-5]+"_new.xlsx")
print(thedocurl)
Now open the document from the URL:
document = desktop.loadComponentFromURL(thedocurl,"_blank",0,())
Recalculate all the formulas:
document.calculateAll()
Store as Excel .xlsx format:
p = PropertyValue()
p.Name = 'FilterName'
p.Value = 'Calc Office Open XML'
document.storeAsURL(newdocurl, (p,))
Close the document in LibreOffice:
document.dispose()
After executing all of the lines above the Python in Databricks froze and never finished executing. The solution to this problem was to kill the process of LibreOffice. This was the final step that finally released Python from an endless loop:
subprocess.run(['/usr/bin/killall', 'soffice.bin'], stdout=subprocess.PIPE).stdout.decode('utf-8')
Finishing the code with ending the multiline string and running the process:
"""
with open("init.py", "w") as f:
f.write(py_file_string)
# function to run and print output from shell
def run_process_func(exe):
p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
while True:
# returns None while subprocess is running
retcode = p.poll()
line = p.stdout.readline().decode("utf8")
yield line
if retcode is not None:
break
def run_process(exe):
print(f"running cmd: {exe}")
for line in run_process_func(shlex.split(exe)):
print(line, end='')
run_process("/usr/bin/python3 init.py")
In the end the code allowed us to open the Excel code, recalculate all the formulas and save the file.
Final Code
First cell
%sh
sudo apt-get install libreoffice -y
sudo apt-get --yes --force-yes install python3-uno
Second cell
temp_file = "temp.xlsx"
py_file_string = """
temp_file = '""" + temp_file + """'
inputFile = temp_file
import time
import os
import uno
from com.sun.star.beans import PropertyValue
import subprocess
os.popen('/usr/lib/libreoffice/program/soffice --headless --accept="socket,host=localhost,port=2002;urp;StarOffice.ServiceManager"')
time.sleep(3) # Sleep for 3 seconds
# get the uno component context from the PyUNO runtime
localContext = uno.getComponentContext()
# create the UnoUrlResolver
resolver = localContext.ServiceManager.createInstanceWithContext(
"com.sun.star.bridge.UnoUrlResolver", localContext)
# connect to the running office
context = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
smgr = context.ServiceManager
desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",context)
def urlify(path):
return uno.systemPathToFileUrl(os.path.realpath(path))
thedocurl = urlify(temp_file)
# the new filename URL
newdocurl = urlify(temp_file[:-5]+"_new.xlsx")
print(thedocurl)
document = desktop.loadComponentFromURL(thedocurl,"_blank",0,())
document.calculateAll()
p = PropertyValue()
p.Name = 'FilterName'
p.Value = 'Calc Office Open XML'
document.storeAsURL(newdocurl, (p,))
document.dispose()
subprocess.run(['/usr/bin/killall', 'soffice.bin'], stdout=subprocess.PIPE).stdout.decode('utf-8')
"""
with open("init.py", "w") as f:
f.write(py_file_string)
# function to run and print output from shell
def run_process_func(exe):
p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
while True:
# returns None while subprocess is running
retcode = p.poll()
line = p.stdout.readline().decode("utf8")
yield line
if retcode is not None:
break
def run_process(exe):
print(f"running cmd: {exe}")
for line in run_process_func(shlex.split(exe)):
print(line, end='')
run_process("/usr/bin/python3 init.py")