I am processing Excel spreadsheet in Java with Apache POI. Some columns in this spreadsheet contain hyperlinks to the documents stored on the web, which I need to extract. There are several different type of hyperlinks. Some looks like this (if open Excel cell):
=HYPERLINK("/doc", "ref to doc")
but these are rare. Most of them looks like this:
=HYPERLINK(R246,"ref to doc")
or
=IF(I257="#undef","undefined",HYPERLINK(U257,"ref to doc"))
Unfortunately. cell.getHyperlink()
always returns null. I can extract the formula text from the cell, with cell.getCellFormula()
, so I decided to get the hyperlink manually from the formula. I created a regular expression for it:
Pattern pattern = Patternpile("HYPERLINK\\((.*?),");
Matcher matcher = pattern.matcher(cellFormula);
String val = matcher.group();
This results in exception java.lang.IllegalStateException: No match found.
Can somebody either point out what is wrong with my regex or suggest some better way to get those hyperlinks?
I am processing Excel spreadsheet in Java with Apache POI. Some columns in this spreadsheet contain hyperlinks to the documents stored on the web, which I need to extract. There are several different type of hyperlinks. Some looks like this (if open Excel cell):
=HYPERLINK("https://web.address/doc", "ref to doc")
but these are rare. Most of them looks like this:
=HYPERLINK(R246,"ref to doc")
or
=IF(I257="#undef","undefined",HYPERLINK(U257,"ref to doc"))
Unfortunately. cell.getHyperlink()
always returns null. I can extract the formula text from the cell, with cell.getCellFormula()
, so I decided to get the hyperlink manually from the formula. I created a regular expression for it:
Pattern pattern = Pattern.compile("HYPERLINK\\((.*?),");
Matcher matcher = pattern.matcher(cellFormula);
String val = matcher.group();
This results in exception java.lang.IllegalStateException: No match found.
Can somebody either point out what is wrong with my regex or suggest some better way to get those hyperlinks?
Parsing Excel formulas using regular expression is not what I would suggest. Even if you get it work, you will get either "\"https://web.address/doc\""
or "R246"
or "U257"
for your examples. The first result is usable. But what to do with the second and the third result then?
Fortunately Apache POI provides a FormulaParser. To parse formulas I always suggest using that.
Complete example:
import java.io.FileInputStream;
import org.apache.poi.ss.formula.*;
import org.apache.poi.ss.formula.ptg.*;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;
import org.apache.poi.ss.util.CellAddress;
public class ExcelHyperlinkDestinationFromFormula {
private static String getHyperlinkFormulaDestination(XSSFSheet sheet, String hyperlinkFormula) {
String result = "HyperlinkFormulaDestination not found";
XSSFEvaluationWorkbook workbookWrapper = XSSFEvaluationWorkbook.create((XSSFWorkbook) sheet.getWorkbook());
Ptg[] ptgs = FormulaParser.parse(hyperlinkFormula, workbookWrapper, FormulaType.CELL, sheet.getWorkbook().getSheetIndex(sheet));
for (int i = 0; i < ptgs.length; i++) {
//System.out.println(ptgs[i]);
if (ptgs[i] instanceof FuncVarPtg) {
FuncVarPtg funcVarPtg = (FuncVarPtg)ptgs[i];
//System.out.println(funcVarPtg);
if ("HYPERLINK".equals(funcVarPtg.getName())) {
int firstOperand = funcVarPtg.getNumberOfOperands();
Ptg firstOperandPtg = ptgs[i-firstOperand]; //Reverse Polish notation: first operand is number of operands before funcVarPtg
//System.out.println(firstOperandPtg);
if (firstOperandPtg instanceof StringPtg) {
result = ((StringPtg)firstOperandPtg).getValue();
} else if (firstOperandPtg instanceof RefPtg) {
Cell cell = sheet.getRow(((RefPtg)firstOperandPtg).getRow()).getCell(((RefPtg)firstOperandPtg).getColumn());
DataFormatter dataFormatter = new DataFormatter();
dataFormatter.setUseCachedValuesForFormulaCells(true);
result = dataFormatter.formatCellValue(cell);
} else if (firstOperandPtg instanceof FuncVarPtg) {
result = "HyperlinkFormulaDestination behind FuncVarPtg not parsed"; //ToDo
} else if (firstOperandPtg instanceof NamePtg) {
result = "HyperlinkFormulaDestination behind NamePtg not parsed"; //ToDo
}
}
}
}
return result;
}
public static void main(String[] args) throws Exception {
XSSFWorkbook workbook = (XSSFWorkbook)WorkbookFactory.create(new FileInputStream("./test.xlsx"));
XSSFSheet sheet = workbook.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
if (cell.getCellType() == CellType.FORMULA) {
CellAddress source = cell.getAddress();
String formula = cell.getCellFormula();
if (formula.toLowerCase().contains("hyperlink(")) {
System.out.println(source + "=" + formula);
String hyperlinkFormulaDestination = getHyperlinkFormulaDestination(sheet, formula);
System.out.println(hyperlinkFormulaDestination);
}
}
}
}
workbook.close();
}
}
The code should be able getting the hyperlink formula destination of your examples. ToDos are to extend the code so it is able parsing functions and/or names being the first operand in HYPERLINK
formula too.
org.apache.poi.ss.formula.functions.Hyperlink
from the cell (not theorg.apache.poi.ss.usermodel.XSSFHyperlink
, as you are trying). The value in the cell=HYPERLINK("https://web.address/doc", "ref to doc")
represents an Excel function. – prasad_ Commented Feb 1 at 7:54matcher.matches()
is false as the cellFormula does not end with,
. Instead usematcher.find()
thenmatcher.group()
– DuncG Commented Feb 1 at 11:30matcher.group()
is whatfind()
matched, you would need to look at the capture group 1 of your regexmatcher.group(1)
. You still have a problem that this could be an expression (such as another cell reference R246) or string argument"https://web.address/doc"
so further processing is required. – DuncG Commented Feb 2 at 12:33