review: perf: Precompile patterns for identifier checks #4072

SirYwell · 2021-07-31T15:38:27Z

@I-Al-Istannen and I started looking into ways to improve performance of https://github.com/I-Al-Istannen/JavadocApi.

When looking into profilings, I noticed that Pattern.compile() had a very high invocation count (in fact, it was the highest of all traced methods). I tracked down the callees and ended up in in methods of CtReferenceImpl where regular expressions were used in combination with String#matches(), String#replaceAll() and String#split(). Precompiling these patterns seems to have a very high impact in our case. Some of the comparisons we've done with and without this patch:

From @I-Al-Istannen with the linked JavadocApi:

/usr/lib/jvm/java-16-adoptopenjdk/bin/java -Xmx8G -jar _target/JavadocApi.jar   1612,75s user 3,70s system 690% cpu 3:54,06 total
/usr/lib/jvm/java-16-adoptopenjdk/bin/java -Xmx8G -jar _target/JavadocApi.jar   1042,61s user 3,32s system 673% cpu 2:35,21 total

(scanning JavaFX, upper is without patch, lower is with patch)

/usr/lib/jvm/java-16-adoptopenjdk/bin/java -Xmx8G -jar _target/JavadocApi.jar   1276,71s user 11,40s system 424% cpu 5:03,47 total
/usr/lib/jvm/java-16-adoptopenjdk/bin/java -Xmx8G -jar _target/JavadocApi.jar   978,21s user 10,06s system 452% cpu 3:38,48 total

(scanning OpenJDK 16, upper is without patch, lower is with patch)

With YourKit, I profiled the CtGenerationTest, this is a screenshot of the differences focused on the checkIdentifierForJLSCorrectness method:

slarse

Hi @SirYwell,

This is great! I left one suggestion for how we might be able to squeeze out even more performance.

On a side note, if this turns out to still be a bottleneck after this optimization, we might want to try not using regex at all as it's kind of slow. It's probably even faster to just write custom search functions that operate directly on the string.

src/main/java/spoon/support/reflect/reference/CtReferenceImpl.java

slarse

Let's ignore the optimization I suggested earlier for now. See my two new comments.

slarse · 2021-08-09T14:09:51Z

src/main/java/spoon/support/reflect/reference/CtReferenceImpl.java

+	private static final Pattern IS_ARRAY_OR_INSTANCE = Pattern.compile("\\[\\]|@");
+	private static final Pattern IS_INNER_OR_GENERIC = Pattern.compile("\\.|<|>");


These two patterns are not used to identify arrays or generics, so I think the names are a bit misleading. Could you rename them appropriately? E.g. IS_INNER_OR_GENERIC would be more fittingly named NESTED_OR_GENERIC_SPLITTER or something like that.

slarse · 2021-08-09T14:18:40Z

src/main/java/spoon/support/reflect/reference/CtReferenceImpl.java

@@ -101,9 +105,9 @@ private void checkIdentiferForJLSCorrectness(String simplename) {
 		 */
 		//JDTTreeBuilderHelper.computeAnonymousName returns "$numbers$Name" so we have to skip them if they start with numbers
 		//allow empty identifier because they are sometimes used.
-		if (!simplename.matches("<.*>|\\d.*|^.{0}$")) {
+		if (!IS_INNER_OR_GENERIC_OR_EMPTY.matcher(simplename).matches()) {


The more I look at this, the more I'm thinking it's completely unnecessary to use a regex for it. The regex only actually checks the first and last characters so it seems redundant to use a regex at all. Seems to me like something like this should be faster and work just as well:

private static boolean isEmptyOrLocalOrGeneric(String identifier) { return identifier.isEmpty() || Character.isDigit(identifier.charAt(0)) || (identifier.startsWith("<") && identifier.endsWith(">")); }

WDYT?

oh nice, a human, static compilation of the regexp :) 👍

This was a misunderstanding on my side, it's actually not about generics but about names like <init> there. I re-used your code there, the check for anonymous/local classes was moved to the other part of the code.

SirYwell · 2021-08-09T20:10:15Z

I'll look into it in a few days again, thanks for your feedback so far.

…le-patterns

SirYwell · 2021-08-14T10:29:09Z

I spend a little bit more time on this and rewrote the logic without any regular expressions. I wrote microbenchmarks with JMH (https://github.com/SirYwell/spoon-benchmark, can be run with ./gradlew jmh).

I changed the ContractOnSettersparametrizedTest, as it set the simple name of a CtPackage to spoon.support.reflect.declaration.CtPackageImpl@1 which isn't a valid identifier (and it's the only case were a @ was inserted). I'm not sure if the change there is appropriate but it makes the logic a little bit easier.

In general, the new implementation is more restrictive (e.g. <<>>...@@@ was allowed as simple name before).

I also tried to add dense documentation to avoid further confusion.

The bechmark result when running it on my machine (I rearranged the lines):

Benchmark                                                                            (stringToCheck)  Mode  Cnt     Score     Error  Units
SpoonCheckIdentifierBenchmark.testHandMadeCheckerImpl                                      Hi<T.R>[]  avgt    4   103,862 ±   2,957  ns/op
SpoonCheckIdentifierBenchmark.testOldImpl                                                  Hi<T.R>[]  avgt    4  2034,576 ±  95,341  ns/op
SpoonCheckIdentifierBenchmark.testPrecompiledPatternImpl                                   Hi<T.R>[]  avgt    4   759,309 ±  43,301  ns/op
SpoonCheckIdentifierBenchmark.testThreadLocalMatcherImpl                                   Hi<T.R>[]  avgt    4   669,038 ±  24,191  ns/op

SpoonCheckIdentifierBenchmark.testHandMadeCheckerImpl                                     HelloWorld  avgt    4    38,164 ±   1,884  ns/op
SpoonCheckIdentifierBenchmark.testOldImpl                                                 HelloWorld  avgt    4  1195,557 ±  43,976  ns/op
SpoonCheckIdentifierBenchmark.testPrecompiledPatternImpl                                  HelloWorld  avgt    4   495,497 ±  63,751  ns/op
SpoonCheckIdentifierBenchmark.testThreadLocalMatcherImpl                                  HelloWorld  avgt    4   562,554 ±  13,842  ns/op

SpoonCheckIdentifierBenchmark.testHandMadeCheckerImpl     VeryLongValidKeyword123<A.B.C.D.E.F>[][][]  avgt    4   320,571 ±   7,270  ns/op
SpoonCheckIdentifierBenchmark.testOldImpl                 VeryLongValidKeyword123<A.B.C.D.E.F>[][][]  avgt    4  4163,707 ± 159,896  ns/op
SpoonCheckIdentifierBenchmark.testPrecompiledPatternImpl  VeryLongValidKeyword123<A.B.C.D.E.F>[][][]  avgt    4  2255,912 ± 106,559  ns/op
SpoonCheckIdentifierBenchmark.testThreadLocalMatcherImpl  VeryLongValidKeyword123<A.B.C.D.E.F>[][][]  avgt    4  2096,540 ±  51,277  ns/op

slarse

Starting to look very good! This is significantly more explicit than using a regex. As we don't have any good performance regression tests at the moment, this PR becomes more valuable when it also improves the code itself. Very nice restructuring of the entire thing, as we already iterated over the entire string in checkIdentifierChars in addition to matching with regex, your restructuring is simply much better.

I have a few comments on the final code but after that I think we can merge.

As a general comment, I'd recommend toning down the use of inline comments. In many cases in this PR, they either cover up insufficiently detailed code (e.g. saying that 0 is a character to expect nothing in particular can be rewritten with a variable name) or state the obvious (e.g. that a keyword is not allowed in an identifier, when the code literally says if this is a keyword return false).

slarse · 2021-08-16T07:17:23Z

src/main/java/spoon/support/reflect/reference/CtReferenceImpl.java

+					if (start == i) { // first char of a part
+						if (!Character.isJavaIdentifierStart(name.charAt(i))) {
+							return false;
+						}
+					} else {
+						if (!Character.isJavaIdentifierPart(name.charAt(i))) {
+							return false;
+						}
+					}


Collapse?

Suggested change

if (start == i) { // first char of a part

if (!Character.isJavaIdentifierStart(name.charAt(i))) {

return false;

}

} else {

if (!Character.isJavaIdentifierPart(name.charAt(i))) {

return false;

}

}

if (i == start && !Character.isJavaIdentifierStart(name.charAt(i))

|| !Character.isJavaIdentifierPart(name.charAt(i))) {

return false;

}

slarse · 2021-08-16T07:21:02Z

src/main/java/spoon/support/reflect/reference/CtReferenceImpl.java

+			i++;
+		}
+		int start = i; // used to mark the beginning of a part
+		char expectNext = 0; // 0 = do not expect anything


Avoid unnamed magic values :)

Suggested change

char expectNext = 0; // 0 = do not expect anything

final char anything = 0;

char expectNext = anything;

slarse · 2021-08-16T07:21:22Z

src/main/java/spoon/support/reflect/reference/CtReferenceImpl.java

+		int start = i; // used to mark the beginning of a part
+		char expectNext = 0; // 0 = do not expect anything
+		for (; i < name.length(); i++) {
+			if (expectNext != 0) {


idem

Suggested change

if (expectNext != 0) {

if (expectNext != anything) {

slarse · 2021-08-16T07:21:47Z

src/main/java/spoon/support/reflect/reference/CtReferenceImpl.java

+				if (name.charAt(i) != expectNext) {
+					return false;
+				} else if (name.charAt(i) == expectNext) {
+					expectNext = 0; // reset


Suggested change

expectNext = 0; // reset

expectNext = anything;

- manually, because Github does not seem to allow additional changes

slarse

I spend a little bit more time on this and rewrote the logic without any regular expressions. I wrote microbenchmarks with JMH (https://github.com/SirYwell/spoon-benchmark, can be run with ./gradlew jmh).

Completely missed this comment, but those numbers look very promising! I'm looking forward to trying this out with some of my dependent projects, based on some ad-hoc tests I ran it looks like a pretty substantial performance improvement.

Anyway, this all looks good to me now!

slarse · 2021-08-16T09:48:32Z

Many thanks @SirYwell, this is an excellent contribution.

Precompile patterns for keyword checks

5e14f98

andre15silva mentioned this pull request Jul 31, 2021

fix: Fix compliance level support for old code #4068

Merged

Make regexes more consistent

ee676fe

slarse suggested changes Aug 9, 2021

View reviewed changes

src/main/java/spoon/support/reflect/reference/CtReferenceImpl.java Outdated Show resolved Hide resolved

slarse changed the title ~~review: refactor: Precompile patterns for identifier checks~~ review: perf: Precompile patterns for identifier checks Aug 9, 2021

slarse suggested changes Aug 9, 2021

View reviewed changes

monperrus mentioned this pull request Aug 10, 2021

chore: add performance regression test infrastructure. #4084

Open

SirYwell added 3 commits August 14, 2021 09:13

Merge remote-tracking branch 'upstream/master' into refactor/precompi…

ce8737f

…le-patterns

Rewrite identifier validation

3252fbc

add parentheses to ifs

546066d

slarse suggested changes Aug 16, 2021

View reviewed changes

Apply suggestions

101f064

- manually, because Github does not seem to allow additional changes

slarse approved these changes Aug 16, 2021

View reviewed changes

slarse merged commit a08a57e into INRIA:master Aug 16, 2021

monperrus mentioned this pull request Aug 19, 2021

release Spoon 9.1.0 #4104

Merged

woutersmeenk pushed a commit to woutersmeenk/spoon that referenced this pull request Aug 29, 2021

perf: Replace regex in identifier checks with loop (INRIA#4072)

a40434c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

review: perf: Precompile patterns for identifier checks #4072

review: perf: Precompile patterns for identifier checks #4072

SirYwell commented Jul 31, 2021

slarse left a comment

slarse left a comment

slarse Aug 9, 2021

slarse Aug 9, 2021

monperrus Aug 10, 2021

SirYwell Aug 14, 2021

SirYwell commented Aug 9, 2021

SirYwell commented Aug 14, 2021 •

edited

Loading

slarse left a comment

slarse Aug 16, 2021

slarse Aug 16, 2021

slarse Aug 16, 2021

slarse Aug 16, 2021

slarse left a comment

slarse commented Aug 16, 2021

		private static final Pattern IS_ARRAY_OR_INSTANCE = Pattern.compile("\\[\\]\|@");
		private static final Pattern IS_INNER_OR_GENERIC = Pattern.compile("\\.\|<\|>");

	char expectNext = 0; // 0 = do not expect anything
	final char anything = 0;
	char expectNext = anything;

review: perf: Precompile patterns for identifier checks #4072

review: perf: Precompile patterns for identifier checks #4072

Conversation

SirYwell commented Jul 31, 2021

slarse left a comment

Choose a reason for hiding this comment

slarse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SirYwell commented Aug 9, 2021

SirYwell commented Aug 14, 2021 • edited Loading

slarse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slarse left a comment

Choose a reason for hiding this comment

slarse commented Aug 16, 2021

SirYwell commented Aug 14, 2021 •

edited

Loading