Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

review: perf: Precompile patterns for identifier checks #4072

Merged
merged 6 commits into from
Aug 16, 2021

Conversation

SirYwell
Copy link
Collaborator

@I-Al-Istannen and I started looking into ways to improve performance of https://github.com/I-Al-Istannen/JavadocApi.

When looking into profilings, I noticed that Pattern.compile() had a very high invocation count (in fact, it was the highest of all traced methods). I tracked down the callees and ended up in in methods of CtReferenceImpl where regular expressions were used in combination with String#matches(), String#replaceAll() and String#split(). Precompiling these patterns seems to have a very high impact in our case. Some of the comparisons we've done with and without this patch:

From @I-Al-Istannen with the linked JavadocApi:

/usr/lib/jvm/java-16-adoptopenjdk/bin/java -Xmx8G -jar _target/JavadocApi.jar   1612,75s user 3,70s system 690% cpu 3:54,06 total
/usr/lib/jvm/java-16-adoptopenjdk/bin/java -Xmx8G -jar _target/JavadocApi.jar   1042,61s user 3,32s system 673% cpu 2:35,21 total

(scanning JavaFX, upper is without patch, lower is with patch)

/usr/lib/jvm/java-16-adoptopenjdk/bin/java -Xmx8G -jar _target/JavadocApi.jar   1276,71s user 11,40s system 424% cpu 5:03,47 total
/usr/lib/jvm/java-16-adoptopenjdk/bin/java -Xmx8G -jar _target/JavadocApi.jar   978,21s user 10,06s system 452% cpu 3:38,48 total

(scanning OpenJDK 16, upper is without patch, lower is with patch)

With YourKit, I profiled the CtGenerationTest, this is a screenshot of the differences focused on the checkIdentifierForJLSCorrectness method:
yourkit result

Copy link
Collaborator

@slarse slarse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SirYwell,

This is great! I left one suggestion for how we might be able to squeeze out even more performance.

On a side note, if this turns out to still be a bottleneck after this optimization, we might want to try not using regex at all as it's kind of slow. It's probably even faster to just write custom search functions that operate directly on the string.

@slarse slarse changed the title review: refactor: Precompile patterns for identifier checks review: perf: Precompile patterns for identifier checks Aug 9, 2021
Copy link
Collaborator

@slarse slarse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ignore the optimization I suggested earlier for now. See my two new comments.

Comment on lines 35 to 36
private static final Pattern IS_ARRAY_OR_INSTANCE = Pattern.compile("\\[\\]|@");
private static final Pattern IS_INNER_OR_GENERIC = Pattern.compile("\\.|<|>");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two patterns are not used to identify arrays or generics, so I think the names are a bit misleading. Could you rename them appropriately? E.g. IS_INNER_OR_GENERIC would be more fittingly named NESTED_OR_GENERIC_SPLITTER or something like that.

@@ -101,9 +105,9 @@ private void checkIdentiferForJLSCorrectness(String simplename) {
*/
//JDTTreeBuilderHelper.computeAnonymousName returns "$numbers$Name" so we have to skip them if they start with numbers
//allow empty identifier because they are sometimes used.
if (!simplename.matches("<.*>|\\d.*|^.{0}$")) {
if (!IS_INNER_OR_GENERIC_OR_EMPTY.matcher(simplename).matches()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at this, the more I'm thinking it's completely unnecessary to use a regex for it. The regex only actually checks the first and last characters so it seems redundant to use a regex at all. Seems to me like something like this should be faster and work just as well:

	private static boolean isEmptyOrLocalOrGeneric(String identifier) {
		return identifier.isEmpty()
				|| Character.isDigit(identifier.charAt(0))
				|| (identifier.startsWith("<") && identifier.endsWith(">"));
	}

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice, a human, static compilation of the regexp :) 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a misunderstanding on my side, it's actually not about generics but about names like <init> there. I re-used your code there, the check for anonymous/local classes was moved to the other part of the code.

@SirYwell
Copy link
Collaborator Author

SirYwell commented Aug 9, 2021

I'll look into it in a few days again, thanks for your feedback so far.

@SirYwell
Copy link
Collaborator Author

SirYwell commented Aug 14, 2021

I spend a little bit more time on this and rewrote the logic without any regular expressions. I wrote microbenchmarks with JMH (https://github.com/SirYwell/spoon-benchmark, can be run with ./gradlew jmh).

I changed the ContractOnSettersparametrizedTest, as it set the simple name of a CtPackage to spoon.support.reflect.declaration.CtPackageImpl@1 which isn't a valid identifier (and it's the only case were a @ was inserted). I'm not sure if the change there is appropriate but it makes the logic a little bit easier.

In general, the new implementation is more restrictive (e.g. <<>>...@@@ was allowed as simple name before).

I also tried to add dense documentation to avoid further confusion.

The bechmark result when running it on my machine (I rearranged the lines):

Benchmark                                                                            (stringToCheck)  Mode  Cnt     Score     Error  Units
SpoonCheckIdentifierBenchmark.testHandMadeCheckerImpl                                      Hi<T.R>[]  avgt    4   103,862 ±   2,957  ns/op
SpoonCheckIdentifierBenchmark.testOldImpl                                                  Hi<T.R>[]  avgt    4  2034,576 ±  95,341  ns/op
SpoonCheckIdentifierBenchmark.testPrecompiledPatternImpl                                   Hi<T.R>[]  avgt    4   759,309 ±  43,301  ns/op
SpoonCheckIdentifierBenchmark.testThreadLocalMatcherImpl                                   Hi<T.R>[]  avgt    4   669,038 ±  24,191  ns/op

SpoonCheckIdentifierBenchmark.testHandMadeCheckerImpl                                     HelloWorld  avgt    4    38,164 ±   1,884  ns/op
SpoonCheckIdentifierBenchmark.testOldImpl                                                 HelloWorld  avgt    4  1195,557 ±  43,976  ns/op
SpoonCheckIdentifierBenchmark.testPrecompiledPatternImpl                                  HelloWorld  avgt    4   495,497 ±  63,751  ns/op
SpoonCheckIdentifierBenchmark.testThreadLocalMatcherImpl                                  HelloWorld  avgt    4   562,554 ±  13,842  ns/op

SpoonCheckIdentifierBenchmark.testHandMadeCheckerImpl     VeryLongValidKeyword123<A.B.C.D.E.F>[][][]  avgt    4   320,571 ±   7,270  ns/op
SpoonCheckIdentifierBenchmark.testOldImpl                 VeryLongValidKeyword123<A.B.C.D.E.F>[][][]  avgt    4  4163,707 ± 159,896  ns/op
SpoonCheckIdentifierBenchmark.testPrecompiledPatternImpl  VeryLongValidKeyword123<A.B.C.D.E.F>[][][]  avgt    4  2255,912 ± 106,559  ns/op
SpoonCheckIdentifierBenchmark.testThreadLocalMatcherImpl  VeryLongValidKeyword123<A.B.C.D.E.F>[][][]  avgt    4  2096,540 ±  51,277  ns/op

Copy link
Collaborator

@slarse slarse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to look very good! This is significantly more explicit than using a regex. As we don't have any good performance regression tests at the moment, this PR becomes more valuable when it also improves the code itself. Very nice restructuring of the entire thing, as we already iterated over the entire string in checkIdentifierChars in addition to matching with regex, your restructuring is simply much better.

I have a few comments on the final code but after that I think we can merge.

As a general comment, I'd recommend toning down the use of inline comments. In many cases in this PR, they either cover up insufficiently detailed code (e.g. saying that 0 is a character to expect nothing in particular can be rewritten with a variable name) or state the obvious (e.g. that a keyword is not allowed in an identifier, when the code literally says if this is a keyword return false).

Comment on lines 158 to 166
if (start == i) { // first char of a part
if (!Character.isJavaIdentifierStart(name.charAt(i))) {
return false;
}
} else {
if (!Character.isJavaIdentifierPart(name.charAt(i))) {
return false;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapse?

Suggested change
if (start == i) { // first char of a part
if (!Character.isJavaIdentifierStart(name.charAt(i))) {
return false;
}
} else {
if (!Character.isJavaIdentifierPart(name.charAt(i))) {
return false;
}
}
if (i == start && !Character.isJavaIdentifierStart(name.charAt(i))
|| !Character.isJavaIdentifierPart(name.charAt(i))) {
return false;
}

i++;
}
int start = i; // used to mark the beginning of a part
char expectNext = 0; // 0 = do not expect anything
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid unnamed magic values :)

Suggested change
char expectNext = 0; // 0 = do not expect anything
final char anything = 0;
char expectNext = anything;

int start = i; // used to mark the beginning of a part
char expectNext = 0; // 0 = do not expect anything
for (; i < name.length(); i++) {
if (expectNext != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

Suggested change
if (expectNext != 0) {
if (expectNext != anything) {

if (name.charAt(i) != expectNext) {
return false;
} else if (name.charAt(i) == expectNext) {
expectNext = 0; // reset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expectNext = 0; // reset
expectNext = anything;

- manually, because Github does not seem to allow additional changes
Copy link
Collaborator

@slarse slarse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spend a little bit more time on this and rewrote the logic without any regular expressions. I wrote microbenchmarks with JMH (https://github.com/SirYwell/spoon-benchmark, can be run with ./gradlew jmh).

Completely missed this comment, but those numbers look very promising! I'm looking forward to trying this out with some of my dependent projects, based on some ad-hoc tests I ran it looks like a pretty substantial performance improvement.

Anyway, this all looks good to me now!

@slarse slarse merged commit a08a57e into INRIA:master Aug 16, 2021
@slarse
Copy link
Collaborator

slarse commented Aug 16, 2021

Many thanks @SirYwell, this is an excellent contribution.

@monperrus monperrus mentioned this pull request Aug 19, 2021
woutersmeenk pushed a commit to woutersmeenk/spoon that referenced this pull request Aug 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
  NODES
COMMUNITY 2
Note 1
Project 4
USERS 1