Package ca.corbett.extras.io
Class TextFileDetector
java.lang.Object
ca.corbett.extras.io.TextFileDetector
This utility class can be used to quickly "guess" if a file is likely a text file
based on the presence of non-printable characters in a sample of its content.
It uses a simple heuristic that counts the number of non-printable characters
in the first N bytes of the file, and if the ratio of non-printable characters
exceeds a certain threshold, it classifies the file as binary.
This detector works well for single-byte encodings (ASCII, UTF-8, ISO-8859-1, etc.) but will classify UTF-16 and UTF-32 encoded files as binary due to their embedded null bytes. For more comprehensive encoding detection, consider using a library like Apache Tika or ICU4J.
The detection algorithm:
- Reads a sample of bytes from the beginning of the file
- Immediately rejects files containing null bytes (0x00)
- Counts non-printable control characters (excluding common whitespace)
- Classifies as text if non-printable ratio is below threshold
USAGE: There are three ways to use this class:
- Use the static isTextFile(File) method with default settings.
- Use the static isTextFile(File, int, double) method with custom sample size and threshold.
- Use the Builder class to configure and perform detection.
Example usage of the Builder class:
boolean isTextFile = new TextFileDetector.Builder()
.sampleSize(16384) // set sample size to 16KB
.threshold(0.03) // set non-printable threshold to 3%
.detect(testFile); // run the detection and report result
- Since:
- swing-extras 2.6
- Author:
- claude.ai
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classBuilder for configurable text file detection. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic booleanisTextFile(File file) Detects if a file is likely a text file using default settings.static booleanisTextFile(File file, int sampleSize, double nonPrintableThreshold) Detects if a file is likely a text file with configurable parameters.
-
Constructor Details
-
TextFileDetector
public TextFileDetector()
-
-
Method Details
-
isTextFile
Detects if a file is likely a text file using default settings.Note: This method is optimized for single-byte encodings (ASCII, UTF-8, etc.) and will classify UTF-16/UTF-32 files as binary.
- Parameters:
file- the file to check- Returns:
- true if the file appears to be a text file
- Throws:
IOException- if an I/O error occurs
-
isTextFile
public static boolean isTextFile(File file, int sampleSize, double nonPrintableThreshold) throws IOException Detects if a file is likely a text file with configurable parameters.Note: This method is optimized for single-byte encodings (ASCII, UTF-8, etc.) and will classify UTF-16/UTF-32 files as binary.
- Parameters:
file- the file to checksampleSize- number of bytes to read for analysisnonPrintableThreshold- maximum ratio of non-printable characters (0.0 to 1.0)- Returns:
- true if the file appears to be a text file
- Throws:
IOException- if an I/O error occurs
-