Class TextFileDetector

java.lang.Object
ca.corbett.extras.io.TextFileDetector

public class TextFileDetector extends Object
This utility class can be used to quickly "guess" if a file is likely a text file based on the presence of non-printable characters in a sample of its content. It uses a simple heuristic that counts the number of non-printable characters in the first N bytes of the file, and if the ratio of non-printable characters exceeds a certain threshold, it classifies the file as binary.

This detector works well for single-byte encodings (ASCII, UTF-8, ISO-8859-1, etc.) but will classify UTF-16 and UTF-32 encoded files as binary due to their embedded null bytes. For more comprehensive encoding detection, consider using a library like Apache Tika or ICU4J.

The detection algorithm:

  • Reads a sample of bytes from the beginning of the file
  • Immediately rejects files containing null bytes (0x00)
  • Counts non-printable control characters (excluding common whitespace)
  • Classifies as text if non-printable ratio is below threshold

USAGE: There are three ways to use this class:

  1. Use the static isTextFile(File) method with default settings.
  2. Use the static isTextFile(File, int, double) method with custom sample size and threshold.
  3. Use the Builder class to configure and perform detection.

Example usage of the Builder class:

   boolean isTextFile = new TextFileDetector.Builder()
              .sampleSize(16384) // set sample size to 16KB
              .threshold(0.03)   // set non-printable threshold to 3%
              .detect(testFile); // run the detection and report result
 
Since:
swing-extras 2.6
Author:
claude.ai
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    Builder for configurable text file detection.
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static boolean
    Detects if a file is likely a text file using default settings.
    static boolean
    isTextFile(File file, int sampleSize, double nonPrintableThreshold)
    Detects if a file is likely a text file with configurable parameters.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • TextFileDetector

      public TextFileDetector()
  • Method Details

    • isTextFile

      public static boolean isTextFile(File file) throws IOException
      Detects if a file is likely a text file using default settings.

      Note: This method is optimized for single-byte encodings (ASCII, UTF-8, etc.) and will classify UTF-16/UTF-32 files as binary.

      Parameters:
      file - the file to check
      Returns:
      true if the file appears to be a text file
      Throws:
      IOException - if an I/O error occurs
    • isTextFile

      public static boolean isTextFile(File file, int sampleSize, double nonPrintableThreshold) throws IOException
      Detects if a file is likely a text file with configurable parameters.

      Note: This method is optimized for single-byte encodings (ASCII, UTF-8, etc.) and will classify UTF-16/UTF-32 files as binary.

      Parameters:
      file - the file to check
      sampleSize - number of bytes to read for analysis
      nonPrintableThreshold - maximum ratio of non-printable characters (0.0 to 1.0)
      Returns:
      true if the file appears to be a text file
      Throws:
      IOException - if an I/O error occurs