Is it possible to classify text with Regex?


#1

I’m trying to create MODEL like decision tree style that receive series of STRINGs.

I’m using WEKA , with J48 classifier and stringToWordVector as a filter.

As I know a lot of classifiers run with numbers instead of strings (like regression , currently I don’t want map between string <-> numbers).

I’ve create an .arff file training data and test data.

@relation test

@attribute class-att {OUTPUT_1,OUTPUT_2,OUTPUT_3}
@attribute Text1 string
@attribute Text2 string
@attribute Text3 string
@attribute Text4 string
@attribute Text5 string

@data 
OUTPUT_1,'a','b','c','d','e'
OUTPUT_2,'a','b','c','d','?' 
OUTPUT_2,'a','b','?','?','?'

OUTPUT_3,'f','g','h','i','j'
OUTPUT_3,'f','g','h','i','?'

   % -- here where instead of '?' I want to be 
         string regex any char-- %

Test data:

@relation test

@attribute class-att {OUTPUT_1,OUTPUT_2,OUTPUT_3}
@attribute Text1 string
@attribute Text2 string
@attribute Text3 string
@attribute Text4 string
@attribute Text5 string

@data
?,'a','b','c','d','e'
?,'a','b','c','d','x'
?,'a','b','q','w','r'
?,'f','g','h','i','j'
?,'f','g','h','i','x'

How can I classify data as regex when ‘?’ appears…?

Any suggestions please :slight_smile:


#2

Just to clarify, are you referring to regex as a string (which can be renamed to any other word, say “unknown”) or regex as a concept of Regular Expression?


#3

Yes , at any position when ‘?’ appears in training data , can be every char/word…