#REGEX FOR NUMBER GREATER THAN CODE#
> given that, how do i fit this into a mllib pipeline, and it if doesn't fit > as part of a pipeline, what is the best way to include it in my code so > that > it can easily be reused both for training and testing, as well as in > production. it also has a > different number of input rows than output rows due to the group by > operation. this is because > it > takes a single input column and outputs multiple columns. > i would like to make at lease step 2 a custom transformer and add that to > a > pipeline, but it doesn't fit the transformer abstraction. > after this preprocessing, i would use transformers to create other > features > and feed it into a model, lets say logistic regression for example. then perform a transformation similar to one from step 2), > where > the column that is pivoted and counted is a column that came from the data > stored in cassandra. df.groupby("user_id").pivot("event_type").count() > - we can think of the columns that this creates besides user_id as > features, where the number of each event type is a different feature > 3) join the data from step 1 with other metadata, usually stored in > cassandra. > my preprocessing steps are generally in the following form: > 1) load log files(from s3) and parse into a spark dataframe with > columns > user_id, event_type, timestamp, etc > 2) group by a column, then pivot and count another column > - e.g. > thanks > yanbo > on thu, at 11:02 am, aatv wrote: > i want to start using pyspark mllib pipelines, but i don't understand > how/where preprocessing fits into the pipeline. on fri, at 9:10 pm, yanbo liang wrote: > hi adrian, > did you try sqltransformer? your preprocessing steps are sql operations > and can be handled by sqltransformer in mllib pipeline scope. by the way, if you like to get hands dirty, writing a transformer in scala is not hard, and multiple output columns is valid in such case. Of course, if you let your programming language evaluate something numerically rather than try to match it against a regular expression, you'll save headaches and CPU.Sqltransformer is a good solution if all operators are combined with sql. )$Īnd finally, we close the parentheses and anchor the regex to the end of the line with the dollar sign, just as the caret anchors to the beginning of the line. So this matches everything above 0 and below 1. This matches any number that starts with one or more 0 characters (replace + with * to match zero or more zeros, i.e.25), followed by a period, followed by a string of digits that includes at least one that is not a 0. Parentheses are used to group multiple expressions separated by or-bars.Īnd the second part: 0+.** For purposes of evaluation of this expression, It has higher precedence than everything else, and effectively joins two regular expressions together. The pipe character is an " or-bar" in this context. The 0* at the start handles leading zeros, so 005 = 5. So our 2.0 would be matched, but 0.25 would not. This matches any integer or any floating point number above 1.
The opening parenthesis is there because of the or-bar, below. The caret matches the null at the beginning of the line, so preceding your regex with a caret anchors it to the beginning of the line. Let's break the expression down into parts that are easier to digest. The same thing can be expressed in Basic RE, but almost everything understands ERE these days. Note that this second one is an Extended RE. And while we're at it, we'll add support for leading zeros on integers (i.e. 0.25), as well as case where your pattern has a decimal part that is 0. 2.5 or 3.3̅), cases where your pattern is between 0 and 1 (i.e. If you want to match real numbers (floats) rather than integers, you need to handle the case above, along with normal decimal numbers (i.e. I don't know how MVC is relevant, but if your ID is an integer, this BRE should do: ^*$