String Handling and Regular Expressions
String Handling and Regular Expressions
Inside C#, Second Edition: String Handling and Regular Expressions Part 1
By Tom Archer | 14 May 2002 | Article
.NET1.0 C# Windows Dev Intermediate
Licence First Posted 14 May 2002 Views 609,692 Bookmarked 228 times
This article will examine the String class, some of its simple methods, and its range of formatting specifiers.
4.94 (88 votes)
Title Authors
Publisher Microsoft Press Published Apr 2002 ISBN Price Pages 0735616485 US 49.99 912
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProject Page 2 of 19
public static void Main(string[] args) { string a = "strong"; // Replace all 'o' with 'i' string b = a.Replace('o', 'i'); Console.WriteLine(b); string c = b.Insert(3, "engthen"); string d = c.ToUpper(); Console.WriteLine(d); } }
The String class has a range of comparison methods, including Compare and overloaded operators, as this continuation of the previous example shows:
if (d == c) // Different { Console.WriteLine("same"); } else { Console.WriteLine("different"); }
Note that the string variable a in the second to last example isnt changed by the Replace operation. However, you can always reassign a string variable if you choose. For example:
string q = "Foo"; q = q.Replace('o', 'i'); Console.WriteLine(q);
You can combine string objects with conventional char arrays and even index into a string in the conventional manner:
string e = "dog" + "bee"; e += "cat"; string f = e.Substring(1,7); Console.WriteLine(f); for (int i = 0; i < f.Length; i++) { Console.Write("{0,-3}", f[i]); }
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProject Page 3 of 19
If you want a null string, declare one and assign null to it. Subsequently, you can reassign it with another string, as shown in the following example. Because the assignment to g from f.Remove is in a conditional block, the compiler will reject the Console.WriteLine(g) statement unless g has been assigned either null or some valid string value.
string g = null; if (f.StartsWith("og")) { g = f.Remove(2,3); } Console.WriteLine(g);
If youre familiar with the Microsoft Foundation Classes (MFC) CString, the Windows Template Library (WTL) CString, or the Standard Template Library (STL) string class, the String.Format method will come as no surprise. Furthermore, Console.WriteLine uses the same format specifiers as the String class, as shown here:
int x = 16; decimal y = 3.57m; string h = String.Format( "item {0} sells at {1:C}", x, y); Console.WriteLine(h);
If you have experience with Microsoft Visual Basic, you wont be surprised to find that you can concatenate a string with any other data type using the plus sign (+). This is because all types have at least inherited object.ToString. Heres the syntax:
string t = "item " + 12 + " sells at " + '\xA3' + 3.45; Console.WriteLine(t);
String.Format has a lot in common with Console.WriteLine. Both methods include an overload that takes an open-ended (params) array of objects as the last argument. The following two statements will now produce the same output:
// This works because last param is a params object[]. Console.WriteLine( "Hello {0} {1} {2} {3} {4} {5} {6} {7} {8}", 123, 45.67, true, 'Q', 4, 5, 6, 7, '8'); // This also works. string u = String.Format( "Hello {0} {1} {2} {3} {4} {5} {6} {7} {8}", 123, 45.67, true, 'Q', 4, 5, 6, 7, '8'); Console.WriteLine(u);
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProject Page 4 of 19
String Formatting
Both String.Format and WriteLine formatting are governed by the same formatting rules: the format parameter is embedded with zero or more format specifications of the form "{ N [, M ][: formatString ]}", arg1, ... argN, where: N is a zero-based integer indicating the argument to be formatted. M is an optional integer indicating the width of the region to contain the formatted value, padded with spaces. If M is negative, the formatted value is left-justified; if M is positive, the value is rightjustified. formatString is an optional string of formatting codes. argN is the expression to use at the equivalent position inside the quotes in the string. If argN is null, an empty string is used instead. If formatString is omitted, the ToString method of the argument specified by N provides formatting. For example, the following three statements produce the same output:
public class TestConsoleApp { public static void Main(string[] args) { Console.WriteLine(123); Console.WriteLine("{0}", 123); Console.WriteLine("{0:D3}", 123); } }
Therefore: The comma (,M) determines the field width and justification. The colon (:formatString) determines how to format the datasuch as currency, scientific notation, or hexadecimalas shown here:
Console.WriteLine("{0,5} {1,5}", 123, 456); // Right-aligned Console.WriteLine("{0,-5} {1,-5}", 123, 456); // Left-aligned Console.WriteLine("{0,-10:D6} {1,-10:D6}", 123, 456);
Of course, you can combine themputting the comma first, then the colon:
Console.WriteLine("{0,-10:D6} {1,-10:D6}", 123, 456);
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProject Page 5 of 19
We could use these formatting features to output data in columns with appropriate alignmentfor example:
Console.WriteLine("\n{0,-10}{1,-3}", "Name","Salary"); Console.WriteLine("----------------"); Console.WriteLine("{0,-10}{1,6}", "Bill", 123456); Console.WriteLine("{0,-10}{1,6}", "Polly", 7890);
Format Specifiers
Standard numeric format strings are used to return strings in commonly used formats. They take the form X0, in which X is the format specifier and 0 is the precision specifier. The format specifier can be one of the nine built-in format characters that define the most commonly used numeric format types, as shown in Table 10-1. Table 10-1 - String and WriteLine Format Specifiers Character C or c D or d Interpretation Currency Decimal (decimal integerdont confuse with the .NET
Decimal type)
E or e F or f G or g N or n P or p R or r Exponent Fixed point General Currency Percentage Round-trip (for floating-point values only); guarantees that a numeric value converted to a string will be parsed back into the same numeric value Hex
X or x
Lets see what happens if we have a string format for an integer value using each of the format specifiers in turn. The comments in the following code show the output.
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProject Page 6 of 19
public class FormatSpecApp { public static void Main(string[] args) { int i = 123456; Console.WriteLine("{0:C}", i); // 123,456.00 Console.WriteLine("{0:D}", i); // 123456 Console.WriteLine("{0:E}", i); // 1.234560E+005 Console.WriteLine("{0:F}", i); // 123456.00 Console.WriteLine("{0:G}", i); // 123456 Console.WriteLine("{0:N}", i); // 123,456.00 Console.WriteLine("{0:P}", i); // 12,345,600.00 % Console.WriteLine("{0:X}", i); // 1E240 } }
The precision specifier controls the number of significant digits or zeros to the right of a decimal:
Console.WriteLine("{0:C5}", Console.WriteLine("{0:D5}", Console.WriteLine("{0:E5}", Console.WriteLine("{0:F5}", Console.WriteLine("{0:G5}", Console.WriteLine("{0:N5}", Console.WriteLine("{0:P5}", Console.WriteLine("{0:X5}", i); i); i); i); i); i); i); i); // // // // // // // // 123,456.00000 123456 1.23456E+005 123456.00000 1.23456E5 123,456.00000 12,345,600.00000 % 1E240
The R (round-trip) format works only with floating-point values: the value is first tested using the general format, with 15 spaces of precision for a Double and seven spaces of precision for a Single. If the value is successfully parsed back to the same numeric value, its formatted using the general format specifier. On the other hand, if the value isnt successfully parsed back to the same numeric value, the value is formatted using 17 digits of precision for a Double and nine digits of precision for a Single. Although a precision specifier can be appended to the round-trip format specifier, its ignored.
double d = 1.2345678901234567890; Console.WriteLine("Floating-Point:\t{0:F16}", d); Console.WriteLine("Roundtrip:\t{0:R16}", d);
// 1.2345678901234600 // 1.2345678901234567
If the standard formatting specifiers arent enough for you, you can use picture format strings to create custom string output. Picture format definitions are described using placeholder strings that identify the minimum and maximum number of digits used, the placement or appearance of the negative sign, and the appearance of any other text within the number, as shown in Table 10-2. Table 10-2 - Custom Format Specifiers Format Character 0 Purpose Display zero placeholder Description Results in a nonsignificant zero if a number has fewer digits than there are zeros in the format Replaces the pound symbol (#) with only significant digits Displays a period (.) Separates number groups, as in 1,000 Displays a percent sign (%)
. , %
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProject Page 7 of 19
Exponent notation
Literal character
Used with traditional formatting sequences such as \n (newline) Displays any string within quotes or apostrophes literally Specifies different output if the numeric value to be formatted is positive, negative, or zero
'ABC' "ABC" ;
Literal string
Section separator
Lets see the strings that result from a set of customized formats, using first a positive integer, then using the negative value of that same integer, and finally using zero:
int i = 123456; Console.WriteLine(); Console.WriteLine("{0:#0}", i); // Console.WriteLine("{0:#0;(#0)}", i); // Console.WriteLine("{0:#0;(#0);<zero>}", i); // Console.WriteLine("{0:#%}", i); // i = -123456; Console.WriteLine(); Console.WriteLine("{0:#0}", i); // Console.WriteLine("{0:#0;(#0)}", i); // Console.WriteLine("{0:#0;(#0);<zero>}", i); // Console.WriteLine("{0:#%}", i); // i = 0; Console.WriteLine(); Console.WriteLine("{0:#0}", i); // Console.WriteLine("{0:#0;(#0)}", i); // Console.WriteLine("{0:#0;(#0);<zero>}", i); // Console.WriteLine("{0:#%}", i); //
0 0 <zero> %
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProject Page 8 of 19
} }
From the foregoing code, you can see that the statement
Console.WriteLine(a);
is the same as
Console.WriteLine(a.ToString());
The reason for this equivalence is that the ToString method has been overridden in the Int32 type to produce a string representation of the numeric value. By default, however, ToString will return the name of the objects typethe same as GetType, a name composed of the enclosing namespace or namespaces and the class name. This equivalence is clear when we call ToString on our Thing reference. We canand shouldoverride the inherited ToString for any nontrivial user-defined type:
public class Thing { public int i = 2; public int j = 3; override public string ToString() { return String.Format("i = {0}, j = {1}", i, j); } }
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProject Page 9 of 19
} }
Certain non-digit characters in an input string are allowed by default, including leading and trailing spaces, commas and decimal points, and plus and minus signs. Therefore, the following Parse statements are equivalent:
string t = " -1,234,567.890 "; //double g = double.Parse(t); // Same thing double g = double.Parse(t, NumberStyles.AllowLeadingSign NumberStyles.AllowDecimalPoint NumberStyles.AllowThousands NumberStyles.AllowLeadingWhite NumberStyles.AllowTrailingWhite); Console.WriteLine("g = {0:F}", g);
Note that to use NumberStyles you must add a using statement for System.Globalization. Then you either can use a combination of the various NumberStyles enum values or use NumberStyles.Any for all of them. If you also want to accommodate a currency symbol, you need the third Parse overload, which takes a NumberFormatInfo object as a parameter. You then set the CurrencySymbol field of the NumberFormatInfo object to the expected symbol before passing it as the third parameter to Parse, which modifies the Parse behavior:
string u = " -1,234,567.890 "; NumberFormatInfo ni = new NumberFormatInfo(); ni.CurrencySymbol = ""; double h = Double.Parse(u, NumberStyles.Any, ni); Console.WriteLine("h = {0:F}", h);
In addition to NumberFormatInfo, we can use the CultureInfo class. CultureInfo represents information about a specific culture, including the names of the culture, the writing system, and the calendar used, as well as access to culture-specific objects that provide methods for common operations, such as formatting dates and sorting strings. The culture names follow the RFC 1766 standard in the format <languagecode2>-<country/regioncode2>, in which <languagecode2> is a lowercase twoletter code derived from ISO 639-1 and <country/regioncode2> is an uppercase two-letter code derived from ISO 3166. For example, U.S. English is "en-US", UK English is "en-GB", and Trinidad and Tobago English is "en-TT". For example, we could create a CultureInfo object for English in the United States and convert an integer value to a string based on this CultureInfo:
int k = 12345; CultureInfo us = new CultureInfo("en-US"); string v = k.ToString("c", us); Console.WriteLine(v);
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 10 of 19
$12,345.00
Note that were using a ToString overload that takes a format string as its first parameter and an IFormatProvider interface implementationin this case, a CultureInfo referenceas its second parameter. Heres another example, this time for Danish in Denmark:
CultureInfo dk = new CultureInfo("da-DK"); string w = k.ToString("c", dk); Console.WriteLine(w);
DateTime values are formatted using standard or custom patterns stored in the properties of a DateTimeFormatInfo instance. To modify how a value is displayed, the DateTimeFormatInfo instance
must be writeable so that custom patterns can be saved in its properties.
using System.Globalization; public class DatesApp { public static void Main(string[] args) { DateTime dt = DateTime.Now; Console.WriteLine(dt); Console.WriteLine("date = {0}, time = {1}\n", dt.Date, dt.TimeOfDay); } }
Table 10-3 lists the standard format characters for each standard pattern and the associated DateTimeFormatInfo property that can be set to modify the standard pattern. Table 10-3 - DateTime Formatting Format Character Format Pattern Associated Property/Description
MM/dd/yyyy
ShortDataPattern
dddd,MMMM dd,yyyy
LongDatePattern
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 11 of 19
MM/dd/yyyy HH:mm
MM/dd/yyyy HH:mm:ss
M,M
MMMM dd
MonthDayPattern
r,R
RFC1123Pattern
yyyy-MM-dd HH:mm:ss
HH:mm
ShortTimePattern
HH:mm:ss
LongTimePattern
yyyy-MM-dd HH:mm:ss
dddd,MMMM dd,yyyy,HH:mm:ss
UniversalSortableDateTimePattern
y,Y
MMMM,yyyy
YearMonthPattern
The DateTimeFormatInfo.InvariantInfo property gets the default read-only DateTimeFormatInfo instance thats culture independent (invariant). You can also create custom patterns. Note that the InvariantInfo isnt necessarily the same as the current locale info: Invariant equates to U.S. standard. Also, if you pass null as the second parameter to DateTime.Format, the DateTimeFormatInfo will default to CurrentInfo,as in:
Console.WriteLine(dt.ToString("d", dtfi)); Console.WriteLine(dt.ToString("d", null)); Console.WriteLine();
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 12 of 19
Console.WriteLine(dt.ToString("F", dtfi)); Console.WriteLine(dt.ToString("g", dtfi)); Console.WriteLine(dt.ToString("G", dtfi)); Console.WriteLine(dt.ToString("m", dtfi)); Console.WriteLine(dt.ToString("r", dtfi)); Console.WriteLine(dt.ToString("s", dtfi)); Console.WriteLine(dt.ToString("t", dtfi)); Console.WriteLine(dt.ToString("T", dtfi)); Console.WriteLine(dt.ToString("u", dtfi)); Console.WriteLine(dt.ToString("U", dtfi)); Console.WriteLine(dt.ToString("d", dtfi)); Console.WriteLine(dt.ToString("y", dtfi)); Console.WriteLine(dt.ToString("dd-MMM-yy", dtfi));
12:55 12:55:03
GMT
12:55:03
[I]nvariant or [C]urrent Info?: C 03/01/2002 03/01/2002 03 January 2002 03 January 2002 12:55 03 January 2002 12:55:47 03/01/2002 12:55 03/01/2002 12:55:47 03 January Thu, 03 Jan 2002 12:55:47 GMT 2002-01-03T12:55:47 12:55 12:55:47 2002-01-03 12:55:47Z 03 January 2002 12:55:47 03/01/2002 January 2002 03-Jan-02
Encoding Strings
The System.Text namespace offers an Encoding class. Encoding is an abstract class, so you cant instantiate it directly. However, it does provide a range of methods and properties for converting arrays and strings of Unicode characters to and from arrays of bytes encoded for a target code page. These properties actually resolve to returning an implementation of the Encoding class. Table 10-4 shows some of these properties. Table 10-4 - String Encoding Classes Property Encoding
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 13 of 19
ASCII
Encodes Unicode characters as single, 7-bit ASCII characters. This encoding supports only character values between U+0000 and U+007F
BigEndianUnicode
Encodes each Unicode character as two consecutive bytes, using big endian (code page 1201) byte ordering.
Unicode
Encodes each Unicode character as two consecutive bytes, using little endian (code page 1200) byte ordering.
UTF7
Encodes Unicode characters using the UTF-7 encoding. (UTF-7 stands for UCS Transformation Format, 7-bit form.) This encoding supports all Unicode character values and can be accessed as code page 65000.
UTF8
Encodes Unicode characters using the UTF-8 encoding. (UTF-8 stands for UCS Transformation Format, 8-bit form.) This encoding supports all Unicode character values and can be accessed as code page 65001.
For example, you can convert a simple sequence of bytes into a conventional ASCII string, as shown here:
class StringEncodingApp { static void Main(string[] args) { byte[] ba = new byte[] {72, 101, 108, 108, 111}; string s = Encoding.ASCII.GetString(ba); Console.WriteLine(s); } }
If you want to convert to something other than ASCII, simply use one of the other Encoding properties. The following example has the same output as the previous example:
byte[] bb = new byte[] {0,72, 0,101, 0,108, 0,108, 0,111}; string t = Encoding.BigEndianUnicode.GetString(bb); Console.WriteLine(t);
The System.Text namespace also includes several classes derived fromand therefore implementing the abstract Encoding class. These classes offer similar behavior to the properties in the Encoding class itself: ASCIIEncoding UnicodeEncoding UTF7Encoding UTF8Encoding
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 14 of 19
You could achieve the same results as those from the previous example with the following code:
ASCIIEncoding ae = new ASCIIEncoding(); Console.WriteLine(ae.GetString(ba)); UnicodeEncoding bu = new UnicodeEncoding(true, false); Console.WriteLine(bu.GetString(bb));
Note thatas with most other typesyou can easily convert from a StringBuilder to a String:
string s = sb.ToString().ToUpper(); Console.WriteLine(s);
Splitting Strings
The String class does offer a Split method for splitting a string into substrings, with the splits determined by arbitrary separator characters that you supply to the method. For example:
class SplitStringApp { static void Main(string[] args) { string s = "Once Upon A Time In America"; char[] seps = new char[]{' '}; foreach (string ss in s.Split(seps)) Console.WriteLine(ss); } }
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 15 of 19
The separators parameter to String.Split is an array of char; therefore, we can split a string based on multiple delimiters. However, we have to be careful about special characters such as the backslash (\) and single quote ('). The following code produces the same output as the previous example did:
string t = "Once,Upon:A/Time\\In\'America"; char[] sep2 = new char[]{ ' ', ',', ':', '/', '\\', '\''}; foreach (string ss in t.Split(sep2)) Console.WriteLine(ss);
Note that the Split method is quite simple and not too useful if we want to split substrings that are separated by multiple instances of some character. For example, if we have more than one space between any of the words in our string, well get these results:
string u = "Once Upon A Time In America"; char[] sep3 = new char[]{' '}; foreach (string ss in u.Split(sep3)) Console.WriteLine(ss);
Upon A Time In
America
In the second article of this two-part series, well consider the regular expression classes in the .NET Framework, and well see how to solve this particular problem and many others.
Extending Strings
In libraries before the .NET era, it became common practice to extend the String class found in the library with enhanced features. Unfortunately, the String class in the .NET Framework is sealed; therefore, you cant derive from it. On the other hand, its entirely possible to provide a series of encapsulated static methods that process strings. For example, the String class does offer the ToUpper and ToLower methods for converting to uppercase or lowercase, respectively, but this class doesnt offer a method to convert to proper case (initial capitals on each word). Providing such functionality is simple, as shown here:
public class StringEx { public static string ProperCase(string s) { s = s.ToLower(); string sProper = ""; char[] seps = new char[]{' '}; foreach (string ss in s.Split(seps)) { sProper += char.ToUpper(ss[0]); sProper += (ss.Substring(1, ss.Length - 1) + ' '); } return sProper;
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 16 of 19
} } class StringExApp { static void Main(string[] args) { string s = "the qUEEn wAs in HER parLOr"; Console.WriteLine("Initial String:\t{0}", s); string t = StringEx.ProperCase(s); Console.WriteLine("ProperCase:\t{0}", t); } }
This will produce the output shown here. (In the second part of this two-part series, well see how to achieve the same results with regular expressions.)
Initial String: the qUEEn wAs in HER parLOr ProperCase: The Queen Was In Her Parlor
Another classic operation that doubtless will appear again is a test for a palindromic stringa string that reads the same backwards and forwards:
public static bool IsPalindrome(string s) { int iLength, iHalfLen; iLength = s.Length - 1; iHalfLen = iLength / 2; for (int i = 0; i <= iHalfLen; i++) { if (s.Substring(i, 1) != s.Substring(iLength - i, 1)) { return false; } } return true; } static void Main(string[] args) { Console.WriteLine("\nPalindromes?"); string[] sa = new string[]{ "level", "minim", "radar", "foobar", "rotor", "banana"}; foreach (string v in sa) Console.WriteLine("{0}\t{1}", v, StringEx.IsPalindrome(v)); }
For more complex operationssuch as conditional splitting or joining, extended parsing or tokenizing, and sophisticated trimming in which the String class doesnt offer the power you wantyou can turn to the Regex class. Thats what well look at next in the follow-up article to this one
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 17 of 19
String Interning
One of the reasons strings were designed to be immutable is that this arrangement allows the system to intern them. During the process of string interning, all the constant strings in an application are stored in a common place in memory, thus eliminating unnecessary duplicates. This practice clearly saves space at run time but can confuse the unwary. For example, recall that the equivalence operator (==) will test for value equivalence for value types and for address (or reference) equivalence for reference types. Therefore, in the following application, when we compare two reference type objects of the same class with the same contents, the result is False. However, when we compare two string objects with the same contents, the result is True:
class StringInterningApp { public class Thing { private int i; public Thing(int i) { this.i = i; } } static void Main(string[] args) { Thing t1 = new Thing(123); Thing t2 = new Thing(123); Console.WriteLine(t1 == t2); string a = "Hello"; string b = "Hello"; Console.WriteLine(a == b); } } // True
// False
OK, but both strings are actually constants or literals. Suppose we have another string thats a variable? Again, given the same contents, the string equivalence operator will return True:
string c = String.Copy(a); Console.WriteLine(a == c);
// True
Now suppose we force the run-time system to treat the two strings as objects, not strings, and therefore use the most basic reference type equivalence operator. This time we get False:
Console.WriteLine((object)a == (object)c);
Time to look at the underlying Microsoft intermediate language (MSIL), as shown in Figure 10-1.
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 18 of 19
Figure 10-1 - MSIL for string equivalence and object equivalence. The crucial differences are as follows: For the first comparison (t1==t2), having loaded the two Thing object references onto the evaluation stack, the MSIL uses opcode ceq (compare equal), thus clearly comparing the references, or address values. However, when we load the two strings onto the stack for comparison with ldstr, the MSIL for the second comparison (a==b) is a call operation. We dont just compare the values on the stack; instead, we call the String class equivalence operator method, op_Equality. The same process happens for the third comparison (a==c). For the fourth comparison, (object)a==(object)c, were back again to ceq. In other words, we compare the values on the stack in this case, the addresses of the two strings. Note that Chapter 13 of Inside C# illustrates exactly how the String class can have its own equivalence operator method via operator overloading. For now, its enough to know that the system will compare strings differently than other reference types. What happens if we compare the two original string constants and force the use of the most primitive equivalence operator? Take a look:
Console.WriteLine((object)a == (object)b);
Youll find that the output from this is True. Proof, finally, that the system is interning stringsthe MSIL opcode used is again ceq, but this time it results in equality because the two strings were assigned a constant literal value that was stored only once. In fact, the Common Language Infrastructure guarantees that the result of two ldstr instructions referring to two metadata tokens with the same sequence of characters return precisely the same string object.
Summary
In this article, we examined the String class and a range of ancillary classes that modify and support string operations. We explored the use of the String class methods for searching, sorting, splitting, joining, and otherwise returning modified strings. We also saw how many other classes in the .NET Framework support string processingincluding Console, the basic numeric types, and DateTimeand how culture information and character encoding can affect string formatting. Finally, we saw how the system performs sneaky string interning to improve runtime efficiency. In the next article, you'll discover the Regex class and its supporting classes Match, Group, and Capture for encapsulating regular expressions.
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012
Inside C#, Second Edition: String Handling and Regular Expressions Part 1 - CodeProj... Page 19 of 19
License
This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below. A list of licenses authors might use can be found here
http://www.codeproject.com/Articles/2270/Inside-C-Second-Edition-String-Handling-and-... 5/13/2012